Clustering of Web Search Results Based on Document Segmentation

Mohammad Hasan Haggag, Amal Aboutabl, Najla Mukhtar

Abstract


The process of clustering documents in a manner which produces accurate and compact clusters becomes increasingly significant mainly with the vast size of information on the web. This problem becomes even more complicated with the multi-topics nature of documents these days. In this paper, we deal with the problem of clustering documents retrieved by a search engine, where each document deals with multiple topics. Our approach is based on segmenting each document into a number of segments and then clustering segments of all documents using the Lingo algorithm. We evaluate the quality of clusters obtained by clustering full documents directly and by clustering document segments using the distance-based average intra-cluster similarity measure. Our results illustrate that average intra-cluster similarity is increased by approximately 75% as a result of clustering document segments as compared to clustering full documents retrieved by the search engine.


Full Text: PDF DOI: 10.5539/cis.v6n3p89

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.

Computer and Information Science   ISSN 1913-8989 (Print)   ISSN 1913-8997 (Online)
Copyright © Canadian Center of Science and Education

To make sure that you can receive messages from us, please add the 'ccsenet.org' domain to your e-mail 'safe list'. If you do not receive e-mail in your 'inbox', check your 'bulk mail' or 'junk mail' folders.