Entropy Based Measurement of Text Dissimilarity for Duplicate – Detection

The problem of identifying approximate similarity between pair of strings is an essential step for data cleansing and data integration process. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity potential duplicate. But existing system does not produce the similarity percentage between pair of strings. In this paper we propose a method using entropy and information gain (IG) to find dissimilarity between pair of strings to increase the accuracy of data.


Introduction
Databases play an important role in today's IT-based economy.Many industries and systems depend on the accuracy of database to carry out operations.Therefore, the quality of information stored in the database can have significant cost implication to a system that relies on information to function and conduct business.In an error free system with perfectly clean data, the construction of a comprehensive view of the data consist of linking-in relational terms ,joining-two or more table on their key fields.Unfortunately, data often lack a unique, global identifier that would permit such an operation.Furthermore, the data are neither carefully controlled for quality nor defined in a consistent way across different data sources.Thus, data quality is often compromised by many factors, including data entry error.
When database contain records that were collect from multiple information sources, they frequently include field values and tuples that refer to the same entity, but are not syntactically identical.Hence, maintaining quality of data is one of the most critical problems faced in the data warehousing.Since data is entered in the system by different people, in different standards and at different levels, so this results in several data quality issues.One of the top most issues is having multiple representation of same logical real world entity or in other words having duplicate records for same entity at the time of data entry.Both entity may be an exactly or approximately equal string.Such duplicate records at any point of time might be crucial but this ends in a lot of duplicate data in system which can affect to business.Similarity is a complex concept which has been widely discussed in the linguistic, Philosophical and information theory communities [Hatzivassiloglou et al., 1999].[Frawley, 1992] discusses all semantic typing in terms of two mechanisms: the detection of similarities and difference.An effective method to compute the similarity between short texts or sentences has many applications in natural language processing and related areas such as information retrieval to be one of the best techniques for improving retrieval effectiveness [Park et al., 2005] and in image retrieval from the Web, the use of short text surrounding the images can achieve a higher retrieval precision than the use of the whole document in which the image is embedded [Coelho et al., 2004].The use of text similarity is beneficial for relevance feedback and text categorization [Ko et al., 2004;Liu and Guo, 2005], text summarization [Erkan and Radev, 2004;Lin and Hovy, 2003], word sense disambiguation [Lesk, 1986;Schutze, 1998], methods for automatic evaluation of machine translation [Liu and Zong, 2004], evaluation of text coherence [Katarzyna and Szczepaniak, 2005;Lapataand Barzilay, 2005], formatted documents classification [Katarzyna and Szczepaniak, 2005].
Here we point out some drawbacks of the existing methods.Distinguish those draw back as two types technical, functional.As a technical point of view, Simple similarity measures for string matching include character n-gram similarity, the Levenshtein distance [Vladimir I. Levenshtein, 1965] and the Jaro-Winkler measure [Winkler, 1999], in which the same penalty value is used regardless of the characters to be matched (or ignored).In general, these simple measures do not work very well for generate similarity percentage of pair of string.As a functional point of view, One of the major drawbacks of most of the existing system is depending on the particular domain, i.e., once the similarity method is designed for a specific application domain, it cannot be adapted easily to other domains, most of the system require minimum changes for move one domain to another domain.This lack of adaptability to the domain does not correspond to human language usage as sentence meaning may change, to varying extents, from domain to domain.To address this drawback, we aim to develop a method that is fully automatic without requiring users' feedback and can be used independent of the domain in applications requiring text similarity measure.
For this purpose, we have introduced a new method by using entropy and information gain by considering two parameters C and R as a text.Here our work is automatically determining the dissimilarity percentage between pair of strings.In entropy based duplicate detection form a generic dynamic truth table by decision making algorithm, calculate the entropy and information gain and deals with the application of the new method.An example is given to illustrate the efficiency of our propose method.

Preliminaries
In this section, for the convenience of formulation of proposed method.We need to introduce some preliminaries about entropy and information gain.

Entropy
A measurement of the text similarity of a system.Systems tend to go from a state of order (low entropy) to a state of maximum disorder (high entropy).The entropy of a system is related to the amount of information it contains.

Information Gain
One of the most important components for calculating dis-similarity measurement is the criterion used to select which attribute will become a subset attribute in given pair of strings.There are different criteria one of the most well known is information gain.

Characteristics of Entropy Based Dissimilarity Calculation
In order to facilitate to calculate dissimilarity percentage, we first analyze the characteristics of formulas in proposed method.A proposed method may contain various kinds of component, such as algorithm, formulas, truth table construction, entropy calculation, and Information Gain calculation.In order to represent those characteristics,

•
Apply decision making algorithm for truth table construction.

•
Apply entropy formula for calculate entropy value.

•
Apply entropy value to ID3 for calculate Information Gain.

Decision Making Algorithm
For (i=1 to n) Loop

Truth Table Construction
To calculate the logical representation for pair of text characters, a common truth table is constructed.In pair of strings which one have more length Where Ci, i=1…..n represents the i'th character of column.Rj, j=1…..m represents the j'th character of row in Table 1.

Entropy Calculation
Entropy is a formula to calculate the homogeneity of a sample data.If the value of entropy is 0 it is a completely homogenous sample data.If the value of entropy is 1 than it is an equally divided sample data .The entropy formula for negative and positive element is, Entropy (p, n) = -p log 2 (p) -n log 2 (n)

Information Gain
The information gain is based on the decrease in entropy after a dataset is split on an attribute.First the entropy of the total dataset is calculated.Second, the dataset then split on the different attributes.The entropy for each branch is calculated.Then it is added proportionally, to get total entropy for the split it is entropy value of child dataset.The resulting entropy is subtracted from the entropy before the split it is entropy value of total dataset.Gain (G) = ∑ (entropy values of child dataset) -∑ (entropy values of total dataset) * 100 The Gain value is the dis-similarity percentage of pair of text.It shows that the use of a parameterized value allows calculating dis-similarity measure at any point of time for any text values, only changing the parameter, no need to modify the system.
The experiment results obtained show that it is possible to use an automatic method to select the parameters value of the system than produce dis-similarity measurement of pair of parameter text and flexibility introduced by the parameters of text value are useful to better tune the fuzzy automaton in order to frequent dis-similarity calculation for a particular problem.The system not only offers good accuracy without knowing a prior the kind of errors that the input parameter text could contain or not.Its parameters can also be trained to treat as a text in order to better tune the fuzzy automaton accuracy for a particular problem.

Experiment Result
Table 2 shows how to construct the truth table for sample data "LEVEL" and "LEVAL".The Length of string "C" is equal to five, number of 0's occurs in total is one times and number of 1's occurs in total four times.If we apply theses value to entropy based duplicate detection (3) in all steps, this method (3) will produce a dissimilarity percentage nothing but GAIN value "G" as 20 percentage.

Conclusion
In this paper, we have proposed a method to identify dissimilarity percentage between pair of text.First we have introduced decision making algorithm for calculating logical values to strings C and R. Second calculated the entropy and information gain to find the dissimilarity percentage.The new method offers more accuracy of data without user feedback at the time of duplicate detection.

Table 1 .
Truth table construction for calculating the logical representation for pair of text characters

Table 2 .
Truth table construction for sample data