Representation of Textual Documents by the Approach Wordnet and N-grams for the Unsupervised Classification (clustering) with 2d Cellular Automata: a Comparative Study

In this article we present a 2D cellular automaton (Class_AC) to solve a problem of text mining in the case of unsupervised classification (clustering). Before to experiment the cellular automaton, we vectorized our data indexing textual documents from the database REUTERS 21,578 by Wordnet approach and the representation of text documents by the method n-grams. Our work is to make a comparative study of two approaches to representation that is the conceptual approach (Wordnet) and the n-grams. Section 1 gives an introduction on the biomimétisme and text mining, Section 2 presents representation of texts based on Wordnet approach and the n grams, Section 3 describes the cellular automaton for clustering, Section 4 shows the experimentation and comparison results and finally Section 5 gives a conclusion and perspectives.


Introduction
Biomimétic in a literary sense is the imitation of life.Biology has always been a source of inspiration for researchers in different fields.These have found an almost ideal in the observation of natural phenomena and their adaptation to solve problems.Among these models are the genetic algorithms, ant colonies, swarms particles, and clouds of flying insects [Nicolas Monmarché 2003] and of course cellular automata that we will detail in the next section.The first approaches mentioned methods are widely recognized and studied but cellular automata against methods are rarely used and in particular in the field of unsupervised classification.It has been our motivation for the use of this method in this field.This method is known by scientific community as a tool for implementing machinery and other (A cellular automaton (CA) is primarily a formal machine).We consider cellular automaton as a biomimetic method.At all times, many researchers have been inspired by the model of the nature.Opportunistic man has always dreamed of flying like a bird or swim like a fish... Leonardo de Vinci (1452 -1519) was a universal genius.As both an artist, philosopher, scientist and it was also the first true biomimétique researcher.After studying the flight of birds, he constructed in 1505 of theft of equipment, helicopters and parachutes.Unfortunately, the society of the time was not yet ready and has prevented his ideas to be transformed into real products.
Since the 50th, biomimétic has been steadily progressing and is a major issue for current research.
Biomimétic is a scientific practice that tries to imitate or draw inspiration from natural systems.Examples of this area, we find among other things: imitating skin fishes used in cars or vehicles aerodynamics, or the algorithm inspired from ant colonies to find the shortest path in a graph… The Text mining is the combination of techniques and methods for the automatic processing of textual data in natural language.It is a multidimensional analysis of textual data, which aims to analyze and discover knowledge and connections from the available documents.In text mining similarities are used to produce synthetic representations of large collection of documents.Text mining includes a series of steps to go documents to text, text to number, number to the analysis and analysis to decision-making.

State of the art
To implement classification methods, because there is currently no method of learning that can directly represent unstructured data (text), we should choice a mode of representation of documents [Sebastiani, F 2002].Also, it is necessary to choose a similarity measure and an algorithm for unsupervised classification.

a-Textual representation
A document (text) di is represented by a numerical vector as follows Where T is the set of terms (or descriptors) that appear at least once in the corpus.
(|T| is the vocabulary size), and V ki is the weight (or frequency).
The simplest representation of textual documents is called a "bag of words" representation [Aas, K.1999], it is to transform texts in vectors where each element represents a word.This text representation excludes any form of grammatical analysis and any notion of distance between words.
Another representation, called "bag of phrases", provides a selection of phrases (sequences of words in the texts, not the lexeme "phrases"), by favouring those who are likely to carry a meaning.Logically, such a representation should provide better results than those obtained by the representation "bag of words".
Another method is based on stemming; it is to seek the root of a lexical term [Sahami M 1999], for example, the infinitive forms of the singular for verbs and nouns.
Another representation, which has several advantages (mainly, this method treats the textual regardless of the language), is based on "n-grams" (a "n-gram is a sequence of n consecutive characters).
There are different methods to compute the weight Vki knowing that for each term, it is possible to compute not only its frequency in the corpus, but also the number of documents containing that term.
Most approaches [Sebastiani, F 2002] focus on the vector representation of text using the measure TF * IDF.TF represents "Term Frequency": the number of occurrences of the term in the corpus.IDF represents the number of documents containing the term.These two concepts are combined (by product) to assign a higher weight to terms that often appear in a document and rarely in the entire corpus.

b. Similarity measure
Several measures of similarity between documents have been proposed in literature in particular is the Euclidean distance, Manhattan and Cosinus that we detail in Section 3.

c. Unsupervised classifications algorithm
The principle of unsupervised classification (clustering) is to group texts that seem similar (having common affinities) in the same class.The texts in different classes have different affinities.
The unsupervised classification of a set of documents is a highly combinatorial problem.Indeed, the number of possible partitions P n k , n documents into k classes is given by the Stirling number of second species: The methods of unsupervised classification can be divided into two families: the family of methods of hierarchical classification and methods of non-hierarchical classifications (Figure 1).
Unsupervised classification or clustering is a fundamental technique in data mining (structured or unstructured).Several methods have been proposed: -Ascending hierarchical classification: Successive Agglomeration.
-Flat Classification : algorithm for k-means: Partition Some work in the field of classification: -An overview of biomimetic algorithms for classification carried out in Informatics' Laboratory in University of Tours, Ecole Polytechnique of the University of Tours [AZZAG H 2004].
-Visual search and classification data cloud of flying insects.[Nicolas Monmarché 2003].
Mainly these are the most seen in the inspiration of our work.

Representation of texts based on the n grams
An n-gram is a sequence of n consecutive characters in a document, all n-grams (usually n = (2, 3, 4.5)) is the result obtained by moving a window of n boxes on the body text.
This movement is made by steps of one character and each step we take a picture.All of these pictures gives the set of all n-grams of the document.Then there are the frequencies of n-grams found.
Note.A dash (-) represents the space between words in text Since its creation by (Shannon) in 1948, the concept of n-grams has been widely used in several areas such as the identification of speech and information retrieval: representation of textual document by the method of n-grams has many advantages In fact, the n-grams capture the knowledge of most frequent words of each language which facilitates identification of language and the method of n-grams is independent of language, while the systems based on words For example, are dependent on language.
Another advantage of the representation of texts with the n-grams is that this method is tolerant to spelling mistakes, for example, when a document is scanned using an OCR, the OCR is often imperfect.
For example, the word "character" can be read as "claracter.A system based on the words hardly recognize the word "character" or even its root; But, a system based on the n-grams will be able to take into account other n-grams as " aract", "racte" etc., with n = 5, some retrieval systems based on n-grams have retained their performance with the deformation rate of 30%, a rate at which any system based on words can not function properly.
Finally, with the use of n-grams for the representation of textual documents is not required to pre-treatment language, that is to say, the application of techniques stemming or elimination of "stopwords", sequences of n-grams do not improve performance.
For example, if a document contains many words from the same root, the frequencies of n-grams corresponding increase without requiring any prior language processing.
In the phrase "the fisherman fishing" the 5-grams and the corresponding representative vector are as Figure 2.

Representation of texts based on Wordnet Approch
We used in our experiments the Reuters 21578 corpus, which represents a database of 21578 text documents of news information in English.Thus, before the clustering phase, text documents must be indexed i.e. vectorized without loss of semantics.The first step in indexing is the pre-treatment; it is to remove any symbol that does not correspond to a letter of the alphabet (points, commas, hyphens, numbers, etc.).This operation is motivated by the fact that these characters are not related to the contents of documents and does not change the meaning if they are omitted, and therefore they can be neglected.The second step is called stopping which corresponds to the deletion of all words that are too frequent (they do not have to distinguish between documents) or play a purely functional role in the construction of sentences (articles, prepositions, etc...).This operation is motivated by the fact that these characters are not related to the contents of documents and does not change the meaning if they are omitted, and therefore they can be neglected.The stopping corresponds to the deletion of all words that are too frequent (they do not have to distinguish between documents) or play a purely functional role in the construction of sentences (articles, prepositions, etc...).The result of stopping is that the number of words in the collection, what is called the mass of words is reduced by an average of 50%.To eliminate the words, known as stop-words, are harvested in the stop-list which usually between 300 and 400 elements.Then comes the step of stemming which is to replace every word in the document root e.g.national, nationality and nationalization are replaced by their root "national" and conjugated verbs by their infinitives.The stemming does not affect the mass of words, but reduced by 30% in the average size of the document.We used the algorithm PORTER to address this step.Then comes the lemmatization using WordNet approach; WordNet is lexical database.WordNet is commonly referenced as a lexicon.The words in WordNet are represented by their canonical form called lemma.This step is used to prepare the next step, which is crucial to indexing vectorization (scanning).Lemmatization replaces each word of the document by its SYNSET (set of synonyms in the lexicon).
The vectorization is realised by the method TF-IDF (Term Frequency / Inverse Document Frequency) which is derived from an algorithm for information retrieval.The basic idea is to represent documents by vectors and measure the closeness between documents by the angle between vectors, this angle is assumed to represent a semantics distance.The idea is to encode each word of the bag by a scalar (number) called tf-idf to give a mathematical aspect to text documents. Where: • tf(i,j) is the term frequency: the frequency of term t i in document d j • idf(i) is the inverse document frequency: the logarithm of the ratio between the number N of documents in the corpus and N i the number of documents containing the term t i .
A document corpus di after vectorization is: where m is the number of word of i th bag of word and x j is the tf-idf.
This indexing scheme gives more weight to words that appear with high frequency in some documents.The underlying idea is that these words help to discriminate between texts with different subject.The tf-idf has two fundamental limitations: The first is that longer documents are typically rather strong weight because they contain more words, so "the term frequencies" tend to be higher.The second is that the dependence of the "term frequency" is too important.If a word appears twice in a document d j , it does not necessarily mean that it has two times more important than in a document d k where it appears only once.
In our study we standardized frequencies TFxIDF by the following formula: The weighting of TF × IDF has the following effects: 1) Importance of each word in the standard text 2) A word that appears in all documents is not important for differentiation of texts 3) Relevance words globally uncommon but common in certain documents.
Encoding TF × IDF does not correct the lengths of texts, to this end, the coding TFC is similar to that of TF × IDF, but it fixes the lengths of the texts by cosine normalization, in order not to encourage the longest.

The cellular automaton for clustering
The cellular automaton we propose is a network of cells in a 2D space and belongs to the family (k, r) where k is the number of possible states of a cell i.e. the cardinal of all states and r is the environment of the cell i.e. r is the radius of the neighbourhood.This automaton has 4 possible states (k = 4) and the radius of neighbourhood is a single cell (r = 1).In other words, the neighbourhood used is the vicinity of Moore (8 neighbouring cells around the cell itself) slightly modified.
Thus a cell of the automaton is dead, alive, alone or contains a data that all states of the automaton is (dead, alive, isolated, Active).
A dead cell will contain the value 0, a living cell will contain the value -1, an isolated cell will contain the value -2 and an active cell contain data (number of the document corpus).
We used these values and especially the value of the living cell (-1) to make the difference between a living cell containing the value 1 and a cell containing a data (number of the document) 1.Thus a cell will contain a value of {-2, -1, 0, 1, 2... N} where N is the number of the last document of the corpus used.
If N is the number of documents to classify the size of the 2D grid cell is m * m with m = 2 * (Int (sqrt (N)) + 1) where 2 is an empirical coefficient used for the organization of spatial class in the grid.
Example: to classify 150 documents in the corpus REUTER 21,578 we must have a grid of 13 x 13 to represent the 150 documents and a 26 x 26 grid to represent the different classes of 150 documents in the grid spacing (Figure .3).

The Neighbourhood
The neighbourhood used in the cellular automaton that we propose is an hybrid containing neighbourhood near which is the Moore neighbourhood of radius 1 containing 8 cells around the cell itself and two neighbourhoods of radius 1 arising from the fact that the grid is planar.Since the grid is the planar ends of the four neighbourhood contains only three (3) and the neighboring cells surrounding a cell (i, j) belonging to the perimeter of the grid (without the ends) is the set of five (5) cells surrounding the cell (i, j) of radius 1. (Figure 4)

The similarity matrix
We experiment the classification using three type of similarity distance: the Euclidean distance, Manhattan and cosine.The similarity matrix is a symmetric matrix of dimension N * N, where N is the number of documents to classify diagonal equal to zero for the Euclidean distance and Manhattan and diagonal equal to 1 for the cosine distance.

The Euclidian distance
The indices represent the indexes in the corpus of the documents to classify.

Description of the algorithm Class_AC
-Indexing documents corpus to classify.
-Vectorize each text document corpus by the TF-IDF method.
-Compute the similarity matrix from the vectors found: sim (i, j) = D (d i , d j ).
-Initialize all the cells of the automaton to the state 'dead' (status = 0).
At each iteration of the algorithm, the cells will change their status under the transition rules defined by the cellular automaton that will seek to consolidate similar statements to the active cells (containing the documents).The classification is recovered (The data may appear multiple times in the grid).

Simulation of the cellular automata
• At time t, the cell is dead.
• At time t+1, the cell is active (contains data) and the neighbourhood is alive.
• At time t, the cell is alive.
• At time t+1, after checking the neighbourhood, the cell will become active after consulting the similarity matrix and the index will contain similar (if this similarity is less than or equal to a predefined threshold).
These sequences are iterated until we obtain a cellular automaton with state activated, isolated and dead; data clusters are separated by cells in isolated state.(Stopping criterion: when we have finished with all documents)

Experimentations
After testing the algorithm Class_AC on the outcome of the Reuters 21578 corpus, we obtained the following results in terms of number of classes and purity of the clusters for both methods of representing text documents in this case the conceptual representation based on WordNet and representation by N-grams.: Regarding the purity of a cluster we used a threshold of similarity is the distance between two documents.If this distance is less than or equal to the threshold then the documents are similar.Cosine to the distance threshold is compared to the value | 1 -cos (Vi, Vj) |.
In terms of purity of the cluster, and error rate in classification we used two measures of assessment in this case the entropy and f-measure.Both measures are based on two concepts: recall and precision defined as follows: where N is the total number of documents, i is the number of classes (predefined), K is the number of clusters unsupervised classification, N Ci is the number of documents of class i, N k is the number of documents to cluster C k , N ik is the number of documents of class i in cluster Ck.The entropy and f-measure are calculated on a partition P as follows: The partition P corresponds to the expected solution is one that maximizes the F-measure or minimizes the associated entropy.(In our study P is the partition that corresponds to the class of results of classification by the SOM method for the number of documents associated).

Definition of the threshold
Threshold 1: For the Euclidean distance and Manhattan and after normalization of the matrix of similarity (distance in [0,1]) we allowed an error rate of 10% (threshold 1 = 0.1) and the distance cosine we have tolerated 20%.
These threshold values were chosen after testing of the classification by cellular automata.

Results
The experiment started with 50 documents and the results have only proved that cellular automata were able to make the unsupervised classification of text documents (Clustering of text) because they regrouped effectively similar documents.Then we performed the experiment with 150 documents (first 50 texts of three documents REUTERS 21,578) and we have achieved concrete results from the previous, which led us to increase the size of the corpus documents in 350 to take a decision on the quality of the classification after evaluation by the entropy and F-measure.
In this study and the representation of text documents by the method of n-grams we have chosen n = 5, i e the method of 5-grams.
We experienced our cellular automaton on the Reuters 21578 corpus, we proceeded to the extraction 350 texts we have indexed and then calculated the similarity matrix.After testing we have achieved the results grouped in tables and figures below.
For each representation of documents and in terms of results we obtained different classes by the three distances used by varying the threshold of similarity (Table 1, 2, 3, 4, 5, 6).The classes are found in a grouping of similar documents in a way guided by the threshold.

Interpretation
From Figure 5 we see that there is almost no difference for the two representations of documents in terms of number of classes and seeing figures 6 and 7 we note that the method of n-grams has an influence on the approach wordnet because for the f-measure, the graph of the n-gram is superior to that of wordnet and the same thing for the graph entropy n-grams is inferior to that of wordnet therefore may conclude that for cosine distance, the representation of n-grams is much better than wordnet.
Regarding the Euclidean distance and from Figure 8 the number of classes varies roughly the same for both methods of representation.According to Figure 9 which shows the rating curve of the f-measure, the graph approach wordnet is well above the graph representation of the n-grams but from Figure 10 which shows the rating curve of the entropy, the graph representation of the n-grams is much less than the graph approach wordnet.So it is difficult to decide on the best representation of documents with the Euclidean distance because based on the f-measure can be said that the approach wordnet is better than n-grams as the maximum f-measure is achieved in the approach wordnet and based on the entropy we can say that the representation by the n-gram is better than the approach wordnet as the minimum entropy is achieved in the method of n-grams.In conclusion for this distance, we can opt for the textual representation of the WordNet approach because we can rely on the assessment by the F-measure itself because it is based on two concepts namely recall and precision but the entropy is based only on precision.
Regarding the Manhattan distance from Figure 11 we see a total difference of variation curve for both methods of representation.From Figures 12 and 13 we note that the curves behave the same as those observed for the Euclidean distance.So what has been concluded for the Euclidean distance is valid on the Manhattan distance.
The colored boxes in the tables above represent the best Threshold of classification because of the choice of the minimum entropy (gray boxes) or the maximum F-measure (blue boxes).The best values of the entropy or the F-measure are highlighting in yellow in the table above.Regarding the choice of the best classification we opted for the F-measure because it is based on two concepts (recall and precision) as shown in formula (1).
In conclusion we can say that to a good classification should be a good representation of textual documents and a good choice of similarity metric.To make a good representation must choose the right approach.To choose the right approach and based on the results of our study, we can choose the representation of text documents by the methods of n-grams if the distance chosen is the cosine distance, and opt for the representation of textual documents by wordnet approach if the distance chosen is the Euclidean distance or Manhattan distance.It recapitulates in the following table.
In terms of time, the convergence of the algorithm is very fast (less than 1 second) as indicated in the table above for the Euclidean distance, Manhattan and the Cosine distance.Therefore what was said in the literature on cellular automata is observed in our study.We noticed that the execution time increases with the number of documents to classify.

Conclusion and perspectives
In conclusion, we proposed a first algorithm for unsupervised classification (clustering) using cellular automata and represented by two methods (approach Wordnet and N-grams).After testing we have proved that this algorithm can solve a text mining problem .i.e. the clustering.
The transition function used in our automaton is changed by forming groups (cluster) similar to a certain threshold meadows.The methods of indexing text documents such as TF-IDF ,n-grams approach and Wordnet approach helped us to mummeries documents so that the use of cellular automaton on digital vectors is possible.So passage of documents to text, text to number, number to the analysis by cellular automata and analysis to decision making on the classification have been the subject of this study in this article.
As future work; we are studying 3D cellular automaton and its use for visual viewing and navigation classes in 3D space.These algorithms will also be tested for other types of data such as images and multimedia data in general.

-
Hierarchical classification: tree of classes.
Distances between vectors Ti and Tj in multidimensional space are: If the cell C i,j dies when then cell C iTi and T j in multidimensional space are Where Ti, Tj represents the scalar product of the vectors Ti and Tj | | Ti | | and | | Tj | | Represent respectively the standards of Ti and Tj.

Figure
Figure 3. Example of grid for 150 documents • • •

Table 1 .
Result of Cosine distance with Wordnet Approach

Table 2 .
Result of Cosine distance with N-grams Approach

Table 3 .
Result of Euclidian distance with Wordnet Approach

Table 4 .
Result of Euclidian distance with N-grams Approach

Table 5 .
Result of Manhattan distance with Wordnet Approach

Table 6 .
Result of Manhattan distance with N-grams Approach

Table 7 .
Choice of approach and distance