Context-sensitive Spelling Correction Using Google Web 1T 5-Gram Information

In computing, spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. Basically, a spell checker is a computer program that uses a dictionary of words to perform spell checking. The bigger the dictionary is, the higher is the error detection rate. The fact that spell checkers are based on regular dictionaries, they suffer from data sparseness problem as they cannot capture large vocabulary of words including proper names, domain-specific terms, technical jargons, special acronyms, and terminologies. As a result, they exhibit low error detection rate and often fail to catch major errors in the text. This paper proposes a new context-sensitive spelling correction method for detecting and correcting non-word and real-word errors in digital text documents. The approach hinges around data statistics from Google Web 1T 5-gram data set which consists of a big volume of n-gram word sequences, extracted from the World Wide Web. Fundamentally, the proposed method comprises an error detector that detects misspellings, a candidate spellings generator based on a character 2-gram model that generates correction suggestions, and an error corrector that performs contextual error correction. Experiments conducted on a set of text documents from different domains and containing misspellings, showed an outstanding spelling error correction rate and a drastic reduction of both non-word and real-word errors. In a further study, the proposed algorithm is to be parallelized so as to lower the computational cost of the error detection and correction processes.


Introduction
For many years, computers have been solely exploited to solve mathematical and computational problems.Nevertheless, during the last few years, this trend has changed radically as the rapid booming of IT industry and advances in computing technologies gave birth to a new breed of computing applications.One of these applications is the processing of human languages, a field that is often known as natural language processing (NLP) or computational linguistics.Spell checking is a sub-field of computational linguistics whose function is to detect and sometimes correct words in a text that are not spelled correctly (Manning, Raghavan, & Schütze, 2008).In essence, a spell checker or spell corrector is a computer program often integrated with a word processor that performs spell checking and is majorly composed of a blend of three units: The error detector which flags misspelled words by validating them against a dictionary or a lexicon of words; The candidate spellings generator which provides alternative corrections for the detected errors; and the error corrector which selects the best candidate as a replacement for the detected error.
At heart, a spell checker/corrector is based on a built-in dictionary of words to detect errors, and on a corpus-based probabilistic model to perform error correction.However, with the dynamic growth of new words and terminologies entering the language, closed and conventional dictionaries are no more adequate to cover every single word in the vocabulary.Besides, traditional dictionaries rarely cover proper names, names of countries and regions, technical keywords, domain-specific terms, and acronyms.Consequently, there will be little data in these dictionaries to cover all words of the language, a problem known as data sparseness (Allison, Guthrie, & Guthrie, 2006).Basically, the richer vocabulary a dictionary or any source of text has, the more accurate, in general, is an NLP system (Banko & Brill, 2001).Likewise, a study conducted by the computational linguistics community showed that a generic colossal corpus such as the web, can most of the time overcome data sparseness problems (Kilgariff & Grefenstette, 2003).This paper proposes a new spelling error detection and correction technique for computer text documents, based on data statistics from Google Web 1T 5-gram data set (Google Inc., 2006) which encompasses a massive volume of n-gram words ranging from unigrams (1-gram) mostly suitable for dictionary implementation and 5-gram word sequences emulating a universal text corpus, all extracted from the World Wide Web.Inherently, the proposed technique consists of several building blocks: An error detector that detects non-word errors using unigram statistics from Google Web 1T data set; a candidate spellings generator coupled with a character-based 2-gram model that generates candidate spellings for every detected error; and a context-sensitive error corrector that selects the best spelling candidate to replace the detected error using 5-gram statistics from Google Web 1T data set.

State of the Art
Practically, spelling errors in type written text vary between 1% and 3% (Grudin, 1983;Kukich, 1992) where 80% of them are usually caused by trivial editing operations such as insertion, deletion, substitution, and transposition (Damerau, 1964).Nonetheless, a different study (Peterson, 1986) pointed out that 94% of spelling errors are typically caused by such editing operations.In fact, spelling error detection and correction algorithms can be merely broken down into several types (Kukich, 1992): The non-word error detection which consists of detecting error words that are non-words, that is, words that cannot be found in a dictionary; The isolated-word error correction which consists of correcting non-word errors but looking at them in isolation, independently of their context; And the context-dependent or context-sensitive error detection and correction which consists of detecting and correcting errors according to their context in the sentence.
In fact, spelling error correction is not a new subject; It has been exploited by several researchers for over decades now.Several linguistic models and algorithms were proposed and experimented; The most prominent ones are the Noisy Channel model, the n-gram model, the edit distance algorithm, and the context-sensitive error correction algorithms.

Noisy Channel Model
The concept behind the noisy channel model is to consider a spelling error as a noisy signal that has been distorted somehow during transmission.The quintessence of this approach is that if one could know how the original word was distorted, it is then easy to find the actual correction (Jurafsky & Martin, 2008).The noisy channel model is a special case of Bayesian inference (Bayes, 1963) which is principally a classification-type model that inspects some observations and ranks them into appropriate classesand categories.Bledsoe and Browning (1959), and Mosteller and Wallace (1964) were the first among other researches to apply the Bayesian inference to detect misspellings in computer generated text.
Mathematically, the Bayesian model is a probabilistic model based on statistical assumptions and probability theory, namely the prior probability P(w) and the likelihood probability P(O|w)which are usually calculated by the following equation: P(w) is called the prior probability and indicates the probability of w to occur in a specific corpus.P(O|w) is called the likelihood probability and denotes the probability of observing a misspelling O given that the correct word is w.O is the actual misspelled word and w is a potential spelling candidate.For every candidate, the product of P(O|w)*P(w) is to be calculated; The candidate having the greatest product is to be selected as a correction for O and is denoted by w'.
The prior probability P(w) is straightforward as it is simply computed as P(w) = C(w) + 0.5 / N + 0.5, where C(w) is the frequency or the number of occurrence of the word w in the corpus, and N is the total number of words in the corpus.In order to avoid zero counts for C(w), the value of 0.5 is added to the equation.On the other hand, the likelihood P(O|w) is harder to calculate than P(w) as it is imprecise to find the probability of a word to be misspelled, however, it can be estimated by calculating the probability of possible wrongful insertion, deletion, substitution, and transposition in general.
Experiments conducted by Kernighan, Church and Gale (1990) proved that the Bayesian model is not perfect as it can fail to correct spelling errors in some cases, for instance, falsely correcting the spelling error "acress" as "acres", while the original word is "actress".

N-Gram Model
The n-gram model has been so far applied in many linguistics problems such as spelling correction, speech recognition, and word sequence prediction.Principally, the n-gram is a probabilistic model originally proposed by Markov (1913) and later applied by Shannon (1948), Chomsky (1956), and Chomsky (1957) to predict the next word in a particular sequence of words.In short, an n-gram is simply a collocation of words that is n words long.For instance, "the cat" is a 2-gram sequence also referred to as bigram, "the cat is eating" is a 4-gram sequence, "the cat is eating food and drinking" is a 7-gram sequence, and so forth.Below are examples for 3-gram and 4-gram word sequences with their frequencies extracted from Google Web 1T n-gram data set (Google Inc., 2006).

3-gram word sequences:
 ceramics collectables collectibles (55)  ceramics collectables fine (130)  ceramics collected by ( 52)  ceramics collectible pottery (50)  ceramics collectibles cooking (45) 4-gramword sequences:  serve as the incoming (92)  serve as the incubator (99)  serve as the independent (794)  serve as the index (223)  serve as the indication (72)  serve as the indicator (120) Unlike the prior probability P(w) which calculates the probability of a word w regardless of its surrounding words, the n-gram model calculates the conditional probability P(w|s) of a word w given the previous sequence of words s, that is, predicting the next word based on the preceding n-1 words.For example, the conditional probability of P(car|blue) consists of calculating the probability of the whole sequence "blue car".Put differently, for the word "blue", the probability that the next word is "car" is to be computed.
Since it is too complicated to calculate the probability of a word given all previous sequence of words, the bigram or 2-gram model is rather used most of the time.It is denoted by P(w n |w n-1 ) denoting the probability of a word w n given the previous word w n-1 .For a sequence of bigrams, the probability is calculated as follows: Several broader studies were investigated to improve the n-gram model from different perspectives; this may include but not limited to smoothing techniques suggested to solve the problem of zero-frequency of n-grams that never occurred in a corpus (Jeffreys, 1948;Church & Gale, 1991), the weighted n-gram model that more accurately estimates the n-grams based on their location in the context (Kuhn & Mori, 1990), and the variable length n-gram model (Niesler & Woodland, 1996) which varies the length of grams to attaint better overall system performance and compactness.

Minimum Edit Distance
The Minimum Edit Distance algorithm was first conceived by Wagner and Fischer (1974), and it is defined as the minimum number of edit operations needed to transform a string x into a string y.These operations are insertion, deletion, and substitution.In spelling correction, the purpose of the Minimum Edit Distance algorithm is to reduce the number of candidate spellings by eliminating the candidates with maximum edit distance as they are considered to share fewer characters with the spelling error than other candidates.There exist different edit distance algorithms: Levenshtein (Levenshtein, 1966), Hamming (Hamming, 1950) and the Longest Common Subsequence (Allison & Dix, 1986) algorithms.
The Levenshtein algorithm employs a weighting mechanism that assigns a cost of 1 to every performed edit operation irrespective of its type (insertion, deletion, or substitution).For example, the Levenshtein edit distance between "cat" and "dog" is 3 (substituting c by d, a by o, and t by g).The Levenshtein edit distance between "ping" and "rings" is 2 (substituting p by r, and inserting s at the end of ping).
The hamming distance is yet another algorithm for measuring the distance between two strings of the same length.It is calculated by finding the minimum number of substitutions required to transform string x into string y.Practically, the Hamming distance between "ring" and "ping" is 1 (changing r to p), the hamming distance between "334223" and "331227" is 2 (changing 4 to 1 and 3 to 7), and the hamming distance between "ring" and "pings" is invalid because the strings are not of the same length.
Another popular technique for finding the distance between two words is the LCS short for Longest Common Subsequence.The idea pivots around finding the longest common subsequence of two strings.A subsequence is a series of characters, not necessarily consecutive, that appear from left to right in a string.Accordingly, the longest common subsequence of two strings is the maximum length of the mutual subsequence.For example, if x=AABDDLPSTTACFM and y=BDADSAQPDSTAABCME, then LCS is equal to BDDPSTAM.

Context-sensitive Spelling Error Correction
Context-sensitive spelling error correction is the task of detecting and correcting spelling errors that result in valid words, i.e. real-word errors.For instance, in the sentence "you should constantly backup your computer flies", the word "flies" is a real-word error mostly caused by a typographical mistake.Obviously, the writer didn't intend to mean that computer flies like planes, but he most probably meant "computer files".This slight confusion produced a real-word error that is actually valid in the English dictionary, however invalid with respect to the sentence in which it has occurred.Context-sensitive spelling error correction tries to detect and correct such real-word errors by inspecting their grammatical and semantic contexts.Error correction based on grammatical context or syntactic context, attempts to apply grammatical rules to detect misspellings, for instance, asserting that the word "play" in the sentence "he play" is a grammatical error is true since in the English language, a third person verb in the present tense must always ends with an "s".In contrast, error correction based on semantic context can correct the word "peace" into "piece" in the sentence "peace of cake".Since words "peace" and"piece" are valid nouns in the English language, they are hard to be flagged by traditional non context-sensitive spell checkers.
Mays, Damerau and Mercer (1991) proposed using the n-gram model to predict the actual correction for a real-word error.The idea centers around generating candidate spellings for every misspelled word by only applying simple edit operations such as insertion, deletion, and substitution, and then using n-gram statistics derived from a corpus to compute P(w n |w n-1 ).Church and Gale (1991) suggested the use of a noisy channel to predict the actual correction for a real-word error.The technique harnesses a 100 million words corpus and n-gram statistics to correct errors according to their contextual information.Liu and Curran (2006) employed n-gram statistics to correct real-word errors using a big corpus of text collected from crawling the web.As a result, huge improvements were achieved due to the large volume and generality of web corpuses.Carlson and Fette (2007) employed the same previous technique but instead a memory-based learner was used to correct cross-domain errors.The system was trained using n-gram data tokens extracted from the web.The experiments yielded high precision real-word and non-word error correction.
Another approach was proposed by Demetriou, Atwell and Souter (1997), based on semantic knowledge and large vocabulary to correct spelling errors.A semantic model was built based on semantic association between words in a text to largely decrease the semantic ambiguities in natural languages.Hodge and Austin (2003)

Candidate Spellings Generation Algorithm
The proposed candidate spellings generation algorithm builds a list of possible spelling corrections for every detected non-word error.Those candidate corrections are denoted by C={c 11 ,c 12 ,c 13 ,c 1r ,…,c m1 ,c m2 ,c m3 ,c mq } where c denotes a particular candidate spelling, m denotes the total number of detected non-word errors, and r and q denote the total number of candidates generated for a particular detected error.In essence, the algorithmis based on a character-based 2-gram model which searches for unigrams in Google Web 1T data set having 2-gram character sequences in common with the error word.
For example, assuming that the original sentence to be validated is "case where only one sangle element is allowed to be stored" in which the word "single" was misspelled as "sangle", the non-word error "sangle" can be broken down into 2-gram character sequences as follows: sangle → sa, an, ng, gl, le.
Searching for unigrams in this list that share 2-gram character sequences with the error word "sangle", would give the following results: sa: salute sandbox sand sale sandwich salt sanitary an: tangle sanitary sandbox sand sandwich manangle ng: tangle single English angle tingle fringe ring gl: single singly tingle angle beagle tangle English le: single angle beagle unable tingle tangle disable The top 10 words having the highest number of common 2-gram character sequences with the error word "sangle" are selected as candidate spellings, and they are: "tangle": It shares 4 sequences with "sangle" "angle": It shares 4 sequences with "sangle" "single": It shares 3 sequences with "sangle" "tingle": It shares 3 sequences with "sangle" "beagle": It shares 2 sequences with "sangle" "sand": It shares 2 sequences with "sangle" "sandbox": It shares 2 sequences with "sangle" "English": It shares 2 sequences with "sangle" "sanitary": It shares 2 sequences with "sangle" "sandwich": It shares 2 sequences with "sangle" Choosing the top 10 unigrams ensures that the correction word is most of the time in the candidates list.
Unigrams having same number of common 2-gram character sequences are prioritized according to their length with the respect to the error word, for instance, "sangle" is made out of 6 characters; and hence, words whose length is 6 are favored over those whose length is 5 or 7. Error Correction (candidates) // launches the error correction algorithm } Based on the above results, the generated candidate spellings are C sangle ={tangle, angle, single, tingle, beagle, sand, sandbox, English, sanitary, sandwich}.Now, the ultimate task is to select the best candidate to replace the error word "sangle", a task left for the third algorithm namely the context-sensitive real-word error correction.

The Context-sensitive Error Correction Algorithm
The proposed context-sensitive spelling error correction algorithm takes each generated candidate c ik with four of the words that precede the original error in the original text, leading to S k ="w i-4 w i-3 w i-2 w i-1 c ik " where S denotes a 5-gram word sentence, w denotes a word preceding the original error, c denotes a particular candidate spelling for a particular error, i denotes the i th word preceding the original error, and k denotes the k th candidate spelling.Each constructed sentence S k is then compared with Google 5-gram word counts from Google Web 1T 5-gram data set.The candidate c ik that belongs to the sentence S k with the highest count is selected as a replacement for originally detected error word.Back to the previous example, the list of S sentencescan be outlined as follows: S 1 = "case where only one tangle" S 2 = "case where only one angle" S 3 = "case where only one single" S 4 = "case where only one tingle" S 5 = "case where only one beagle" S 6 = "case where only one sand" S 7 = "case where only one sandbox" S 8 = "case where only one English" S 9 = "case where only one sanitary" S 10 = "case where only one sandwich" The candidate spelling c k (tangle, angle, single, etc.) in the sentence S k having the highest frequency in Google Web 1T 5-gram data set is selected as a correction for the error word "sangle".The proposed algorithm is context-sensitive as it relies on real world word counts from Google data set initially extracted from the Internet.Therefore, despite the fact that the word "single" might be misspelled as the real-word "tingle", the algorithm should be able to correct it since the sentence "case where only one tingle" is to occur very few times over the Internet, fewer than any other sentence, for instance, "case where only one single".
Next is the pseudo-code for the proposed context-sensitive error correction algorithm.

Experiments and Results
For evaluation purposes, 300 articles pertaining to various domains were experimented including finance, business, IT, literature, political science, medicine, sports, and others.In total, they comprise 200,000 words including dictionary words, proper names, domain specific terms and terminologies, acronyms, and technical jargons and expressions.Initially, those articles are error-free as they do not contain any misspellings or linguistic mistakes; however, several words were randomly altered on purpose yielding to non-word and real-word errors in the text.These induced errors are approximately 1% of the original text; and hence they are around 2,000 spelling errors.Table 2 gives the total number of words in the set of articles, in addition to the number of induced non-word and real-word errors.For comparison purposes, the GNU Aspell (Atkinson, 2004) and Ghotit Dyslexia (Ghotit ltd., 2011) were used to spell check the test data.The Aspell is a free-software cross-platform spell checker that is the standard spell checker for the GNU software project and has been integrated into commercial software applications such as Notepad++, Opera, gedit, and others.It is compatible with Unix-based operating systems, as well as Microsoft Windows.On the other hand, Ghotit Dyslexia is a proprietary contextual spell checker developed by Ghotit and mostly intended for people with dyslexia, dysgraphia, and other English writing difficulties.Ghotit is a Microsoft Word add-on that includes a context spell checker, a grammar checker, and an integrated word dictionary.
The results of executing Aspell to spell check the test data are given in Table 3, while the results for Ghotit are given in Table 4.In a head-to-head comparison, it is evident that the proposed method outperformed the other two existing solutions as a higher number of non-word and real-word errors were detected and corrected successfully.
Particularly, the proposed method managed to correct 99% of total non-word errors and 70% of total real-word errors, yielding to an error correction rate close to 93%.Only 7% of total errors were left either undetected or were falsely corrected.In contrast, the GNU Aspell yielded an error correction rate of 51%, while the Ghotit yielded a 62% rate.It was obvious that the strong point of the proposed method was in real-word error correction (context-sensitive) as it outscored the Aspell spell checker by 800% (8 times more errors were corrected), and the Ghotit spell checker by 240% (2.4 times more errors were corrected).These outstanding results are primarily due to the large count of 5-gram tokens and their abundant statistics in the Google Web 1T data set harnessed by the proposed method.Moreover, since the data of Google Web 1T set are pulled out of the Internet, it is heavily stuffed with real data encompassing dictionary words, proper names, domain specific terms and terminologies, acronyms, and technical jargons and expressions that can cover most of the words and their possible sequences in the language.

Conclusions and Future Work
This paper presented a novel context-sensitive approach for detecting and correcting non-word and real-word spelling errors in text documents.The proposed algorithm is based on Google Web 1T 5-gram data set that houses a huge volume of word sequences originally extracted from Internet web pages.The goal of this new method was to improve the error correction rate of modern spell checkers, especially context-sensitive error correction.The proposed method exceled when put under test alongside with other spell checkers, more particularly the GNU Aspell and the proprietary Ghotit Dyslexia.In effect, 99% non-word errors and 70% real-word errors were corrected by the proposed method; While the closest competitor namely Ghotit hit approximately 70% for non-word errors and 29% for real-word errors.Overall, 93% of total errors were corrected by the proposed method, while Ghotit scored 62%.In a nutshell, the proposed method was able to detect and correct 2.4 times more errors than the best existing method.The major reason behind these noteworthy results is the integration of Google Web 1T data set into the proposed algorithm as it embraces a wide-ranging set of words and precise statistics about word associations that cover domain-specific terms, technical terminologies, acronyms, expressions, proper names, and almost every word in the language.
As for future work, a parallel algorithm is to be devised and experimented; It can typically be implemented over multiprocessor machines or distributed computing infrastructures with the purpose of boosting the execution time and performance of the error detection and correction processes.
proposed a supervised learning spell checking methodology based on Hamming distance algorithm and on an n-gram model for detecting isolated word errors.The generated candidate spellings are ranked based on their Hamming distance and n-gram statistics.In due course, candidates having the highest score are selected as correction for the detected real-word errors.can be used as tion, speech re e Inc. has alrea vailable on si ncies extracted ms with their co Figure 1.

Table hed
The proposed error detection algorithm detects non-word errors E={e 1 ,e 2 ,e 3 ,e m } in the original text T={w 1 ,w 2 ,w 3 ,w n }, where e isa misspelled word or simply an error word, m isthe total number of detected errors, w is a word in the original text and n is the total number of words in the text.The process starts by validating every word w i in T against Google's data set of unigrams; If w i is found, then w i is said to be correct and no correction is to take place.Otherwise, if the word w i is not found, then w i is said to be misspelled, and hence a correction is required.Google's data set of unigrams is already sorted alphabetically, and thus Binary search can be employed immaculately to speed up the execution time of error detection.Ultimately, a list of errors is generated and is denoted by E={e 1 , e 2 , e 3 , e m } where m is the total number of non-word errors detected in the original text.Below is the pseudo-code for the proposed error detection algorithm.Proc Error Detection (T) { W ← Split (T , " ") // splits the text on every space and stores the returned words into an array W for(i←0 to N) // iterates until all words are validated { // searches for every W[i] in Google Web 1T unigrams data set flag←BinarySearch (Google Unigrams Data Set, W[i]) if(flag = true) // indicates that W[i] is found in Google data set and thus it is spelled correctly i← i+1// move to the next word W[i+1] else // indicates that W[i] is misspelled and thus it requires correction GenerateCandidates (W[i])// proceed with the candidate spellings generation algorithm } } The complete pseudo-code for the proposed candidate spellings generation algorithm is given below:

Table 2 .
Number of induced errors

Table 3 .
GNU Aspell test resultsApplying the proposed method on the test data to detect and correct spelling errors resulted in 1,860 errors being corrected successfully, among which 1,581 were non-word errors and 279 were real-word errors.As a result, around 93% of the total errors were corrected; around 99% of total non-word errors were corrected; and around 70% of total real-word errors were corrected successfully.Table5outlines the obtained test results for the proposed method.Below are examples of successful and unsuccessful corrections observed during the execution of the proposed error correction method.It is worth noting that errors are marked by an underline and results are interpreted using a special notation in the form of [error-type ; corrected error ; intended word].
… would like to ask you to voice your sopport for this bill …→ [non-word error; support; support] … but the content of a computer is vulnerable to fee risks …→ [real-word error; few; few] … medical errors effect us all whether we are involved or not …→ [real-word error; affect; affect] … many of the best poems are found in too collections …→ [real-word error; two; two] Not Corrected 2%: … whether you hit the road in a sleek imported sporting car …→ [real-word error; sporting; sports] … we fear the precaution of medication prior to tonsillectomy …→ [real-word error; fear; feel] Falsely Corrected 4%: … After all I slept near my door on the pavement…→ [real-word error; dog; door] … I saw the ball running too fast …→ [real-word error; bus; ball]