A New Model for Automatic Sentence Segmentation

Context Overlapping Model (COM) is presented in this article for the task of Automatic Sentence Segmentation (ASS). Comparing with HMM, COM expands observation from single word unit to n-gram unit and there is an overlapping part between the neighboring units. Due to the co-occurrence constraint and transition constraint, COM model reduces the search space and improves accuracy of segmentation. In this research we treated ASS as a task of sequence labeling and applied 2-gram COM to it. The experiment results show that the overall correct rate of the open test is as high as 90.11%, which is significantly higher than the baseline model (second order HMM), which is 85.16%.


Introduction
Automatic sentence segmentation (ASS) is an important step in the Automatic Speech Recognition (ASR).Due to the lack of morphological hints such as capitalization the task is more difficult than the sentence boundary detection (SBD).Moreover there are always many misrecognized words, which make ASS more difficult.Little work has been done in this area but recently it gained more interest from the research community.(Mikheev,2003:216) CYBERPUNC (Beeferman, Berger, and Lafferty, 1998) is a system which aims to segment sentences in the speech transcripts.This system was designed to augment a standard trigram language model of a speech recognizer with information about sentence splitting.CYBERPUNC was evaluated on the WSJ corpus and achieved a precision of 75.6% and recall of 65.6%.Other than this experiment few of English experiments are reported on this task.Tanev and Mitkov (2000) had an evaluation of a sentence segmentation system for Slavonic languages.This system used nine main end-of-sentence rules with a list of abbreviations and achieved 92% in precision and 99% in recall measured on a text of 190 sentences.In Chinese, ASS is more difficult due to the ambiguity of sentence boundary in the sense of linguistics.Until now there is no formal rule for the definition of sentence in Chinese.Since Chinese sentences are always segmented by comma as well as period, question mark or exclaimer, comma is always regarded as the sentence boundary.Moreover, some researches (Stevenson and Gaizauskas, 2000) show that this task is not only difficult to computer but also to human.The performance of human is that the precision is 90% and the recall is 75%.And there are substantial disagreements among human annotators.
In our research we regarded the ASS as a task of annotation.We transfer the task of segmenting sentences to the task of determining the state of each word in the sequence.The states include terminal (denote by T) and non-terminal (denote by C).If there is a word sequence w1w2…wn, we want to get a state sequence s1s2…sn.If the state of wi is T, then wi is the end of a sentence.If the state of wi is C, then there is no stop after wi.With such a transfer, we can apply the general tagging model such as Hidden Markov Model (HMM) to this task.HMM is widely used in the task of part of speech tagging.But in ASS the number of state is much fewer than pos tags and state is much dependent on the neighboring words.And some experiments results show that HMM has a poor performance in ASS for the observation independence assumption.
In order to improve the performance of ASS, we create a new model based on HMM, which is named Context Overlapping Model (COM).COM expands the observation from one single word unit to n-gram unit and between the neighboring units there is an n-1 gram part, which is shared by the neighboring units.For the overlapping part the model uses the neighboring observations to determine the current state.The result comparing with HMM shows that COM outperforms HMM in ASS significantly.
The structure of the following part of this article is: in the second part we will introduce COM model.The third part will address how to estimate parameters and handle sparseness data.The fourth part is about evaluation criteria and the fifth part presents the experiments and results.The final part is some discussions and future work to do.

COM Model
COM model is based on HMM.HMM is a form of generative model, that defines a joint probability distribution p(X,Y) where X and Y are random variables respectively ranging over observation sequences and their corresponding state sequences.There is an assumption for HMM that the observation element at any given time may only directly depend on the state at that time.
Concerning with observation independence assumption, COM is different from HMM. COM can be divided into different kinds in terms of the length of observation unit.Here we present the formalism of 2-gram COM, in which the length of observation unit is 2 words.The formalisms of other n-gram COM (n>2) can be gotten according to the formalism of 2-gram model.
In the 2-gram COM there is a basic state set } ,... , { The corresponding state of a 2-gram observation unit  is one of the basic states of 1  i w and j i q is one of the basic states of i w .The state sequence will not be more than the amount of the combination of states of 1  i w and i w .
The search for the state sequence with the highest joint probability can be computed as: q q q q p q q p q p i i Q denotes the state sequence and S denotes the observation sequence.Q  denotes the final state sequence, whose joint probability is the highest.For the convenience of computation, we insert 2 "*B*", whose state is "B" at the beginning of the sequence and insert 2 "*E* ", whose state is "E" at the end of the sequence.And then the above formula will be: In this model there is an overlapping part between the neighboring observation units Q  is a sequence consisting of h+1 2-gram state units like: It is obvious that the final state sequence can be gotten from the above sequence.

Parameters estimation and Evaluation Criteria
There are 2 main parameters to be estimated in COM: (1) t P :State transition probability; (2) e P :State emission probability.
We apply the maximum likelihood to estimate these parameters from the tagged corpus.The details of the estimation will not be introduced here.
For the expansion of the observation the sparseness problem in n-gram COM is more serious than that in HMM.COM applies back-off strategy to deal with the sparseness data.The main idea is that if n-gram (n>2) is not in the n-gram vocabulary, which is gotten from the training corpus, it will be replaced by n-1 is not in the 2-gram vocabulary then the state units of

Corpus and Preprocessing
We apply COM to the Chinese ASS task.For we have no speech transcripts and we just focus on the performance of COM without any usage of the speech information, we take the general Chinese corpus as the training and test corpus.The training and test data are all taken from the People's Daily of year 2000 , which has been segmented and manually assigned PoS tags by the Peking University.The division of corpus is displayed in table 1.
Before training and tagging the corpus is preprocessed.First, all the named entities such as personal names, location names, organization names and all the digits are replaced by some particular symbols.For example, personal names are all replaced by "*PerN*".Second, we transfer the comma, period, question mark, exclaim mark, semi-colon, colon to one tag "T" as the sentence terminal.
The baseline model is the 2nd order HMM, whose results will be compared with that of 2-gram COM.

Results
The results are shown in table2.The precision and recall rate of COM all outperform HMM significantly.
It is interesting to see that the gap between the C rates of both models is only about 5 per cent, but gap between other evaluators are much bigger than 15 per cent.The possible reason is the unbalance distribution of word states.In the test corpus the number of T is 20134 and N is 134164.The number of state N is much more than state T.So even if we guess all the states to be N the overall correct rate will not lower than 85%.For this reason and the observation independence assumption of HMM, HMM has a poor performance in ASS.HMM assign much more probability to state N both in observation and transition probability regardless the neighboring words.But in COM it takes the neighboring words into the model and the observation and transition probability will both be influenced.In this sense COM outperforms HMM in the guessing of sentence terminal.

Discussion
COM is not only suitable to the task of ASS.We have applied it to the Chinese word segmentation, part of speech tagging and chunk detection, in which COM also achieves satisfactory results.Comparing with HMM, COM has the advantages of smaller search space and higher tagging precision rate.Comparing with the discriminative models such as CRF and Maximum Entropy, COM has the advantages of less training time and comparable precision rate.All of these prove that COM is a general, efficient and robust model for sequence labeling.

Conclusion
In this paper we apply the COM model to the task of ASS and achieve better performance than the HMM model.COM is superior to HMM because it overcomes the limit of observation independence assumption with the expansion of observation unit and construct an overlapping part between neighboring unit.In the further studies, we will explore possible applications of COM in the sequence labeling task in natural language processing.

F
i w is not in the unigram vocabulary it will be handled as same as in HMM.We use the following criteria to evaluate the performances of COM in ASS.(1) Overall Correct rate (C):Correct_Terminal_Tags denotes the number of correct sentence stop tags by the tagging model.The Total_Terminal_Original_Tags denotes the total terminal tags in the original text.(3)Recall rate(R) of terminal tags: score denotes the average performance of the model.
one state unit of the observation unit

Table 1 .
Division of corpus