Abstract Sentence Classification for Scientific Papers Based on Transductive SVM

,


Introduction
Over the recent years, the study about short text becomes more and more significant in many fields such as natural language processing, information retrieval, machine learning etc. Our work aims at classifying abstract sentences in scientific papers into four categories according to the content of the sentences. In general, the abstract usually consists of the background part, the goal part, the method part and the conclusion part. In addition, they are also the most important parts even in the text of a scientific paper. So we consider it very significant to classify the sentences into the four categories for further usage like frequent pattern mining, information extraction and writing assistant of scientific paper with a large number of predicted sentences.
The abstract almost covers the whole content of a paper clearly and concisely, and sentences in it usually have a large amount of information in spite of their short length. In the traditional text classification tasks, statistic machine learning methods usually have good performance along with the vector space model. Several studies suggest that Support Vector Machine is generally acknowledged to have the best performance in text classification among these statistical machine learning methods. However, short texts have quite sparse features, usually couples of words. So the traditional Bag-of-words method meets some problems which leads to low efficiency. In our task, an abstract sentence usually has tens of words. In addition, our aim is to classify the abstract sentences into "background", "goal", "method" and "result", so some uncertain semantic factors may also influence the classification.
Because of the lack of pre-existing corpus, we build the corpus and mark the instances ourselves. We carry out the experiments in both supervised and semi-supervised methods upon the scale-limited data set. And finally, we improve the accuracy of classification after several steps. After the classification, we may then choose the high-confidence predicted abstract sentences in each category for further researches such as frequent pattern mining, information extraction and writing assistant of scientific papers. Figure 1 displays the process of our work. First, we build the corpus of abstract sentences ourselves in order to carry out the experiments, including acquiring the corpus, analyzing the corpus and tagging the instances in it. Second, we conduct a group of experiments on feature selection in order to get a better feature vector for classification in our task. Third, we carry out the classification task with both supervised method and semi-supervised method. This paper is organized as follows: in Section 1, we introduce the background and aim of this work; in Section 2, we talk about the related work; in Section 3, we introduce the data set on which we conduct the experiment; in Section 4, we discuss the feature selection step of our work; in Section 5, we discuss the processing of the classification and the results of the experiments; in Section 6, we demonstrate the conclusion of our work.

Related Work
In this section we review some recent related literature on sentence classification and Transductive Support Vector Machine.

Sentence Classification
Research on sentence classification has been carried out over the recent years. In an earlier research, Naughton and Stokes et al. (2008) treated event detection as a sentence level text classification problem. They concluded that SVM consistently outperform the Language Model technique in their task, and that the manual rule based classification system was a powerful baseline that outperformed the SVM on half of the six event types. Lui (2012) used feature stacking to combine a variety of feature sets drawn from lexical and structural information at sentence level as well as the sequential information at the abstract level. Their system attained a ROC area-under-cure of 0.972 and 0.963 on two subsets of test data and produced the winning entry to the ALTA2012 Shared Task (Note 1). Molla (2012) found that the cluster-based feature improved the results for Naive-Bayes classifiers but not for better-informed classifiers such as Max-Entropy and Logistic Regression in their participation to the ALTA 2012 Shared Tasks.

Transductive Support Vector Machine
After Joachims (1999) proposed the Transductive Support Vector Machine, lots of machine learning researches and tasks were carried out based on it. Chen and Wang et al. (2008) introduced an application of TSVM in Chinese Semantic Role Labeling. They designed some heuristics from the semantic perspective to improve the performance of TSVM and the results showed that TSVM outperformed SVM in small tagged data and that after using heuristics TSVM performed further better. Miceli-Barone and Attardi (2012) presented a shift/reduce dependency parser that could handle unlabeled sentences in its training set using a transducive SVM as its action selection classifier. They performed the experiments with this parser on a domain adaptation task for the Italian language. TSVM was also used in fields like pixel classification (Chakraborty & Maulik, 2011;Maulik & Chakraborty, 2013).

The Building of Date Set
Our study object is mainly abstract sentences in scientific papers, because the abstract usually covers the whole content of a paper. The scientific papers are downloaded from the web site "http://www.sciencedirect.com/", involving several fields. And then the abstract parts are extracted from the pages and split into sentences. By several studies and investigating of our corpus, we find that the abstract sentences could be classified into four categories as follows.
Class 1 is the background, usually including the general areas of the research, the specific research direction, the background and related work of the research, the significance and importance of the research, the usual research methods and basis of the area. For instance, the sentence "Feature selection is an indispensable preprocessing step for effective analysis of high dimensional data." belongs to this category. So does the following sentence "Finding an optimal feature subset for a problem in an outsized domain becomes intractable and many such feature selection problems have been shown to be NP-hard.".
Class 2 is the goal, usually including the sentences that directly point out the proposed idea, methods, concepts, etc., but not the simple repetition of the aim. In addition, it can also contain the structure of the paper. For instance, the sentence "This paper formulates the text feature selection problem as a combinatorial problem and proposes an Ant Colony Optimization (ACO) algorithm to find the nearly optimal solution for the same." belongs to this category. So does another sentence "In this study, we focused on pathway figures that illustrate signaling or metabolic pathways, because many of these are important in understanding disease mechanism(s).".
Class 3 is the method, usually including the process of the research, what method and data is used, as well as the principle and the conditions of the experiment. For instance, the sentence "Documents from 20 newsgroup benchmark dataset were used for experimentation." belongs to this category, so does another sentence "Multivariate analyses were performed to analyze the subject's perceptions and to build conceptual models for telephone design.".
Class 4 is the result, usually including the observed results of the experiment, the analysis of the results, the conclusion obtained from the results, comparison with other results and the prospect of further work. For instance, the sentence "An F -score of 0.78 is obtained for labeling relevant coordinating constructions in an independent test set." belongs to this category. So does another sentence "Experiments showed that the performance of classifiers improved through adopting the proposed methodology.".
We then tagged the corpus that has a scale of 4550 abstract sentences with the four labels. After splitting, we get 127718 words (including words, tokens, names, other special tokens) and a vocabulary book of 10346 words. Table 1 displays the statistics of the labeled corpus for each class. The "#sentence" refers to the number of the sentences for each class. The "proportion" refers to the proportion of the sentences in each class. The "#word" refers to the number of words in each class. The "vocabulary" refers to the scale of vocabulary for each class. And Figure 2 displays the distribution of the words after lowercasing, stemming, lemmatizing and stop words removal. Description: The "#sentence" refers to the number of the sentences for each class. The "proportion" refers to the proportion of the sentences in each class. The "#word" refers to the number of words in each class. The "vocabulary" refers to the scale of vocabulary for each class.

Supervised Learning Method
Supervised statistical machine learning methods are widely used in text classification tasks, such as KNN method, Naive Bayes method, ME model, Support Vector Machine model, Decision Tree method, ANN model et al. And among them Support Vector Machine has the best effect in most common text classification tasks. However, for short text classification, Support Vector Machine also meets the problem of sparse feature. In our task, lack of training data is another factor that can affect the performance of the classifier.
We determine the feature selection modes based on the investigation above, namely lower letter formula, no stop word removal, stemming, lemmatizing and bi-gram. We make a CHI-test for the four categories in the training set and select the Top-1000 words that have the highest CHI-value from each category. Then we merge the four sets of words and finally get a 3758-dimension feature vector. We train the Support Vector Machine model on the training set with libsvm-3.16 (Note 3). With the RBF kernel trick and the grid technique, the accuracy rises to 70.2579%.

Semi-Supervised Learning Method
Semi-supervised learning method can be used in a task in the situation that the labeled training data is not enough to fit the distribution while a large number of unlabeled data is available. This situation should usually rely on the cluster assumption that the decision hyperplane should cross the area in which few spots are located. Joachims (1999) proposed a semi-supervised learning method based on Support Vector Machine model, the Transductive Support Vector Machine (TSVM). In his work he conducted several experiments on text classification and gave an explanation why TSVMs are especially well suited for text classification. Based on the theory he proposed, he developed the SVMlight to solve the optimization problem. Generally speaking, a TSVM trains a classifier both on the labeled training data and unlabeled data.
In our task, we carry out a group of experiment based on SVMlight-5.00 (Note 4). The training set contains the 2404 labeled instances and large amount of unlabeled data. The testing set is still the 2146 labeled instances. The feature selection mode is the same as Section 2 has described. On bringing in 8821 unlabeled instances totally, the accuracy rises to 75.8621% on the test dataset. The following figures (Figure 3, Figure 4, Figure 5 and Figure  6) show the trend of accuracy, precision, recall, F-score for each class after bringing in unlabeled instances step by step. The X-axis refers to the number of training data, 1 represents 2404 labeled instances, 2 represents 2404 labeled instances with 2246 unlabeled instances, 3 represents 2404 labeled instances with 4368 unlabeled instances, 4 represents 2404 labeled instances with 6657 unlabeled instances and 5 represents 2404 labeled  Vol. 6, No. 4;2013