Hybrid learning of Syntactic and Semantic Dependencies

This paper presents our solution for jointly parsing of syntactic and semantic dependencies. The Maximum Entropy (ME) classifier is selected in this system. Also the Mutual Information (MI) model was utilized into feature selection of dependency labeling. Results show that the MI model allows the system to get better performance and reduce training hours


Introduction
Since 2002, semantic role labeling which focus on recognizing and labeling semantic arguments has received considerable interests because of its big contribution to many Natural Language Processing (NLP) applications, such as information extraction, question answering, machine translation, paraphrasing, and etc.
In the past four years, the Conference on Computational Natural Language Learning (CoNLL) featured an associated share task every year which allow the participants to train and test their Semantic Role Labeling (SRL) or Syntactic systems on the same date sets and share their experiences.In 2004 and 2005, the shared tasks of CoNLL were focus on SRL.In CoNLL-2006 andCoNLL-2007, the shared tasks were dedicated to the syntactic dependency parsing.
In 2008, the CoNLL consolidates the last four years endeavor and unites the tasks into one, and proposes a new challenge of the merging of both syntactic dependencies (extracted from the Penn Treebank) and semantic dependencies (extracted both from PropBank) under an unique unified representation.
After more than 7 years endeavor on semantic and syntactic areas, many different learning techniques have been practiced on this area, such as ME, Support Vector Machine (SVM), Relevance Vector Machine(RVM),SNoW, Tree-Conditional Random Field(T-CRF), DT, CPM.

Problem Definition
SRL concerned the recognition of semantic roles in a given sentence.As defined in the Shared Tasks of CoNLL-2004, the task includes analyzing the propositions expressed by some target verbs of the sentence.Based on PropBank predicate-argument structures, for each target verb, all the constituents in the sentence which fill a semantic role of the verb have to be recognized (Carreras, Màrquez, & Processing, 2005).
In each sentence a word (most time is a verb) will be specified as a predicate and its accompanying arguments will be recognized and labeled with different categories depending on the roles they played.
For example: Here, the roles for the predicate receive (that is, the roleset of the predicate) are defined in the PropBank Frames scheme as: V: verb A0: receiver A1: thing gotten A2: receive-from AM-MOD: modal AM-NEG: negation The addressing of SRL problem task takes for quite a long time.Since CoNLL-2004 shared task, the SRL problem is come up with machine learning algorithm and based on partial syntactic information without using full parsers and external lexico-semantic knowledge bases (Blunsom, 2004).In CoNLL-2005, following the earlier year's initiative, the task still interest in the recognition of semantic roles for the English language.Novelty of this shared task is that the complete syntactic trees were provided as input to improve the performance in SRL.And the training corpus has been substantially enlarged.In CoNLL-2006 andCoNLL-2007, the shared tasks were dedicated to the syntactic dependency parsing.In 2008, the CoNLL combined past four years endeavor and merged the tasks into one and proposed a new challenge of the merging of both syntactic dependencies (extracted from the Penn Treebank) and semantic dependencies (extracted both from PropBank) under a unique unified representation.Compared with the earlier CoNLL share task, this time SRL using a dependency-based representation and the predicated is not limited to verbal and nominal.Researchers try to prove the dependency-based representation better for SRL than the constituent-based formalism and the merged representation more helpful than the individual ones.After merge, this combination is easier adoptable for most NLP technology and linear time processing is better fit for many applications.

Maximum Entropy Model
In Shannon's Information Theory, entropy is defined as a measure of the uncertainty associated with a random variable.Maximum Entropy model allows people to make an optimal choice among various states which all satisfy prior constraints by choosing the state with the maximum entropy, in other words, all unknown possibilities have equally opportunity.
Maximum entropy allows us to restrict the model distribution to have the constraints which are represented as the same expected value for feature as seen in the training data (Nigam, Lafferty, & McCallum, 1999).The motivating idea behind maximum entropy is to derive the uniform model with maximum H(p) value and shown in Eq. (3.1).
arg max H( ) The maximum entropy framework makes different features be integrated into the same model without concerning the relationships among them.These advantages make the ME model has been used in many NLP areas.ME model is defined in E.q (3.2).
Where y is a class label (in our case: meaningful or unmeaning), X is an input vector containing predicates on the matched text, and Z(X) is a normalization term.Each feature function f i (X, y) maps an input vector and class to a binary value, for example: The parameters of the model are the feature weights i  .They are determined in such a way that the parameters maximize the conditional log-likelihood of the training data , where N is the number of training samples.We use Zhangle's ME tool package (Zhang, 2004) to train the ME model.
For the semantic role labeling, the main task will be picking up the right label from the semantic label sets.  X y P denotes the conditional probability of getting the output y with the context X. f i (X, y) describes the feature constrain with weighting parameter i  .
The ME model will choose the value of   X y P which makes H(p) the gets the maximum value.

Corpus and Evaluation
Large corpora that allow to automatically extracting information about language are beginning to serve researchers in NLP as critical tools.Since there is no large-scale dependency Treebank published, the corpus we adopted is provided by the CoNLL shared task.It is generated from several well-known corpora.These corpora will be briefly introduced in the follows: The Penn Treebank (Marcus, Marcinkiewicz, & Santorini, 1993), Proposition Bank (PropBank) (Palmer, Kingsbury, & Gildea, 2005), NomBank (Meyers, et al., 2004).The prior corpora have been merged and converted into dependency formalism which were used in shared task evaluation (Surdeanu, Johansson, Meyers, Màrquez, & Nivre, 2008).The data format defined (Surdeanu, et al., 2008) as 12 columns.They are ID, FORM, LEMMA, GPOS, PPOS, SPLIT_FORM, SPLIT_LEMMA, PPOSS, HEAD, DEPREL, PRED, ARG.The HEAD means syntactic head of the current token and DEPREL means syntactic dependency relation to the HEAD.
The official evaluation of 2008 shared task is concerned from three aspects (Surdeanu, et al., 2008): syntactic dependencies semantic dependencies the overall task The labeled attachment score (LAS) is used for evaluating the syntactic dependencies and defined as the percentage of tokens which have correct HEAD and DEPRELvalues.Labeled F1 score is applied for the evaluation of semantic dependencies.The macro average of the two previous scores is used to score the whole task (Surdeanu, et al., 2008).

System Architecture:
Comparing with traditional Semantic Role Label, our system will produce a joint rich syntactic-semantic output to allow people getting semantic role annotation and syntactic structure at the same time.The main components of our system: Syntactic parsing.

Predicate tagging.
Features Selection using Mutual Information.

Semantic dependency labeling.
During syntactic parser stage, syntactic dependency tree are achieved.In the predicated tagging module, the predicates in the sentence will be marked with tag numbers, 00 is for non-predicate.The results generated by earlier two stages are served as input of the last component.With the filtered features, the Semantic dependency labeling generates the classification of semantic roles.

Syntactic parsing
In our system, the first stage is to create a labeled syntactic dependency parse y for input sentence x including words and their part of speeches (POS).Inspired by the parsing model presented in (McDonald, Pereira, Ribarov, & Hajic, 2005), we equates the problem of dependency parsing to finding maximum spanning trees in directed graphs.As Figure 1, the following flow chart presents the sequence of syntactic parsing.
Our previous work (Li et. al. 2008) detailed the features used for the syntactic parsing.

Predicate Tagging
Usually an event in a sentence can be described using predicate and its arguments.The predicate, which most time is a verb, reveals the type of event.Arguments are the sentence constituents holding semantic roles and combine with the predicate to complete the meaning of the sentence event.Accordingly the semantic task includes two subtasks: Predicate tagging.

Semantic dependency labeling
Predicated tagging task is not only binary classification problem, we treated it as multiple classification question since we include role sets as predicate type which corresponding to different meanings.The framework for Predication Tagging is shown as Figure 2.
In both PropBank (Palmer, et al., 2005) and NomBank (Meyers, et al., 2004), predicates usually have assigned several rolesets corresponding to different meanings.For example, the verb abandon has three rolesets marked as ordinal numbers 01, 02 and 03 as described below.Choosing different rolesets means different meanings, which will affect estimation of event.Consequently, these numbers of role set are treated as tags for predicates; some statistical properties will be obtained.
The tag set was chosen as {01, 02... 22} which corresponds to numbers of role set.And a special condition shall be considered that when the word is not predicate.00 will be added in to indicate non-predicate.
The predicate tagging feature template includes 3 categories features: content features, context features and compound features (shown in Table 1).
The content features give the information about current word.The POS framework of current word was presented as neighbor's POS, called context features.With study of the corpus, we found out compound words such as short-term affect the predicate tagging performance distinctly.The feature template for compound words will included several separate features which will be assigned null for non-compound words.

Semantic Dependency labeling
Shown in Figure 3, this part of system is based on the output of syntactic parser and predicate tagging mentioned earlier and to find out the dependents of a given predicate in a sentence and label them with the one from the set of semantic dependency labels.Suppose (p, d) is a couple of predicate and one of its possible dependents, T is the dependency tree generated by syntactic parsing; L is the set of semantic dependency labels, null is included as a special tag for no semantic dependency between p and d.This task can also be recognized as classification issue and work out with ME model and be described as E.q (5.1).For most language processing task, no matter which machine learning algorithm has been chosen, the selection of features set always was the crucial factors affect the performance of systems.We select features from three aspects: predicate, dependent, and properties between predicate and dependent.

 
To get better performance, a lot research work had been done on our train data and their properties.With the survey on the 39832 sentence and their 421039 semantic roles, the high-frequency PoS pairs between predicate and dependent and associated semantic roles have been survey, such as IN-NNS, RB-VBN,VBZ-VBN, RBS-VBN,CC-VB.Partial concurrence statistics results for POS pair and semantic roles were presented in the Figure 4.
Study the statistic result, the affiliation between POS pair and semantic role is obvious.And for the different semantic roles, the statistic property of the PoS pair is distinct.These characteristics show that the PoS pair is sustaining feature for semantic roles labeling.
At the same time, to discover the relevance between semantic dependency and syntactic dependency, huge amount word pairs have been studied.These words' family relationships on the syntactic tree also have been concerned.The family dependency relationship is defined as IS-A relationship between current word and predicate.The relationship set includes Son, Father, Brother, Self and Grandson relationship.The details are presented as Figure 5. Totally 421039 semantic dependency relationship were involved.59.09% of them have directly syntactic dependency, which means the current word is the son node of predicate.If other family relationship were included into the consideration, this part occupies the 87.34% of total 421039 word pairs.These proves the syntactic dependency devote great contribution to the semantic recognition.
After plenty statistics and analysis, our system selects such as PoS pairs and family dependency relationships totally 24 features as the features template to build training model.The chosen features are listed in the Table 2.

Mutual Information
In dependency labeling, different features have different knowledge in distinguishing class labels.More features will contain more knowledge.Therefore the quantity of feature template is not proportion to the recognition ability of the model.Large numbers of features do not always signify high quality recognition.Too much information will bring interferes among themselves and decrease in performance.At the same time, a large number of feature templates mean huge time consuming and thereby limit the system practicability.How to reduce feature set and choose the optimal subset without compromising classification accuracy?
Mutual Information (MI) methods can be utilized as a pruning algorithm to choose a suitable subset from the earlier large candidate features.The mutual information represents the amount of uncertainty remaining about the system output Y that is resolved by observing the system input X.Mutual information (Al-Ani, 2003) between X and Y is given as Eq.(6.2).

     
, log In our system, y is the semantic labels from label set; x is one of feature instances generated by a particular feature template.All the feature templates used in this system are listed in the Table 1.I YX is the mutual information of a feature template with all the semantic labels.The mutual information of all feature templates will be calculated.The large value of mutual information between a feature template and its output means the closer associations.Usually a pruning value is set during the phrase of choosing feature templates; the feature template with the higher mutual information value will be kept.After calculation, the feature templates numbers in Table 1 have been reordered according its MI value in descending order.The first 18 feature templates are listed as follows: ( 9,6,8,5,23,11,15,4,19,18,14,21,2,16,22,24, 1, 10)

Predicate Tagging
The predicate identify model was tested on devel, wsj, brown and wsj+brown data separately.The experimental result is shown below in Table 3.
The performance on devel and wsj data is better than on brown data.That is because the training data is from the same corpus (Penn Treebank) as devel and wsj data, the brown data is from a different corpus (Brown).The results show that our system has limitations that the identification result might be worse on other test corpora.

Semantic Dependency labeling
Utilizing the MI model, the features are chosen according their contribution to the results.First all features are grouped, then we collect the first 18 features and those with less contribution are removed.The procedure is continued until the last group with 7 features.The test results with different feature combinations are included in the Figure 6.F24 means 24 features were selected, so as to analogize.
Figure 6 presents the best performance gotten with the 10 features combination.This result shows that more features did not give the better performance.
Figure 7 shows the 10 features selected recognition result which is correct.However, when 18 features (shown in Figure 8)were selected, the predicate "say" was not identified.This example demonstrates that excessive features go with redundant information which may lower performance, concise but more interrelated information may provide better performance.
Usage of the MI model decreases the template from 24 features to 10 and the training time is cut sharply, the performance with 10 features is better than others.
From the test results, we can also see that our system gets much better performance on WSJ corpus than Brown corpus.The reason is that the syntactic parser is constructed based on the WSJ corpus and we might get worse performance using other test corpora.The performances also imply that the error of syntactic parsing and predicate tagging could be probably augmented in semantic dependent labeling.In order to improve the performance of the whole system, the deep dependence between the two stages should be broken up in future research.

Conclusion
We present a semantic dependency system, which includes syntactic module, predicate tagging module and dependency labeling module.Through analyzing a large amount of corpus, we proved that there are the affiliations between POS pair and semantic role.And also we discover the relevance between semantic dependency and syntactic dependency.In the dependent role labeling subtask, the results show that the performance will not always be improved when we increase the number of features.MI model is applied to reduce the feature set; different feature combinations are evaluated, and get better performance and faster training speed.

Figure 3 .
Figure 3. Framework for Semantic Dependency Label

Figure 4 .
Figure 4. Examples of concurrence statistics for POS pair and semantic roles

Table 1 .
Feature template for predicate tagging.