Unsupervised Coreference Resolution with Hypergraph Partitioning

Unsupervised-learning based coreference resolution obviates the need for annotation of training data. However, unsupervised approaches have traditionally been relying on the use of mention-pair models, which only consider information pertaining to a pair of mentions at a time. In this paper, it is proposed the use of hypergraph partitioning to overcome this limitation. The mentions are modeled as vertices. By allowing a hyperedge to cover multiple mentions that share a common property, the additional information beyond a mention pair can be captured. This paper introduces a hypergraph partitioning algorithm that divides mentions directly into equivalence classes representing individual entities. Evaluation on the ACE dataset shows that our unsupervised hypergraph based approach outperforms previous unsupervised methods.


Introduction
Coreference resolution is the process of partitioning mentions into different real world entities.It is a key component of many Natural language processing (NLP) applications.Especially, due to its important role in Information extraction (IE), coreference resolution was defined as an IE subtask and officially evaluated in the Message Understanding Conference (MUC) and Automatic Content Extraction (ACE) programs.So far, supervised-learning-based approaches have been widely applied to coreference resolution, which requires a set of training data to build a classifier for coreference judgment (Soon et al., 2001;Ng and Cardie, 2002).However, coreference annotation is a difficult task, which involves not only deep linguistic knowledge, but also background knowledge related to the domain.For this reason, the size of existing annotated coreference corpora is quite small (e.g., 599 documents in the ACE2005 corpus) compared with other NLP tasks, and is limited only in some specific domains.
To deal with the lack of the training data, several unsupervised approaches were proposed which require no training data for coreference resolution and are adaptive to different domains.For example, Cardie and Wagstaff (1999) suggested recasting coreference resolution to a clustering problem, which tries to group noun phrases into different coreference clusters.They defined a distance function to measure the incompatibility of two mentions.Given a document, mentions are processed backwards one by one.Two mentions are placed into the same cluster if their distance is below a threshold, and no mentions from their respective clusters are incompatible.Wagstaff (2002) further enhanced this method by adding more linguistic constraints (must-link and cannot-link) during clustering.
However, there are several problems with the previous clustering based unsupervised methods.
(1) As with many other learning based approaches to coreference resolution (e.g., Soon et al. (2001), and Ng and Cardie (2002)), these methods adopt a mention-pair model.The distance function is only based on the information of two given mentions.However, as individual mentions lack adequate information about the entities they refer to, the distance may be not accurate to represent the (in)compatibility of two mentions.For example, the compatibility between mentions "Powell" and "She" may be different, depending on the gender information of "Powell" which cannot be determined from the mention alone.
(2) As the clustering is agglomerative, the wrong linking decision could not be undone and would lead to cascading errors.Suppose we have three mentions "Mr.Powell", "Powell", "She".If "She" is wrongly linked to "Powell", the cluster cannot be broken and will prevent the subsequent linking of "Powell" with "Mr.Powell".
To overcome the above problems, this paper proposes an unsupervised coreference resolution approach with hypergraph partitioning.Hypergraph is a special graph in which an edge connects more than two vertices (Berge, 1989).To model coreference resolution, mentions could be viewed as vertices.A set of mentions is covered by a hyperedge if they show a specific common property.As a hyperedge can describe information shared by two or more mentions, it has a more powerful representation capability for knowledge than a traditional mention-pair feature.By using a partitioning algorithm, mentions are divided into equivalence partitions representing individual entities.The partitioning process can avoid the cascading errors in the clustering-driven unsupervised approaches.In our experiments, we evaluated our approach on the ACE data and our experimental results show that our approach is effective for coreference resolution.
The following sections are organized as following: Section 2 describes some related works for coreference resolution and hypergraph with its applications.Section 3 introduces the basic concepts of hypergraph and the partitioning algorithm.Section 4 describes the hypergraph-based model for coreference resolution.Section 5 gives the experimental results with some discussions.Finally, section 6 summarizes the conclusion and presents future works.

Related work
Supervised-learning-based approaches are widely adopted in coreference resolution.It was first proposed by using decision tree approach (McCarthy and Lehnert, 1995), and later many other systems follow.A typical one of them is presented by (Soon et al., 2001).In it, coreference resolution is deemed as a classification problem.A training or testing instance is formed by two mentions, with a feature vector describing their properties and relationships, including the information of gender, number, person, semantic, string match, appositive, name alias, and so on.When testing, a mention to be resolved is checked against its preceding mentions, and is linked with the closest one that is classified as positive.The work is further enhanced by expanding the feature set and adopting "best-first" linking strategy (Ng and Cardie, 2002).Such a mention-pair-based model only considers information related to two mentions in question, and would cause triangular contradiction errors at a testing time.Suppose we have three mentions "Mr.Powell", "Powell", and "she" in a document.The model tends to link "she" with "Powell" because of their proximity, and link "Mr.Powell" with "Powell" since head string matching.Merging the two pairs together, nevertheless, would lead to gender disagreement between "she" and "Mr.Powell".
Several researchers proposed to use graph theory to deal with the triangular contradiction errors in coreference resolution.They converted a document to a graph in which mentions in the document are mapped to vertices in the graph.An edge connecting two vertices represents the coreference relationship between the two corresponding mentions.The weight of an edge accounts for the confidence of the coreference relationship and is derived from coreference classification.Then, some graph partitioning algorithms can be used for global optimization, such as BESTCUT (Nicolae and Nicolae, 2006).Similarly, Bell-Tree global searching (Luo et al., 2004) and triangular contradiction constraint learning with Conditional Random Field (McCallum and Wellner, 2003) are proposed for such problem.However, they all are supervised learning methods.
Hypergraph has shown many advantages in clustering and classification problems (Zhou et al., 2006).In recent years, it is also employed in NLP applications like sentence parsing (Klein and Manning, 2001;Huang, 2008), word sense disambiguation (Klapaftis and Manandhar, 2007) and document clustering (Shinnou and Sasaki, 2007).However, to our knowledge, our work is the first effort to adopt this technique to the coreference resolution task.

Basic concepts of hypergraph
Let X = {x 1 , x 2 , …, x n } be a finite set, H = {E 1 , E 2 , …, E m } be a family of subsets of X.The family H is said to be a hypergraph on X if ( ) is the order of the hypergraph.The elements x 1 , x 2 , …, x n are vertices and the sets E 1 , E 2 , …, E m are called hyperedges.
An example hypergraph is shown in Figure 1.An edge E i with |E i | > 2 is drawn as a curve encircling all of its vertices.An edge E i with |E i | = 2 is drawn as a line connecting its two vertices.An edge E i with |E i | = 1 is drawn as a loop as in a graph.If |E i | ≤ 2 for all i, a hypergraph is reduced to a common graph.In a hypergraph, two vertices are said to be adjacent if there is a hyperedge E i that contains both of these vertices.Two hyperedges are said to be adjacent if their intersection is not empty.

The incidence matrix of hypergraph
) with m rows that represent the hyperedges of H and n columns that represent the vertices of H, that is, Each (0, 1)-matrix is an incidence matrix of a hypergraph if no row or column contains only zeros.For illustration, the hypergraph of Figure 1 can be converted to the following one.
There exist quite a few hypergraph partitioning algorithms that have been proved effective in different practical problems, such as partitioning circuit netlists, clustering categorical data, and segmenting images.In our study, we chose hMETIS (2.0pre1) (Note 1) which is capable of providing high quality partitions with a high speed.The algorithm in hMETIS is based on multilevel hypergraph partitioning algorithm (Selvakkumaran and Karypis, 2006).In our study, we used the direct k-way partitioning scheme of hMETIS.The overall quality of the obtained partitioning can be computed using the following quality measures (Note 2): 3.1 Scaled Cost: defined for k-way partitioning as where |E(P i )| is the number of hyperedges that are incident but not fully contained inside the partition P i .The hMETIS program performs partitioning by minimizing the Scaled Cost, while maximizing Absorption at the same time.

HyperGraph modeling for coreference resolution
To recast coreference resolution to a hypergraph partitioning problem, we view mentions as vertices, and use various kinds of knowledge for coreference resolution to create hyperedges.In this section, we will focus on several important aspects of the hypergraph model: designing hyperedges, choosing proper weights of hyperedges, performing partitioning, and setting the stopping criterion.

Hyperedges
Traditional learning based coreference resolution systems represent knowledge in terms of features.For coreference resolution with hypergraph partitioning, however, we represent knowledge with hyperedges.As introduced in Section 3, a hyperedge is a special edge that covers more than one vertex.We can convert mentions in a document to vertices in a hypergraph.Several mentions are thought of being covered by a hyperedge if they share a specific common property.In this way, we can capture the information of multiple mentions at the same time, instead of a mention-pair as in tradition learning-based approaches to coreference resolution.Hence, the hypergraph would provide us a more powerful representation capability for knowledge.
In our study, we define the following types of hyperedges.For illustration, we use the text in Table 1 as an example.
1) FullString: This type of hyperedge covers the mention vertices that have the same string (excluding the determiners).
For example, in Table 1, mentions m 2 and m 7 have the same string, and so do mentions m 6 and m 10 .Then we will have two FullString hyperedges that cover {m 2 , m 7 } and {m 6 , m 10 }, respectively.
2) Head: This type of hyperedge covers mentions with the same head string.
3) Gender: This type of hyperedge covers the mentions that have the same gender type.To considering only effective partitioning of mentions, there are only two types of gender (Note 3), Male and Female.A hypergraph has at most two Gender hyperedges.A mention with a neuter gender (such as "it", "the president") is not covered by a hyperedge of type Male for Female.In Table 1, mentions m 4 , m 5 and m 9 are male and thus will be covered by a hyperedge {4, 5, 9}.
8) TwoSentences: This type of hyperedge is similar to ThreeSentences, but it just considers mentions within a two-sentence window.9) OneSentence: This type of hyperedge is similar to ThreeSentences, but it just considers mentions within the same sentence.
11) Appositive: This type of hyperedge covers the mentions that are in the same appositive structure.
12) CannotLink: As suggested by Wagstaff (2002), we enforce cannot-link constraints during partitioning.For this purpose, we create hyperedges to cover a pair of mentions <m i , m j > that are not likely to corefer.In our study, we consider the following constraints: a. m j is an indefinite noun phrase.
b. m i and m j are three sentences apart and do not have the same head word.
c. m i and m j are pronouns and m i and m j do not agree in number or gender.
d. m j is a pronoun and m i and m j are three sentences apart.
The hyperedges generated for Table 1 are in Table 2.

Weights for hyperedges
We classify the hyperedges into six categories based on their confidence level for a positive coreference determination, as shown in Table 3.
The hypergraph partitioning algorithm tends to divide vertices covered by a hyperedge with a low-weight, and retain in the same partition vertices covered by a high-weight hyperedge.Thus, we manually assign a higher weight to a hyperedge that covers mentions that are likely to corefer, while a lower weight to a hyperedge that covers mentions that are unlikely to corefer.Table 3 shows the different weights for different levels of hyperedges.The hyperedge CannotLink was assigned the lowest weight of zero (Note 5).The hyperedges FullString, Appositive, and NameAlias, which are strong indicators of coreference relationship (Soon et al., 2001), were given the highest weight.

Coreference resolution with mention partitioning
Given a document, all the mentions are placed into a large cluster initially.As described, we map each mention to a vertex in a hypergraph, and find out all the possible hypergraphs for the vertices.Then we can invoke the hMETIS program to perform mention partitioning.The process is done in an unsupervised way.After the partitioning stops, a generated partition could be deemed as a coreferential cluster for a single entity, with all the mentions in the same cluster being coreferential with each other.

Stopping Criterion
One problem with mention partitioning is when the process should stop.In other partitioning tasks, the number of target clusters is predefined.However, for our task, it is not possible to give a predefined cluster number as the number of entities in a document is unknown before resolution.Therefore, we need to design a stopping criterion for partitioning.
As described in Section 3, to certain partitioning clusters number k, the hMETIS program performs partitioning by minimizing the Scaled Cost (5), while maximizing Absorption (6) at the same time.After inner optimization, hMETIS would find the best partitioned clusters with the final Scaled Cost and Absorption values, named as ScaledCost f inal (k) and Absorption f inal (k) respectively.
Enlarging the target entity number k would make the former value increase while the latter decrease.Without any prior knowledge, we try to find a k which compromise on the two costs varying trends at the same time.For this consideration, we define a stopping criterion based on the product of final generated ScaledCost f inal (k) and Absorption f inal (k) values after optimization: We prefer a partition with high product of ScaledCost and Absorption.For a given document, we put all the mentions in a cluster (k=1) and perform partitioning repeatedly.The process stops at a round when the value of P reaches the peak, or when each cluster contains only one mention.The generated clusters are output as the coreference resolution result.Actually, in our study such stop criterion achieved good result.

Experimental Setup
In our study, we did evaluation on the ACE-2 V1.0 corpus (NIST, 2003) that is divided into three domains: broadcast news (BNews), newspaper (NPaper), and newswire (NWire).As we conducted unsupervised learning, we did not use the training data and just ran the system on the test data.However, in the comparing supervised systems, the training and testing data were used together.The number of entities with more than one mention, as well as the number of the contained mentions, is summarized in Table 4.
For both training and resolution, an input raw document was processed by a pipeline of NLP modules of OpenNLP (Note 6), including sentence boundary detector, tokenizer, and part-of-speech tagger.The boundaries of a mention are directly from annotation in the corpus.Our experiment setup just follows the official "diagnose" evaluation of ACE in which coreference resolution is done and evaluated on the perfect mentions, which allows the validation of the utility of the hypergraph method under an environment of accurate mention features.We used the mention's head, boundaries, and the semantic type information from "gold" annotation.Other features, like string-matching, apposition, name-alias, distance and so on, were all computed at a running time.Following tradition, all results are reported using recall, precision and F-measure based on the MUC-6 scoring algorithm (Vilain et al., 1995).

Results and discussion
Table 5 lists the performance of different coreference resolution systems.
For comparison, we first duplicated the traditional unsupervised learning based system by Cardie and Wagstaff (1999) as baseline.The first line of Table 5 shows the results of such a system, which adopted the same clustering radius threshold (i.e., r = 4) as in Cardie and Wagstaff (1999)'s system.Our duplicated system (denoted by Cardie99r4) achieves a recall of 66.30% and a precision of 50.99%, obtaining an F-measure of 57.64%.The F-measure (57.64%) is higher than their results (52.8%) reported on the MUC-6 data.
Cardie and Wagstaff (1999)'s radius value was fined-tuned for the MUC-6 data, and is not necessarily optimal for the ACE data.In our experiments, we examined the performance of the duplicated system under different radius value from 1 to 10.We found that the system achieves the best result when r = 6 (Cardie99r6), with 67.34% recall, 51.73% precision and 58.5% F-measure.The recall, precision, F-measure results for each domain are consistently higher than those of Cardie99r4.It indicates that the performance of their system is significantly affected by the threshold value.
In the experiments, we were also interested in comparing performance difference between the system with unsupervised learning and the system with supervised learning.For this purpose, we implemented the classical decision trees-based coreference resolution system by Soon et al. (2001) (denoted by Soon01), and the results are shown in the third line of Table 5.Compared with Cardie99r6, the system has a drop in recall (up to 11.83%), but achieves a large improvement in precision (up to 21.23%).Overall, it produces an average 55.51% recall, 72.96% precision, and 63.05% F-measure.The F-measure is 4.54% higher than Cardie99r6.This is in line with Wagstaff (2002)'s report that Cardie and Wagstaff (1999)'s unsupervised approach got an F-measure about 9% lower than the supervised system.
The fourth line of Table 5 summarizes the performance of our system with hypergraph partitioning.From the table, the system produces a higher recall of 79.37%, 12.03% than Cardie99r6, with just only 2.87% loss in precision.Overall, the F-measure is about 2% higher than Cardie99r6.The difference against the supervised based system (Soon01) is reduced to 2.57%, and the results are encouraging considering that our approach did not use any training data.
One interesting finding of the table is that unsupervised approaches tend to produce a lower precision but a higher recall than supervised approaches.This should be the case because our hypergraph method is based on top-down partitioning.Mentions tend to be retained in the same cluster unless they have some inconsistency.By contrast, a supervised approach is based on bottom-up merging, mentions are only merged together if some coreference indicators, like string matching, name alias or appositive can be satisfied.The merge is comparatively conservative and thus leads to a higher precision but a lower recall.
We were also concerned how much each type of hyperedge affected the resolution performance.Table 6 summarizes the performance contribution of each kind of hyperedges to our system of HyperGraph.The last three columns show the gain or loss in recall, precision and F-measure, respectively, because of subtracting a particular hyperedge while keeping the rest in the HyperGraph system.
As our approach is partitioning-driven, the low-weight hyperedges play an important role in dividing mentions.We were also concerned how much the hyperedge CannotLink affects the resolution performance.The last line of Table 6 shows the loss of performance by removing the CannotLink from the system.From the table, the removal of CannotLink results in a drop of by 37.56% in recall and 6.14% in precision.Overall, the F-measure decreases by 18.22%.Similarly, the hyperedge contribution to whole system F-measure decreased like Semantic(1.27%), NameAlias(0.54%),Number(0.35%),ThreeSentences(0.24%),Gender(0.14%),HeadString(0.08%),Appositive(0.05%).
Interestingly, when only subtracting FullString, Person, TwoSentences, and OneSentence, the final F-measure increased 0.25%, 0.11%, 0.30%, 0.21%, respectively.In other words, the four kinds of hyperedges decreased the whole system performance using all features.After deep analysis, we found that the FullString with high weight is little repeated by HeadString with middle weight.Moreover, Person is just a kind of semantic.The Person=True hyperedges are replaced by Semantic=Person hyperedges.When replaced, the hyperedges are redundant, and hence decrease the final resolution result.Meanwhile TwoSentences is repeated to some extent by ThreeSentences and OneSentence.Similarly, so does OneSentence by ThreeSentences and TwoSentences.
Our results show that reducing feature redundancy is a practical problem for unsupervised coreference resolution hypergraph partitioning.Actually, we experimented on subtracting any two, three or all of the above four kinds of hyperedges while keeping the rest in the HyperGraph system.The results were all worse than using all features.It was because all the features were intersecting in the hypergraph.

Conclusion
This paper presented an unsupervised learning approach for coreference resolution based on hypergraph partitioning.It converts a document to a hypergraph where a vertex corresponds to a mention in the document.It uses a hyperedge to cover mentions that share a specific common property, which can capture information about multiple mentions, instead of only two mentions as in the traditional approaches based on the mention-pair model.Our approach adopts a hypergraph partitioning algorithm to divide mentions into clusters each representing a single entity.The partitioning process can avoid the cascading errors in the previous clustering-based unsupervised approaches.
In the paper, we described the resolution framework, the definition of hyperedges, and the stopping criteria of partitioning.The evaluation on the ACE data set shows that the hypergraph partitioning approach performs better than the previous clustering-based unsupervised approach (with up to 1.97% in F-measure), and the gap between the supervised approach is only 2.57% in F-measure.
Our current work focuses on the framework of coreference resolution with hypergraph partitioning.There are several directions for future work: (1) For simplicity, we currently just used some common knowledge, represented as hyperedges, for coreference resolution.We would like to explore more effective knowledge, such as grammar roles, context template information, and others proposed in Ng and Cardie (2002).
(2) In the current system, the weights for hyperedges were all heuristically designed.We intend to try some weights learning mechanisms, e.g., the genetic algorithm.
(3) The stopping criterion has a big influence on the final resolution performance.However, our current stop criterion was defined in a heuristic way.We would like to incorporate more prior knowledge related to coreference resolution.
(4) Feature redundancy is another problem for hypergraph partitioning.We will try to process it as a learning problem.

Notes
Note 1. http://glaros.dtc.umn.edu/gkhome/fetch/sw/hmetis/hmetis-2.0pre1.tar.gzNote 2. These definitions can be extended in a straightforward manner for hypergraphs with weighted hyperedges, as described in http://glaros.dtc.umn.edu/gkhome/fetch/sw/hmetis/manual.pdfNote 3. The gender type of a person name was obtained from a name-gender list provided by the corpora from NLTK package, while the gender of a common noun (e.g., mother, son, president) was got from WordNet (if the gender of a mention, such as "the president", is not available in WordNet, we set the gender type as neuter).

:
This is defined as where E is the set of hyperedges, |e∩P i | is the number of vertices of a hyperedge e in partition P i , and |e| is the number of vertices of e.

4)
Number: This type of hyperedge covers the mentions that have the same number type.A hypergraph may have two Number hyperedges for singular or plural mentions.5) Person: This type of hyperedge covers the mentions that have the same person type.There are only two Person hyperedges for mentions that are persons or non-persons.

Table 1 .
This is an example about tables

Table 2 .
An example of generated hyperedges

Table 3 .
Weights for four kinds of hyperedges

Table 4 .
Statistics of entities (length > 1) and contained mentions for the test data set in ACE

Table 5 .
Results of different systems for coreference resolution