Using Sketch Engine to Investigate Synonymous Verbs

Synonymy is an important yet intricate linguistic feature in the field of lexical semantics. Using the 100 million-word British National Corpus (BNC) as data and the software Sketch Engine (SkE) as analyzing tool, this study examines the usage differences between raise and increase, two synonymous verbs notorious for their complex semantic and syntactic usage patterns. In addition to examining the collocates of the verbs, the study also investigated the syntactic patterns that the verbs typically occupy in the sentence structure and their functional implications. The data analysis yields an informative delineation of the internal semantic structure of the synonym set. The results also show the need for the corpus approach to go beyond collocational analysis in the study of synonymous verbs. The limitations of using SkE to extract and disambiguate synonyms are also addressed. This paper ends by discussing the pedagogical implications that this research may have when the results are introduced into the classroom.


Introduction
Synonymy, or semantic equivalence, is an important yet intricate linguistic feature in the field of lexical semantics.Synonyms are not completely interchangeable; rather, they differ in shades of meaning and vary in their connotations, implications, and register (DiMarco et al., 1993).Any natural language consists of a considerable number of synonymous words.Due to historical reasons, English is particular rich in synonyms, which enables English speakers "to convey meanings more precisely and effectively for the right audience and context" (Liu & Espino, 2012, p. 198), but also constitute a thorny area for EFL (English as a Foreign Language) learners because of their subtle nuances and variations in meaning and usage.
It thus comes no surprise that an important aspect of English linguistics is to find the proper measures of automatically identifying and extracting synonyms (Peirsman, Geeraerts, & Speelman, 2015) and of distinguishing one word from its synonyms or near-synonyms (Hanks, 1996;Biber et al., 1998;Gries, 2001;Xiao & McEnery, 2006;Divjak, 2006;Gries & Otani, 2010;Liu, 2010).Although the two orientations of researching synonyms are equally important, I will in this paper focus more attention on the second one.The main purposes of this study are methodological, in that I would like to discover what the relative strengths and weaknesses of using Sketch Engine to research synonyms are, and what their relative scope of applicability is.
The rest of this paper is structured as follows.In the next section, I will give an overview of related work by introducing corpus studies of lexical semantics in the first place, and then discussing corpus-based automatic extraction and discrimination of synonymous words.Section 3 will present corpus data and tools used in this study.The results of this study are presented and discussed in Section 4, where I show the success of Sketch Engine in researching synonyms.The final section summarizes major findings and pointers for future research.

Corpus Studies of Lexical Semantics
In the field of lexical semantics, there are a number of closely related key issues such as "How do we know what words mean?What evidence do we have?Is this evidence observable and objective?How can large text collections (corpora) be used to study what words mean?" (Stubbs, 2001, p. 4).For centuries, researchers, language teachers, and dictionary makers have used both their own intuitions and also attested uses of words, often in the form of thousands of quotations from printed books.However, it is only since the mid-1980s that corpus methods have been able to provide evidence about word meaning by searching across large text collections.
The approach of using corpus evidence to study meaning of words or phrases is often labeled as corpus semantics or empirical semantics, and the most active and influential scholars are called neo-Firthian corpus linguists.The leading figure is John Sinclair who might as well be one of the first people to bring Firth's ideas together with a corpus linguistic methodology (Stubbs 1996).Other important neo-Firthians include Michael Hoey, Susan Hunston, Bill Louw, Michael Stubbs, Wolfgang Teubert and Elena Tognini-Bonelli (McEnery & Hardie, 2012, p. 122).
At the core of the neo-Firthian school of corpus linguistics is searching for the units of meaning.The assumption that single words or lemmas are the main unit of meaning has underlain the construction of English-language dictionaries for hundreds of years.However, the work of Sinclair and associates provides a considerable amount of evidence that units of meaning are phraseological units instead of single words.Inspired by Firth's (1957, p. 179) maxim that "you shall know a word by the company it keeps", Sinclair has paid much attention to the context in which a word is used.He firmly believes in the principle of 'trust the text' (Sinclair, 2004) and claims that 'the language looks rather different when you look at a lot of it at once' (Sinclair, 1991, p. 100).
Reading concordance and calculating collocates from corpus are two important ways to study a lexical item in its context used by Sinclair, hence his well-cited book is entitled as Corpus, Concordance, Collocation (Sinclair, 1991).The concordance is the basic tool for anyone working with a corpus.Even far before the emergence of corpus linguistics, concordances to major works such as the Bible and Shakespeare have been available.The computer has merely made concordances easy to compile.For Sinclair (1991, p. 32), "A concordance is a collection of the occurrences of a word-form, each in its own textual environment.In its simplest form, it is an index.Each word-form is indexed, and a reference is given to the place of each occurrence in a text."In corpus linguistics, a simple and effective convention called KWIC (Key Word In Context) has been widely used.
Closely related to concordance is the notion of collocation.Firth (1957, p. 181) defines collocations of a given word as "statements of the habitual and customary places of that word".Nevertheless, Firth's research on collocation is largely intuition-based, which is in sharp contrast with most corpus linguists' belief that the only way to reliably identify the collocates of a given word is to study patterns of co-occurrence in a corpus.For example, Hunston (2002, p. 68) argues, "Collocation may be observed informally in any instance of language, but it is more reliable to measure it statistically, and for this a corpus is essential." The idea that Firth proposed is operationalized by Sinclair andassociates' early work from 1970 (reprinted in 2004) which may be considered a methodological elaboration on the concordance.A collocation is a cooccurrence pattern that exists between two items that frequently occur in proximity to one another-but not necessarily adjacently or, indeed, in any fixed order.Closely related to collocation is the notion of node and collocates.A node is an item whose total pattern of co-occurrence with other words is under examination; a collocate is any one of the items which appears with the node within a specified span (Sinclair et al., 2004, p. 10).Collocates are also determined within particular spans: "Two other terms . . .are span and span position.In order that these may be defined, imagine that there exists a text with types A and B contained in it.Now, treating A as the node, suppose B occurs as the next token after A somewhere in the text.Then we call B a collocate at span position +1.If it occurs as the next but one token after A, it is a collocate at span position +2, and so on."(Sinclair et al., 2004, p. 34) In order to test whether two words are significant collocates, four pieces of data are required: the length of the text in which the words appear, the number of times they both appear in the text, and the number of times they occur together (Sinclair et al., 2004, p. 28).The optimal span is 4:4, as demonstrated in Sinclair's (1991: 170) definition of collocation, "Collocation is the co-occurrence of two or more words within a short space of each other in a text.The usual measure of proximity is a maximum of four words intervening".On the basis of Sinclair's work, Hoey (2005, p. 5) defines collocation as "a psychological association between words (rather than lemmas) up to four words apart and is evidenced by their occurrence together in corpora more often than is explicable in terms of random distribution".
The units Sinclair argues for, units which reach beyond the word and thus incorporate the collocations of words, are referred to either as extended units of meaning or as lexical items (Sinclair, 1996(Sinclair, , 2004)).Stubbs (2001Stubbs ( , 2009) develops Sinclair's ideas into a systematic account of how the extended lexical units around a word may be studied by the successive analysis of collocations, colligations, semantic preferences and semantic or discourse prosodies.Colligation, semantic preference and discourse prosodies are all abstractions of collocation -that is, they are built upon a collocation analysis.
In sum, Sinclair and his associates have shown that lexical items tend to occur in particular linguistic contexts, e.g. they tend to co-occur or collocate with certain other words, phrases, and/or grammatical structures, and these distributional tendencies help define their meanings.Sinclair's pioneering work has shaped contemporary research on lexical semantics, leading to experimental and corpus approaches to the synonymous words.

Corpus Approaches to Synonyms
Boosted by the advent of the computer era and the central ideas of corpus semantics, the past decades have witnessed significant advances in the studies on synonymy.Based on the Brown Corpus, Miller & Charles (1991) found that the more two words are judged to be substitutable in the same linguistic context (i.e. the same location in a sentence), the more synonymous they are in meaning.Employing a "lexical substitutability" test in a corpus study of the near-synonyms ask for, request and demand, Church et al. (1994) produced the same finding: the substitutability of lexical items in the same linguistic context constitutes a good indicator of their semantic similarity.Gries (2001) quantifies the similarity between English adjectives ending in -ic or -ical (like economic and economical) on the basis of the overlap between their collocations.Gilquin (2003) investigates the difference between the English causative verbs get and have, Glynn (2007) compares intra-and extralinguistic factors in the contexts of hassle, bother and annoy, and Gries & Otani (2010) study the synonyms big, great and large and their antonyms little, small and tiny.Other sets of synonyms that have attracted attention include strong and powerful (Church et al., 1991), absolutely, completely and entirely (Partington, 1998), big, large and great (Biber et al., 1998), quake and quiver (Atkins & Levin, 1995), principal, primary, chief, main and major (Liu, 2010), and actually, genuinely, really, and truly (Liu & Espino, 2012) One corpus-based approach to synonyms is sometimes labeled as corpus-based behavioral profile (BP) study.Generally, a BP study uses corpus data to examine the distributional patterns of lexical items, such as the linguistic contexts a word is typically used in and the words it usually collocates with, so as to identify its unique semantic and usage patterns.For instance, Hanks (1996) examined the syntactic and collocational patterns of the verbs urge and incite, including the types of subjects (such as animate or inanimate) and the types of complementation structures each verb typically takes (such as a simple object complement vs. a complement involving an object noun plus an infinitive complement as shown in "Rice urged the president to resolve the issue").He also investigated, among other things, the semantics of the complement structures (i.e., whether the instances of the typical complement structure of a verb are positive or negative in meaning).The results of the examination helped uncover the behavioral profiles of the verbs, which in turn Behavioral Profile study of near-synonymous adverbs revealed the primary and secondary meanings of each verb and differentiated it from its synonyms.For instance, in the case of the verb urge, its behavioral profile distinguishes it from its near-synonyms like ask, request, and order, because the latter verbs do not share the same complement collocation patterns, among other profile features, with the verb urge.
In recent years, Gries and associates (Divjak & Gries, 2006;Gries, 2001;Gries & Otani, 2010) have developed a more sophisticated BP approach in examining both adjectives and verbs.In this approach, they first imported all the relevant corpus data into a spreadsheet, then manually annotated all the linguistic and contextual features they considered relevant, and finally analyzed the annotated data using a statistical program designed specifically for BP research called "R script BP 1.0".The types of linguistic and contextual features they annotated for synonymous verbs included, among others, tense/aspect, the types of complements, and clause types.By examining the various distributional features of the synonyms, such corpus-based BP studies have been able to effectively identify the internal semantic structures of the synonym sets being examined, including the fine-grained semantic differences among the synonyms in each set, an important type of information in the study of synonymy that traditional research methods had difficulty uncovering.
Nevertheless, the BP approach developed by Gries and associates might be complex for pedagogical purpose and thus the scope of its application may be limited.This study, based on a leading corpus tool Sketch Engine, aims to introduce a simple method that can be widely used by researchers, language teachers and even EFL students.

Corpus Data: BNC
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written (Aston & Burnard, 1998).The written part of the BNC (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text.The spoken part (10%) consists of orthographic transcriptions of unscripted informal conversations and spoken language collected in different contexts, ranging from formal business or government meetings to radio shows and phone-ins.
BNC is, by nature, monolingual, synchronic, general and sample-based, in that it deals with modern British English, it covers British English of the late twentieth century, it includes many different styles and varieties instead of being limited to any particular subject field, genre or register, and that it contains many samples which allows for a wider coverage of texts within the 100 million limit.The corpus is encoded according to the Guidelines of the Text Encoding Initiative (TEI) to represent both the output from CLAWS (automatic part-of-speech tagger) and a variety of other structural properties of texts (e.g.headings, paragraphs, lists etc.).Full classification, contextual and bibliographic information is also included with each text in the form of a TEI-conformant header.

Corpus Tool and Analysis Procedure
The Sketch Engine (SkE) is a leading corpus tool, widely used in lexicography, language teaching, translation and the like (Kilgarriff et al. 2004).It actually refers to two different things: the software, and the web service.The web service includes, as well as the core software, a large number of corpora pre-loaded and 'ready for use', and tools for creating, installing and managing users' own corpora.Corpora in SkE are often annotated with additional linguistic information, the most common being part of speech information (for example, whether something is a noun or a verb), which allows large-scale grammatical analyses to be carried out.
SkE has a number of core functions: Thesaurus, Wordlist, Concordance, Collocation, word sketches, and Sketch Diff.I will introduce most of them that are relevant for this study.

Thesaurus
In Sketch Engine the automatic identification of synonymy is achieved by the tool Thesaurus.SkE prepares a 'distributional thesaurus' for a corpus, a thesaurus created on the basis of common collocation.If two words have many collocates in common, they will appear in each other's thesaurus entry.For example, if we find instances of both raise revenue and increase revenue, that is one small piece of evidence that the two verbs raise and increase are similar.We can say that they 'share' the collocate revenue (noun), in the OBJECT relation.In a very large computation, for all pairs of words, we compute how many collocates they share, and the ones that share most (after normalization) are the ones that appear in a word's thesaurus entry.The thesaurus entry for the verb raise is shown in Figure 1.The similar words of raise are clustered into three categories: need, increase, and spend.Since some collocates of raise may have different forms (for example, rate and rates), in 'Attribute' we choose lemma.The span (the number of words left and right of the search word) is (-5, 5), the minimum frequency of each collocate being set 10 and minimum frequency in given range (in our case -5, 5) 5. Of seven measures to calculate the strength of collocation (T-score, MI, MI3, log likelihood, min.sensitivity, and LogDice), I choose the default one logDice which is considered more reliable than the frequently used MI (mutual information) measure.

Word Sketch
The function that gives the Sketch Engine its name is the word sketch: a one-page summary of a word's grammatical and collocational behavior.Figure 6 demonstrates the word sketch for increase (verb).Its collocates are grouped according to grammatical relations in which they occur.In the first column, for example, a number of words such as wage, cent, population, pay, spending and tax are grouped into the category subject, i.e., they are used as the subject of increase.This function is probably the most straightforward when researching synonymous words.Figure 7 presents the interface of Sketch-Diff.When users click the button 'Show Diff', the software will generate a summary-list of two synonymous words in terms of collocates arranged by grammatical categories.

The Frequencies of Raise and Increase
Concordance enables researchers to compare frequencies of synonymous words.As shown in Table 1, while increase as a noun is much more frequent than raise, the two words as verb are quite close in terms of frequency.It is clear that the dominant collocates of increase are also nouns which fall mainly into two categories:  Business and economic terms: number, rate, size, risk, production, share, price, tax, population, output, income, sale, profit, productivity, cost, demand, spending, expenditure, volume, investment, export, budget;  Abstract nouns: pressure, efficiency, concentration, value, awareness, power.
In addition to nouns, adverb collocates are also quite salient.Of 50 collocates there are 8 adverbs: greatly, significantly, dramatically, substantially, rapidly, steadily, gradually and considerably.Words describing numbers or percentage (such as per, cent, million) also frequently collocates with increase.It seems that the dominant collocates of increase are also nouns which have much to do with amount, number, or value.

The Syntactic Patterns of Raise and Increase
The syntactic patterns of the two verbs are based on the Word Sketch function of SkE as demonstrated in Figure 5.In order to present a fine-grained comparison, I summarized the 17 patterns of raise and 21 patterns of increase in Table 4 and Table 5.In the first example of Table 4, the redden word eyebrows functions as the object of raise.It has to be noted that although the syntactic patterns of the two verbs are similar in many ways, there also exist apparent differences, which can be easily shown when using Sketch-Diff function of SkE.

Direct Comparison of Lexical and Grammatical collocates
The Sketch-Diff function of SkE allows users to visually compare and contrast synonymous words according to their salient collocational context.Figure 8 is part of the result when clicking 'Show Diff' in Figure 7.In the figure, the greener a word is, the more closely it relates to raise.The redder a word is, the more closely it relates to increase.For example, it is more usual to say to raise or lower than to increase or lower and similarly, it is more fluent to say to increase or decrease than to raise or decrease.Apparently, despite that the two verbs raise and increase share a number of syntactical patterns, the collocates in each pattern differ considerably.In the 'and/or' pattern, for example, lower frequently collocates with raise but never used with increase.On the other hand, decrease occurs 54 times with increase, but there is no occurrence of decrease with raise.
In the 'modifier' pattern, there are many words (such as fourfold, fivefold, vastly) that only collocate with increase.Even if some words (such as gradually, further, significantly, substantially, considerably) do collocate with raise, their occurrences with increase are much higher.It is thus no surprise that collocation tokens of for raise are only 2128 but 3558 for increase.
In the 'object' pattern, the collocation tokens for raise are 15789 and only 11285 for increase, indicating that there are more words used as objects of raise.Words like eyebrow, arm, head, issue, question, doubt, matter, only collocate with raise instead of increase.Words that collocate with both verbs have substantially different frequencies as illustrated in Table 6.It is an interesting observation that possibility and likelihood are semantically similar, but the former mainly collocates with raise and the latter with increase (Seretan, 2011, pp. 15-17).

Discussion
So far we have demonstrated how to use some core functions of SkE to research two synonymous verbs raise and increase.Each function, as noted before, has its advantages and disadvantages.Concordance not only enables us to look at re-occurring patterns of the words under investigation, it can also provide the frequency information of the synonymous words as demonstrated in Table 1.Concordance can also make the invisible patterns visible as wisely pointed out by Tognini-Bonelli (2001, p. 18): "In an individual text, we can observe neither repeated syntagmatic relations nor any paradigmatic relations at all, but it is precisely these two things which concordances make visible".Because it gives access to many important language patterns in texts, the concordance is considered "at the centre of corpus linguistics" (Sinclair 1991, p. 170).
Important as concordance is, given the situation that the concordance of both raise and increase consist of nearly two hundred thousands of concordance lines, it would be more valuable to find a list of collocates which tend to occur near or next to the target item under investigation.Collocation thus plays a central role in the research of synonyms, as strongly articulated by Gries (2001, p. 82): the meaning of words can be defined "in terms of their significant collocates".Word sketch enriches the traditional study of collocation by providing syntactic patterns between the node (raise or increase in our case) and the collocates, as demonstrated in Figure 6 and Table 4 & 5.
On the top of all these, Sketch-Diff seems to be the easiest and the most straightforward method to distinguish one word/phrase from the other.Nevertheless, Sketch-Diff alone is not sufficient to demonstrate the semantic and syntactic features of words/phrases under investigation."To begin with, the summary list like Figure 8 is incomplete.Many important collocates may be missing.For example, the word rate is an important collocate for both raise and increase as shown in (1) Germany promises not to raise interest rates but refuses to lower them.(BNC-HLP) (2) For example, if the government were to increase the rate of VAT gross turnover may increase (and with it the rent) while the tenant's net profit remains static.(BNC-J6R) Further investigation indicates that while increase can make rate its object (other attested examples include increased respiratory rate, increased the rate of soil evaporation, etc.), it is raise rather than increase that typically collocates with interest rate(s).
In addition to the above limitation, using Sketch-Diff alone will make users of SkE lose the opportunity to take an overall look at the collocates and syntactic patterns of synonyms as a whole.For example, we might have lost the opportunity to observe and categorize the noun collocates of raise and increase and then find the subtle differences between the two verbs in terms of noun collocates.
In a nutshell, while using Sketch-Diff function alone can give researchers a quick glance at the apparent differences between synonyms in the light of both collocations and syntactic patterns, it would be more rewarding to examine synonyms by way of other core functions of SkE.
It has to be pointed out that SkE has not without its limitations.One apparent limitation is its automatic extraction of similar words.In Figure 1, some of the similar words provided by the tool Thesaurus seems to have little similarity with raise.The measure of automatically identifying and extracting synonyms in a recent study (Peirsman et al. 2015) might be able to help SkE to improve its accuracy.
Another problem facing SkE is its accuracy of grammatical annotation (part of speech).As demonstrated in Figure 9, some uses of raise are typical of verb instead of noun.
Figure 9. Search hits for raise (noun) in BNC At the present stage, SkE still cannot semantically annotate a corpus as does another web-based corpus tool Wmatrix.This is not a problem, of course, but a direction that the SkE team may wish to endeavor in the near future.

Conclusion
In view of its importance and intricacy, researching synonymy is a crucial task in the field of lexical semantics.This paper has introduced the leading corpus tool SkE and its advantages in investigating synonymous verbs.
The results show that different functions of SkE can make different contributions to the discrimination of raise and increase.
This study has also a number of pedagogical implications.In our teaching, we have noticed that students tended to confine their use of raise into a limited scope, such as raise your hands or/and raise your voice.Instead, they tended to overuse increase (such as increase money, increase interest rate, etc.) where raise might be more appropriate.
Studies in first language acquisition show that children memorize not only words in isolation, but also, to a large extent, groups (or chunks) of words.These chunks are viewed as the building blocks of language.They are available to speakers as ready-made or prefabricated units, contributing to conferring fluency and naturalness to their utterances.Thus, if EFL teachers aim to help their students to achieve a great amount of fluency and accuracy, they may hope to use examples extracted from corpus as in Table 4 & 5.
In view of the fact that there exist a huge amount of synonyms in English, it would be unlikely for teachers to teach each pair of them to students.It might be more promising to teach students how to use SkE to conduct their own research, hence the so-called Chinese saying, 'It's better to teach one fishing than to give him fish'.

Figure 1 .
Figure 1.Clusters of similar words of raise in BNC

Figure 2 .
Figure 2. Simple search form

Figure 3 .
Figure 3. Query type for searching the lemma raise as verb

Figure 4 .
Figure 4. Search hits for the verb raise in BNC

Figure 6 .
Figure 6.Word sketch for the verb increase in BNC

Figure 7 .
Figure 7.The interface of Sketch-Diff

Figure 8 .
Figure 8.Comparison of raise and increase in terms of collocational patterns

Table 1 .
Frequency of raise and increase in BNC (per million)

Table 2 and
Table 3 list the top 50 collocates of the verb raise and increase automatically generated by the software:

Table 2 .
The top 50 collocates of raise (verb) in BNC Abstract nouns: question, issue, awareness, objection, charity, standard, point, possibility Other collocates such as above, whether, by, about, his and her have much to do with the grammatical relation which will be analyzed in the next section.
As shown inTable 2, the dominant collocates of raise are nouns which can be grouped into four categories:  Physical organs: eyebrow, head, hand, arm, leg, eye, knee;  Physical items: voice, glass, hon;  Business and economic terms: money, fund, tax, cash, revenue, rate, finance, capital, price; 

Table 3 .
The top 50 collocates of increase (verb) in BNC

Table 4 .
The syntactic behavior of raise (verb) in BNC

Table 5 .
The syntactic behavior of increase (verb) in BNC

Table 6 .
Frequency comparison of words used as objects of both raise and increase Table 2 & 3. Nevertheless, it is neither found in the 'object' pattern nor in the 'subject' pattern generated by Sketch-Diff.Below are two examples in which rate is used as the object of both verbs.