Item Types : Their Effect on the Sensitivity of Multiple-Choice Cloze Tests

To evaluate the sensitivity of multiple-choice cloze (MCC) tests that use different types of items—syntactic, semantic, and connective—to assess reading ability, 170 English as a foreign language (EFL) students in a vocational college in Taiwan were recruited. The students were divided into two groups (level A and level B) based on their scores on 4 classroom reading comprehension tests. Both groups then took 9 MCC tests that included a total of 50 cloze questions. Connective items were most sensitive for assessing reading ability. Research results and pedagogical applications are discussed.


Introduction
The role of cloze tests as a measuring tool of overall reading ability has been controversial.One main reason for this, among others, is that when counting the scores of cloze test, we often ignore the discrepant traits of cloze items.Thus I study the construct validity of multiple-choice cloze (MCC) test scores.Specifically, I analyzed syntactic, semantic, and connective clues by comparing the MCC test scores of two different levels of students.

Statement of the Problem
The word "cloze" is derived from "closure", a term that Gestalt psychologists use to describe the human self-organizing tendency to form a whole.Cloze has been a hot issue in language testing and teaching since it was developed (Taylor, 1953).Interestingly enough, cloze is still unfamiliar to people outside the EFL field, and even some older editions of Microsoft Word's spell checker do not recognize the word "cloze" and want to replace it with "close".Over the past six decades, the most commonly discussed problem with cloze is its sensitivity: Can cloze test long-range constraints and so be able to measure integrative language proficiency or reading comprehension, or does it just test a single component of language at a time?Some important pro-cloze studies are (Taylor, 1953;Cziko, 1978;Oller, 1979;Bachman, 1982Bachman, , 1985;;Jonz, 1990Jonz, , 1991;;Hale, Stansfield, Rock, Hicks, Butler, & Oller, 1988;Abraham & Chapelle, 1992;Wu, 1994;Chatel, 2001;Kobayashi, 2002;Ravand & Sardai, 2017).In contrast, some important skeptical or anti-cloze studies are (Alderson, 1979;Brown, Yamashiro, & Ogane, 1999;Shanahan & Kamil, 1982;and Ashby-Davis, 1985).

Main Reason for the Disagreement
The main reason for the disagreements about cloze studies is inconsistent results because the discrepant traits of the types of cloze items have been ignored.Traditionally, to derive the score of a cloze passage is to calculate the unweighted sum of the scores of all its items.Bachman (1985) emphasized that "although research on the cloze test has offered differing evidence regarding what language abilities it measures, there is a general consensus among researchers that not all the deletions in a given cloze passages measure exactly the same abilities" (p.535).If it is true that the items do not always measure the same dimension, then the total score would not always yield consistent results.This is why readers must be more cautious when explaining the scores of a cloze passage: the discrepancies of the sensitivity of cloze tests might occur because of the deletion of different types of items.To explain the outcomes of cloze tests, readers must consider the traits of each cloze item because different types of cloze items affect the passage in different ways.
The classification of types of cloze items stated above is used for conventional blank-filling cloze tests, but the present study focuses on MCC tests.A three-part classification was used in the present study: syntactic, semantic, and connective clues.Briefly, syntactic clues refer to within-clause grammatical points, semantic clues refer to lexical points, and connective clues refer to both across-clauses and across-sentences conjunctions and to transitional words and phrases (Table 2).Taps propositional information at an interclausal level and emphasizes knowledge of syntax.

RV Reading Comprehension/Vocabulary:
Requires long-range constraints and a lexical choice.

GR Grammar/Reading Comprehension:
Taps knowledge of surface syntax and within-clause proposional information.

VF Vocabulary/Reading Comprehension:
Primarily deals with vocabularyand invokes reading comprehension within clause boundaries.

2.Relative Openness
Based on long-range constraints.

Review of Related Literature
Some studies have investigated the concurrent validity of cloze and reported data to explain the sensitivity of cloze items.These studies show that cloze item types are important for explaining the cloze scores.
Two important questions related to the focus of the present study will be briefly reviewed: (1) the different dimensions and functions of different types of cloze items, and (2) the association between cloze performance and student proficiency level.
Several studies have emphasized that different types of cloze items address different dimensions of language points.Bachman (1985) said that "not all words in a given text function at only one or at the same structural level, and it therefore seems unreasonable to expect all deletions to depend equally on the same level or range of context for closure" (p.538).Cziko (1978) claimed that "the contextual information available to the reader can be of three types: syntactic constraints, semantic constraints and discourse constraints" (p.473).Syntactic constraints are those provided by the preceding words and the syntactic rules of the language (e.g., in the sentence The boy … in the snow.the word The will be followed by a noun).Semantic constraints are those provided by the meaning of the preceding words (e.g. the words The boy at the beginning of a sentence will most likely be followed by a verb phrase describing something a boy is likely to do), and discourse constraints are those provided by the topic of the text (e.g., all the sentences in a reading about skiing will be in some way related to skiing) (p.473).Jonz (1991) stated that "…cloze was sensitive to features of textual processing that ranged far beyond the narrow focus of local syntactic constraint" (p. 6).Chihara, Oller, Weaver, and Chavez-Oller (1994) pointed out that "short-range constraints are often confused with mere syntactic elements, and yet they often involve lexical connections that transcend traditional conceptions of phrase structure syntax" (p.144).Abraham and Chapelle (1992) reported that "in the multiple-choice cloze the potential answers were limited…Thus the multiple-choice format made content words differentially easier than function words" (p.472).
In addition, some studies emphasize the close relationship between student proficiency levels and cloze test scores.Bachman (1985) found that "the percentage of correct closures was higher for groups with higher language proficiency than for groups with lower proficiency" (p.536).Chihara et al. (1994) found that "as subjects increase in proficiency, they become more able to benefit from discourse constraints (i.e.long-range constraints) ranging across sentence boundaries" (p.142).
The present study aimed to answer this question: Are there language proficiency-based differences in student scores on MCC tests that use different types of cloze items (syntactic, semantic, and connective)?

Participant Characteristics
About 200 students were recruited.Most had completed 6 years of EFL classes in junior and senior high school, and about half of them had also finished one year of 2-hour/week English classes in their first year of college.
The students took four 15-minute 20-question reading comprehension quasi-placement tests during the first four weeks of the study.Students were then grouped into two levels: Level A students were those who had correctly answered at least 14 of the 20 questions in the 4 reading comprehension tests, while Level B students were those who had correctly answered 7-13 questions.Students who had correctly answered fewer than 7 questions were excluded from analysis because their tests had little reference value.Finally, 170 students completed the study.

Materials
The 4 reading comprehension passages used in this study to determine student language proficiency levels were taken from Taiwan's Testing Center for Technological and Vocational Education (TCTE) website (http://www.tcte.edu.tw/down_exam.php).Each passage was 180-240 words, and the difficulty levels of the passages were between Flesch-Kincaid Grade Levels 8.4 and 11.6, according to the Automatic Readability Checker (http://www.readabilityformulas.com/free-readability-formula-tests.php).
The 9 cloze passages were also from the TCTE.Each passage was 160-201 words; 5-7 items followed each passage, and the difficulty levels of the passages were between Flesch-Kincaid Grade Levels 8.3 and 11.5.

Procedures
Students who had taken the required one year of 2-hour/week English classes in their first year of college were randomly chosen for the study.At the outset of the experiment, all students took a reading comprehension test randomly chosen from the item bank of the entrance examination of Vocational and Technical Education in Taiwan.Passages were of comparable length and difficulty level (about 180-240 words); difficulty levels were between Flesch-Kincaid Grade Levels 9.5 and 11.6, according to the Automatic Readability Checker.
Then in the following four weeks, each student answered 10-15 MMC test questions per week in 15 minutes (a total of 50 cloze items).The 50 MCC test questions were graded and categorized into three types of cloze item clues by 5 experienced EFL instructors blinded to the identities of the tested students.If the agreement of the 5 EFL instructors was less than 80%, the disagreements were discussed and reset.Three dubious and complicated items were discarded and replaced with new answers.Consequently, 20 of the 50 items were syntactic, 14 were semantic, and 16 were connective.Unequal variance t tests were used to investigate the differences between the three different item types.The independent variables of the study were Level A and Level B, and the dependent variables were the scores of the three different item types.

Results
The unequal variance t test was used to determine the differences between the test scores of Levels A and B. In Table 3, for syntactic item types, the t test value was 2.740 with 115 degrees of freedom; for semantic item types, the t test value was 4.902 with 113 degrees of freedom; and for connective item types, the t test value was 7.493 with 130 degrees of freedom.All three differences were significant (p < 0.05).These results seem to be natural and reasonable, since there is difference of the language proficiency between Level A and Level B students.The mean differences of the three different item types (Level A minus Level B) were then compared.
In Table 4, the mean difference between syntactic and semantic item types was not significant (t test value was 0.809); however, the mean differences between connective and syntactic item types and between connective and semantic item types were significant (t test values were 2.796 and 2.315, respectively).Results show that connective clues were the most sensitive.

Findings
The study's most important findings were that the connective clues (i.e., across-clause, within sentence clues and across-sentence, within-text clues) were the most sensitive indicators of language level and reading comprehension: the differences between connective clues and the other two types (syntactic and semantic) were statistically significant.There was, however, no statistically significant difference between syntactic and semantic clues.Although more work remains to be done, these findings suggest that as language proficiency increases, students better understand the discourse constraints across sentence boundaries.These findings confirm Bachman's (1985) statement that "to develop a test that could potentially measure textual relationships beyond the clausal level, it was necessary to identify criteria for classifying and selecting words to be deleted" (p.538).These results are also consistent with some other studies (e.g., Chihara et al., 1994;Jonz, 1990Jonz, , 1991)).They also support the pro-cloze hypothesis that cloze can measure integrated reading comprehension in addition to testing discrete points of language use.The findings of the study might help clarify the nature of cloze items and enable researchers, test makers, and test administrators to rationally rather than randomly choose which words and phrases should be deleted.This kind of knowledge should lead to better language tests and to assessments that are more accurate.

Limitations
This study has some limitations.First, because of the homogeneity of the participants, readers must be cautious when interpreting the outcomes and conclusions.Second, only a three-part system was used, not the four-or five-part systems that some others (e.g., Bachman, 1982Bachman, , 1985;;Jonz, 1990Jonz, , 1991) ) used.Third, the five experienced EFL teachers in this study reached only 80% agreement when judging the three item types.These item types might need to be redefined in future studies to make the findings more general and convincing.Fourth, because of the student homogeneity and the narrow range of difference in language proficiency levels in this study, the two levels are somewhat arbitrary.In future studies, a broader range of difference in language proficiency levels would likely be more accurate.

Pedagogical Application of the Study
Several useful conclusions related to EFL teaching and testing can be drawn from the present study's findings.
First, the study supports the pro-cloze position that cloze, including the MCC test, can be used to measure global and integrated reading proficiency.Second, to increase reading ability proficiency, it is important to develop the reader's ability to grasp long-range contextual constraints.Third, when designing a rational-deletion cloze test, connective clues cannot be neglected.Fourth, when designing or interpreting MCC tests, EFL teachers should carefully choose item types.Fifth, experienced EFL teachers need to be involved when designing a rational deletion MCC test.Sixth, properly constructed MCC tests can teach some foreign language students who are characterized as word-by-word readers to perceive how individual words relate to one another and, therefore, to pay attention to contextual constraints to promote their integrated reading proficiency.

Recommendations for Future Research
The language levels of the participants in this study were low-to-intermediate and tended to be homogeneous.If advanced-level students are recruited for future studies, the outcomes and conclusions will be more convincing.
In addition, in similar studies, if four-or five-part systems (e.g., Bachman, 1985;Jonz, 1990;Hale et al., 1992) are used and can be factor analyzed to ascertain whether cloze procedure scores can account for a model that reflects the nature of the cloze item types (Jonz, 1990), the outcomes and conclusions should be more meaningful.Because MCC tests are currently commonly used in national entrance examinations in many countries and in internationally famous English proficiency examinations like TOEFL and TOEIC (Lee & Wu, 2018, p. 2), the research on MCC item traits should be meaningful.Note.L: level; Level A: 14 -20; Level B: 7-13; syn.: syntactic clues; sem.: semantic clues; con.: connective clues.

Copyrights
Copyright for this article is retained by the author, with first publication rights granted to the journal.
This is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

Table 2 .
Similarity between the categories of the present study and some main studies in Table1 above **Identical or very similar; *only slightly similar.