Optimal Number of Gaps in C-Test Passages

This study addresses the issue of the optimal number of gaps in C-Test passages. An English C-Test battery containing four passages each having 40 blanks was given to 104 undergraduate students of English. The data were entered into SPSS spreadsheet. Out of the complete data with 160 blanks seven additional datasets were constructed. In the first dataset the scores on the first five gaps in each passage were aggregated and the rest of the gaps were ignored, as if each passage had only five gaps. In the second dataset the scores on the first ten gaps were aggregated. In each subsequent dataset five more gas were added. The eight datasets were analyzed and their psychometric properties were compared. The results showed that as the number of gaps in each passage increases item discrimination, reliability and factorial validity of the test increase accordingly. The implications for C-Test application are discussed.


Introduction
In studies in second language, researchers usually need to measure the general language proficiency of learners and use it as a moderator or control variable.The assessment of language proficiency is only a side activity and is done along with the testing of other variables which are selected for the study.This is usually a cumbersome process as the testing time and resources at the researchers' disposal are very limited.
Considering its ease of application and scoring and the short time needed to administer it, C-Test can be a valuable proficiency measure in second language research.Besides, three decades of research on C-Test has demonstrated its validity and reliability as a measure of first and second language proficiency (see Dörnyei & Katona, 1992;Eckes & Grotjahn, 2006;Grotjahn, 1992Grotjahn, , 1994Grotjahn, , 1996Grotjahn, , 2002Grotjahn, , 2006Grotjahn, , 2010;;Sigott, 2004).
Apart from its application in research in second language, C-Test has widely been used in large scale testing contexts for selection and placement purposes in Germany (see Grotjahn, Klein-Braley & Raatz, 2002).
C-Test is a general language proficiency test that is comprised of 4-6 short independent passages.Stating from the second word in the second sentence the second half of every second word is deleted.To avoid problems of local item dependence in statistical analyses of C-Tests, each passage is considered a polytomous or super-item.The total scores on each passage is entered into analysis as if each passage is an independent Likert item (Baghaei, 2007(Baghaei, , 2010;;Eckes & Grotjahn, 2006;Sigott, 2004).
C-Test proponents, Raatz andKlein-Braley (1985, 2002), suggested either 20 or 25 blanks in each C-Test passage, but never provided a psychometric grounding for their suggestion.There is no research in C-Test literature that shows the optimal number of gaps in C-Test passages from a psychometric point of view.The purpose of the present study is to systematically investigate and monitor the psychometric characteristics of a C-Test with different number of gaps in each passage.The characteristics which are considered here are interval consistency reliability, item discrimination and factorial validity.

Method 2.1 Instruments, participants and procedure
For the purpose of this study a C-Test battery comprising four passages, each passage containing 40 blanks was constructed.The test was given to 104 Iranian undergraduate EFL students in two universities.As was mentioned before in statistical analyses of C-Tests the data for the single gaps are not entered into the analysis.That is, the gaps are not considered as items; each passage is considered a super-item or testlet.A C-Test battery that contains say, four short passages in fact has four polytomously scored items.The number of gaps which have been correctly filled in each passage are counted and that is the score for that super-item.This is done in order to avoid local item dependence problem in C-Tests.Under classical test theory and Item Response Theory the items should be locally independent from each other.This means that a correct or wrong reply to an item should not lead to a correct or wrong reply to another item.We know that this assumption is violated in C-Tests and cloze tests if we treat each single gap as an independent item.This is the reason why the number of correct gaps in each passage is aggregated and passage total scores are entered into analysis.In other words, C-Test passages are analysed as Likert items which can have as many as 20-25 response options.
The data were entered into SPSS spreadsheet for analysis.Out of the complete data with 160 blanks seven other datasets were constructed.In the first dataset the scores on the first five gaps in each passage were aggregated and the rest of the gaps were ignored, as if each passage had only five gaps.In the second dataset the scores on the first ten gaps were aggregated.In each subsequent dataset five more gas were added.This resulted in eight datasets, Dataset 1 to 8, with five, ten, fifteen, twenty, twenty-five, thirty, thirty-five and forty gaps in each C-Test passage respectively.
Along with the C-Test a reading comprehension test comprising 14 multiple choice items was also administered to the sample.The test comprised two passages and was taken from CAE practice tests.The Cronbach's alpha reliability of the reading tests was .57which is due to the small number of items.The datasets were analysed separately, considering each passage as a polytomous item, and their psychometric properties were compared.

Results
In each dataset classical item facilities or p-values and item discrimination indices for passages or super-items were calculated.Test level comparisons were made by calculating Cronbach's alpha reliability, mean of the sample on each test form, standard deviations, correlation with reading and factorial validity, i.e., the percentage of variance explained by the first factor.Tables 1 and 2 show the results.

Table 1 about here
Table 1 shows that as the number of gaps in passages increases the discriminations of the super-items (passages) increase accordingly.However, the p-values of the passages do not show a pattern which is reasonable.There is no reason why the difficulty indices of the first five gaps should be greater or smaller than the difficulty indices of the first ten gaps, unless we assume that test-takers get fatigued and lose concentration when they get to the gaps at the end of the passages or they become familiar with the test as they work through the texts and this makes the subsequent items easier for them.The fact that a falling or rising pattern in p-values is not observed means that forty gaps and four passages are well within the concentration span of the test-takers and also three is no learning effect to influence item difficulty indices.

Table 2 about here
Table 2 shows the test-level statistics for test datasets with different number of gaps in each passage.It is evident from the table that Cronbach's alpha reliability increases as the number of gaps in each passage increase.In order to make the means and standard deviations of the sample comparable across datasets, the percentage of correct replies for each examinee was computed rather than the sum scores.The means and standard deviations of these percent corrects were then computed and compared.As the table shows, there is no pattern in the means and standard deviations.Observing no pattern in means is parallel with previous finding of no pattern in the p-values.This also indicates that the test-takers do not get tired as they work through the passages.
The eight datasets were analyzed with principal component analysis (PCA).Before performing PCA the factorability of the data was checked for all the datasets.The Kaiser-Meyer-Oklin value for the dataset with five gaps in each passage was .69 which exceeds the recommended value of .60 (Kaiser, 1974).The value increases as the number of gaps increase.Bartlett's Test of Sphericity was significant for all eight datasets (p<.001).These suggest the suitability of the data for factor analysis.
Principal component analysis resulted in one factor solutions for all datasets.The fifth column "% Variance" shows that the variance first factor explains increases as the number of gaps increases.The varimax rotation does not change the percent of variance explained by the first factor in the datasets.
As the last column shows the correlation between C-Tests with varying number of gaps and the reading comprehension test does not show a specific pattern and does not increase with the number of gaps.In fact, the highest correlation is observed with 15 gaps.This indicates that the criterion-related validity of C-Test is somewhat independent of the number of gaps in passages and the preciseness of predictions does not depend on the length of the passages.

Discussion and Conclusions
The results of the present study clearly show that as the number of gaps in C-Test passages increases the reliability and factorial validity increase accordingly up to a certain point.However, as the number of gaps increased from 35 to 40 no change in reliability was observed.Furthermore, the improvement in reliability in tests with 20 to 35 gaps was very small.The same pattern was also observed for factorial validity.There is a great leap in the improvement of factorial validity from five to 20 gaps but from 20 gaps onward the change is very little.This indicates that as the number of gaps in each passage increases we get more information about examinees'.However, after a certain point the information that extra gaps add become repetitive and we do not learn more about the respondents.Grotjahn (1987) argues that with small number of gaps we cannot measure text-level macro-skills.He states that 25 or even 30 blanks are needed to measure these skills.The results of the present study suggest that with 25 gaps in each passage the reliability and factorial validity is reasonably high for most purposes.However, for placement and research purposes where we need to have some rough proficiency groupings we can have C-Tests with 15 gaps.This will save a lot of administration and scoring time and leaves the researcher with plenty of time for the testing of the other variables of interest.However, for high-stakes testing contexts where more precise measures are required more gaps are needed.
One limitation to the study is that the datasets of five to 35-gap passages were not actually separate test forms.That is, the data for these datasets contained the entire passages.The additional items were just ignored in the analyses.Test-takers had the chance to take advantage of the entire passages for answering the items which may have affected the results.The results might be different if we construct C-Tests out of very short passages that contain only ten or fifteen items without further text to help students in text-processing.Future research in this line should focus on constructing separate C-Test forms which have passages with different number of gaps and compare their psychometric properties.Especially comparing the results of such a study with the results of the present study can be very informative as regards the effects of larger context on C-Test processing.
It is important to note that the results of this study can only be generalized to C-Tests which contain four passages.The number of passages can also have significant effects on the psychometric properties of C-Test.It would be interesting to study the conjoint effect of the number of passages and the number of gaps in each passage.In other words, what number of passages and with what numbers of gaps optimize the psychometric properties of the C-Test.

Table 1 .
Passage difficulty and discrimination for the datasets

Table 2 .
Test level statistics for the datasets