On the Factor Structure of a Reading Comprehension Test

To investigate the construct validly of a section of a high stakes test, an exploratory factor analysis using principal components analysis was employed. The rotation used was varimax with the suppression level of .30. Eleven factors were extracted out of 35 reading comprehension items. The fact that these factors emerged speak to the construct validity of the test. However the problem of over-factoring was obvious. This may be attributable to different paradigms of testing on which the items were based. In other words, the test constructer opted for passages from TOEFL, FCE and IELTS books with much alteration.


Introduction
University of Tehran administers a proficiency test to PhD candidates on a yearly basis.The test can be considered a high stakes one by the virtue of the fact that almost 10,000 candidates take it.Admission tests for universities or other professional programs, certification exams, or citizenship tests are all high-stakes assessment situations (Roever, 2001).According to Messick (1988), if the validity of a test is not known, it might have undesirable consequences for the society at large.
The purpose of the current study is to investigate the factor structure of the reading section of University of Tehran English Proficiency Test (the UTEPT).While the UTEPT plays a key role in the academic lives of individuals, no in-depth study has ever been conducted regarding the validity of the test.The only study is that of Zand Karimi (2005).The study, however, has methodological flaws, not the least of which is the inappropriate use of Principal Components Analysis (PCA).Specifically, PCA has been applied with no reference to loading patterns.The current study attempts to shed more light on the factor structure of the reading comprehension of the test.Information on other sections can be found elsewhere (Rezaee andSalehi, 2008, andSalehi andRezaee, 2008).Palmer and Groot (1981) view construct validation as a theory testing procedure and distinguish it from all types of validity in which reference to a criterion is important.In their definition, the importance of exploratory factor analysis and confirmatory factor analysis is underscored.They maintain that:

Definitions of Construct Validity
In construct validation, one validates a test not against a criterion or another test, but against a theory.To investigate construct validity, one develops or adopts a theory which one uses as a provisional explanation of test scores until, during the procedure, the theory is either supported or falsified by the results of testing the hypotheses derived from it.(p. 4) Hughes's (1989) definition has often been quoted by other researchers (e.g., McDonough, 1995).
A test, part of a test, or a testing technique is said to have construct validity if it can be demonstrated that it measures just the ability which it is supposed to measur.One might hypothesize, for example, that the ability to read involves a number of sub-abilities, such as the ability to guess the meaning of unknown words from the context in which they are met.(p.26) An interesting point about this definition is that it can be applied to language testing per se.What Hughes implies is that reading is a multi-faceted phenomenon.There are various sub-abilities involved in the reading process.Inferencing, vocabulary, and topic identification being some of them.

Approaches to Construct Validation
There have been several approaches to test validation.A sketch of the approach of Alderson, Clapham, and Wall (1995) is the most appealing.The first approach that they talk about is the correspondence with theory.In other words, the test results are supposed to verify the theory.The authors remind us that the theory itself is not called into question.The second approach that they mention is internal correlations.If a test battery is composed of some sub-parts, like a proficiency measure, then the correlations of these sub-parts should be low, so that evidence can be collected on the distinctness of these parts.The authors rightfully mention that the correlation of any sub-part with itself is necessarily one or perfect.Now, to assure that the test has construct validity, the subparts should yield a positive correlation with the total test.Still, another problem may arise; the correlation of any sub-part with the total test with including the sub-part may inflate the correlation.To solve that problem, the authors suggest excluding that particular sub-part from the total test and then running the correlation.Still, another approach they touch upon is factor analysis which will be explained in the following sections.Another approach is multitrait-multimethod (MTMM) approach which will be elaborated on in due course.Finally, the last approach is taking account test bias and actually assessing the role of background knowledge, gender, race, etc.
Out of the approaches mentioned above, three needs more elaboration that was already given.One is factor analysis.The other one is multitrait multimethod.And finally he last one is the role of background knowledge to assess the role of bias in the testing process.

Factor Analysis
Baker (1989) maintains that "factorial analysis is broadly speaking, to simplify a variety of sets of scores (which we will call variables) for a given population" (p.62).
There are two major types of factor analysis: exploratory and confirmatory.
As for exploratory factor analysis, Bachman (1990) maintains, "In the exploratory mode, we attempt to identify the abilities, or traits that influence performance on tests by examining the correlations among a set of measures" (p.260).Bachman (1990) offers the following insight about confirmatory factor analysis: "In the confirmatory mode, we begin with hypotheses about traits and how they are related to each other, and attempt to either confirm or reject these hypotheses by examining the observed correlations" (p.260).

Multitrait Multimethod
Perhaps the pioneers for MTMM designs can be Campelle and Fiske (1959).Palmer and Groot (1981) maintain that the design was applied to language testing by Stevenson (1981).There will be an overview of the concept followed by theoretical underpinnings to be further followed by research studies.
Test scores may be the function of the trait and the method used to test it.For example, a trait may be tested differently by different methods like multiple choice completion and simple completion.If two individuals with the same overall grammatical knowledge perform differently under the two test conditions using two different methods, then the difference can be attributed to the influence which using different methods has exerted.Essential to the MTMM designs are the notions of convergent and divergent validity.
As for the convergent validity, it can be maintained that if a trait is to be tested by two methods, because the trait is the same in each method, the correlation is expected to be high.So, if a group of testees take a grammar test in the form of multiple choice and simple completion, the correlation is supposed to be high because in each case grammar is being tested and any difference can be attributed to the effect of the method exercised.
On the other hand, divergent or discriminant validity is logically related to the convergence of scores.The difference between convergence and divergence can be illustrated with an example.Vocabulary and grammar are supposed to tap different constructs.To the extent that these two produce a low correlation speak to the discriminant validity of the tests.Palmer and Groot (1981) rightfully remind us that a high correlation between two apparently distinct traits may indicate that the two may be related deep down.For example, reading and writing are supposed to be distinct traits and a low correlation is expected.But a relatively high correlation goes to show the two skills tap similar skills like vocabulary knowledge and world knowledge.As Palmer and Groot maintain the MTMM designs can be shown in a matrix.To illustrate the point, the example pointed out above can be shown by a matrix as in Table 1.
As it can be observed, the two traits (grammar and vocabulary) and the two methods (multiple choice and fill-in-the blanks) are shown in the matrix.Correlational analysis can provide evidence for the convergent and discriminant validity of the tests.High correlations between test #1 and test # 2 will provide evidence for the convergent validity of the grammar tests.By the same token, evidence of convergent validity for the vocabulary tests can be found via high correlations between test # 3 and test # 4. On the other hand, low correlations between test #1 and test # 3, test # 2 and test # 4 speak to the degree that the tests demonstrate evidence of discriminant validity.

The Role of Background Knowledge
Zumbo (1999), among others, is of the belief that a construct validation study needs to take into account construct irrelevant factors into account.Detecting differential item functioning is one way of taking care of it.
Zumbo (ibid) maintains that "DIF occurs when examinees from different groups show differing probabilities of success on (or endorsing) the item after matching on the underlying ability that the item is intended to measure."(p.12)

The Research Question
The research question addressed in the current study is as follows: Do the test items in the 'Reading Comprehension' sections of the UTEPT distinctly measure various sub-skills?3. Methodology

Participants
The participants in the present study were 3,398 testees chosen from the total population of 8,696 testees who took the UTEPT in February 2007.).Outliers were discarded.The participants majored in different fields of study, including physics, chemistry, theology, etc.As for the number of participants that should be present in factor analysis studies, different scholars hold different views For example, Kline (1994) suggests that the number of subjects should be two times as many as the number of variables.Henson and Roberts (2006) maintain that, "It is not uncommon to find rules of thumb in the factor analytic literature; it is less common, though, to find consistency in recommendations" (p.402).They further refer to other scholars' recommendations.For example, they mention Stevens (1996) as suggesting that "the number of participants per variable is a more appropriate way to determine sample size (ranging from 5 to 20 participants per variable).Fewer participants are needed when component saturation is high" (p.402).
Usually, "the larger the better" sounds all too familiar.But the rule of thumb is two times as many as the number of variables.In the current study, there were 100 variables corresponding to the number of test items.However, the sample size in the current study exceeded the criterion level.There were 3,398 participants.The sample size can play a very crucial role.It is even said that with larger samples, the distinction between using various types of factor analytic techniques becomes insignificant (see, Kline, 1994).
In SPSS, there is a convenient option offered to check whether the sample is big enough.This is the Kaiser-Meyer-Olkin measure of sampling adequacy.The sample is adequate if the value of KMO is greater than 0.5.To check the adequacy of sampling, KMO was performed and the results showed that the KMO test of adequacy of sampling was .753for the total test.It is greater than .5.So the test is adequate in terms of sampling.
Variance is a crucial factor in test validation which primarily affects reliability.In other words, the higher the variance, the more the reliability.There is a substantial body of evidence for the importance of reliability as a prerequisite for validity.Kline (1994) maintains that heterogeneity is a crucial issue in factor analysis.This assumption was met because the variance of this test was 39.873.
Taking into account the criteria in the literature, it becomes obvious that the current study meets the benchmarks as set by factor analysts in the social sciences.

The UTEPT
The test consists of 100 items.The three sections of the test are grammar, vocabulary, and reading comprehension.The grammar section has 35 items.The first 20 items are multiple choice completion items.The second 15 items are error identification; 10 items (items 36 to 45) deal with grammar and vocabulary tested in context.The next section deals with vocabulary.This section is divided into two parts; part one has 10 items (items 46 to 55) and part two has 10 items (items 56 to 65).The last section is concerned with reading comprehension.This section has 35 items consisting of six passages.

Data Analysis
To answer the research question of the study, .i.e., "Do the test items in the reading comprehension sections of the UTEPT distinctly measure various sub-skills?" an exploratory factor analysis was performed.This statistical procedure was used to extract factors in the reading comprehension section.The extraction method was Principal Components Analysis (PCA).The justifications are as follows: 1-It is mathematically simple (Kline, 1994).In other words, the algebra and computation of Principal Components Analysis is not complex.
2-The computational methods used make absolutely clear the basis of the assertions that factors account for variance and explain correlations (Kline, 1994).

Results and Discussion
To answer the research question, the researcher operated on the assumption that reading is a trait which consists of sub-abilities (Hughes, 1989) and expected that factor analysis would yield some sub-abilities.
Having used exploratory factor analysis (Principal Components Analysis), 11 factors were extracted.Correlations below .30were suppressed.What follows is a list of the factors and their interpretations.All the 11 components accounted for 41.96 percent of the total variance.Table 2 shows the extracted factors and their loadings.4.1.1.Factor One Items 72, 83, 86, 90 and 94 loaded on factor one which is a vocabulary factor and all the five items are vocabulary items tested within the reading passages.

Factor Two
Items 81, 88, 89, 94 and 100 loaded on this factor.It is mostly a main idea factor.Item 94 looks like a non-belonging one.The reason can be attributed to the fact that the item is not factor pure and has commonalities with factor one.

Factor Three
Items 71, 87, 95, 96, 97, and 99 loaded on this factor.Items 71, 87, 96 and 99 have the lowest factor loadings.Item 97 has the highest factor loading.Finally, item 95 has a low factor loading.Item 71 is not factor pure and also loads on factor 5 and will be elaborated on later.Item 87 is a factor-pure item.But the point is that it does not have a high factor loading.Item 95 is not factor pure and shares variance with factor 7. But at the same time, it has a moderate factor loading.Turing to item 96, it has a low factor loading and shares variance with factor 7 in the same way as the preceding item did.The next item to be discussed is item 97, which has the largest factor loading of all the items in this study.The last item under this factor is item 99.The item is low in factor loading and is not factor pure.
One might refer to this factor as one related to inference.There are a few points that need to be made about the factor.First and foremost, items 99 and 71 have loaded on this factor.This is surprising because they are vocabulary items and our expectation was that they would be loaded on the first extracted factor.The second point pertains to item 97.This item has, as mentioned before, the largest factor loading of all the items collected under the factor.This item has one peculiar characteristic: it taps topic identification which is an endeavor in inferencing.

Factor Four
This factor consist of items 66,67,68,and 69. Factor loadings are relatively high .589,.582,.388,and .451,respectively.They are directly-stated question items.All four items are based on a single passage.These items are easy ones.As a matter of fact, relating the performance of the testees to these items confirms the claim.The facility values for the mentioned items are: .61,.76.5, .61,and .33,respectively.Except for item 69, other items are considered to be relatively easy.

Factor Five
Items 71, 79, 85, 93, and 98 loaded on this factor.Item 71 is not factor pure and loads on factor 3 as much as it does on this factor.Item 79 is not factor-pure either and loads more on factor 11 than it does on this particular factor.Item 85 has a relatively high factor loading and is factor pure.By the same token, item 93 is factor pure and has a factor loading close to that of item 85.Our expectation is that this factor, whatever it is, is going to be related to these two items.The last item is not factor pure and it cannot be expected to contribute to this factor.
Item 85 is a vocabulary item.Item 93 is not a vocabulary item; it is more related to reasoning ability than it is to simple vocabulary knowledge.

Factor Six
Items 76, 78, and 79 came to be loaded on this factor.Item 76 is factor pure with a negative factor loading.Item 78 is factor pure with a high factor loading.Lastly, item 79 is not factor pure and also loads on factor 11. So, probably items 76 and 78 should help us in factor naming.
Item 78 has the highest factor loading and is a reference item.Probably, all items are concerned with word paraphrases.

Factor Seven
Items 74, 92, 95, 96, and 98 loaded on this factor.The items can be analyzed in terms of factor pureness.Item 74 is not factor pure; it also shares variance with factor 8. It loads more on factor 8 than it does on this factor (i.e., factor seven).So, not much investment can be made on the contribution of this factor.Item 92 has the highest factor loading of all the variables (here items).Also, it is a factor pure item.This item has made the greatest contribution to the factor.Items 95 and 96 loaded on this factor as they did on factor three.Finally, item 98 loaded on this factor as it did on factor 5. So, emphasis needs to be placed on item 92 to help us to come up with a name for the factor.
It should come as no surprise that this item has the largest factor loading of all as well as being a pure-factor item.The reason is that this item tests a grammatical point in the language; no other item in the section bears any resemblance to this one.

1.8 Factor Eight
Items 69, 70, 73, 74, and 99 came to be included under this factor.Item 69 is not factor pure and also loaded on another factor.As a matter of fact, the impureness of this in terms of factor loading is evident in the fact that the item is incongruent with the set of other items belonging to directly stated questions.Apart from that item, one can observe what has happened to item 70.This item has a large, albeit not the largest, factor loading.The factor is probably expecting a great contribution from the item.Next, there is item 73 with the largest factor loading of all the items and is expected to make a good contribution to the extracted factor.The last two items are not factor pure which means that they are not expected to be of any help in naming the factor.
The two items have appeared under the same factor for very good reasons.One is that they are both based on the same passage.But more important than that is the fact that the items fall somewhere between inference and main idea types which place a lot of demands on the test taker; and directly stated questions which are not as demanding for the test takers.So, this factor can be safely called "understanding through paraphrase".
4. 1.9 Factor Nine Items 82, 83, and 84 loaded on this factor.Item 82 is factor pure. Item 83 is not and also loads on factor one.So, this item is most probably a vocabulary factor.Finally, item 84 is also factor pure and accountable for explaining the most variance.Items 82 and 84 can be scrutinized to see if our prediction about the characteristic of item 83 is borne out.
Turning to our prediction about item 83, it behaved in the way we expected.But as for items 82 and 84, it becomes evident that both use the word "suggest" in their stems leading us to conclude that the concern of the items is to tap "drawing conclusions".

1.10 Factor 10
Items 75, 77 and 80 came to be loaded under this factor.Items 75 and 79 are factor pure and are likely to be accountable for the greatest contribution to the factor as opposed to item 80 which does not load on a single factor; it also loads on factor 5.
Both items are based on the same passage.Item 75 is looking for an identification of a title for the passage.Item 77 is indirectly having the same function.Please notice that in both items, the correct answer has the word "knowledge" in them.As a matter of fact, some kind of manipulation of the items leads us to the conclusion that the items have similar traits.In item 75, the key phrase is "knowledge in the higher education".Now, in item 77, we can combine the stem with the correct choice and come up with the same proposition.In other words, "higher education furnishes the graduates with knowledge" is propositionally the same as "knowledge in higher education".

1.11 Factor 11
Items 79, 81 and 91 loaded on this factor.The first two items are not factor pure and item 91 is held accountable for explaining the variance.
Item 91 has surprisingly loaded on this factor.It is the point where factor analysis should be combined with logic.

Discussion
There is the problem of over factoring because 35 items lend themselves to 11 factors.This is, however, justifiable on the grounds that the reading passages were extremely heterogeneous by nature, meaning that the test constructor opted for an amalgamation of different orientations in language testing.It should be noted that this pertains to methods not traits, but methods and traits are sometimes indistinguishable (Stevenson, 1981).In other words if we can operationally define orientation (FCE, IELTS, TOEFL) as methods then it is entirely possible that methods or orientations might have induced error into the process.As Bachman (1990) puts it, the performance of test takers on the test might be more of engagement with methods than with traits.The factor analysis might be appropriate but the distinction between traits and methods may become fuzzy.Another obvious problem is under factorability in the sense that some factors were not represented among the factors.According to Messick, this is referred to as construct under representation.To exemplify it, topic prediction was never represented in the reading comprehension questions.The UTEPT is an example of proficiency test not an achievement one.The point is that a proficiency test should embody the constructs of a test, but in this study some constructs were under represented.According to Weir (1990) this is referred to as an apriori validation whereby the test maker operates on preconceived assumptions.To strip it into simple language, he is equipped with a table of specification Weir also talks about posteriori approach to validation of which this study is an example While the researcher could control the latter he could not exercise any control over the former.With no table of specifications available to the researcher, it was very difficult to see what the sub-skills were intended by the test constructor.
Another point should be made about the factor analysis.Carroll (1983) is of the opinion that each factor extracted should be represented by at least three variables.In this study, it could not be materialized.As it can be observed, only four out of 11 factors do meet the criterion as set by Carroll (1983).Of course some factors could have had more than the current number of variables if the researcher had accepted lower factor loadings.This is in close alignment with overfactoring.It may be the case that the items are so different in terms of the underlying traits. .

Conclusions
It can be concluded that factor analysis proved as a robust tool in the investigation of construct validity.Principal Components Analysis (PCA) could easily delineate factors in the section.Eleven factors were extracted out of 35 reading comprehension items.The results of the study should be treated with circumspection.First and foremost, there was the issue of overfactoring.It can be justified on the grounds that the test developer used different orientations in language testing.In other words, passages from ILTES, TOEFL and FCE were included.The passages are a melting pot of various approaches to language testing.Factor analysis is supposed to delineate underlying traits not the methods employed.Apparently the same traits were tested using different methods.The study calls for a research in terms of comparability of TOEFL and ILTES.In terms of performance of testees, they diverged when it came to the two orientations.It remains to be seen whether the two orientations yield the same results.In this study, they did not.Second of all, truncated subjects coud have altered the results.

Table 1 .
An Example of an MTMM Design

Table 2 .
Extracted Factors on Reading Comprehension Items by Principal Components Analysis