An Investigation into the Consequential Validity of a Diagnostic College English Speaking Test

This paper reports the verification of the consequential validity of a Diagnostic College English Speaking Test. A case study was conducted with 28 sophomore students from a national key university in China engaged in seven sets of DCEST tests. The analysis of the DCEST scores of the students in the experiment group indicates that progress has been made in their oral English proficiency over the two-month period. The survey data analysis reveals that the provision of diagnostic feedback is welcomed by a great majority of students, and they think that the diagnostic feedback of the DCEST can reflect the strengths and weaknesses of their oral English ability. Results of both quantitative and qualitative data analyses provide supportive evidence to the consequential validity of DCEST. The limitations and future research directions are finally discussed.


Introduction
Validity is defined by Messick (1989: 13) as "an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment".And the concept of consequential validity was put forward by Messick (1996) as one aspect of the construct validity.Messick (1996: 249) argues that "the consequential aspect appraises the value implications of score interpretation as a basis for action as well as the actual and potential consequences of test use, especially in regard to sources of invalidity related to issues of bias, fairness and distributive justice (Messick 1980;1989), as well as to washback".
Validation is considered as an essential component of language test development, for it can examine whether the test has achieved its intended purposes.Messick (1996) suggests that evidence and rationales for evaluating the intended and unintended consequences of score interpretation and use in both the short-and long-term, especially those associated with positive or negative washback effects on teaching and learning should be collected to support the consequential aspect of construct validity.Furthermore, Weir (2005: 210-215) suggests that consequential validity can be considered from three perspectives: differential validity, washback, and effect on society.Differential validity deals with the construct under-representation or construct-irrelevant components of test scores that differentially affect the performance of different groups of test takers (American Educational Research Association et al. 1999).Washback examines the impact of tests on teaching and learning in a variety of settings.Effect on society refers to the effect of high-stakes tests on a wider community.Consequential validity has attracted more and more attention from designers of high-stakes tests and English teaching in China in recent years (Gong, 2012;Jin 2000;2004;Yang and Gui 2007;Zhao, 2010;Zhao and Fan, 2012).However, fewer studies have been conducted to investigate the consequential validity of formative assessments in China, whose impact on English teaching and learning should never be neglected.The present study will focus on exploring the consequential validity of a Diagnostic College English Speaking Test (DCEST) in terms of its impact on oral English teaching and learning in real educational settings, as the DCEST is a formative assessment designed to diagnose students' oral English proficiency.
The specific questions addressed by the study are as follows: 1) What is the impact of DCEST on students' oral English proficiency?
2) What do students think of the test and the usefulness of the test feedback?
3) What kind of impact will the feedback exert on students' oral English learning?

Method
The Diagnostic College English Speaking Test (DCEST) is designed as a 15-minute face-to-face interview test for the purpose of identifying the strengths and weaknesses of students' English speaking ability at the tertiary level in China, which involves three task types: reading aloud, individual presentation and information gap (Zhao 2011).A checklist is designed for the examiner to record each test-takers' performance with respect to pronunciation, intonation, grammatical accuracy, grammatical complexity, vocabulary accuracy, vocabulary range, fluency, communicative strategy, coherence, discourse size.In addition to the report of a five-level composite grade to each test-taker, individualized feedback is provided detailing students' strengths and weaknesses.
As part of the a posteriori validation of the DCEST, a case study was conducted with 28 sophomore students from a national key university in China engaged in seven sets of DCEST tests from April to June in 2008.Apart from the student participants, one college English teacher and two doctoral students of applied linguistics were invited to help with data collection and analysis.A variety of instruments were employed for the purpose of obtaining various types of information to validate the consequential validity of the DCEST (see Table 1).To explore the impact of diagnostic feedback on students' oral English learning, both the control group and the experiment group took the same tests (DCEST 1 and DCEST 7) at the beginning and the end of the main study, and the experiment group took another five tests (DCEST 2 to DCEST 6) during the two-month experiment period.Students' evaluation of the usefulness of the DCEST feedback was gathered through the student questionnaire of feedback evaluation (SQFE) survey and the student and the teacher interviews at the end of the main study.The SQFE was composed of 14 questions which were divided into two sections: an evaluation of the overall usefulness of the feedback and an evaluation of each parameter used to report the profile scores.All the questions were designed using the five-point Likert scale.Following the SQFE survey, the researcher conducted face-to-face interviews with eight students in the experiment group and their college English teacher.Altogether the raw data collected in the present study were 7 sets of test scores from DCEST, 28 questionnaires and 9 interviews.

What is the Impact of DCEST on Students' Oral English Proficiency?
To explore the impact of diagnostic feedback on students' oral English learning, both the control group and the experiment group took the same tests (DCEST 1 and DCEST 7) at the beginning and the end of the main study, and the experiment group took another five tests (DCEST 2 to DCEST 6) during the two-month experiment period.
DCEST 1 scores of the control group (CG1) and those of the experiment group (EG1) were compared using the Independent Samples t-test.The result indicated that there was no significant mean difference between the two groups (p=0.203)(see Tables 2 and 3).However, the Independent Samples t-test of DCEST 7 scores of the two groups (CG2 and EG2) showed that there was a significant mean difference between them (p=0.000)(also see Tables 2 and 3).Further comparison of DCEST 1 and DCEST 7 scores of the experiment group (EG1 and EG2) also confirmed a significant mean difference between the two scores (p=0.006)(see Table 4).
In contrast, the Paired Samples t-test of the control group's DCEST 1 and DCEST 7 scores did not show significant mean difference (p=.325) (also see Table 4), indicating that the control group made little progress in their oral English proficiency over the two-month period of the main study, during which the control group didn't take any oral English test except the pre-and post-tests of the DCEST, nor did they receive any feedback from their English teacher about their strengths and weaknesses in oral English communication.Furthermore, the 10 analytic scores of the experiment group on DCEST 1 to DCEST 7 were compared to see on which aspects students had made more progress.Table 5 indicated that the experiment group students made progress in all the aspects of oral English concerned in the study, with the progress in the following three aspects being most substantial: pronunciation (improved by 0.86), intonation (improved by 0.57), coherence (improved by 0.43).In addition to the comparison between the control and the experiment groups, a close look at the descriptive statistics of the seven total scores of the experiment group revealed that the mean score of one test was always slightly higher than that of the previous one with the only exception of DCEST 6 (see Table 6).
In other words, students in the experiment group were making steady and consistent progress in their oral English performance on the DCEST tests.DCEST 6 had a mean 0.28 points lower than that of DCEST 5.This could be explained by the fact that the students were somewhat distracted by their final exams which were administered in late June when they took DCEST 6.A closer examination of the changes in students' test scores in the main study revealed that the three students of the advanced proficiency group exhibited steady progress over time.Two students from the intermediate proficiency group (Student A and B), however, made the most dramatic progress on their performances.Students in the lower-intermediate group showed changes in their scores in both upward and downward directions.
In sum, the analyses of the test scores suggested that the experiment group students on average improved their oral English proficiency, whereas the control group students showed little progress.This progress could be attributed to the constant provision of diagnostic feedback after each test session, which guided the students to improve their oral English in the right direction.However, it seemed too early to claim that such improvement was the strongest evidence of positive effects of the diagnostic feedback on students' learning.It could be due to students' self-learning during the eight-week period.Therefore, the SQFE survey and the interview data would be analyzed for more supportive evidence to prove the usefulness of the diagnostic feedback provided.

What do Students Think of the Test and the Usefulness of the Test Feedback?
The assumption was that students from the experiment group would have a good understanding of the usefulness of the feedback and make positive comments on the DCEST tests and the accompanying detailed feedback reports.The descriptive data of the experiment group students' evaluation of the feedback were summarized in Table 8. .730Note: a=very helpful, b=quite helpful, c=so-so, d=not quite helpful, e=no help at all, a=5, b=4, c=3, d=2, e=1.
Questions 1-4 inquired about the overall usefulness of the feedback from several aspects.Question 1 was about the extent to which the feedback report can reflect students' general oral English proficiency.The majority of the students (85.7%) agreed that the feedback report on the whole can be a valid indicator of their oral English proficiency.Responses to Question 2 indicated that a great majority of the students (92.9%) thought that the feedback report can accurately describe the strengths of their oral English ability.With regard to Question 3, 85.7% of the students thought that the feedback report can provide useful diagnostic information on their weaknesses in oral English communication.Answers to Question 4 showed that the majority of the students (78.6%) agreed to the usefulness of the diagnostic feedback for their oral English learning.
Questions 5-14 focused on the evaluation of each feedback parameter.The means of the questions revealed that feedback on vocabulary accuracy was considered the most useful, followed by fluency and use of communicative strategies.Feedback on intonation was perceived as the least useful by the experiment group (see Table 9).However, the analysis of the experiment group students' seven DCEST scores showed a different picture (see Table 5).The test results showed that experiment group students made great progress in their test scores in these aspects: pronunciation, intonation, and coherence, which were quite different from those aspects evaluated by the experiment group students as the most useful.The discrepancy may be explained by the fact that it might take longer time for students to make improvements in those aspects they considered most useful..802Note: 1= not useful at all, 2= not quite useful, 3= so-so, 4= quite useful, 5= very useful.
Furthermore, the comparison of the three proficiency groups' evaluation of the usefulness of the 10 feedback parameters indicated that the intermediate group commented most favorably on the usefulness of the feedback, whereas the lower intermediate group least favorably (see Table 10).The evaluations of the usefulness of each feedback parameter by the three proficiency subgroups were also investigated.Table 11 indicated that students of the lower intermediate group regarded the feedback on using communicative strategies as the most useful.Students at the intermediate level also reported that the feedback on using communicative strategies as the most useful, and considered the feedback on vocabulary range as the least useful, whereas students at the advanced level thought that the feedback on vocabulary accuracy was the most useful and the feedback on intonation was least useful..548This section explored the effect of the feedback on students' test performance and students' evaluation of the usefulness of the feedback.For a better understanding of how the feedback would be used by students in oral English learning, the next section analyzed the Student Interview and Teacher Interview data.

What Kind of Impact will the Feedback Exert on Students' Oral English Learning?
The interview data were transcribed and then subjected to a qualitative analysis through a hermeneutic process of reading, analyzing and re-reading.When asked about the usefulness and the impact of feedback on their oral English learning, some students commented that the feedback could raise their awareness of the linguistic problems in oral English communication.The following is one illustrative piece of interview excerpt: I think the feedback is quite useful; at least it makes me aware of my weaknesses in oral English ability… After knowing my problems, I pay special attention to them in learning.(Student B) Some students thought that the feedback enabled them to make an overall evaluation of their oral English proficiency: The feedback enables me to make a systematic and all-round evaluation of my oral English proficiency.(Student D) The feedback is quite helpful.The score profile and feedback descriptors show me an objective and accurate picture of my oral English proficiency on both macro and micro levels.Then in my daily oral English learning, I will pay special attention to these problems, and in this way, I think I can make greater progress.(Student H) Some students considered the feedback on some aspects as more useful to their learning.The following interview excerpt illustrated this point: I was able to improve my grammatical and vocabulary accuracy, and I also paid more attention to fluency of my speech.I used to use a lot of fillers such as 'er', 'um', etc., now I am using them less frequently.(Student A) I paid a lot of attention to the criterion of accuracy of pronunciation, then I made fewer mistakes, and I thought I made some progress in this aspect.(Student D) However, comments on intonation were not as positive as those on accuracy of vocabulary and grammatical structure.The following excerpt might give us some idea: I don't think that I can make progress in intonation within a short period of time, and I spent little time practicing it.(Student B) Furthermore, some students thought that the feedback not only diagnosed their oral English proficiency but also showed them the way forward in their oral English learning: The feedback provides macro-level diagnostic information; it shows us the directions to move forward.For example, if the feedback tells you that your major problems lie in grammar and vocabulary, then you will have a clear learning objective.(Student H) The feedback has raised my awareness of the importance of oral English learning and influenced positively my oral English learning methods.If I had no chance to communicate in English as I did in the DCEST, then I would not have known my problems.Feedback from the DCEST tests enables me to realize my weaknesses and then I know how to improve my oral English.(Student G) In addition to these positive comments, concerns raised by several students were also worth mentioning and discussing.One student pointed out that the distinctions between some levels of the rating scale were too subtle: Generally speaking, I think the feedback can reflect the strengths and weaknesses of my oral English proficiency.But I think distinctions between Level 3 and Level 4 and between Level 4 and Level 5 are too subtle.I hope that more information could be provided to distinguish these adjacent levels.(Student E) Though efforts have been made to make the feedback descriptors as accurate as possible, there is room for improvement.One of the possible methods is to refine the descriptors on the basis of students' actual test performances.This was pointed out as a recommendation in the final chapter.Some students also expressed their hope to be provided with specific guidance on the appropriate types of actions they need to take in addition to the diagnosis of their difficulties and problems.One student commented that: The feedback reveals the problems of my oral English proficiency.But more importantly, I would like to have more information on how to overcome the problems and make improvements in these aspects.(Student F) In sum, the above quantitative and qualitative data analyses indicated that a great majority of the students welcomed the diagnostic feedback report and agreed that the diagnostic feedback report can provide accurate information on their weaknesses and strengths in oral English proficiency.Students' perceptions of each criterion and the accompanying feedback descriptor indicated that diagnostic information focusing on the lexical-grammatical knowledge such as vocabulary and grammatical accuracy, use of communicative strategies and coherence was considered more useful than feedback on other aspects.
Since the main study was conducted in the middle of a term, the college English teacher for the experiment and control group was not able to participate in the study to evaluate the impact of the feedback on oral English teaching due to the conflict of teaching plans.The researcher therefore invited the teacher to observe the test-taking process of the experiment group, and had a brief interview with the teacher for his comments and suggestions.
The teacher agreed that the diagnostic feedback would be useful for students to know about their strengths and weaknesses in oral English, but he pointed out that the effectiveness is largely dependent on how students would make use of it.The following interview excerpt illustrated this point.

The usefulness of the feedback depends on how students will use it in their oral English learning. (English teacher)
As for the impact of feedback on teaching, the teacher argued that the usefulness of the feedback on oral English teaching would depend on the teacher's pedagogical approach, the purpose of learning and the context of learning.Just as the following interview excerpt indicated: The feedback may exert little impact on oral English teaching, because the speaking activities in classroom are very limited, and it is impossible to take all students' needs into account in oral English teaching.(English teacher) The teacher's interview data implied that the effectiveness of the feedback may depend largely on the degree to which it was compatible with the teachers' teaching plan and students' learning attitude.

Discussion
The results of both quantitative and qualitative data analyses indicated that students who took the DCEST tests over a period of eight weeks and received feedback regularly made great progress in their oral English proficiency.The survey data analysis revealed that the provision of diagnostic feedback was welcomed by a great majority of students, and they thought that the diagnostic feedback of the DCEST could reflect the strengths and weaknesses of their oral English ability.In all, the results of DCEST scores, SQFE, SI and TI analyses provided supportive evidence to the consequential validity of the DCEST.
As with any scholarly investigation, this study has its share of limitations.First, due to the limitation of human resources and time constraints, the study was administered to a small sample size (N=28) over a two-month period.Considering the relatively small sample size, further studies with larger sample sizes are necessary to generalize the findings beyond the participants in this study.Another limitation in the research is that the teacher's participation was restrained due to the conflict of teaching plans.Since the study began in the middle of a term, the college English teacher of the student participants was not able to participate in the study and use the test feedback in his oral English teaching.Hence, this study only focused on investigating students' evaluation of the usefulness of the feedback and the impact of feedback on students' oral English learning, without giving much attention to the impact of feedback on oral English teaching.
It is hoped that future research should be conducted on a larger sample with students from a variety of majors and universities that could better represent Chinese undergraduates for the purpose of confirming the generalizability of the results of the present study.In addition to student participants, college English teachers should also be invited to participate in future research to investigate the impact of feedback on oral English teaching over a period of time.

Table 2 .
Means and SDs of CG1 and EG1, CG 2 and EG2

Table 5 .
Analytic mean scores of the experiment group in DCEST 1 to 7

Table 6 .
Means and SDs of seven DCEST test scores of the experiment groupTo facilitate the analysis of experiment group's improvement in oral English proficiency, the researcher calculated the mean of the seven DCEST total scores for the experiment group, and categorized the students with a mean score below 35 as the lower-intermediate subgroup, and those with a mean score between 35 and 40 as intermediate, and those with a mean score above 40 as advanced.Each group's mean score of the seven tests was 34, 38 and 42.4 respectively (see Table7).

Table 7 .
Three proficiency subgroups in the experiment group

Table 8 .
Descriptive statistics of responses to SQFE

Table 9 .
Means and SDs of the usefulness of analytic feedback parameters

Table 10 .
Descriptive statistics of the three proficiency subgroups' overall evaluation of the usefulness of feedback

Table 11 .
The three proficiency subgroups' evaluation of the usefulness of each feedback parameter