Development and Validation of a Diagnostic Rating Scale for EFL Writing in China

Diagnostic assessment of EFL writing ability is useful yet seldom adopted for Chinese EFL students. In line with this urge, this study intends to design and validate a diagnostic rating scale for EFL writing in China. This rating scale is adapted from China’s Standards of English Language Ability (CSE in short) for an argumentative writing assignment of College English III students at a key university in Eastern China. To collect data for validation, four raters were asked to score 67 compositions utilizing the rating scale. A multi-facet Rasch analysis was employed to investigate the validity of the rating scale. Three facets—examinee, rater, and criteria—basically accord with the ideal requirements. Comparing our validated rating scale and rating scales for writing assessment designed in other contexts, the importance of setting rating scales in a specific context is demonstrated. Additionally, our context-specific, CSE-based rating scale once again corroborates the versatility of CSE. This study provides a meaningful examination of the appropriate form of a rating scale for diagnostic assessment in China.


Introduction
As a major form of language assessment, diagnostic assessment has received increasing attention in the language assessment literature. Following this trend, this study intends to design a diagnostic rating scale for English writing ability based on China's Standards of English Language Ability (CSE) and investigate its validity through a multi-facet Rasch model analysis on the results of an argumentative writing task. Setting in the context of a writing test for Chinese undergraduates, EFL students in a key university in Eastern China, this study illustrates the importance of designing rating scales with the specific context, and provides Chinese EFL teachers with a validated diagnostic assessment rating scale for writing ability.
Firstly, this study reviews previous literature in diagnostic assessment for writing ability, paying extra attention to the rating scale, theoretical framework of diagnostic assessment and validation of rating scales. Secondly, a detailed presentation of our research design is provided. Moreover, the results of our research are analyzed and discussed, and the findings are stated. Implications and limitations of this study are included in the last section of this paper.

Diagnostic Assessment for Writing Ability
Written language is one of the basic ways of communication and verbal expression that enables people from different cultures and backgrounds to participate in all aspects of today's global society. With the rapid development of technology and communication in countries around the world, to write in a second or foreign language has become an important skill. Therefore, improving the ability to write well is a need for all students in academic and second foreign language courses, not only to generate new information but also to impart knowledge (Lee & Sawaki, 2009).
To write efficiently, writers are expected to master a series of abilities and strategies including coherence, logic construction, etc. Lack of these skills can impede the writing quality. Thus, timely feedback is necessary for the future improvement. The last few decades have witnessed the interest in the L2 writing feedback. However, accurate diagnostic information about the strengths and weaknesses of L2 learners' writing is hard to obtain based on the current feedback system (Llosa et al., 2011).
Diagnostic assessment, as an alternative to existing feedback methods, has recently received more attention from educational experts due to its ability to provide diagnostic information about students' strengths and weaknesses. Diagnostic assessment is defined as using tests "for the purpose of discovering a learner's specific strengths or weaknesses" to inform "decisions on future training, learning, or teaching" (ALTE, 1998), which is found to be effective in identifying students' problems and providing possible solutions, for its integration of assessment and instruction (Pellegrino & Chudowsky, 2003).
Unfortunately, diagnostic assessment of EFL writing ability is seldom adopted for Chinese EFL students. As Ross (2008) and Matoush (2012) pointed out, teachers and researchers are worried by the fact that traditional high-stake, large-scale proficiency tests dominate Chinese EFL teaching and testing, and individualized learning needs have been largely neglected. Moreover, very few studies have explored diagnostic assessment for EFL writing ability in the Chinese context and the need for practical studies is apparent. In line with this urge, this research focuses on the rating scale of diagnostic assessment in the context of a writing test for Chinese undergraduate EFL students.

Rating Scales for Diagnostic Writing Assessments
As a crucial part of instrumentation, the rating scale is the major concern of diagnostic assessment (Lee, 2015). There has been a collective realization that rating scales designed for conventional tests are considered not suitable for diagnostic assessment (Alderson, 2005(Alderson, , 2015Knoch, 2011;Upshur & Turner, 1995). To cite an example, Weigle (2002) demonstrated that the analytical feedback of conventional tests given by raters is likely to be imprecise due to the "halo effect", that is, the interference from the overall impression on certain single aspects of the writing script. As a result, rating scales designed specifically for diagnostic assessment are necessary to accomplish the special purpose of diagnostic writing assessment. Weigle (2002) proposed several questions during rating scale development, which can be summarized as "What type of rating scale is desired?" "Who is going to use the rating scale?" "What criteria should be used as the basis for the rating?" "What will the descriptors look like and how many scoring levels will be used?" and "How will scores be reported?". Given the context of diagnostic assessment, the first two questions are automatically answered (It should be an analytical, assessor-oriented scale). The third question is elaborated on in the following section while the fourth and fifth questions are briefly answered in the methodology section of this paper. As further clarification, Knoch (2011) pointed out that the several factors specified in the questions must be weighed as a whole. That is, the format (descriptors and scoring levels) and theoretical orientation (criteria) of the rating scale must be appropriate for the context (e.g., the characteristics of the test-takers and assessors) in use. To illustrate this point, when Bruce and Hamp-Lyons (2015) tried to design an assessment tool for a new cohort of Hong Kong Dioploma of Secondary Education Examination students in Hong Kong City University, they found opposing tensions between local and international rating scales, claiming that aligning scales based on English for Academic Purposes needs and ones of CEFR or IELTS will cause trouble. Seen in this light, although there are several available standards for assessment of writing such as ACTFL and CEFR, the contexts in which they are set are not even close to the context of writing tests for Chinese undergraduate EFL students, as a result, they may not function satisfactorily in the context of our research. Our research, therefore, has both the theoretical significance of providing a reference for future Chinese diagnostic assessment rating scales designing and the applicable value of helping Chinese undergraduate EFL students with their English writing.

Theoretical Frameworks of Writing Development
It has been widely argued that the criteria in a diagnostic rating scale for writing should be based on a theory or a model of language or writing development (Behizadeh & Engelhard, 2011;Dolgova & Siczek, 2019;Knoch, 2011;Read & von Randow, 2013). Furthermore, in a project of examining how assessors inform themselves of using assessment for Canadian university-learning EAP courses, Doe (2015) chose The Canadian Academic English Language (CAEL) as the assessment framework, explaining that the students should be familiar with assessments based on the framework and the raters should be familiar with the framework itself. Therefore, the appropriate theoretical model used in our research should not only be comprehensive enough but also suitable for Chinese undergraduate EFL students and familiar enough for the raters.
With the task-based, topic-based writing test as our research context in mind, we compared several available frameworks for English writing ability. The result was immediately clear. As the recent assessment framework developed by the National Education Examinations Authority, CSE provides a comprehensive description of English proficiency skills based on English language education in China. Besides, English teachers and researchers, as assessors, are adequately familiar with rating scales developed from CSE. Thus, CSE can best ijel.ccsenet.org International Journal of English Linguistics Vol. 11, No. 1;2021 reflect our understanding of the model where we should base our rating scale.
Although no study has explored any assessor-oriented rating scale for diagnostic writing assessment for Chinese students, relevant studies on assessment in China provide us with important information and caution on designing our rating scale based on CSE. To name a few, in a project of designing a self-assessment scale, Pan, Song and Deng (2019) specified that CSE Band 4−7 aligns with the non-English major undergraduate students.
Fulcher (2003) offered a provisional model consisting of four factors (rater, rating scale, task, and candidate) that influence the score of a test taker, and indicated that the probable interaction between rating scales and the others factors requires further research. Although former studies (Knoch, 2009(Knoch, , 2011 have supported the notion that rating scales should be designed with the specific testing context (e.g., task and candidate) in mind, few studies have explored to which extent does a different context (e.g., a test for Chinese undergraduate EFL students in the case of our research) influence the diagnostic rating scale designed accordingly. By qualitative comparison of the similarities and differences between our rating scale and rating scales designed in other contexts (e.g., Hawkey, 2001;Kuiken & Vedder, 2017;Madsen, 1983), the influence of context-specific factors on rating scales and the importance of setting rating scales in a specific context can be demonstrated (or disproved).

Validation of Diagnostic Rating Scales
Validation of a rating scale is needed since it is considered a fundamental concern of testing (Messick, 1989) and is associated with fairness and social responsibilities of a test (Kane, 2010). Multi-facet Rasch model analysis (MFRM) and generalizing theory (G-theory) are two mainstream validating approaches in recent studies (Beglar, 2009;Hsieh, 2013;Bochner, 2015). Through comparison, a multi-facet Rasch analysis is selected as the specific method for validation, as MFRM provides finer and more individual analyses and performs better in analyzing how test bias occurs than G-theory does (Kim & Wilson, 2009;McNamara & Knoch, 2012;Sudweeks, Reeve, & Bradshaw, 2004). In more concrete terms, MFRM is an extended Rasch model where test setting factors (e.g., task difficulty, test-taker ability, rater severity) are named facets. In writing assessment, ratings are viewed as the outcome of interactions between different facets. By MFRM, analyses of different facets and interactions of facets are conducted on the same logit scale in a statistical software called FACETS. Besides, different elements of a certain facet are detected, and fit statistics of each element underlying facets are provided, thus yielding detailed identification of validity of the assessment measure (Barkaoui, 2013;Linacre, 1993Linacre, , 1994. Based on Knoch's (2007Knoch's ( , 2009Knoch's ( , 2011 validation framework, to validate a rating scale, one needs to explore the validity of five dimensions of the scale, which are discrimination of rating scale, rater separation, rater reliability, variations in rating, and scale step functionality. Knoch (2007) compared the validation of two rating scale-Diagnostic English Language Needs Assessment scale and a self-developed one-by examining the aforementioned five dimensions, and acquired abundant evidence to support the validity of the self-developed scale. Grounded in Knoch's validation framework, we adjust our validation process to seven dimensions in accord with features of FACETS software, which is presented in the Results section.

Methodology
Based on the aforementioned literature, this research aims to investigate the validity of a CSE-adapted diagnostic rating scale for English writing ability.

Research Question
This study intends to answer the following research question: How valid it is of the CSE-adapted diagnostic assessment scale for English writing ability, in the context of Chinese EFL?

Participants
Sixty-seven students (seventeen females and fifty males) from different majors were selected from the course "College English III" at a key university in Eastern China to participate in the writing test task.

The Writing Task
The task adopted in this study is an argumentative writing assignment of College English III students. A statement and a question are given as the prompt: Children today do not have much play time as parents believe a busy schedule of study, arts class and Participants performed the untimed task without being informed of the purpose of the study beforehand, to make sure the authenticity of their writings.

The Rating Scale
Based on China Standards of English, the rating scale consists of two columns. The left is descriptors and the right marks. It assesses writings from three aspects: language quality, essay structure, and task completion. Language quality contains two dimensions: vocabulary and syntax; essay structure also contains two dimensions: discourse and coherence. Band 4−7 are chosen because they are reported to be in accordance with the ability of college students apart from English majors in China (Pan, 2019).
As Alderson (2005) and Weigle (2002) have suggested, current diagnostic rating scales usually employ vague or impressionistic terms which might confuse raters and hence impede both the validity of the scale and informativity of the test. From this perspective, we try to confine the descriptors to clear, concrete, and objective (Raffaldini, 1988) terminology as much as possible. We set five bands for each aspect of the scale, in part because the reliability was reported to be highest for scales with 5−9 bands (Miller, 1956;Myford, 2002). We also decide that both the combined score and sub-scores should be reported, for it is an accepted view that the scores should offer as much feedback as possible to students (Knoch, 2009;Alderson, 1991).

Procedure
Firstly, participants were assigned the aforementioned writing task. After finishing the untimed argumentative writing, they sent the electronic version of the writings back to the researchers. Before the official rating process started, to make sure raters understand the descriptors unanimously and correctly, a pilot rating was implemented by four raters. When the agreement of ratings on 5 samples reached 80%, the four raters were allowed to rate all the sixty-seven compositions according to the diagnostic assessment rating scale. All four raters, who were both postgraduate students in English linguistics and teaching assistants of the course College English, had received the professional rating training. Once finished, the rating results were sent back to the researchers.

Data Analysis
The software FACETS for Windows No.3.80.0 (Linacre, 2017) was selected as the instrument of multi-facet Rasch analysis. Two steps were involved in the data analysis process. At first, a multi-facets model was established. Then, a multi-facet Rasch analysis was taken to evaluate the scoring outcome to investigate the validity of a self-developed diagnostic rating scale through FACETS.
Based on the research aim, a multi-facets model was established: is the probability that on the certain item , the rater gives points to student . ) is the probability that on the certain item , the rater gives ( − 1) points to student .
is the student's writing skill.
is the harshness of the rater. is the mean item difficulty estimated for each item. is the difficulty of receiving a based on ( − 1) points.

Results
The results will be presented from seven aspects: overall statistics, Knoch's (2007Knoch's ( , 2009) five components for validation of a rating scale (discrimination of the rating scale, rater separation, rater reliability, variation in ratings, scale step functionality), and criteria facet. Figure 1 presents the overall status of all three facets which are all measured by the same vertical axis in the unit of logit. With the "examinee" facet being tagged as positive, it is suggested that the higher the measure is, the more proficient in writing the examinee is. And from the figure, a normal distribution can be roughly observed in the column of "examinee", which means the examinees are rated appropriately and differentially using the rating scale. While the "rater" facet was tagged as negative, it indicates that the first rater (R1) and the fourth rater (R4) is the most severe in giving high scores and the third rater (R3) is the most lenient. Seeing that all the raters crowd together, which implies that there is no deviation for raters' ratings, the inter-rater agreement is satisfied. The "criteria" facet is tagged as negative as well, from which we may state that, among five aspects of the

Rater R
The next p scale that h reliability, other in ra The rater p raters. My considered that the m criteria.
The Separ values und three value significant

Variat
The fourth square val value whic away from value whic other word "3" on a fi rater 1(.96 close to 1, the eviden

Scale S
The last i functionali examinatio For the rat larger than and so do (Linacre, category to descriptors band.  dity is variatio infit is 1. Acc that ratings ref del, which can es that ratings onsistently and rding to Figure  90), all of whic it. The result s ned rating sca FACETS outp y.
llowing requir values of aver easure, besides om Figure 4 (2000), the d raters rate inc he contrary, th an what they a e the intermedi lue of rater 4 i The mean of th gs is appropria ity, which ma ics and probab value of catego ase monotonic level should l ed but for ba on of band lev e functions ap Vol. 11, No. 1;er, 2005 Another in the candid figure, ban

Criteri
What's mo score from square val which we

Discuss
Comparing crucial rol two domai performan these aspe write an a organizatio    Kuiken and Vedder (2017) designed a rating scale for Dutch learners' EAP argumentative writing assessment, no descriptors were dedicated to variety/difficulty/complexity of vocabulary and syntax, while varied word choice and syntax are preferred in our rating scale. This is because students in the Netherlands were taught to keep their texts easily comprehensible and smoothly readable (Kuiken & Vedder, 2017), while Chinese EFL undergraduate students are encouraged to make more use of varied and somewhat complex words and syntactic structures as is reflected in CSE. The contrasting pedagogical methods may be explained by the contrasting language distance between English and the native languages. It is conceivable that rating scales in an improper task style or country will fail to distinguish between writing samples at different levels and fail to provide informative feedback for test takers. The importance of setting rating scales in a specific context is demonstrated.
Besides, we may suggest from this study that our context-specific, CSE-based rating scale once again corroborates the versatility of CSE. Different from another distinguished scale, CEFR, which distinctly clarifies its usage, CSE may be limited in that (Wang, 2018). However, with its abundant subscales, explorations of its usages are extensively unfolded, such as on writing (Pan, 2019), listening (Min, 2018), speaking (Jie, 2019), or interpreting (Wang, 2017). Pan (2019) developed a writing assessment scale for college EFL students, using language ability and language use strategy from CSE. Though Pan successfully validates the usage of CSE on writing assessment, his study uses the scale for students' self-assessment. This study, to some extent, provides a new perspective to consider the utility of CSE, and works as a supplement to studies of CSE' application to writing assessment, for what our research concentrates on is the extent to which EFL teachers could use the scale developed from CSE to appropriately and validly rate students' writings.
Based on the results, three variables-examinee, rater, and criteria-basically accord with the ideal requirements. Examinees' scores are presented as normal distribution and examinees of different abilities can be well-differentiated, both intra-rater reliability and inter reliability are satisfied, and five criteria function well in measuring the writing ability. However, as for the scale step functionality, compared with the normal function of band 1−4, band 0 seems to malfunction, which may result from the ambiguous existence of band 0. This is because in the scale, band 1−4 equals to CSE band 4−7, and those who fail to reach the CSE band 4 (including CSE band 1−3) are all scored 0. For one band to cover various proficiency levels, it is conceivable that band 0 malfunctions.

Conclusion
Based on the quantitative analysis above, this study successfully designed a diagnostic rating scale for English writing ability based on CSE, which fulfills the core framework of diagnostic assessment and proves to be differentiating for test takers. With the assistance of the multi-facet Rasch model, the study investigated three variables, including examinee, rater, and criteria, which were all satisfied with the model. The model is statistically efficient in discriminating examinees' writing abilities, capable of maintaining the consistency of intra-rater/inter-rater separation and reliability, and reliable with five criteria on assessing the writing skill of the examinees. Thus, it is feasible to design a diagnostic rating scale for assessing EFL students' writing ability.
This study investigated the validity of a rating scale specifically designed to diagnose EFL writing ability. It contributes, in part, to the relatively new field of research in diagnostic assessment. By developing a diagnostic scale, this study has provided a meaningful examination of the appropriate form of a diagnostic rating scale. In conclusion, the efforts of the present study in the experimental and validation process are important for the development of testing tools, especially rating scales, in diagnostic assessments with some insights. Furthermore, rather than simply examining the form or definition of diagnostic assessments, this research focuses on the relative under-researched application of multi-facet models in EFL testing. These attempts also help to clarify factors and methodologies worth noting in future research and even in the development of diagnostic assessments of EFL writing.
However, several improvements should be noted for future research on this topic. First, this study only investigated four raters on the writing performance assessment, and further research involving more raters with a diverse academic background is needed to demonstrate whether similar conclusions would be drawn from the same context. Second, because only one type of essay topic was explored, more research is needed to elucidate whether raters would exhibit bias in other topics. Finally, qualitative methods, such as interviews, may need to be used to examine how raters' academic training affects their perceptions of the scoring categories' perceptions, and how those perceptions can be a factor in scoring bias.