Establishing an Operational Model of Rating Scale Construction for English Writing Assessment

A Study on the Mechanism Abstract Rating scales for writing assessment are critical in that they determine directly the quality and fairness of such performance tests. However, in many EFL contexts, rating scales are made, to certain extent, based on the intuition of teachers who strongly need a feasible and scientific route to guide their construction of rating scales. This study aims to design an operational model of rating scale construction with English summary writing as an example. Altogether 325 university English teachers, 4 experts in language assessment and 60 English majors in China participated in the study. 20 textual attributes were extracted, through text analysis, from China’s Standards of English Language Ability (CSE), theoretical construct of summary writing, comments on sample summary writing essays from 8 English teachers and their personal judgement. The textual attributes were then investigated through a large-scale questionnaire survey. Exploratory factor analysis and expert judgement were employed to determine rating scale dimensions. Regression analysis and expert judgement were conducted to determine the weighting distribution across all dimensions. Based on such endeavors, a tentative operational model of rating scale construction was established, which can also be applied and adapted to develop rating scales in other writing assessment.


Introduction
Summary writing is a common practice particularly in higher education where students usually need to grasp and digest the main ideas and the basic structure of a huge amount of information from books or teachers' lectures (Friend, 2000). Therefore, summary writing assessment tasks are considered of great authenticity (Li, 2016) and have been included in many language assessments worldwide, for instance TOEFL, China's National Matriculation English Test (NMET), Test for English Majors in China (TEM) etc.
Despite the prevalence of summary writing in language assessments and studies concerning how students should be trained for summary writing assessment tasks (Jin, 2016) and the construct of summary writing (Li, 2014;Yu, 2007Yu, , 2008Yu, , 2009, little attrition has been paid to investigate how summary writing should be rated or how its rating scales could be developed. In classroom teaching, teachers usually rely on three options to rate students' manuscripts including directly using an existing scale, adapting scales and creating a scale from scratch (Perlman, 2003). As a result, a valid rating scale appropriate for summary writing is urgently needed to guide the rating work.
Moreover, based on the document "Deepening the Reforms on the Educational Exams and the Enrollment Systems" issued by the State Council of China and supported by the Ministry of Education of China, the National Education Examinations Authority (NEEA) launched a research project to develop a proficiency scale of English, China's Standards of English Language Ability (CSE). The CSE aims to define the English proficiency of learners in China and provide references and guidelines for English learning, teaching and assessment (Liu, 2015). CSE consists of several subscales concerning skills of listening, reading, speaking, writing, translating, interpreting, etc., which could offer standards to be applied in summative and formative language assessment. It is, therefore, expected that CSE writing scales can be employed to guide the construction of rating scales for summary writing assessment tasks.
The present study, basing itself on students' real performance in summary writing and writing scales of the CSE, aims to establish an operational model to construct a rating scale using both quantitative and qualitative methods. The rating scale to be constructed could be applied after appropriate modifications in either large-scale writing tests or classroom formative assessments, thus enhancing the scoring validity and test fairness of summary writing. More importantly, based on the construction process, it is possible to establish an operational model concerning rating scale construction in writing, which can have empirical value for the construction of rating scales for English writing tasks other than summary writing.

Construct of Summary Writing
Summary writing is a task in which students read to write (Stawiarska, 2016), aiming to convey information in an efficient manner so that readers can learn the main idea and essential details through a piece much shorter than the original (Frey, Fisher, & Hernandez, 2003). The statement reveals significant elements in summary writing, e.g., source text, main idea, essential details and has won many echoed voices (Yang, 2014). Therefore, summary writing is the result of integration and interaction of reading and writing skills.
Such interactive relationship reminds us that the accomplishment of summary writing lays a cognitive burden on summarizers when they make elaborate cognitive processing of source text information (Léonard, 2001). The cognitive burden never remain on the same level across all occasions but vary greatly due to many source-text-related factors including text quality (Hidi & Anderson, 1986), writing styles (Kobayashi, 2002;Yu, 2009), text length (Kirkland & Saunders, 1991), text availability, i.e. how much time summarizers could read the source text (Kirby & Pedwell, 1991;Stein & Kirby, 1992) and text structure (Lorch, 1989). Despite the aforementioned factors, more microscopic investigations have been conducted concerning cognitive loads from summary writing tasks. Asención (2004) compared ESL and EFL learners in cognitive processing in response to summary writing tasks. Think-aloud protocols revealed that monitoring and planning occurred most frequently while organizing, selecting and connecting were much less frequent. It could be argued that although cognitive loads of summary writing generally include identifying, analyzing and synthesizing (Yang, 2014), significance of such elements is by no means the same and might vary with different levels of proficiency of the target language.

Attributes of Rating Scales
Rating scales are defined as "rules that guide scoring" (Popham, 1997: 72) or "guidelines that clearly articulate performance expectations and proficiency levels" (Gezie et al., 2012: 422) or "a tool used in the process of assessing student work" (Dawson, 2017: 349). The above definitions are offered from the perspective of the function of rating scales. More specific definitions are offered, including "by using a number of descriptive bands for a particular skill, on a scale of competence…" (Kabir, 2012: 37). Representing constructs being tested (Knoch, 2011), rating scales play a significant role in terms of the validity of subjective writing assessments (Weigle, 2002), whose rating work should be conducted with rating scales to make the subjective rater decisions as fair and objective as possible.
etc. In addition, Zhao (2013) developed an analytic scale for measuring authorial voice strength in L2 argumentative writing. The scale is one adapted from Hyland (2008)'s interactional model of voice, but with more detailed descriptors of the six hierarchical levels in the scale added by the researcher. The construction process of theory-based rating scales, however, is rather subjective, based merely on designers' judgment, thus lacking persuasiveness. Other criticism holds that such scales may lead to reliability and validity problems (Brindley, 1998;Turner & Upshur, 2002), because scales based on general linguistic or assessment theories ignore specific and dynamic contexts of assessment tasks. Knoch (2011) asserts that the ideal option is to base the construction on the psycholinguistic development process of test takers. Such scales are called performance-driven scales, which attach great importance to observations of language performance as the foundation of descriptors (Fulcher, Davidson, & Kemp, 2011). Students' performance samples are selected and reviewed by experienced raters, teachers or other specialists, after which samples, together with these people's verbal reports, are used to generate the verbal basic content of the scale (Jeffrey, 2015). In addition, since all descriptors come from analysis of students' essays in one particular assessment task, such scales are usually not used for scoring work in other tests, thus lacking generalizability.
An important branch of performance-driven scales requires collection of performance samples from test takers via identification of key performance attributes based on text analysis of students' essays (Fulcher, Davidson, & Kemp, 2011). What follows is the determination of the number levels in the scale through discriminant analysis. The text attributes and different levels then constitute the essential structure of the rating scale (Knoch, 2007). Such scale are conventional ones in that they are composed of various dimensions and descriptors reflecting a continuum ranging from what is poor performance to what is excellent.
Recent years, however, have witnessed more endeavors in developing "empirically based, boundary-driven (EBB)" scale (Plakans, 2013). Such scales are quite unique in that they are "composed of a hierarchical set of articulated binary questions or descriptions" (Hirai & Koizumi, 2013: 400). Raters need to repeatedly make choices between "Yes" and "No" through answering binary questions about a performance and are therefore led by the scale from one step to another until finally a total score is achieved (North, 2003). EBB scales are typical performance-driven because the content and descriptors are developed based on analysis of samples of students' test performance (Hirai & Koizumi, 2013;North, 2003;Plakans, 2013;Turner & Upshur, 2002). However, EBB scales have demonstrated weakness in rating efficiency in that raters have to make several rounds of "YES/NO" choices before making decisions. Therefore, such scales are rarely employed in large-scale tests with their feasibility suspected from time to time.
In a word, theory-based scales are formed with full reliance upon related theories and might be too general and not task-specific enough. In writing assessments, however, the change of tasks naturally means the change of the assessed construct (Brindley, 1998). In this sense, theory-based rating scales are weak in their promotion value. In contrast, descriptors of performance-driven scales are derived from the specific performance of test takers, thus guaranteeing the authenticity and suitability of rating scales for various writing tasks. Now that theories-based and performance-driven approaches for constructing rating scale are powerful in their respective domains, neither of them should be excluded from rating scale construction. In addition, despite abundant studies concerning the CSE from the following aspects: (1) elaborations of theoretical foundations and basic principles in constructions of CSE as a whole and sub-scales of CSE (Liu, 2015;Liu & Han, 2018;Liu & Peng, 2017;Zeng & Fan, 2017); (2) Investigations and analysis of structures and content of CSE (He & Chen, 2017;Kong & Wu, 2019); (3) Validation of CSE scales (Fang & Yang, 2017); Application of the CSE in foreign language pedagogy and assessment (Liu, 2017;Liu, 2019), yet little has been done to explore its role in constructing rating scales. Therefore, the present study draws on the CSE and makes integrative use of theory and performance-driven approaches to construct a rating scale for summary writing tasks, aiming to answer the following questions: (1) What are textual attributes to be included in the rating scale?
(2) What are the dimensions and their weighting of the rating scale?
(3) What is the operational model of rating scale construction for English writing assessment?

Participants
Participants were 60 juniors (9 males, 51 females) majoring in English in a university in China. They have been learning English for at least 12 years with good English proficiency. They were invited to accomplish a summary writing task, in which they summarized a source text of about 400 words using no more than 80 words. After writing, two summaries, which represented the intermediate level of proficiency of the group, were selected from all the 60 essays.
What's more, to determine appropriate source texts for summary writing, a pilot study was conducted in which another 4 juniors of English majors (not included in the 60) were invited to write summaries for all the 4 different source texts selected by the researchers. They were also asked to provide feedback on the quality and difficulty of the source texts so as to help determine which source text was the most appropriate for the present study.
In addition to students, two groups of teachers were enrolled in this study. The first group were English teachers (n=325) from 25 universities in China for a questionnaire survey concerning textual attributes of summary writing (See Table 1). The second group consisted of 8 English teachers randomly selected from the 325 teachers to read the two intermediate level samples of summary writing. After reading, the 8 teachers separately wrote their own comments on the quality of the two samples, which later served as one source of textual attributes of summary writing.

Selecting Source Text
Considerations of selecting source texts for summary writing were made from two perspectives, one being the genre, the other linguistic complexity. To begin with, narratives are relatively easier to be summarized than expository or argumentative texts due mainly to people's more familiarity with them (Meyer & Freedle, 1984). Since student participants were English majors with good English proficiency, narrative texts, therefore, were excluded. "Narrative and expository texts are common in studies of summarization in education, linguistics, and psychology. Few have employed argumentative texts" (Yu, 2009: 118). With the above considerations, the present study chose expository texts as the source text.
Readability and text length are important factors that have strong impact on the products of summary writing (Kirkland & Saunders, 1991), making it necessary to consider linguistic complexity while selecting source texts. This study chose source texts from past English for Postgraduate Admission Examination (EPAE) to ensure text quality considering EPAE's widely acknowledged high validity. Four passages were selected for further comparison. They were, respectively, Text 2 in Reading Comprehension of EPAE-2014, Text 4 in Reading Comprehension of EPAE-2006, the passage in Translation of EPAE-2007 and the passage in Translation of EPAE-2014. What followed was to decide, based on textual analysis of the passages and students' performances in the pilot summary writing of the 4 passages, which passage would be the most appropriate for further use in the study. Table 2 presents the results of analysis of linguistic complexity of the 4 passages. The 4 students who wrote the summaries of the four texts reacted positively to Text 2, which was considered to be more suitable for them in terms of difficulty in comprehension. As a result, Text 2 was finally chosen as the source text for the study.

Composing Summaries
All 60 students composed summary writing essays respectively based on the two texts. According to National Matriculation English Test (NMET) of Shanghai and Zhejiang province in China, the lengths of source text of summary writing and summary essay are about 300 words and 60 words respectively (SMEEA, 2017;NEEA, 2015), the ratio being approximately 5:1. This study also adopts the same ratio and since the source texts are approximately 400 words in length, the limit for the summary length should be no more than 80 words. The time limit for the task was one hour, after which the 60 summary essays were collected and numbered from 1 to 60 for anonymous considerations.

Accumulating Textual Attributes
In order to make a CSE-based rating scale, the researchers investigated all CSE writing scales, from which descriptors related to summary writing were picked out. The descriptors were then further analyzed to extract elements directly related to the core of the construct of summary writing. Additionally, a comprehensive review of the construct of summary writing provided a theoretical base and a second source for textual attributes.
Now that the rating scale is designed to be performance-driven, 8 teachers from among the total 325 were invited to read and provide commentary feedbacks on the quality of the two sample essays. The comments were made through Think-aloud protocols (TAPs), i.e. the 8 teachers read the samples and wrote down whatever thoughts they had while reading, guided by no rating scales. After giving comments, the 8 teachers, based on their teaching experience and personal judgement, brainstormed as many textual attributes as possible, which they believed should be included in the evaluation standards for summary writing.

Surveying with a Questionnaire
A questionnaire was designed to explore to what extent the textual attributes accumulated were appropriate for the rating scale of summary writing. It consisted of two sections, the first aiming to collect personal information about age, gender, professional title, rating experience etc. The other section displays all textual attributes to consult the 325 teachers for their attitudes towards whether the textual attributes should be included as descriptors in the rating scale. This section is presented in the form of 5 point Likert scale (1=completely disagree, 2=basically disagree, 3=uncertain, 4=basically agree, 5=completely agree).
To facilitate the survey process and ensure easier access to respondents, the questionnaire was presented using "Questionnaire Star", a professional questionnaire online platform in China. The "Questionnaire Star" automatically collected all responses from the 325 teachers. However, the researchers examined the results and found that 14 respondents made the same choice for all items in the questionnaire. Therefore, these 14 copies were discarded and the number of copies of questionnaire for further research was 311. The Cronbach α of the questionnaire is 0.911, indicating very high internal reliability.
In order to determine and define the dimensions of the rating scale, exploratory factor analysis (EFA) was applied with SPSS 24.0 to categorize textual attributes into various dimensions. Meanwhile, qualitative expert judgement also played a significant role as supplement. The experts are 4 experienced university English teachers, each with a doctor's degree and an academic title of full professor. Their research focuses are language assessment and second language acquisition.

Regression Analysis of Questionnaire Data
The purpose was to determine the weightings for various dimensions of the rating scale. Based on EFA and expert judgement, a preliminary rating scale was constructed with 5 specific dimensions, each with 5 levels on the continuum of "Excellent-Good-Ordinary-Poor-Fail". For convenience considerations, the full mark of the summary writing task was 100 points and each dimension shared the same weighting, i.e. 20 points. Besides, the 20 points for each dimension was divided evenly across the 5 different levels. Table 3 presents an outline of the preliminary rating scale constructed. To prepare statistics for regression analysis, the two researchers rated the 60 essays separately based on the preliminary rating scale. The 1 st step was to determine at which level the essay was located for a particular dimension, i.e. a level score. They also needed to pick up an exact point from the range at the determined level, i.e. a specific score. The 2 nd step was to repeat what had been done in the first step four times, one for each of the other 4 dimensions. The researchers initially rated 6 essays out of the 60, after which their results were compared to ensure inter-rater reliability. Correlation analysis revealed that the rating of the two raters was reliable for further studies (r=0.92, p<0.05). The two raters then separately rated the remaining 54 essays following the aforementioned 2 steps above. The average scores of the rating results of the two raters were used as the final scores for all the 60 summaries.
As a result, for all the 5 dimensions of an essay, there were respectively 5 level scores and 5 specific scores. These scores were then added up to obtain the final total scores for each of the 60 essays. Regression analysis was conducted with the 5 level-scores as independent variables and the total scores as the dependent variable. Standardized regression coefficients (β) and Unstandardized coefficients (B) were employed as indicators of different degrees of significance of the 5 variables, thus helping to determine the ratio of distribution of weightings among the 5 dimensions in the rating scale. The 4 experts were invited again to provide comments and feedbacks as a supplement for final decisions of weighting distribution.

Accumulating Textual Attributes
Accumulation of textual attributes as descriptors of the rating scale was made with the CSE, construct of summary writing, teachers' comments through TAPs and personal judgement as major sources. From each source, we collected typical and representative attributes and then merged identical attributes into various independent attributes presented in Table 4. In order to adapt attributes for further large-scale questionnaire survey, attributes from across various sources were then merged to avoid repetition. For instance, B1, C3 and D9 focus on using paraphrased rather than copied language of the source text; A1, B2, C1, D1 all stress the content coverage of source texts, to name just a few. In addition, some attributes that contain over two aspects were split into several independent descriptors. For example, C5 can be further divided into such elements as "avoiding grammatical errors", "using flexible subordinate clauses" and "using appropriate words accurately", etc., which can then be revised and adapted to construct more specific descriptors.
After processing work of merging and splitting, 29 textual attributes were extracted, based on researchers inductive judgement ( Table 5). As a result, such decisions were subjective and needed to be further evaluated.
The 29 attributes were then sent to the 4 experts in language assessments for consultation. They suggested items Q39, Q17, Q19, Q44 be deleted. To be specific, Q39 was too easy for students; Q17 and Q19 overlapped with Q32; and Q44 overlapped with Q30.

B1-Restatement
of the original text into writers' own words in showing only the author's main ideas (Doyle, 2012) B2-…convey correct information efficiently so that readers learn the main idea and essential details through a much shorter piece (Nancy et al., 2003) B3-A shortened form of a text giving main points from the original text and isolated from trivial details. (Benzer, et al., 2016) C1-He gave a complete summary of the major content of the source text with clear structures. C2-Despite a few errors in vocabulary, overall it's satisfactory. C3-The writer could use his own words to summarize. C4-The summary is written with cohesion and connection between sentences. C5-The writer can use grammar and vocabulary correctly and also diversified & flexible sentence patterns. C6-The language isn't succinct. The writer has talked too much. C7-The writer put too much emphasis on the first part of the source text. C8-Punctuation is used properly. As a result, 25 textual attributes remained, which were formally adopted for the questionnaire survey. Table 6 presents the descriptive statistics of the results. The means for most of the attributes are over 4 except Q34 (Means=3.93), Q35 (Mean=3.66), Q36 (Mean=3.60), Q41 (Mean=3.76) and Q43 (Mean=3.60), indicating that the 5 attributes failed to win large-scale support among the 325 respondents. For confirmation, consultation was made with the 4 experts. They supported deleting the 5 attributes including writing tone, idiom use, writing style, rhetoric devices and complicated grammatical structure, which had little to do with the construct of summary writing. As a result, the 5 attributes were removed, leaving 20 attributes that constituted a bank of textual attributes for the main body of the rating scale.

Categorizing Rating Dimensions
The 20 attributes went through exploratory factor analysis (KMO=.857; p<.05) to help determine the classification of textual attributes and make tentative decisions of the rating scale dimensions.
With the results of exploratory factor analysis (Table 7), the attributes could therefore be safely categorized into 5 components as the dimensions of the potential rating scale. The categorization results are displayed in Table 8 with each potential dimension named by the researchers according to the shared features of attributes. Based on the 20 textual attributes and the dimensions, a preliminary rating scale of summary writing was tentatively established (See Appendix 1). Each dimension consists of 5 levels discriminated by specific indicators including "no", "less", "comparatively", "completely", etc., which demonstrate to what extent summarizers accomplish the task. The 4 textual attributes center on making connections between sentences, paragraphs to present summaries with clear structures.
Fidelity to source texts (FC) Q26, 28, 32, 33 The 4 textual attributes stress the necessity of excluding new content in summary and complete coverage of points in source texts Linguistic accuracy (LA) Q29, 30, 31 The 3 attributes stress the significance of using language in a correct way in terms of vocabulary, grammar and language fluency.

Mechanism (MC)
Q27, 40, 42, 45 The 4 attributes stress the need to write with normalized punctuation, clear and neat handwriting & normalized use of language in authentic use.

Determining Weighting Distribution Across Dimensions
Multiple linear regression analysis supplemented by expert judgement was conducted to help determine weighting distribution across dimensions. Standardized coefficient β is used as an indicator of the influence of independent variables, i.e. the five dimensions, on the dependent variable, i.e. the "Writing scores" (Table 9). In For further confirmation, the preliminary scheme of weighting distribution was presented to the 4 experts. They, however, only partially agreed with the proposed distribution of the potential scale and strongly recommended that the weighting distribution be appropriately rearranged. Expert A made the following comments.

MC is simply about punctuations, neatness of presence, etc. and is not closely connected with the summary work itself. So, I can't accept that MC shares with FS and CC the same weighting in the rating scale.
Expert A's attitude was echoed by Expert C who held that "apparently, MC is far less important than the other four". Consequently, a decision was made that the weighting of MC should be decreased, which gained positive feedback from the other two experts.
Moreover, all experts agreed with the current weighting of LA, holding that a summary in EFL contexts should never be considered of high quality with the presence of many grammatical or lexical errors. This highlights the status of LA allotted with 25% of the total weighting as the highest among all dimensions. In the same sense, LC currently takes 15% of the weighting, which seems inadequate. This can be further supported by opinions from Expert C.
This rating scale is constructed targeting college students, rather than primary or middle school students. They have gained good English proficiency. The language they use in summaries should not only be accurate but also complex, reflected by the use of advanced vocabulary, diversified sentence patterns, etc. Expert D held that "summary writing is in essence a member of the family of English writing tasks". He made the following comments concerning the issue.

To accomplish summary writing tasks, summarizers not only need to put together all major points of information from source texts, but also make smooth connections among all points. This is because it is a passage rather than several isolated sentences summarizers write.
Expert D elevated the status of CC and suggested increasing its weighting, which can be realized, according to Expert B, by "appropriately adding the weighting deducted from MC." All the four experts unanimously expressed the same attitude towards FS, holding that this is a dimension typical of summary writing and might be absent in rating scales for other writing tasks in that "summary writing is a highly condensed version of its source text (Expert A)". Important as this dimension is, the 4 experts believed that the current weighting for FS-20%, is appropriate.
However, experts' suggestions only offered a general scheme of whether the weighting of a certain dimension should be increased, decreased or kept the same. There were no specific proposals concerning the extent to which dimension weightings are to be changed.
The preliminary scheme of weighting distribution based on the β coefficients could then be further revised via the B coefficients, which can also be used to compare significance of different variables with the same units as the precondition (Nardi, 2009). It is apparent that all variables in the study share the same unit of "points", indicating that the use of unstandardized B coefficients, including respectively B=4.891 (LA), B=3.846 (LC), B=3.839 (FS), B=4.914 (CC) and B=3.429 (MC), is acceptable in helping to decide the weighting distribution. Based on the B coefficients, it can be tentatively concluded that LA and CC receive equal weighting, LC and FS receive lower but also equal weighting and MC seems to be the least important due to its comparatively much lower B coefficient.
With the above considerations, Table 10 presents the revised scheme of weighting distribution with symbols "↑", "↓" and "-" respectively meaning "adding", "decreasing" and "no change". For further confirmation and comments, the new scheme was sent to the 4 experts, all of whom offered positive feedback. Expert C, however, was a bit more cautious, suggesting that "although the new scheme seems more acceptable than the previous one, it is still a tentative decision and needs further validation to judge its appropriateness". In practical use, while scoring each dimension, raters are expected to initially locate summarizers' performance into a particular level of that dimension and then decide which specific score to be given within the score range of each of the levels.

Discussion & Conclusion
We now discuss the findings by returning to the research questions of the present study.
(1) What are textual attributes to be included in the rating scale? Rating scale, as an instrument of scoring performance-based assessments like English writing, can serve as the representation of the construct assessed through various textual attributes (Turner & Upshur, 2002). The process of the accumulation of textual attributes of summary writing in this study could be synthesized into a "3D Principle", respectively meaning "Diversified coverage" of the scope of textual attributes, "Diversified sources" of textual attributes and "Diversified extraction approaches" in accumulation. As for diversified coverage, results show that the attributes collected cover a variety of aspects of the construct of summary writing. This is in line with the argument of Yu (2013) that attributes or descriptors collected for rating scales need to be concrete and diversified in content and format. The bank of textual attributes of summary writing cover a wide range, for instance, vocabulary use, sentence structure organization, cohesion, summary-source text relationship etc.
The diversified sources of attributes pertain to endeavors of collection from sources like CSE, construct of summary writing, teachers' commentary feedback and personal judgement. Apparently, CSE and the construct provide some macroscopic attributes concerning the overall summary writing ability, represented for example, by key words "main idea", "most important points", "complete summary", "essential details" etc. In contrast, TAP comments and personal judgement offer more microscopic perspectives reflected by more specific definition of summary writing ability, for example, "using his own words (C3)", "vocabulary with complexity (D5)", etc. Therefore, macroscopic and microscopic perspectives can be supplements for each other to guarantee appropriate coverage of attributes. Extracting textual attributes from teachers' TAP comments on sample summaries, as the second source, has echoes in many studies (Chen & Liu, 2016;Jeffrey, 2015;Turner & Upshur, 1996) because individual teachers view students' summary writing from different perspectives, some of which might overlap but others of which could be put together, thus broadening the coverage of attributes. Jeffrey (2015) proposed the value of teachers' verbal comments on students' performance rather than the written ones in the present study. The difference, however, reminds us that both written and verbal feedbacks or comments can be used for exploration of attributes of writing tasks, which could expand the scope of extraction to avoid any possible omission. The final source of attributes is teachers' personal judgement, which is in line with suggestions by Perlman (2003) concerning rating scales development. Apparently this source resembles, to some extent, teachers' TAP comments, highlighting the important role played by teachers in developing rating scales. Teachers' voices are of more authenticity because textual attributes from such sources are based on students' actual writing performance.
Finally, diversified approaches for extracting textual attributes include analyzing quantitative questionnaire results and the qualitative researchers' extraction work and expert judgment. The researchers, by comparing, merging and splitting attributes, made preliminary processing work, leading to a tentative version of the bank of textual attributes. To justify and ensure the appropriateness of decisions concerning whether to remove or keep certain attributes, it is of great necessity to enroll expert judgement as a qualitative approach. Similar opinions could be found in Plakans (2013), who argued that data and statistical analysis can't be perfect without analysis of language experts including teachers and researchers.
(2) What are the rating dimensions and their weighting of the rating scale?
The 5 dimensions of the rating scale are respectively LA, LC, CC, FS and MC, which are in consistency with the construct of summary writing. Summary writing involves the integration of reading and writing skills (Stawiarska, 2016). The first step to write a summary is reading for an accurate and comprehensive understanding of source texts (van Dijk & Kinstch, 1983), which does not simply lie in the anticipation of equating the content of summary writing with that of the source text, but more demanding and challenging requirements concerning the complicated mental processing of source texts. Summarizers need to extract from source texts specific points of information divided into major and secondary ones, the former of which should all be covered (Q32) while the latter of which is to be abandoned or at least integrated and deducted (Q28). Such attributes were put into the FS dimension in that they all concern the matching relationship between summaries and source texts. This is in line with previous assertions about cognitive loads on summarizers, including selecting essential ideas across original paragraphs (Brown, Day, & Jones, 1983), selecting the information to be represented in a summary (Johnson, 1983) and working out text thesis and major ideas (Li, 2016). Moreover, summary writing in this study was conducted in EFL contexts, which stresses the involvement of language use as well as logical relationship among sentences or paragraphs in rating scales. Therefore, traditional dimensions like LC, CC, LA, & MC are included in the rating scale and present a more comprehensive and complete coverage of the construct of summary writing. Together with FS, the 5 dimensions further confirmed that summary writing is a discourse in its own right, a discourse that requires evaluation criteria different from any other integrated writing tasks (Yu, 2013).
The present study, based on β coefficients in regression analysis, made a preliminary arrangement of weighting distribution (LA=25%, LC=15%, FS=CC=MC=20%). It was then revised through decreasing the weighting of MC to 10% and increasing that of LC to 25% based on B coefficients and expert judgement. Quantitative regression analysis is advocated as an effective and scientific method for "assigning appropriate weights to component parts of a rating scale (Henning, 2001) since it can indicate the different significance of each dimension vividly through statistics. However, quantitative statistics themselves are insufficient to guarantee the persuasiveness of the decisions. Therefore, this study turned to, after the regression analysis, expert judgement for confirmation. Such an approach of double-check can ensure that there is no obvious deficiency.
In addition, more attention should also be paid to general guiding principles that direct our endeavors. The weighting in this scale is not evenly distributed but with emphasis on certain dimensions such as LA (25%) and LC (25%) and with the least share of weighting given to MC (10%). There have been many attempts different from the scheme in this study. Sasaki & Hirose (1999) constructed a rating scale for Japanese university students' expository writing, arguing that all dimensions should receive equal share of weighting because "the explanatory power of each criterion can vary from composition to composition". Such equal weighting distribution roots itself in early discoveries represented by Hamp-Lyons (1991). However, equal distribution has not gained many supportive voices in the academic world, where most people hold that proper rather than arbitrarily equal distribution of weighting is required in constructing rating scales (Zou, 2011).
The issue of weighting distribution is closely connected with the construct of the test task and, therefore, neither equal nor arbitrary distribution is encouraged. The scheme of weighting distribution in the present study is not inflexible at all. Instead, the current scheme needs to be adapted according to factors like test takers' English proficiency, specific types or tests, etc. One typical example is NMET Shanghai, whose test takers are mostly graduates of senior high schools. They are quite different from summarizers in the present study in terms of language proficiency and such difference has been taken into consideration for weighting distribution in the scales. The rating scale in NMET Shanghai, with a full mark of 15 points, consists simply of two dimensions, namely content and language, whose weightings respectively are 10 points and 5 points (SMEEA, 2017). However, summarizers in the present study are juniors majoring in English. As a result, the scheme of weighting distribution for NMET version is by no means suitable for the present study, where more refined divisions should be made with diversified weighting distributions. As an EFL writing task, more emphasis laid upon LA and LC is justified because language proficiency takes priority in EFL contexts, which can be confirmed with Expert A's following words: "the current scheme of weighting distribution conforms to general cognition and has won wide recognition in the field of language teaching, learning or testing".

(3) What is the operational model of rating scale construction for English writing assessment?
The approach for rating scale construction in this study is a mixed one, taking into account the CSE, construct of summary writing and students' actual performance, because only one single approach for rating scale development is by no mean sufficient (Knoch, 2011). All approaches (CSE-based, construct-based & performance-based) have their own strengths and weakness and should be put into synthesized use rather than separated from each other. Figure 1 presents an operational model for constructing the rating scale, which can also offer suggestions for the development of other rating scales in writing assessment. Performance-based approach has long been advocated to construct rating scales (Fulcher, Davidson, & Kemp, 2011;Plakans, 2013). Unlike prototypical performance data-based approaches which collect performance samples and identify key features through textual analysis, this study relies on teachers' commentary feedback on students' summaries together with all potential attributes based on their personal judgement. The 8 raters chosen to give commentary feedbacks worked separately without referring to any rating scales so as to elicit potential textual attributes of summary writing. This process could also be regarded as a "rater workshop" (Shaw & Weir, 2007), a method for construction of rating scales adopted also by Educational Testing Service (Cumming, Kantor, & Powers, 2001) and is in line with the study of Jeffery (2015), in which teachers' summative comments on their students' performance were recorded to be used for extractions of potential descriptors of rating scales. The accumulation of textual attributes from teachers' commentary feedbacks should be flexible and dynamic, in order to enlarge the coverage and enrich the diversity of attributes.
Rating scales in writing assessment have long been developed, as a tradition, based on intuitive judgement of experts, lacking the support of test takers' authentic data. As a remedy, the rating scale constructed in the present study can be either directly applied by teachers and raters, or be modified to suit various assessment contexts. Raters and teachers should be encouraged to strengthen their awareness that rating scales for writing assessment need to be developed based not on subjective and intuitive judgment, but on the integrative use of resources available, including investigations of the construct of the assessment task, teachers' written or verbal commentary feedbacks (Jeffrey, 2015) on students' authentic performances of the assessment task. This is to guarantee that rating scales are more targeted for specific users since they are constructed in a way deeply rooted in real situations. Moreover, since the construction of rating scales for writing assessment is an iterative and ongoing process (Hirai & Koizumi, 2013;Weigle, 2002), full of "missteps and multiple revisions" (Plakans, 2013:161), it should be kept in mind that rating scales, after construction, have to undergo continuous rounds of validation and investigations for potential improvement.

Limitations and Future Directions
The first limitation is that the study covered only a limited sample of data which involved only 350 teachers and 60 students. Therefore, perceptions and attitudes of a limited number of people undermine the plausibility of the rating scale constructed. A larger and more varied sample would be desirable for greater generalizability. Furthermore, based on the CSE, the construct, TAP-based commentary feedback and personal judgment, generalization of textual attributes was made by researchers themselves. This process is completely intuitive and there might be some attributes neglected or misinterpreted. To avoid this potential problem, more experts could be invited to jointly work on text comparison and analysis.
As for future directions, the present study constructed a rating scale of summary writing based on data collected among juniors of English majors whose English proficiency is relatively quite high, indicating that the rating scale may not be suitable for summary writing tasks of other proficiency levels. Therefore, the study could be duplicated targeting students of various proficiency levels including, for instance, non-English majors, or middle school students. Further comparisons can also be made among rating scales of different proficiency levels for investigations of similarities and differences.
In addition, the present study focuses only on the construction of the rating scale, which requires comprehensive validation, thus offering suggestions for its improvement. Moreover, when involved in rating their or peers' performance, students, with rating scales at hand, might greatly improve their writing performance since they have opportunities to explore their own strengths and weakness based on full comprehension of the criteria in rating scales (Andrade, 2005;Becker, 2016). As a result, a rating scale for students could also be constructed with full consideration of students' language proficiency and cognitive characteristics.

Declaration of Interest
None.

Copyrights
Copyright for this article is retained by the author(s), with first publication rights granted to the journal.
This is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).