Unveiling the Scoring Validity of Two Chinese Automated Writing Evaluation Systems: A Quantitative Study

Full Text: <a href="https://ccsenet.org/journal/index.php/ijel/article/download/0/0/44594/47080">PDF &nbsp;
DOI: 10.5539/ijel.v11n2p68

Jian Wang; Lifang Bai

doi:10.5539/ijel.v11n2p68

Unveiling the Scoring Validity of Two Chinese Automated Writing Evaluation Systems: A Quantitative Study

Jian Wang
Lifang Bai

Abstract

Computer Assisted Language Learning (CALL) has been a burgeoning industry in China, one case in point being the extensive employment of Automated Writing Evaluation (AWE) systems in college English writing instruction to reduce teachers’ workload. Nonetheless, what warrants a special mention is that most teachers include automatic scores in the formative evaluation of relevant courses with scant attention to the scoring efficacy of these systems (Bai & Wang, 2018; Wang & Zhang, 2020). To have a clearer picture of the scoring validity of two commercially available Chinese AWE systems (Pigai and iWrite), the present study sampled 486 timed CET-4 (College English Test Band-4) essays produced by second-year non-English majors from 8 intact classes. Data comprising the maximum score difference, agreement rate, Pearson’s correlation coefficient and Cohen’s Kappa were collected to showcase human-machine and machine-machine congruence. Quantitative linguistic features of the sample essays, including accuracy, lexical and syntactic complexity, and discourse features, were also gleaned to investigate the differences (or similarities) in construct representation valued by both systems and human raters. Results show that (1) Pigai and iWrite largely agreed with each other but differed a lot from human raters in essay scoring; (2) high-human-score essays were prone to be assigned low machine scores; (3) machines relied heavily on the quantifiable features, which, however, had limited impacts on human raters.