Can LLMs Replace Human Raters? Evaluating EFL Learners’ English Summaries


  •  Makiko Kato    

Abstract

This study investigates the feasibility of using Claude 3.5 Sonnet, a large language model (LLM), to evaluate English summaries written by Japanese university students, and compares its performance with that of human raters. The research is motivated by the 2024 introduction of a summarization task in the EIKEN Grade 2 exam, which has heightened the demand for effective assessment tools in English education. Traditional automatic metrics such as ROUGE and BLEU often fail to adequately capture linguistic features like paraphrasing. As an alternative, this study explores the use of LLMs for more nuanced evaluation. A total of 70 students wrote summaries of two English texts. These summaries were assessed using two analytic rubrics: one developed by Kato (2024), which includes four criteria (Integration, Language Use, Paraphrasing, and Content Accuracy) and another adapted from Li (2014), which features three criteria. Six trained human raters and Claude 3.5 Sonnet independently evaluated the student summaries. Statistical analyses, including t tests and Pearson correlations, were conducted to compare scoring patterns. Results revealed no significant differences between human and LLM scores for Language Use and Content Accuracy, although significances emerged for Integration and Paraphrasing. Strong correlations were observed for Integration, with moderate correlation for other criteria. However, Paraphrasing emerged as a particular challenge, showing discrepancies in scores and notably weak correlations depending on the sources. Despite differences in rubric structure, scores derived from the two rubrics showed high correlations, indicating consistent evaluation trends. These findings suggest that Claude 3.5 Sonnet can reliably replicate human scoring trends for several aspects of summary quality, positioning it as a promising supplementary tool for assessment. Nonetheless, further refinement is needed to enhance its capacity to evaluate paraphrasing effectively. The study highlights the need for clearer operational definitions and more robust strategies for assessing paraphrasing in automated systems.



This work is licensed under a Creative Commons Attribution 4.0 License.