Effectiveness of the Automated Writing Evaluation Program on Improving Undergraduates’ Writing Performance

Automated Writing Evaluation program (AWE) has gained increasing ground in ESL/EFL writing instruction because of its instructional features, such as the instant automated writing score system and the diagnostic corrective feedback in real-time for individual written drafts. However, there is little known about how the automated feedback provided by the AWE program can impact students’ writing performance in an authentic classroom and how to make the most of it to improve students’ writing performance effectively, especially for ESL/EFL undergraduate students. This paper attempts to offer an overview of the investigation of the effectiveness of automated feedback via a literature review. According to the inclusion and exclusion criteria, eleven articles published in the past five years were finally included for the analytical synthesis. The literature review matrix for the synthesis reveals the research gaps of the previous literature in the levels of the effectiveness of the automated feedback, including the lack of the design of delayed post-test, writing performance in terms of writing traits, and students’ writing strategies regarding the use of AWE program. The conclusion highlights the need for future research by bridging the gaps of exploring the long-term internalized impact of the embedded use of automated feedback and an advanced teaching method on improving both students’ overall writing performance and analytic writing scores.


Introduction
With the support of sophisticated language processing technology, i.e., Natural Language Processing (NLP) and Latent Semantic Analysis (Landauer, Foltz, & Laham, 1998), as well as the innovation of artificial intelligence technology, the Automated Writing Evaluation (AWE) Program has captured ESL/EFL writing researchers' eyes in recent years. The existing myriad types of AWE programs consists of two main systems: 1) One is the automated essay scoring (AES) / the automated essay evaluation (AEE) system providing a holistic score and even analytic scores in terms of the weightings coming from extracting several linguistic features from the essay (Page, 2003;Liu & Kunnan, 2016), such as the corpus-based assessment (Lang, Li, & Zhang, 2019) or a machine-learning algorithm (Bunch, Vaughn, & Miel, 2016;Shermis, Burstein, Elliot, Miel, & Foltz, 2016) that produce scores which simulate human raters (Shermis & Hamner, 2013;Palermo, & Thomson, 2018); 2) The other is the timely individualized diagnostic artificial intelligent feedback (Chapelle, Cotos, & Lee, 2015;Li, Link, & Hegelheimer, 2015) on writing samples. Also, the AWE program has the feature of being an interactive learning platform. Most of the AWE programs afford both build-in and "customizable prompt" (Palermo & Wilson, 2020) for a teacher to assign and a variety of forms for the teacher to give comments, such as the general comment in the macro perspective and text-embedded comment in the micro perspective. Moreover, student-users can revise their drafts according to the received feedback from the source of the AWE system, teacher, and even peers, i.e., MI Write and the Pigai program.
The writing practice via the AWE program is labeled as one essay with multiple drafts (Dikli & Bleyle, 2014;Palermo & Wilson, 2020). Therefore, in order to address the validity of the functions mentioned above, a large body of prior scholarship follows with interest the issues on the consistency or the agreement between the AES/AEE system and human raters, users' perceptions, as well as users' strategies. However, little research attempts to investigate the pedagogical value of its automated feedback on improving students' overall writing competence (Stevenson & Phakiti, 2014;Wilson & Czik, 2016). While, its instant diagnostic feedback on multiple linguistic features (i.e., grammar, vocabulary, organization, mechanics, and so on) provides the potential values for pedagogical use in English writing class which should be investigated thoroughly.
To address this matter, this literature review aims to identify main indicators that are related to the effectiveness of automated feedback provided by the AWE program on improving undergraduate students' writing competence, hoping to shed some lights on future research, particularly in the ESL/EFL context.

Accuracy of the Automated Feedback in AWE
The majority of the previous scholarship has a strong disposition to explore the accuracy of the automated feedback by investigating the consistency between the AWE system and human-raters (i.e., Zhang, 2020;Ranalli, Link, Chukharev-Hudilainen, 2017;Lang, Li, & Zhang, 2019), due to the insufficient literature that investigates the promotional effect on students' writing ability that observed in long-term.
Pertaining to the accuracy of the automated feedback, there is a strong consensus that the AWE system is weak in detecting the deep level errors, i.e., in the aspects of content and organization, but superior in providing surface-level error detection, i.e., grammar, syntax, and mechanics (Wang, 2020;Lang, Li, & Zhang, 2019). To put it more specifically, since the AWE system offers real-time automated corrective feedback to each draft submission, sufficient diagnostic feedback helps students to reduce the frequency of recurring errors in order to improve their writing accuracy. However, in some cases, students find even if they made all the corrections accordingly, they failed to improve the automated score (Li, Link, & Hegelheimer, 2015). That explains the revision of the writing process is far beyond the error correction in the microstructural level (Flower, & Hayes, 1981). Regarding the invalid deep level feedback, Liu and Kunnan (2016) point the AWE program -WriteToLearn hardly detects off-subject essays and its automated feedback on the macrostructural level is kind of general and vague. Moreover, in the Chinese EFL context, the AWE system fails to detect typical Chinese EFL students' errors that root in the negative transfer from Mandarin -Chinglish (Liu & Kunnan, 2016). In addition, Chinese EFL students also feel frustrated because the monolingual automated feedback (in Englishthe target language) cannot be understood clearly (Ding, 2008).

Users' Perceptions on the Use of AWE
Users' perceptions toward the use of the AWE system are one of the powerful indicators to reflect its potentials and effectiveness (Wilson & Roscoe, 2020). From the review of literature, it is found that interviews and questionnaires are the frequent methods to reveal this fact either from the perceptions of the students or the teachers. Since students are the main body of the users, most studies tend to focus on investigating students' perceptions. On the whole, students hold a positive perception when they receive automated scores and artificial intelligent feedback is accurate and valid (Roscoe et al., 2017;Ranalli, 2018). That said, their perceptions towards automated feedback mainly depend on how accurate the automated feedback they receive is, which is also positively correlated to users' perception of the automated score. In addition, students confirm that compared to the teacher's feedback, automated feedback outperforms in the aspects of language use, syntax, and mechanics (Wang, 2020). On the other hand, on the surface, the inaccuracy of the automated feedback can lead to negative attitudes causing the low uptake rate (Liu & Kunnan, 2016) and students may feel the interaction with an artificial intelligence rater lacks social human interaction (Wang, Shang, & Briody, 2013). But on a deep level, some scholars argued that these phenomena support the combination instruction of the AWE program and teacher's feedback. Zhang (2020) found students' negative behaviors for the automated feedback do not automatically mean they have not benefitted from using AWE. From a meta-process for screening automated feedback, it reveals a higher level of corrective ability from the students in their writing after receiving automated feedback.
As for the teachers, their perceptions on the use of AWE are rather mixed. Due to the lack of accuracy of AWE scoring, some teachers are concerned that it may impede the improvement of students writing accuracy (Li, Link, & Hegelheimer, 2015). On the other hand, other research has found that teachers do benefit from the AWE systems. First, the automated scoring system reduces teachers' daily burdensome scoring work (Warschauer & Ware, 2006;Li, Link, & Hegelheimer, 2015;Stevenson & Phakiti, 2019;Palermo, & Wilson, 2020), even in the standard language test, i.e., E-rater has been implemented in GMAT (Lang, Li, & Zhang, 2019). Moreover, the adequacy of the automated feedback has the potential to allow the teacher tend to focus more on providing higher-level feedback, such as feedback on content and organization (Jiang, Yu, & Wang, 2020;Link, Dursun, Karakaya, & Hegelheimer, 2014;Li, Link, & Hegelheimer, 2015), which makes up for the weakness of the automated feedback.

Objective of the Literature Review
This review focuses on current scholarly works on the impact of the automated feedback provided by the AWE program on students' writing performance in the ESL/EFL context to identify the most salient potentials to be highlighted and problems to be addressed in future studies. To be specific, this review aims to answer the following research objective: To determine the levels of effectiveness of the AWE's automated feedback on improving students' writing performance.

Reviewing the Literature
This review was carried out sticking to the guideline of the PRISMA-P (Preferred Reporting Items for Systematic reviews and Meta-Analyses-Protocols) statement (Moher et al., 2009), which represents the PRISMA-P 2015 checklist that contains 17 numbered items, including 26 sub-items, in the aspects of administrative information, introduction, and methods. Also, in order to prevent the final literature was identified from selection bias, the inclusion and exclusion criteria that were proposed by Gough, Oliver, & Thomas (2012) were adopted to help the researchers to rule out the irrelevant studies, where the authors selected the scholarly works by "study design and the population, intervention/issue, comparison, outcome and context/time" (p.125) (see Table 1). Then, a separate discussion was made based on the results from the literature review matrix, which shows the understanding of what lessons can be learned from past studies to create possible and purposeful directions for future research.

Search Strategy
A structured search strategy was divided into two stages. The initial search stage of this literature review was undertaken in February 2021, included peer-reviewed articles published in the past five years in English investigating the AWE software in any aspect. Consequently, 664 articles were identified initially, applying the following steps: a) Electronic databases (i.e., Web of Science, Taylor & Francis, JSTOR, and ProQuest) b) By following up on the reference sections of the identified articles.
The primary result of the searching was identified by utilizing the Boolean operator "OR" and the keywords automated writing evaluation, automated essay evaluation, and automated essay scoring. The articles were searched up to the saturation point so that no new literature was reported no matter using complete or widely accepted abbreviations. In the second stage, the data were screened step by step. (Figure 1. illustrates the logic and data screening process of this review). During the data screening process, two coders were invited to screen the titles and abstracts with high inter-rater reliability (k = .88) on identifying the reasons for inclusion and exclusion criteria elt.ccsenet.org English Language Teaching Vol. 15, No. 7;2022 (see Table 1.) that were examined in the pilot study by testing Cohen's kappa coefficient k (Cohen, 1960). Consequently, 655 articles were excluded, and there are only remaining 11 articles were included for synthesis. and System. Moreover, based on the reviewed articles, the effectiveness of automated feedback provided by the AWE program was examined at four levels: students' overall writing performance, analytic writing performance, revision ability, and writing knowledge. From these 11 identified studies (N=11) where three research was conducted using quantitative research design (N=3), seven used mixed-method research design (N=7), and 1 study employed action research design (N=1). Since the controlled trial study is one of the inclusion criteria, there is no qualitative study involved. In regards to the quantitative research design, nine out of eleven of the sample chose the quasi-experimental research design.

Data Abstraction and Analysis
In order to address the RO, the levels of effectiveness of the automated feedback provided by the AWE program on improving students' writing performance that was investigated by the 11 identified studies were coded. Then, the report of the analysis was based on a literature review matrix (see Appendix) which was developed by extracting the following information from each journal article: author(s), year of publication, journal, the title of the article, underpinning theory or theoretical framework, research method, participants or sampling, instruments, and main findings.

Limitations
Even though this literature review was carried out in a rigorous way to a large extent, the search strategy may be limited in some aspects. First, the time scope for the searching may cause publication bias. Articles published after February 2021 are not included, so those research findings may vary due to the different time scopes. Furthermore, although the exhaustive search was undertaken in four databases only until the data has reached saturation, conference proceedings and grey literature were excluded. In addition, the inclusion and exclusion criteria of the sample, research methods, and the research settings of the implementation of the AWE program narrow down the review scope. Therefore, further research could expand the search scope by increasing the number of databases, publication categories, and research settings.

Findings and Discussion
After all the different themes were coded, four main indicators of the effectiveness of automated feedback provided by the AWE program emerged, i.e., students' overall writing performance, analytic writing performance, revision ability, and writing knowledge.

Students' Overall Writing Performance
Studies (N=3) place special focus on look into the influence of automated feedback on improving learners' overall writing performance. Wang (2019) investigated the effect of the use of Pigai program through comparing two different treatment conditions (i.e., students in the control group (CG) only received teacher feedback and those in the experimental group (EG) who received both teacher feedback and automated feedback provided by the Pigai program) at a vocational college in China. The results revealed that students' writing performance of the experimental group outperforms those of the control group after an 18-week intervention. Wang (2019) contributed such significant improvement of students' overall writing competence to students' constantly and independently revision in terms of the real-time automated feedback which reverses students' passive learning position. Parra and Calero (2019) conducted a quasi-experimental study to compare the effectiveness of two different brands of the AWE programs (i.e., experimental group 1 used Grammarly and experimental group 2 used Grammark) to facilitate teacher's instruction in Ecuador. Findings show that after the 8-week intervention, both two experimental groups improved their overall writing performance significantly from pre-test to post-test and that the effects of Grammarly and Grammark on improving students' overall writing competence are positive and similar. In addition, they came to the same conclusion as Wang (2019) that the immediacy and privacy of the diagnostic automated feedback are particularly useful in developing students' learner autonomy. However, as far as the methodology is concerned, the lack of the control group cannot make the conclusion be drawn from the current quasi-experimental research without any alternative explanations. In another research, Lee (2020) adopted the test-retest design to test the long-term effect of automated feedback provided by the AWE-Criterion on improving undergraduate students' writing development in South Korea. According to the descriptive statistical results, both two participants improved their holistic scores across the pre-and post-tests ranging from 10 to 40. It is quite interesting to note that Lee (2020) used the test-retest design and found these findings in his study. This goes to show that the test-retest design is more appropriate to test the reliability of the instrument, other than a research design (Fraenkel, Wallen, & Hyun 1993).

Students' Analytic Writing Performance
There are two studies that paid particular attention to students' analytic writing performance which provides a specific lens to study the effects of automated feedback on students' writing performance. Link, Mehrzad, and Rahimi (2020) investigated the changes in students' writing performance in two conditions (i.e., Control Group: Process-oriented writing approach (POWA), Experimental Group: POWA + Criterion) in the aspects of three analytic writing traits, which are: syntactic and lexical (assessed by dependent clause per T-unit and mean length clause), accuracy (assessed by coordinate phrases per clause), fluency (assessed by complex nominal per clause) across pre-, post, and delayed tests at an Iran university. The results show that students in Experimental Group retained a notable enhancement only in the accuracy in a long run. However, students in the control group made a more comprehensive enhancement, except for word frequency, and the rest 8 writing traits showed significantly different performances. It is an interesting phenomenon that the embedded use of the Criterion with POWA made students narrow down their focus on grammatical performance.
In a similar vein, Saricaoglu (2019) conducted a quasi-experimental study to ascertain the effects of the automated feedback provided by the ACDET-AWE program (i.e., a newly developed AWE tool, ACDET which specializes in analyzing causal discourse and providing formative feedback on causal explanations) by looking into the changes of students' writing performance on written causal explanations in pre-and post-tests and in particular analyzing students' performance of causal language features (i.e., conjunctions, adverbs, prepositions, adjectives, verbs, and nouns). By counting the frequency of each indicator of the written causal explanations, the descriptive and compared statistical result revealed the development of students' improvement within assignments, and the changes in the total number of students' each causal language features for pre-and post-tests uncover the effects of the ACDET on students' writing ability in causal explanation in long-term.

Students' Revision Ability
A few studies (N=5) focused on the impact of the AWE program on enhancing students' revision ability. To be more specific, since grammar problems are effectively 'treatable' by providing selective error feedback (Bitchener & Knoch, 2009), researchers think there is a potential effect of the AWE program in improving students' grammatical performance via revising.
The normal way of examining the effectiveness of the AWE program on students' grammatical performance is to do the error counts for students' first and last drafts which were submitted through the AWE system (see Li, Feng, & Saricaoglu, 2017;Saricaoglu & Bilki, 2021;Liao, 2016). For example, in 2017, Li, Feng, and Saricaoglu investigated the short-term (within assignment) and long-term (from pre-test to post-test) effects of the AWE-Criterion on two different levels (i.e., intermediate-high level and advanced-low level) of the ESL students' development of the grammatical accuracy in the US. Researchers coded students' grammatical errors into 9 types, which are: word choice, verb form, word form, articles, pronoun, run-on sentence, fragment, sentence structure, and subject-verb agreement. By calculating the error count according to the formula suggested by Chandler (2003): (error count/essay length) ×100 for each error type within and across assignments, they attempted to set up a multilevel growth model to obtain the trajectory of students' writing accuracy across assignments. The results indicate the automated feedback provided by the Criterion has a positive impact on improving students' grammatical accuracy. In other words, Criterion has an advantage in developing students' self-revising skills. Moreover, they found three error categories, namely, fragment, run-on sentence, and subject-verb Agreement were significantly decreased across the three assignments, indicating the Criterion has a positive impact on students' revision ability in sentence level. Since the AWE program is potentially used by students on their own outside of the classroom, Saricaoglu and Bilki (2021) investigated the impact of students' voluntary use of the AWE-Criterion on their revision practice out of the classroom based on two different courses (i.e., Introduction to Sociology (IS) and Introduction to Education (IE)) in a private Turkish university. Without any teacher's monitoring, the utilization of the Criterion was influenced by teachers' attitude which is in line with previous studies (see Roscoe et al., 2017;Li, Link, & Hegelheimer, 2015). Through comparing students' error reduction rate for the first and last drafts within each assignment and across the two assignments, in addition to the significant decreases of four error types which were observed in the low usage group-IE (i.e., Subjective-Verb agreement, Possessive, Missing Article, and Missing Comma), the high usage group (IS) also made significant improvement in a number of aspects, which indicates the high usage rate of the Criterion has a potential to improve students' grammatical revision ability in a thorough way. It is also worth noting that the error types were categorized from the Criterion and that alternative explanations for the results are usually associated with the accuracy of automated feedback (Chapelle et al., 2015) provided by the Criterion.
Another way to investigate students' revision ability is to count students' revision behavior. Link, Mehrzad, and Rahimi (2020) coded students' revision behavior into 6 types: no change, remove, add, delete, change, and transpose. According to the descriptive statistical results, Link, Mehrzad, and Rahimi (2020) found 24% of automated feedback provided by Criterion led to no change in their writing performance which is much higher than the percentage of teacher feedback that resulted in no change (12%). Whether this type of students' negative behavior is linked to the accuracy of the automated feedback or other factors needs researchers to combine qualitative research methods to reveal the fact beyond the numbers.
It is worth noting that there are two other studies (see Huang & Renandya, 2020; Hou, 2020) that investigated the effects of the AWE program on the changes in students writing performance from pre-test to post-test by comparing students' holistic scores of their first drafts and the scores of the last revision drafts. In this situation, it is not appropriate to draw the conclusion from students' revision ability to generalized over students' overall writing ability because in most authentic writing tests students' writing competence mainly refers to students' expression of the knowledge acquired about a topic (Woolfolk, 2013). In addition, usually, there is not too much time left for students to revise their writing pieces. In other words, students' writing performance is more than revising their drafts.

Student's Writing Knowledge
Since the AWE program offers multiple opportunities for learners to engage in revising their written pieces, which boosts their internalization of grammar knowledge and writing knowledge, there is one study that investigated the impact of the AWE-Grammarly on students' learning about passive structure in writing. Qassemzadeh and Soleimani (2016) conducted a quasi-experimental study to compare students' scores of the writing knowledge tests (i.e., multiple-choice tests) of two groups of learners under the treatment of teacher feedback and the use of AWE-Grammarly feedback respectively across pre-, post, and delayed post-tests. The results revealed that students who received the intervention of the AWE-Grammarly underperform those who only received teacher feedback in the post-test, but outperform those in the delayed post-test. It demonstrates that compared to the teacher feedback, the effectiveness of the AWE program has an advantage in promoting students' grammatical performance in a long run. On the other hand, it also indicates that the use of the AWE is not always positive enough to back up self-regulated learning.

Conclusion
In this literature review, the authors have reviewed a number of previous studies published within the past five years on the levels of effectiveness of the automated feedback provided by the AWE program on improving ESL/EFL students' writing performance in higher education. In general, there are four potentials for the implementation of the AWE program. First, the use of diagnostic automated feedback is a crucial learning resource that supplemented teacher's instruction which improves students' overall writing performance. Secondly, the sufficiency and immediacy of the automated feedback help students revise their draft independently and in their privacy which is able to foster their learner autonomy. Moreover, the automated feedback has an outstanding advantage in improving students' writing in the aspect of the accuracy. In addition, the automated feedback facilitates students to internalize their writing knowledge, such as the passive structure.
Further research is recommended to improve the research that focuses on the effects of the automated feedback provided by the AWE program in the following ways. Future research is needed first to fill the gap of the delayed test design to testing AWE's effects on improving students' overall performance under an advanced teaching method. Secondly, studies are called for investigating its effects on students' analytic writing performance. Specific attention also needs to be focused on the influence of automated feedback on the changes in students' writing strategies.