The Validity and Reliability of Rhythm Measurements in Automatically Scoring the English Rhythm Proficiency of Chinese EFL Learners

This article aims to find out the validity of rhythm measurements to capture the rhythmic features of Chinese English. Besides, the reliability of the valid rhythm measurements applied in automatically scoring the English rhythm proficiency of Chinese EFL learners is also explored. Thus, two experiments were carried out. First, thirty students of English major and five native English speakers were selected to read ten English sentences. The participants were divided into four proficiency groups according to human scoring. Then seven previously proposed rhythm measurements were investigated in four proficiency groups. One-way ANOVA results showed that five rhythm measurements were valid to distinguish different English rhythm patterns among four proficiency groups. Based on the valid measurements, an experiment of automatic scoring for English rhythm proficiency was also conducted through statistical technique Multiple Regression. The correlation coefficient between the autoscores and the scores made by experienced teachers reached 0.866. The result showed a high reliability of the objective evaluation for English rhythm proficiency of Chinese EFL learners.


Introduction
Acoustic-phonetic rhythm measurements have successfully identified rhythmic features of languages in L1 studies.Recently, researchers begin to apply these measurements in L2 studies and find some of them are valid to recognize the rhythmic characteristics of second languages.Furthermore, some researchers try to employ the valid measurements in autoscoring the EFL learners' rhythm proficiency.Up to now, however, few empirical studies have probed into Chinese English and Chinese EFL learners.Thus, this article aims to find out the validity of rhythm measurements to capture the rhythmic features of Chinese English.Besides, the reliability of the valid rhythm measurements applied in automatically scoring the rhythm proficiency of Chinese EFL learners is also explored.

Rhythm
Rhythm is one of the three aspects of prosody, along with stress and intonation.According to Zhang (2002), rhythm refers to the basic recurrence of elements or features in alternation with opposite or different elements or features.And speech rhythm is essentially a tendency for the stressed syllables to occur at more or less regular intervals of time.
Every language has its own characteristic rhythm.Initiated by Pike (1945) and Abercrombie (1967), languages of the world can be classified into three rhythmic categories, namely, stress-timed, syllable-timed and mora-timed from the perspective of human perception.These categories are defined on the hypotheses about units of equal duration.Roach (1982) believes that stress-timed languages exhibit more nearly equal intervals between stresses or rhythmic feet, syllable-timed languages display near isochrony between successive syllables and mora-timed languages have nearly isochronous mora.
Speech rhythm in English is said to be stress-timing.Wang (2002) claims that English rhythm is influenced by some factors like stress, linking, assimilation, elision, and weak forms.When it comes to Chinese rhythm, Gui (1985) believes that speech rhythm in Chinese is syllable-timing.Every word is read explicitly except few weak auxiliary words.Besides, the interval of time of each syllable is relatively equal.Due to the first language negative transfer, the English spoken by many Chinese EFL learners is an intermediate language whose rhythm tends to be more syllable-timed rather than stress-timed.The most distinctive characterization of Chinese English dwells on stress.Lin (2007) points out most Chinese EFL learners usually mistake the stress pattern in a word and they prefer to give equal stress to every English word in a sentence.Besides, many Chinese EFL learners are poor in weak forms and linking and thus the English they speak have more pauses and hesitation than that of the native speakers.To sum up, Chinese EFL learners are not good at the skills of adjusting English rhythm like stress, linking, assimilation, elision, and weak forms.Hence the English they speak is more syllable-timed whose syllable structures appear to be less complex than that of the native English speakers.

Rhythm Measurements
Phoneticians have been recently interested in research on acoustic-phonetic measurements of rhythmic structure of languages with an aim to allow the tendency towards stress-or syllable-timing to be derived from the measurements.For the convenience of description, seven rhythm measurements successfully proposed in L1 studies can be classified into three kinds of measurements according to their measuring method.They are raw interval measurements (RIM), rate-normalized interval measurements (NIM) and Pairwise Variability Indices (PVI).
Based on the observation that stress-timed languages have a more complex and variable syllable structure than syllable-timed languages, Ramus, Nespor, and Mehler (1999) propose three measurements by measuring temporal characteristics of vocalic and consonantal intervals.Thus, three raw interval measurements (RIM)-the proportion of vocalic intervals (%V) (Note 1), the standard deviation of the vocalic (ΔV) and consonantal (ΔC) intervals are calculated.Among them, the results of %V are predicted to be larger in syllable-timed languages than in stress-timed languages while the results of ΔV and ΔC are the opposite.
Later, some studies (Barry, Andreeva, Russo, Dimitrova, & Kostadinova, 2003;Dellwo & Wagner, 2003) have found that ΔC varied considerably by speech rate at least in some languages including English and German.If this is the case, speech rate normalization (Note 2) of target utterances seems urgent when ΔC and ΔV are to be used.Hence, Dellwo (2006) puts forward the rate-normalized version of consonantal variability, that is, VarcoC.Soon, VarcoV, the rate-normalized standard deviation of vocalic interval duration is also added by White and Mattys (2007) to fill up the Rate-normalized Interval Measurements (NIM) inventory.The calculation formulas are presented as follows: Calculation formula of VarcoC: (1) Calculation formula of VarcoV: (2) And their results are the same as those of ΔV and ΔC, predicted to be smaller in syllable-timed languages than in stress-timed languages.
Additionally, based on the observation that stressed and unstressed vowels in languages employing stress rhythm vary widely in duration whereas the durations of vowels in syllable rhythm languages vary less, Low, Grabe, and Nolan (2000) thus introduce Pairwise Variability Index (nPVI).Later, Grabe and Low (2002) add another measurement to their Pairwise Variability Index, based on the variability of consonantal intervals (rPVI).The calculation formulas are presented as follows: Calculation formula of nPVI (d k : duration of kth interval; m: number of intervals): (3) Calculation formula of rPVI: (4) Due to factors like frequent vowel reductions and linking within words etc., the Pairwise Variability Indices (PVIs) are predicted to be larger in stress-timed languages than in syllable-timed languages.

Previous Empirical Studies on Rhythm Measurements
The above rhythm measurements have been successfully applied into the monolingual studies (like Grabe & Low, 2002;Dellwo, 2006;White & Mattys, 2007;Ramus et al., 1999;etc.)to identify different first languages.However, it seems that they are not so uniform to differentiate the non-native languages.Low, Grabe, and Nolan (2000) found nPVI rather than rPVI could differentiate Singapore English and British English.Stockmal, Markus, and Bond (2005) reported that ΔC and rPVI may significantly distinguish the language Latvian spoken by native speakers and by Russian Latvian learners of different proficiencies while %V, ΔV and nPVI showed no significant difference among these groups.White and Mattys (2007), with native English speakers and Spanish learners of English as their participants, revealed that measurements like %V and VarcoV were more useful for non-native speech rhythm detection than ΔV, ΔC, nPVI and rPVI.The diverse results of all empirical studies show that rhythm measurements to distinguish different non-native rhythm characteristics are not unified.Vowel-based measurements appear to be more suitable in detecting the rhythm features of Singapore English, Spanish English while consonant-based measurements more effective in capturing that of Russian Latvian.However, thus far, few empirical studies focus on the non-native English spoken by Chinese EFL learners except Chen and Wang (2013).But this issue should be investigated, because, on one hand, the result of the rhythm measurements on the non-native rhythms is not so uniform.On the other hand, more and more Chinese people have begun to learn English nowadays, so it would be valuable to study the characteristics of English rhythm among this increasingly enlarged group.
Furthermore, some phoneticians recently try to employ acoustic-phonetic rhythm measurements in autoscoring the quality of EFL learners' oral language.Chung, Jang, W. Yun, I. Yun, and Sa ( 2008) first makes use of the rhythm measurements to autoscore the pronunciation accuracy of English speech produced by Korean learners of English.On this basis, another Korean researcher Jang (2008) further improves the experiment and clearly suggests an autoscoring method-Multiple Regression for English oral proficiency.The results of their experiments are not so convincing because the characteristics of rhythm is not enough to reflect the proficiency of English pronunciation.But these attempts illustrate the possibility of rhythm measurements for automatic scoring.Given that no empirical study has explored the autoscoring for Chinese EFL learners, this article tries to improve the existing experimental method as to autoscoring the English rhythm proficiency of Chinese EFL learners.And it may provide a theoretical foundation for Computer-Assisted Language Learning System for English Rhythm of Chinese EFL learners.
In order to achieve the above purposes, the current study is going to provide answers to the following questions: 1) What rhythm measurements successfully proposed in L1 studies is valid to capture the English rhythm patterns of Chinese EFL learners with different proficiencies?
2) How reliable are the valid rhythm measurements in automatically scoring the English rhythm proficiencies of Chinese EFL learners?

Participants
Thirty students of English major from four different grades in the Faculty of English Language and Culture (FELC), Guangdong University of Foreign Studies were selected.There were 23 female students and 7 male students.In order to provide a stress-timed rhythm baseline, five native speakers from Britain were also chose.Two of them were male and the other three were female.

Instruments
The reading materials (see Appendix A) were ten sentences whose numbers of words ranged from 3 to 9. Considering the students' English level, the selected sentences had no infrequent words, no complex sentence structures.All of the sentence type was declarative sentence, which was widely used in spoken English.In order to properly exhibit the characteristics of English rhythm, each sentence was made to include at least one factor that can adjust the rhythm like linking, elision, assimilation and weak form.
In addition, the present study made use of the phonetic research tool Praat to segment the recordings of the 35 subjects.Mathematical software Matlab and Excel were also used to calculate the results of the measurements based on the segmented information.Finally, the data were analyzed through SPSS 17.0 and Matlab as well.

Recording
The subjects were asked to read ten sentences.They were given time to read the sentences before the recordings were made.The non-native recordings were made in a recording studio and the native recordings were made in a quiet room.All the recordings were of good quality and had little noise.

Human Scoring
The English utterances spoken by Chinese EFL learners are not the same rhythm pattern.A research conducted by Feng (2010) discovered that Chinese EFL learners with different proficiencies of rhythm had different rhythmic patterns.The better English rhythms the Chinese EFL learners have the more stress-timed rhythm they will display.So it is necessary to divide all the learners into several representative groups according to their rhythm proficiencies before the experiment.
The present study adopted the scoring approach of Absolute Scales proposed by Diekerson (1997).This approach means that the scoring is based on one standard regardless of the students' different learning years of English, improvements etc.According to Wang (2002), stress is the basis of English rhythm.And English rhythm is well embodied in other factors like linking, assimilation, elision, and weak forms.Thus a standard for scoring the rhythm proficiency was drawn up (see Table 1).

C
Poor stress; to adjust English rhythm withoutany techniques; quite similar to the syllable-timed rhythm.
The recordings of 30 Chinese EFL learners were scored by one Chinese experienced university English teacher and one British university teacher.Two teachers separately gave an overall score to each student according to their English rhythm presented in the ten sentences the students have recorded.Although an evaluation standard has been put forward, it is still necessary to test the reliability of the scores in case two teachers may have inconsistent opinions towards the standard.Thus, Pearson correlation and T-test were conducted by SPSS 17.0.
The correlation between the scores made by two teachers significantly reached .693,indicating that two teachers had a consistent standard when scoring.Besides, a pair-samples t-test was run to see whether there was significant difference between the evaluations of two teachers, with the teachers as two groups and the scores as the variables.The result of the t-test showed that the difference between two teachers was not significant, t (29) = 1.682, p = .103).Thus the scores were basically reliable for the following data analysis.
In order to make a parallel comparison, three proficiency groups, namely level A, level B and level C were designed to consist of five speakers respectively as to have the same number with the group of native speakers.Hence, fifteen non-native samples, each of whom had the same scale evaluated by two teachers, were selected to fall on three proficiency groups proportionally.

Segmenting and Calculating
The recordings were segmented by Praat.Every consonant and vowel in sentences was segmented and then the segmented information was saved in Excel.Next, the results of the rhythm measurements for every subject were calculated by Excel and Matlab based on the segmented data.

Analyzing the Data
After the data were collected, one-way ANOVA was conducted by using SPSS 17.0 to find out the valid measurements which can distinguish the native and non-native English rhythm patterns.
With the valid measurements found in the results of experiment one, an experiment of automatic scoring was also performed by the statistical technique called Multiple Regression.This technique has been widely applied in automatic assessment for essay writing (Page, 1994;Page & Petersen, 1995;etc.)and began to be employed in automatic scoring for oral English by Jang (2008).Its equation can be represented as follows, In the present study, the valid measurements were regarded as features to score the overall rhythm ability.
Besides, the estimation of coefficients was conducted by Partial Least Squares in the mathematical software Matlab.Hence, twelve samples in four groups were randomly chosen as the training data in order to estimate the regression coefficients (βs).
The rest of the eight samples from the four groups were used as the testing data and applied in the Multiple Regression.After the autoscores for these eight samples were calculated, the Person correlation was conducted to see the correlation between the autoscoring results and the human scores.

Verification of Raw Interval Measurements (RIM)
The average values according to the Ramus measurements for the four groups are given in Table 2.It shows that the percent vocalic interval of native speakers is lower than that of Level B and Level C while higher than that of Level A. As English has more vowel reduction than that of Chinese, the percent vocalic intervals in native speakers are predicted to be less than those of Chinese EFL learners who are influenced by first language transfer and have a more syllable-timed rhythm.Thus, %V doesn't conform to the previous prediction.However, ΔC and ΔV accord with the prediction as the mean values are largest in native speakers, decreasing as the proficiency level gets lower.Note.%V = percent vocalic intervals, ΔC = SD of consonantal intervals, ΔV = SD of vocalic intervals.
In order to explore whether the mean results of the rhythm measurements that fit the prediction between these four groups are statistically different, One-Way ANOVA was run.The mean results of consonant interval variability (ΔC) were significantly different, F (3, 16) = 6.133, p = .006,showing that the difference between different proficiency groups was statistically significant.Additionally, the vocalic interval variability (ΔV) also reached statistical significance, F (3, 16) = 15.055,p = .000,indicating that there was a significant difference among the four groups.
To sum up, three measurements under the Raw Interval Measurements, except the percent vocalic interval, are consistent with the prediction.Moreover, both the consonantal interval variability and the vocalic interval variability are statistically significantly different among four proficient levels.Therefore these two measurements are valid to capture different rhythm proficiencies.

Verification of Rate-normalized Interval Measurements (NIM)
The results of the rate-normalized interval measurements, Varco ΔC and Varco ΔV are given in Table 3.It seems that the values of these two measurements are the same with the prediction, highest in native speakers and getting lower with the proficient levels declining.Then the two measurements were submitted to One-way ANOVAs with proficiency group as the between factor.Although the lower proficiency groups showed the lower values on the measure of the measurement VarcoΔC, the difference among different groups did not reach statistical significance, F (3, 16) = 1.686, p = .210.However, the measurement Varco ΔV seems valid.On one hand, it is consistent with the prediction.On the other hand, the differences did reach statistical significance between four levels, F (3, 16) = 5.162, p = .011.
From the above data analysis, we can see that Varco ΔV instead of Varco ΔC under Rate-normalized Interval Measurements is effective to distinguish the rhythm patterns of different proficient levels.

Verification of Pairwise Variability Indices (PVI)
The Grabe measurements nPVI and rPVI are given in Table 4.The values of nPVI and rPVI were predicted to be higher in stress-timed rhythm than syllable-timed rhythm.As shown in Table 4, the native speakers, namely, the highest level, showed the higher values of nPVI and rPVI than those of Chinese EFL learners.And with the levels of the learners going down, the values decrease correspondingly.Next, the two descriptive parameters were submitted to One-way ANOVAs with the speaker group as the between factor.With F (3, 16) = 4.435, p = .019,the values of nPVI were statistically different among groups.In addition, the result F (3, 16) = 5.815, p = .007,indicated that the other measurement rPVI, the variability of consonantal intervals, was significant higher for native speakers and lower for the Chinese EFL learner.
In summary, both the two measurements nPVI and rPVI are in keeping with the previous prediction and their average values for four groups are significant different.Hence, the Grabe measurements are useful to capture the different rhythm patterns among four proficient levels.

Autoscoring of Rhythm by the Multiple Regressions
From the above data analysis, Δv, Δc, VarcoΔv, nPVI, rPVI were found to be valid to distinguish the different English rhythm patterns among four proficient groups.These five measurements were regarded as five features to score the overall rhythm ability.Then the equation can be established as soon as the regression coefficients (βs) were estimated.The rest eight samples among four levels were used as testing data which was shown in Table 5.For the convenience of calculation, the proficiency scales were turned into numbers.Then "Native speaker", "A", "B", "C" were changed into "4", "3", "2", "1" respectively.These raw data were applied into the equation for rhythm autoscoring.And the autoscores for eight samples were 3.941 (4), 2.672 (3), 3.515(4), 2.789 (3), 3.379 (3), 3.167 (3), 0.771 (1), and 3.341 (3) successively.
In order to find out the reliability of the autoscoring results, the Pearson correlation was run to the scores calculated by computer and made by the perception of experienced teachers.The correlation between autoscores and scores made by teachers was statistically significant (p = .005),indicating that the objective evaluation is related to the subjective evaluation and its reliability reached.866.

Conclusion
Some implications are illuminated in the present study.On one hand, the study provides an acoustic-phonetic evidence to reveal that the English syllable structures of Chinese EFL learners are different from those of native speakers.The syllable structures of students with lower rhythm proficiency are less complex than those of the higher proficient and native speakers.In order to improve the Chinese EFL learners' English rhythm, it is important for teachers to inoculate students with the pronunciation rules of consonant and vowel like linking, assimilation, elision, and weak forms etc. and urge them to practice in such ways.On the other hand, the high reliability of the autoscoring for English rhythm proficiency run by Multiple Regression makes a modest contribution to the study of Computer-Assisted Language Learning System for English Rhythm.To some extent, it offers a phonetic theoretical foundation for the objective evaluation system for English rhythm of Chinese EFL learners.
The limitations of the present study cannot be neglected.The participants in the study are insufficient.First, each group only contains five participants.This may not be enough to represent the characteristics of every proficiency level.Second, only twelve samples are used as training data for Multiple Regression.This is obviously quite small data to build a robust model.Third, the gender of the participants in the study is not equally the same.More female samples are collected and less male participants are included.

Table 1 .
Standard for rhythm evaluation AAccurate stress; to be adept with pronunciation techniques like linking, assimilation, elision, and weak forms; to exhibits the stress-timed rhythm well.BOne or two inaccurate stress; to use linking, assimilation, elision, and weak forms such techniques sometimes; to be influenced by syllable-timed rhythm to some extent.

Table 2 .
Average values of RIM for four groups of speakers

Table 3 .
Average values of NIM for four groups of speakers

Table 4 .
Average values of PVI for four groups of speakers

Table 5 .
Testing data for automatic scoring of rhythm ability