A Study on the Oral Disfluencies Developmental Traits of EFL Students — A Report Based on Canonical Correlation Analysis

This paper traces 9 non-English major EFL students and collects their oral productions in 4 successive oral exams in 2 years. The canonical correlation analysis approach of SPSS is adopted to study the disfluencies developmental traits under the influence of language acquisition development. We find that as language acquisition develops, the total production of difluenices does not decrease correspondingly as we thought, but keeps constant for a period of time. While the proportions of specific disfluencies phenomena change significantly, which features the decrease of pauses and the increase self-repairs. Besides, the grammatical accuracy and language complexity have opposite effects on disfluencies traits. In the first year, disfluencies were displayed mainly as pauses and repetitions since EFL students paid more attention to grammatical accuracy; in the second year, disfluencies featured more self-repairs and less pauses because EFL students transferred their attention to language complexity. We also find language acquisition can only account for partial developmental traits of disfluencies despite of the strong correlations between them, and other factors, such as psychological or social elements, may also take effects.


Introduction
For EFL students, one of their purposes is to improve their oral English.But the majority of Chinese EFL students cannot communicate fluently even after several years of English learning, which is far from their expectations.In daily classes, instead of giving specific advices to students directed at their individual oral problems, teachers always tend to offer general and vague instructions, such as "pay attention to the accuracy of languages" or "try to improve your pronunciation", etc.These kinds of instructions benefit students less.Students are not clear about their oral problems, nor do they notice their improvement in oral English.So it is very common for Chinese EFL students to give up oral practices without a sense of achievement after a period of time.This phenomenon has a close relationship with the lack of understanding and researches in oral disfluencies of foreign languages.
Oral disfluencies generally refer to the non-fluent parts in oral productions (Shriberg, 1994).It may also refer to the disjointed or relatively slow oral parts in communications (Starkeweather, 1987).From these definitions, we find disfluencies are displayed not only as broken languages, but also as self-repairs and languages errors, etc. Dollaghan and Campbell (1992) studied the "disfluencies traits" system and classified disfluencies into 4 groups: pauses, repetitions, self-repairs and orphans.This paper accepts the 4 groups of disfluencies and considers disfluencies as the oral outputs which make oral productions disfluent or unnatural.Dollaghan and Campbell (1992) suggested that each group of disfluencies was an independent phenomenon, which reflected a corresponding language learning process.Many scholars have studied disfluencies (Baars, Motley, & MacKay, 1975;Dell & Reich, 1981;Fromkin, 1971;Garrett, 1975;Lee, 1974;Pearl & Bernthal, 1980;Wall & Myers, 1984).Majority of them studied certain aspect of disfluencies traits by analyzing the associations between language proficiency and disfluencies.While majority of these researches studied the disfluencies traits at a specific time rather than the longitudinal developmental changes.Besides, most of researches focused on changes of disfluencies traits under one certain language acquisition phenomenon (such as syntax) (Gordon & Peterson, 1986;Colburn & Mysak, 1982) other than more language phenomena.Thus canonical correlation analysis approach was seldom used in these researches.
In this paper, we will study oral disfluencies developmental traits longitudinally by analyzing its correlations with language acquisition development.More than one language acquisition phenomena will be considered, thus canonical correlation analysis approach will be applied in it.

Methodology
Canonical correlation analysis, introduced by Harold Hotelling in 1936, is a way of making sense of cross-covariance matrices.If we have two sets of variables, x 1 ….x n , and y 1 …y n , and there are correlations among the variables.Canonical correlation analysis enables us to find linear combinations of the x's and the y's which have maximum correlation with each other.The linear combinations are called pairs of canonical variates.The coefficients of the pairs of canonical variates show maximum correlations.We need to study several pairs of canonical variates to find out the correlations between these two sets of variables.The second pair of canonical variate is the pair which has the second biggest coefficient and is uncorrelated with the first pair.Then we can use the same way to find the third pair, the fourth and others.When we summarize the correlations of these pairs of canonical variates, we can get nearly all correlations information between the two sets of variables.While one or two pairs of canonical variates are enough to show the correlations in general.
In this experiment, we selected 9 non-English major EFL college freshmen of Dalian University of Technology at random and traced their successive 4 oral English tests productions in 2011-2012.Every student produced a 3-minute speech.In oral tests, students were supposed to draw lots for their topics.Next, we transcribed their tape-recordings into words and labeled the disfluencies signals and language acquisition developmental indicators.Finally, we used canonical correlation analysis approach of SPSS 13.0 to analyze these data to find out the connections between disfluencies phenomena variables and language acquisition development variables.

Disfluencies Phenomena Variables
According to the "disfluencies traits system" of Dollaghan and Campbell (1992), disfluencies can be divided into 4 categories.So disfluencies phenomena variables include: pause ratios, repetition ratios, self-repair ratios and orphan ratios.
X 1 =pause ratios.Pauses in this paper refer to the intermissions in sentences or between sentences longer than 0.3 seconds (Raupach, 1987).X 2 =repetition ratios.Repetitions refer to the repeated parts taking place in the same sentences and the repeated parts are conjoint (pauses may happen in the middle).
X 3 =self-repair ratios.Self-repairs are defined as the error revisions in syntactic frames, lexical structures, tenses or pronunciations.X 4 =orphan ratios.Orphans refer to the intrusion of seemingly unrelated materials to topics.

Language Acquisition Development Variables
The criteria about spoken language proficiency may vary among different researchers (Galloway 1987;McNamara 1996).Higgs and Clifford (1980) proposed Relative Contribution Model (RCM) and suggested that different factors contribute differently to overall language proficiency at differently levels.In RCM model, vocabulary and pronunciation factors are most important at the beginning levels.At the higher level, fluency and grammar make contributions.At the highest levels, all these four factors and sociolinguistic factor work together for greatest language proficiency.This paper adopts this RCM Model and sets language acquisition development (proficiency) variables from the perspectives of vocabulary, grammatical accuracy, grammatical complexity and pronunciation.
Y 1 =simple sentences ratios (the total numbers of simple sentences to total utterance) Y 2 =compound sentences ratios (the total numbers of compound sentences to total utterance) Y 3 =complex sentences ratios (the total numbers of complex sentences to total utterance) Y 4 =non-complete sentences ratios (the total numbers of non-complete sentences to total utterance) Y 5 = type-token ratios (the total number of different words to total number of words) Y 6 =new sematic contents (the utterance of non-repeated complete languages to the total utterance).For example, in a sentence such as "…Yeah, we should… we should care….justas it is, we should care about what we said.It is very important.We can't tell……we can't tell….wecan't tell about some personally things".The new sematic contents are "we should care about what we said", "it's very important" and "we can't tell about some personally (personal) things".Their total utterance is 25, while the total utterance of these sentences is 40, so the new sematic content ratio is 62.5%.If EFL students repeat certain opinions simply in oral productions, the repeated parts cannot be regarded as new semantic contents.New semantic contents ratios reflect the control capabilities of students on topics to some extent.
Y 7 =the number of phrases per T-unit Y 8 = error free T-units to the total number of T-units.T-unit is defined as the independent clauses and clauses affiliated with other clauses (Hunt, 1970).
Y 9 =the number of clauses per T-unit.Y 10 =the average phonological complexity of vocabulary.The phonological complexity is decided by the following way: the word which has fewer than 3 phonemes=1;the word which has 3-4 phonemes=2 and greater than 4 phonemes=3.Using this way, we can give a number to each word and the average number of each oral production can be calculated accordingly (Masterson and Kamhi, 1992).
Y 11 = phonological accuracy.It equals to the number of words with correct pronunciations to the total number of words.

Disfluencies Productions in these 4 Successive Oral Tests
After labeling the disfluencies signals and calculating, we get the data related to disfluencies traits and show them in Table 1 and Figure 1 Note: mean length of utterance=total utterance/total numbers of disfluencies; the data in Table 1, total utterance and total number of words are produced by 9 students in 27 minutes, while mean rate of speech and mean length of utterance are the average number of 9 students in each oral test.
As shown in Table 1, the total number of disfluencies produced within the same 3 minutes increases from the first test to the second test, keeps nearly constant during the interval between the second test to the third test and finally decreases in the fourth test.We also find the obvious ratios changes of specific disfluencies phenomena from Figure 1.The ratios of pauses, repetitions and orphans keep decreasing, while the ratios of self-repairs increase in these 4 tests.In the first and second oral test, pauses and repetitions are the main forms of disfluencies; while self-repairs become the significant disfluencies in place of pause and repetitions in last two tests, The third tendency we can find (from Table 1) is that the mean speech rates increases apparently while the mean length of utterance increases slightly.With the increase of mean speech rates, the number of utterance per minute increases accordingly and the total number of disfluencies rises too.So the mean length of utterance changes slightly.As for the listeners, little progress has been made in EFL students' first three oral productions.While in the last test, the listeners can notice the difference in language expressions and can feel the improvement of oral English proficiency.

Findings from Canonical Correlation Analysis
In SPSS, there is no ready menu for canonical correlation analysis, so we write down the following sentences in grammar window (File-New-Syntax):  From the matrix for each set in Table 2 and Table 3, we find the variables inside each set are correlated.For instance, the 4 variables are correlated in set-1 and their coefficients are appropriate, not too big nor too small.So each variable can represent one kind of disfluencies and cannot be replaced by another.If the coefficient between two variables is too big, it is necessary to consider the combination of these two variables, or just delete one (Tao Zhou, 2010).While although the coefficients between Y 7 and Y 8 ,Y 7 and Y 9 ,Y 8 and Y 9 are big enough to consider the combination issue in set-2, we still keep these 3 variables unchanged because there are no real overlaps in contents among these 3 variables.All in all, the variables in these two sets are well selected and are good representatives of certain aspect of each set.From Table 4, we find the direct coefficients between variables of these two sets are not great, except the 2 greater coefficients of self-repairs (X 3 ) and new semantic content (Y 6 ), and of self-repairs (X 3 ) and the number of phrases per T-unit(Y 7 ) (R=0.6285 and R=0.6092).While in Table 5, the first to the third canonical coefficients (R=0.922,R=0.762 and R=0.718) are bigger than any simple coefficients in Table 4.This proves that the effects of comprehensive canonical correlations are greater than that of the simple correlations among variables.That is, the language acquisition variables as a whole have stronger impact on the disfluencies developmental traits.The significance test results in Table 5 show that when a=0.05, the first and the second canonical coefficients are significant, while the third and the fourth ones are not.So the correlations between two sets of variables are reflected by the correlations of these two pairs of canonical variates.In order to eliminate the influence of different dimensions and units of raw variables, we adopt standardized canonical coefficients (in Table 6) and set up the linear models in Table 7: No. Canonical correlation models V 2 =-0.716Y 1 -0.276Y 2 -0.1.1Y3 -0.179Y 4 +0.280Y 5 -0.216Y 6 -0.386Y 7 -1.396Y 8 -0.324Y 9 -0.008Y 10 -0.126Y 11 .
In the first pair of canonical variates, U 1 represents the first variate of disfluencies; V 1 represents the first variate of language acquisition (proficiency).The coefficients of X 1 and X 3 are 0.603 and.-0.890 and their coefficients are biggest.So U 1 can be expressed mainly by these two variables.Similarly, V 1 can be expressed mainly by Y 7 ,Y 8 and Y 9 .That is, we can study the correlations of the first pair canonical variates by studying the correlations among X 1 ,X 3 ,Y 7 ,Y 8 and Y 9 .In the same way, the correlations of the second pair of canonical variates can be shown by the correlations among X 1 ,X 4 ,Y 1 and Y 8 .
Since the first canonical coefficient is the biggest one, so the importance of pause (X 1 ) and self-repair (X 3 ) comes to the fore.Both pauses and self-revisions have greatest correlations with Y 7 (the number of phrases per T-unit), and the coefficient is -2.852; they have greater correlations with Y 8 (error free T-units to the total number of T-units), and the coefficient is 2.098; they have correlations with Y 9 (the number of clauses per T-unit), and the coefficient is -0.927.
The first pair of canonical variates shows the correlations among pause, self-repairs, grammatical accuracy and language complexity (lexical complexity and syntactic complexity).Self-repairs are correlated with language complexity positively and correlated with grammatical accuracy negatively.Pauses are correlated with language complexity negatively and are correlated with grammatical accuracy positively.These correlations reveal that self-repairs and pauses develop in opposite directions.But from the aspect of disfluencies traits, no matter the improvement of language complexity or grammatical accuracy, either of them will render the increase of disfluencies.This is in accordance with the findings of Bernstein Ratner (1977), which is if the language proficiency and the capabilities of grammar use develop in an unbalanced way, and students' attention will be transferred and will result in oral disfluencies.
Different language phenomena lead to different changes in disfluencies, either the increase or decrease of pauses or the decrease or increase of self-repairs.These correlations tell us that during these 2 years, students cannot control the language accuracy and language complexity at the same time.
(1) When the number of phrases or chunks increases, the number of self-repairs rises as well.This rapid correction under psychological language control consciousness deduces the length of time for accurate and complex languages selection, so it deduces the production of pauses.We can infer from this tendency, with the improvement of language proficiency, the number of self-repairs will go down finally.Thus the use of chunks or phrases will decrease oral disfluencies of EFL students.It is similar to the findings of many researchers (Ping Yuan, 2010;Yan Chen, Qingqing, Zhao, 2010).( 2) When the grammatical accuracy improves, the number of self-repairs will go down naturally for no necessity of revisions.The increase of this accuracy is gained by making the use of more pauses to earn more time for correct languages selection.
The correlations of the second pair of canonical variates primarily are expressed by the correlations among pauses(X 1 , coefficient is -0.881),Y 8 (error free T-units to the total number of T-units, coefficient is -1.396) and Y 1 (simple sentences ratio, coefficient is -0.716).Pauses, simple sentences and grammatical accuracy are correlated positively.The more simple sentences used, the more pauses happened.In the same way, taking advantage of the use of pauses, grammatical accurate forms can be ensured, and finally the languages accuracy is improved.In canonical correlations analysis, we will discuss further about the proportion of variance explained by its opposite canonical variance.Because the variables we select have the biggest coefficients, so every variable can not only explain its own variance, but also explain variance in its opposite set.The higher coefficient, the more the variance information can be explained (Huixuan Gao, 2002).As shown in Table 8, the two variates of set 1 (disfluencies set) can explain accumulated 57% of own variance and accumulated 27.8% of their opposite variance.The two variates of set 2 (language acquisition set) can explain accumulated 36.9% of their own variance and accumulated 39.4% if their opposite variance.This proves that there are correlations between oral disfluencies traits of EFL students and their language acquisition development.But sole language acquisition development variables cannot explain all disfluencies traits and other factors, such as learning strategies, confidence, learning motivation and learning social environment, also have great effects on the development of disfluencies traits.

Discussion
From canonical correlation analyses above, we find the oral disfluencies traits of these 9 non-English major EFL college students have strong connections with their language acquisition development.The effects of comprehensive canonical correlation are better than the simple correlations among variables.In other words, all language acquisition variables working together affect disfluencies better.Through specific analyses, the obvious correlations are: pauses have close connections with simple sentences ratios, language complexity and grammatical accuracy.Self-repairs have tight associations with language complexity and grammatical accuracy.What is more, pauses and self-repairs develop in opposite ways.With the improvement of language proficiency, the language complexity goes up, and so do numbers of self-repairs.By the same token, grammatical accuracy increases, the number of pauses will increase as well.These kinds of correlations remind us that the increase of disfluencies is indispensable in the process of language acquisition.The increase of disfluencies is not necessarily the signal of the retreat of language proficiency.On the contrary, it can be regarded as a benign indicator of the improvement of language acquisition.With the improvement of language proficiency next round, the oral disfluencies will go down gradually, and oral language expressions will appear natural and smooth.
We also find that although the strong correlations between disfluencies traits and language acquisition development, parts of disfluencies traits cannot be explained.So if we want to study the comprehensive disfluencies traits of EFL students, we have to take other factors into account, such as learning motivations, learning strategies, self-confidence and social environment they live in and so on.

Figure 1 .
Figure 1.The weights changes of disfluencies phenomena in the 4 tests Note: The weight (ratio) of each kind of disfluencies= the number of each kind of disfluencies/the total utterance

Table 1 .
: Disfluencies and other measures related to disfluencies When writing down these orders, we have to note that in a CANCORR macroprogram, INCLUDE sentence is used to read macroprogram related to canonical correlation, and the position of macroprogram can vary with different installation catalogues.Besides, INCLUDE and CANCORR these two sentences should be finished with full stops (.).After entering the program above, select menu Run->All and operate this program, then we get the following results about canonical correlation analysis (shown in Table2 and Table 3):

Table 5 .
Canonical Coefficients and Test that remaining correlations are zero:

Table 6 .
Standardized Canonical Coefficients for these 2 sets:

Table 8 .
Canonical redundancy analysis