Developments of the Research of the Formant Tracking Algorithm

The formant is the important part of the phonetic characters, and reliable formant tracking algorithm is the base to study the phonetics. Based on the development course of the phonetic formant tracking algorithm, the linear prediction coding (LPC) and the model matching method are introduced emphatically, and there own advantages and disadvantages are analyzed, and the model matching method based on the hidden dynamic model will be the development direction of the future formant tracking technology.


Introduction
When pronouncing, the air currents pass the vocal tract, which will induce the resonance of the channel, and generate a group of resonance frequency that is called as the formant frequency, i.e. the formant for short.The formant is the important parameter to differentiate different vowels The algorithm to position and mark the tracks of the change of the formant frequency with the time is called as the formant tracking algorithm.The formant tracking is the reflection of speaker's individual character.The acquirement of the formant parameter and the tracking algorithm have been widely used in the speaker recognition, the speech synthesis and the speech coding transfer, and they are the important research topics in the speech signal processing domain.
Based on the important meaning and the wide application foreground of the formant to the speech signal processing, many scholars have applied themselves to the study of the acquirement of the formant parameter and the tracking algorithm in recent years, and new algorithms are continually pushed.By the research and analysis of literatures, these algorithms can be classified as two sorts, i.e. the LPC method and the model matching method.These two methods have their own advantages and disadvantages.The separation of the linear prediction equation can exactly confirm the central frequency and bandwidth of the formant, but the ascriptions of pecks are difficult to be judged in LPC, which can be avoided by the model matching method, but the model needs to be trained by large numbers of data, and the training result depends on the quantity and kind of the training data.These two methods will be briefly introduced and analyzed as follows.

Linear prediction coding method
S S McCandless first used the LPC method to acquire three front formant frequencies and extents in the phonetic fragment of vowel (S.S. McCandless, 1974, p.135-141).By the estimated characters of LPC spectrum, in the region that the energy of signals is strong, i.e. the region closing to the peak value of the spectrum, the LPC spectrum is closing to the signal spectrum, but in the region that the energy of signals is weak, i.e. the region closing to the vale of the spectrum, both spectrums are significantly different.So to check the peak values of the LPC spectrum can confirm the formant.In ideal situation, three front formants of the speech are three front formants of the LPC spectrum.D Broad et al improved the LPC analysis algorithm, and the new algorithm adopted the cepstral spectrum coefficient of LPC to acquire the parameters of formant (D.J. Broad, 1989Broad, , p.2013Broad, -2017)).Comparing with S S McCandless' LPC spectrum estimation algorithm, the robustness of the improved algorithm was better when acquiring the formant of the fragment of vowel.
A.M.De Lima Araujo et al combined the Mel frequency scale according with human ear's hearing character with the LPC analysis to estimate the first formant F1 and the second formant F2 of speech signals (A.M.De Lima Araujo, 1998, p.207-212).Traditional LPC algorithm needed to confirm the orders of the linear predictor according to the amount of the acquired formant, but this algorithm could acquire F1 and F2 by setting up a fixed order of the predictor, and needed not to change the orders of the linear predictor by changing the amount of the acquired formant.D. Talkin put forward an automatic tracking method of phonetic formant track.This method adopted the dynamic programming method to realize the tracking of the formant by introducing the continual limited conditions of frequency (D.Talkin, 1987, p.S55).First, acquire the candidate values of the formant frequency by seeking the roots of the linear predicted equation, then, establish a stationarity function as the limited condition of the frequency continuity, finally, by an improved Viterbi algorithm, compute the minimum value between the mapping value of the formant frequency in the current frame and the mapping value of the formant frequency in the last frame in the limitation of stationarity function to realize the track connection among frames of the formant.The key of this algorithm is to design a proper frequency continual limited stationarity function.However, the experiments showed that it was very difficult to design reasonable stationarity function (G.. Kopec, 1986, 709-729).
In conclusion, because the orders of the coefficient predicted by the LPC method are confirmed beforehand, so the amount of the acquired complex conjugate peak pair is the half of the orders at most.Generally, the bandwidth of the extra peak is bigger than the bandwidth of the formant, so to acquire the formant means to judge the ascription of the peak.To compute simply, for the standard vowel signals, the LPC method can exactly confirm the central frequency of the formant and the bandwidth by separating the linear predicted equation.But if the voice signals are interfered by the noise source, the fake peak and the combined peak will occur on the frequency spectrum, which will bring large difficulty to judge whether formant the peak points belong to, and influence the tracing nicety of the formant track, that is the essential deficiency of the LPC method in the formant tracking analysis.

Model matching method
The model matching method avoids the problem that the LPC method is easy interfered by the fake peak and the combined peak, and it is the research hot in the acquirement of formant parameters and the tracking study in recent years.The model matching method experienced a development process from the HMM model to the HDM model.
In 1975, Baker put forward the idea to adopting the hidden Markov model (HMM) to trace the formant track (J.K. Baker, 1975, p.24-29), but the experiment failed.Ten years ago, G. Kopec first successfully used the formant tracking method based on HMM (G.Kopec, 1985Kopec, , p.1113Kopec, -1116)).He divided the formant tracing problem into two independent problems including checking and estimation, and the formant checking was to judge whether each frame speech signal had the formant or not, and the formant estimation is to endow certain frequency value for the checked formant.The checking and estimation of formant all adopted the Viterbi algorithm to search the optimal status sequence of HMM.This algorithm adopted the statistical method to realize the tracking of formant, but it could only realize the tracking of one formant.
After that, G. Kopc put forward the improved formant tracking algorithm based on HMM and the vector quantization (VQ) technology (G.Kopec, 1986, p.709-729).Comparing with the method in the literature (G.Kopec, 1985Kopec, , p.1113Kopec, -1116)), two aspects were improved in the new algorithm.First, by set up two tracing modes, i.e. the single formant tracing mode and the multi-formant tracing mode, multiple formant tracks could be traced simultaneously, and the problem which could trace only one formant track was solved.Second, adopt the forward-backward algorithm (F-B algorithm) to replace the Viterbi algorithm to check and estimate the formant, because the Viterbi algorithm could generate one single status sequence, not a probability distribution, which would produce two problems.The first one is the problem that the formant checking is difficult, and the formant checking based on the Viterbi algorithm is to control the probability of the error checking and the peak value omission by setting up the thresholds, and it can not directly adjust the thresholds in the tracing process, so it is not flexible enough.And if the checking and the estimation of formant are implementing simultaneously, the formant checking performance will depend on the quantity which is used to denote the status of the formant parameters.With the increase of the status density of the frequency field space, the probability that the single status is checked will decrease.Therefore, with the increase of the status quantity, the probability that the real status is checked will gradually reduce.The HMM formant tracking algorithm based on F-B algorithm will avoid this problem.The second problem is that the formant track traced by the Viterbi algorithm is a group of discrete frequency values defined by the model status, but the formant track obtained by the F-B algorithm is the weighted average value in the status of discrete model, so the track obtained by the latter will be more smooth than the track obtained by the former.P. Zolfaghari (P.Zolfaghari, 1996Zolfaghari, , p.1229Zolfaghari, -1232) ) and J. Darch (J.Darch, 2005, p.941-944) put forward the phonetic formant tracking method based on the Gaussian mixture model (GMM).Because GMM is a continually distributed HMM which status is 1, this method still can be regarded as a formant tracking method based on the HMM model.
There are two deficiencies to adopt the HMM model to solve the formant parameter tracking problem.First, when the algorithm is used to estimate the status of certain time of the formant track, it only takes the continuity of the track as the restriction condition to select the formant, which will easily induce the tracing error.Aiming at this deficiency, Minkyu Lee et al put forward an improved method (M.Lee, 2005, p.741-750).By estimating the status of certain time of the formant track, they combined the phoneme information based on the speech signal text with the continuity of the track to be the restriction condition of selecting the formant, which could enhance the precision of the tracking.D. T. Toledaro et al also put forward similar improved method (D.T. Toledano, 2006, p.511-522).These methods could significantly enhance the tracing precision when they were used to trace the formant track of special people's sound which is related with the text, but for the tracking of non-special people's sound which is not related with the text, the had not obviously improved effects.Second, the algorithm needed large numbers of data to train the model, and the result of the final tracing was decided by the kind and the quantity of the training data.But in different using environments, whether the training data have sufficient representative quality or not could not be confirmed.This deficiency is instinctive for HMM.HMM is general statistical model which is widely used in many different domains.If it is applied in some special domain, special data will be needed to train the model.That is to say, the HMM is a data-driven model, and it doesn't involve any mechanism about the generation of data.So in the actual environment with noises and interferences, the formant tracking method based on HMM can not fulfill the actual requirements.To overcome this deficiency, the new established model should not only consider the observation data but also the pronunciation mechanism of the speech signals to describe the speech signals.
In the late of 1990s, L. Deng put forward a dynamic speech modeling method combining the metrics characters and the phonetic characters (L.Deng, 1998, p.299-323).This method considered the conversion of the co-articulation and the neighboring phones in the pronunciation for the modeling process of the speech signals.It regards the pronunciation system of sound as a hidden dynamic system, and in which, each phone corresponds with one vector objective, i.e. when certain phone is pronounced, the muscles of the vocal cords and the track will approach certain objective status or shape according to the "program".This modeling method which is specially used for sound considers the generation mechanism of the speech signals, and gets rid of the modeling mechanism which only is driven by the data.Almost in the same term, Richards et al put forward similar speech modeling method which was named by the hidden dynamic model (HDM) (H.B.Richards, 1999, p.357-360).To describe the dynamic structure of the sound, Richards mapped the hidden space on the phonetic character space by the nonlinear multiplayer perceptron (MLP), and trained the parameters of the model (target vector and the weight value of MLP) by the selected algorithm.
HDM was successfully applied in the speech recognition (L.Deng, 1998, p.299-323 & L.Deng, 2000, p.3036-3048), and many new phonetic formant tracking algorithms were developed based on that.For example, I. Bazzi put forward a formant tracking algorithm based on the expectation maximization (EM) (I.Bazzi, 2003, p.464-467).This algorithm is composed by two parts, and one is the acquirement of the mapping relationship between the formant and the phonetic observation information, which maps the formant parameters on the Mel frequency cepstral coefficient by a parameter-free nonlinear predictor, and establishes the prediction code text, and the other is the acquirement of the residual information of speech signals, which adopts the EM algorithm to train the residual coefficients of the speech signals and search the optimal format parameters in the prediction code text.Combining with the restriction condition with the target orientation, L. Deng et al put forward a nonlinear predictor which could be used in tracking of VTRs (L.Deng, 2006, p.425-434).This nonlinear predictor maps the formant parameters on LPCC, not on MFCC, and because the LPCC has good separation character, so it can enhance the computation efficiency.
Above two algorithms all first quantifies the parameter space of formant, and maps the quantified formant parameters on MFCC or LPCC to form the prediction code text, and finally selects the optimal formant parameters by training the residual coefficients.To quantify the parameter space, the quantifying dimension should be selected, and too big quantifying dimension will produce large computation, and too small quantifying dimension will influence the tracing precision.Aiming at this problem, L. Deng also put forward an improved method.He regarded the formant parameters as the continual values of variables in the hidden status, and adopted the Kalman filtering and the smoothing technology to trace the track of VTRs, which could solve the problem induced by the quantification of frequency field space (L.Deng, 2007, p.13-23).This method introduced extra prior information by the form of VTRs in the VTRs tracking process.Because the prior information can capture the timing character of VTRs track in the generation process of sound, so this method can exactly trace not only the VTRs with obvious frequency spectrum peak value, but also the VTRs phonetic segments (such as stops, spirants) without obvious formant structure.But this method has a deficiency, i.e. it needs to linearly process the nonlinear predictor, because the Kalman filtering is an implementation of Bayes filtering, and it is the optimal linear filter under the rule of the minimum square error, so it can not be used in the nonlinear occasions.The linear processing of nonlinear speech model will not only increase the computation, but the linearized model can not often represent real nonlinear model.
To overcome the deficiency that the nonlinear predictor needs linear processing, foreign scholars applied the particle filtering technology in the formant tracking based on the HDM model in recent years.The particle filtering is another method to realize the Bayes filtering, and it adopts a group of randomly weighted particles to approach the posterior probability distribution, and because it is not limited by the linear Gaussian conditions, so it is widely applied in the control domain.Yanli Zheng et al first applied the particle filtering technology in the tracking of phonetic formant (Yanli, 2004, p.565-568).This method can process the nonlinear model, and needs not implement the linear processing to the nonlinear speech model, and it is easy to be implemented.But Yanli Zheng et al only offered a developmental idea, and the speech model they used was the simplified HDM without target orientation, so the tracing precision still needs to be enhanced.

Conclusions
The formant tracking technology of speech signals is continually developing, and many foreign and domestic scholars are applying themselves to the research about it and have put forward many methods and algorithms.The formant tracking method developed from the LPC analysis method to the HMM model matching method and to the HDM model matching method.Now, the formant tracking method based on HDM is more and more emphasized by researchers.Of course, the HDM model matching method still has some deficiencies, for example, whether the established model can exactly describe the character of speech signals enough, and how to enhance the precision of the model tracing when simplifying the computation.As the research develops, these problems will be solved gradually, and the HDM model matching method will certainly exert important function in the domain of the speech signal formant tracking.