Linear dynamic models for speech recognition: Making the model fit the data

Joe Frankel

In last year's postgraduate conference I reported on a recognition system that we were building which used linear dynamic models (LDMs) to characterise different phones using a mixture of acoustic and articulatory features. In this talk I will present some ideas we are working on to improve the modelling power of the LDM.

The premise of an LDM is that there is some process which unfolds over time, but is not directly observed. The observations are seen as a corrupted (by noise) representation of these underlying dynamics. The model then consists of two parts. A first equation uses a linear transformation with the addition of some random variation to describe the evolution of the process from one time step to another. A second equation projects these trajectories up into the observation domain using another linear transformation and the addition of Gaussian noise. The observation noise gives a measure of how imperfect the representation of the underlying and 'true' process is.

We are looking into two possible extensions to the model. Firstly, we seek to make the representation of the noise more sophisticated. Replacing the single Gaussian with a mixture model will allow more freedom to the dynamic portion of the model as one underlying trajectory could then be responsible for a number of different sequences of observations.

Secondly we are investigating the possibility of using more than one evolution matrix per model. From one time frame to the next, the model could decide which transformation gave a better prediction of the next observation. This would allow a mixture of trajectories to be associated with each (phone) model. This would perhaps mirror what we know of articulatory motion where the context of a phone affects the accelerations needed to move toward the following target.