Predicting segmental duration using Bayesian networks

Olga Goubanova

We present a new Bayesian-based probabilistic approach to modelling segmental duration in a text-to-speech system. Segment duration is influenced by a number of contextual factors such as segment identity, stress, accent, local context within a syllable, position of a target segment within a syllable, word, and utterance. The factors that affect segmental duration interact with each other in a complex way. Databases of speech data are often imbalanced with respect to frequencies of different factors' combinations. A model of segment duration should account for these problems. We propose a probabilistic Bayesian belief network (BBN) approach to tackle data sparsity and factor interaction problems. In our work, we model segment duration as a hybrid Bayesian network consisting of discrete and continuous nodes; each node in the network represents a linguistic factor that affects segmental duration.

The interaction between the factors is represented as conditional dependence relations in the graphical model. For the purposes of the current research we used a database of over 1000 prosodically rich sentences from the speakers of American and RP English. We contrasted the results of the BBN model with those of the sums of products (SoP) model by van Santen and the CART model implemented in the Festival text-to-speech system. We trained and tested all three models on the same data. The results have shown that our new model outperforms the CART model; it compares in performance with the SoP model. However, we think our model has many other advantages compared to SoP, for instance it is much easier to configure and experiment with new features. This should make it easier to adapt to new languages.