|KLATT 1987, p. 749|
The precise acoustic aspects of a complex articulatory model that might account for naturalness (spectral zero movements, glottal waveform changes from period to period, pitch-synchronous formant motions, natural voiceless-voiced-voiceless transitions, etc.) are not known at this time. Also, the considerably greater computational cost of articulatory synthesis precludes the use of these models in practical systems at the present time.
5. Automatic analysis/resynthesis of natural waveforms
Waveform encoding techniques will not be considered in this review (for example, see Lee and Lochovsky, 1983), but perhaps we should note the Texas Instruments "Speak'n Spell" toy (Wiggins, 1980), which used linear prediction encoding (Itakura and Saito, 1968; Atal and Hanauer, 1971; Markel, 1972; Makhoul, 1973) to store and play back a set of words at a storage cost of about 1000 bits/s of speech (example 13 of the Appendix). This inexpensive device has had a major impact on the technology of presenting "canned" messages to the public. Linear prediction representations of speech waveforms are based on the idea that, at least in the absence of source excitation, the next sample of a speech waveform can be estimated from a weighted sum of 10 or so previous waveform samples, the weights being the linear predictor coefficients. If the source waveform can be found by other means, and if predictor coefficients are updated every 10 ms or so on the basis of analysis of a speech waveform, reasonably good approximations to the original waveform can be derived from this kind of low bit rate representation.
In a text-to-speech application, it is necessary to employ an analysis/ resynthesis procedure that will allow the natural speech samples to be modified in fundamental frequency, amplitude, and duration, as well as perhaps performing some sort of parameter smoothing at boundaries between waveform pieces. Linear prediction analysis of speech appears to be an excellent representation for these purposes (Olive and Spickenagle, 1976). It is even possible to reconstruct a waveform that is perceptually nearly indistinguishable from the original if multipulse excitation (Atal and Remde, 1982) is used to correct some of the errors that occur when the vocal tract is not all-pole and when the glottal source waveform is not like an impulse train (example 14 of the Appendix).
However, a problem with this approach arises when going from duplicating a natural utterance to the more difficult task of creating new sentences by concatenating pieces of speech. The main difficulty has to do with changing the fundamental frequency; it turns out that the predictor equations, in the autocorrelation form, do not estimate formant frequencies and bandwidths accurately. This is no problem if one uses the same fo during resynthesis because the error is undone, but if a new fo is employed, the first formant may be in error by plus or minus 8 % or more (Atal and Schroeder, 1975; Klatt, 1986a), and formant bandwidths can be seriously deviant. Additional losses to naturalness occur if lengthening or shortening of a segment does not quite produce the right vowel quality, or if smoothing at segment boundaries results in too rapid a change in synthesis parameters. Finally, the advantages of multipulse excitation with respect to naturalness more or less disappear in text-to-speech applications. Considering all of these limitations, it is my opinion that linear prediction resynthesis at fo values other than in the original recording may not have the potential quality of a formant synthesizer controlled by rule.
Other analysis-synthesis procedures have also shown an ability to reproduce speech with considerable fidelity. It has even been possible to mimic a high-pitched female singing voice by summing together, for each period, formant-like damped sinusoid waveforms that are time-windowed in such a way as to prevent superposition effects between periods (Rodet, 1984). Again, the problem with any synthesis-by-rule effort based on this type of waveform representation will be to preserve naturalness as rules are developed to create sentences in terms of the primitives of the representation.
This section on speech synthesizer models has come to four main conclusions: (1) modern formant synthesizers of several different configurations are capable of imitating many male speakers nearly perfectly, (2) some of the simplifications in a formant synthesizer result in unsatisfactory imitations of breathy high-pitched vowels that frequently occur adjacent to voiceless consonants in the speech of women and children, (3) linear prediction analysis/ resynthesis is a powerful method for duplicating an utterance with high fidelity, but there are limitations on its applicability to general text synthesis, and (4) an articulatory model is likely to be the ultimate solution to the objective of natural intelligible speech synthesis by machine, but computational costs and lack of data upon which to base rules prevent immediate application of this approach.
B. Acoustic properties of phonetic segments
In order to generate speech using, e.g., a formant synthesizer, it is necessary to develop rules to convert sequences of discrete phonetic segments to time-varying control parameters. Such rules depend on data obtained by acoustic analysis of speech. Perceptual data establishing the sufficiency or relative potency of individual acoustic cues are also of considerable value in determining a rule strategy. Therefore, we first review briefly the development of a body of knowledge concerning the acoustic-phonetic characteristics of the phonetic segments of English. Many of the references to be cited appear in the Lehiste (1967) reprint collection.
The investigation of acoustic cues having the greatest importance
for different speech sounds began with the use of the sound
spectrograph machine at Bell Telephone Laboratories (Koenig et al.,
1946; Potter, 1946; Potter et al., 1947; Joos, 1948). The machine
produced acoustic pictures of speech. The most useful type of picture
for phonetics research was the broadband sound spectrogram -- an
example of which is shown in
Fig. 15. A broadband spectrogram is
a plot of frequency versus time in which blackness represents the
energy present within a 300-Hz bandwidth, as averaged over about
2-3 ms. The display was designed to represent formants as slowly
changing horizontal dark bands, and to indicate fo, as the inverse
of the temporal spacing between vertical striations (at least for
|KLATT 1987, p. 749|