NMAH | Smithsonian Speech Synthesis History Project (dk

have been applied, and erase stress marks after all stress rules have been applied, etc. An alternative, analogous to three-dimensional phonology, is to maintain all forms of representation in parallel (Halle, 1985).

In one sense, rule compilers are an answer to the problem that rule programs written in conventional programming languages nearly always attain a rigidity and opacity that eventually prohibits their developers from making improvements. Rule compilers discourage ad hoc fixes and encourage distinctions between levels of description. Indirect support for this view comes from my own work. I have twice found it necessary to re-program the Klattalk text-to-speech system from scratch within a slightly new conceptualization, using a better programming language each time. Nevertheless, I view existing rule compilers as somewhat constraining compared with general programming languages such as "C," and so thus far I have resisted the temptation to make use of them.

A second advantage of rule compilers is the ability to develop a text-to-speech system for a new language much more rapidly than when language-specific code and general synthesis strategies are intertwined. This is clearly true when a new team of researchers wishes to build from an existing system (as evidenced by the difficulties that both Speech Plus and Digital Equipment Corporation have had in subcontracting software modification efforts to create systems for other languages), but this need not be the case when the system is well understood (Klatt and Aoki, 1984).

4. Concatenation systems

Other laboratory synthesis-by-rule programs include several that attempt to take pieces of natural speech as building blocks to reconstitute an arbitrary utterance. The recorded chunks cannot be whole words because of the reasons identified earlier. However, smaller units might work.

The syllable is a linguistically appealing unit, but there are over 10 000 different syllables in English. The phoneme is another linguistically well-motivated unit, of which there are about 40 in English. However, all efforts to string together phoneme-sized chunks of speech have failed because of the well-known coarticulatory effects between adjacent phonemes that cause substantial changes to the acoustic manifestations of a phoneme depending on context (Harris, 1953). Coarticulatory influences tend to be minimal at the acoustic center of a phoneme, which prompted Peterson et al. (1958) to propose the "diphone," i.e., the acoustic chunk from the middle of one phoneme to the middle of the next phoneme, as a more satisfactory unit, Fig. 23. 5 There are thus about 40 times 40, or 1600, different diphone possibilities, although not all occur (Peterson et al., 1958; Sivertsen, 1961). It may be necessary to include several different versions of each diphone to handle distinctions between stressed and unstressed syllables, to include allophones that can occur in different structural environments, and perhaps to include some larger VCV units which Sivertsen (1961) called syllable dyads. In addition, one must be able to change the duration and fundamental frequency contour on a diphone, or perhaps store multiple variants of each diphone with differing prosody. Wang and Peterson (1958) estimated that as many as 8000 diphones may be necessary, but current systems seem able to function with an inventory of about 1000 diphones.

In order to illustrate the advantages of the diphone approach over synthesis-by-rule programs, consider the task of plosive-vowel synthesis. In the rule programs described above, simple theories were used to generate a plosive before different vowels. In the diphone approach, each plosive-vowel transition is a special case, so no general theory or list of exceptions are required.

A potential disadvantage of the diphone approach is that discontinuities may appear right in the middle of vowels if the two abutting diphones do not reach the same vowel target, as might be the case for, e.g., the word "bill" in the lower panel of Fig. 23, or for "wet" = because the [w] lip rounding and velarization effects can extend well into the vowel. Some sort of smoothing at diphone boundaries minimizes the perceptual consequences of actual formant discontinuities, but a mismatch of vowel quality between the two halves is not as easy to compensate for. Nor is it possible to create vowel-vowel coarticulation across an intervening consonant, or adjust vowel targets according to stress or phonetic environment. These may be second-order effects of less importance than a segmental intelligibility gain achieved by diphone concatenation, but we simply do not know.

Efforts to build synthesis-by-rule programs based on the diphone have had considerable success (Dixon and Maxey, 1968; Olive, 1977). The first diphone system, demonstrated at the 1967 M.I.T. Conference on Speech Communication and Processing, was based on a set of stylized stored parameter tracks to control a formant synthesizer (Dixon and Maxey, 1968). The authors spent many years in a trial-and-error effort to optimize a diphone inventory for this purpose (Estes et al., 1964), and eventually produced a system that

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use