NMAH | Smithsonian Speech Synthesis History Project (dk

lead to advances in semantic representation that can be adapted to text synthesis as well.

III. HARDWARE IMPLEMENTATION

A laboratory text-to-speech system, or a development system, is best implemented on a large general-purpose digital computer. The flexibility and nearly unlimited computational resources outweigh disadvantages of non-real-time output. However, practical commercial systems must realize real-time operation at a reasonable cost/ performance tradeoff, while simultaneously providing additional features such as a flexible user interface and telephonics for many commercial applications. Solutions may require specially designed chip sets (Gagnon, 1978; Goldhor and Lund, 1983) or circuit boards containing off-the-shelf components rich in computer power and memory (Groner et al., 1982; Bruckert et al., 1983).

One important design consideration is the sampling rate and resultant high-frequency cutoff of the output speech. Since many business applications require the telephone, some systems limit the frequency response to that of telephone bandwidth -- 3.4 kHz, or the 4.0-kHz limit imposed by the 8-kHz sampling rate of standard codec digital transmission of speech (Groner, 1982; Olive and Liberman, 1985). DECtalk, on the other hand, produces information at frequencies up to 5 kHz in order to maximize intelligibility over a loudspeaker in, e.g., handicapped applications, such as a reading machine for the blind.

My own experiences may help illustrate hardware issues. In order to transform the Klattalk software into a realtime device, it was necessary for me to find a commercial partner with the appropriate skills and deep pockets. Fortunately, Digital Equipment Corporation was willing to underwrite the development costs. We signed a license agreement in 1982 (Klatt, 1987a), and a product, DECtalk, was announced some 18 months later (Bruckert et al., 1983).

The DECtalk hardware, Fig. 33, was capable of implementing the complete existing Klattalk software; no engineering compromises were necessary. Software added by Digital engineers controlled the user interface to a host computer. Host computer commands were defined to permit initiation or reception of telephone calls, and to permit the host to suddenly halt speaking, or to monitor the instant when a particular word in a sentence has been spoken.

The hardware shown in Fig. 33 includes (1) a Motorola MC68000 general purpose digital computer that processes text corresponding to one clause at a time, producing a set of synthesizer control parameters every 6.4 ms, and (2) a Texas Instruments TMS-32010 signal processing chip that converts control parameters to difference equation constants, and simulates the digital formant synthesizer in order to produce 10 000 12-bit waveform samples per second. Memory requirements are modest. The 6000-word exceptions dictionary places the greatest demands on memory; it occupies about half of the read-only memory shown in the figure. DECtalk can be controlled by any computer or by an ordinary computer terminal since the communication link is via a standard RS-232 port.

The only disappointment was that the price of the original DECtalk system turned out to be about four times our early estimate of $1000, and this placed the device outside the reach of many potential handicapped users. A recent redesign of the main DECtalk board to contain less "integrated circuit glue" has resulted in the DECtalk 3.0 system that is improved in several performance areas and is less expensive to manufacture, so there is still hope that an acceptable price might be achieved. Board size, about 8 x 10 x 0.7-in. sans power and loudspeaker, is now satisfactory for portability, but lower power consumption is a goal that will have to be met in the future.

Today's technology is such that, I am told, it would be possible to put the entire text-to-speech algorithm on a single wafer-sized integrated circuit chip. However, this is not likely to happen until the demand is sufficient to justify chip design costs. Instead, it appears that future versions of the hardware may move toward greater flexibility by replacing all of the read-only memory by RAM that can be down loaded with new code as algorithms are improved.

IV. PERCEPTUAL EVALUATION OF TEXT-TO-SPEECH SYSTEMS

Text-to-speech systems can be evaluated and compared with respect to intelligibility, naturalness, and suitability for particular applications. One can measure the intelligibility of individual phonemes, words, or words in sentence context, and one can even estimate listening comprehension and

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use