SSSHP Contents | Labs

 KLATT 1987, p. 781 
Go to Page | Contents D. Medical applications | Index | Bibl. | Page- | Page+
 

patients that have to be accessed by phone when a doctor is not near a computer terminal. Attempts to use text-to-speech capabilities in novel ways have led to a computer system that tracks compliance in an experimental hypertension treatment program at Boston University Medical Center (Friedman, private communication). The computer calls each patient every day, and uses DECtalk to ask whether medication has been taken and whether any adverse side effects have occurred. The computer then calls a doctor if the patient's telephone keypad response indicates a problem.

Another potential application is an expert system for medical consultation between a doctor and a computerized data base. Those involved in artificial intelligence research have begun to amass large data bases on relations between symptoms and diseases. They hope ultimately to be able to reason logically, suggest additional tests, and deduce disease as well as the average practitioner -- taking advantage of the superb memory capabilities of computers in order to consider rare clusters of symptoms that many doctors have not encountered in their practice. Text-to-speech telephone access could make such systems widely accessible and inexpensive.

VI. CONCLUSIONS

Text-to-speech conversion is a new technology with a rapidly changing set of capabilities and potential applications. The best of the current systems are quite intelligible, but suffer from a number of deficiencies that are often grouped under the catch-all term "lack of naturalness." In this article, we have identified many areas where rules and table values can be incrementally improved in the future to achieve more natural and more intelligible speech output from text-to-speech systems. As a consequence, these systems should become more acceptable to a wide range of users.

We have also identified several more basic problems that impede progress in certain areas of the text-to-speech conversion process (and also impact adversely on progress in other areas of speech science and technology). The first has to do with fitting spectral data obtained from female voices into the framework of current formant synthesizer models. For breathy vowels, the fit is not particularly good (recall Fig. 13), and it appears that some of the spectral deviations caused by tracheal coupling have perceptual importance.

It may be worthwhile to speculate on ways in which this problem might be resolved. Ideally, a new formant synthesizer model will be suggested that is slightly more complex, but still practical to implement. For example, an extra pole, or pole-zero pair might be made available to match extra spectral prominences that are observed. In this scenario, a way will be found to relate speech data from female voices to model parameters, so that a data collection effort will result in effective rules for controlling the new synthesizer model.

I suspect that the solution will not be that simple. If true, we may have to wait for speech science to provide better answers to some basic questions. The first point to note is that the acoustic theory of speech production, whether simplified or made complex by the introduction of better models of the larynx, trachea, and source-filter interactions, is not intended to be a model of the parameters directly controlled when we speak, nor of the parameters directly involved in the perceptual decoding of speech. The theory is a description of the acoustic behavior of a mechanical system. Therefore, efforts to relate observed spectral data from real female talkers to formant frequencies and other acoustic parameters of the theory have no a priori reason to succeed, and actually stand a good chance of failure, in part because there are too many model parameters compared with available spectral details (especially for talkers with high fundamental frequencies). Are we in a situation where it is possible to collect spectral data, yet be unable to relate it unambiguously to the underlying generation process, or to the processes of speech perception or articulation?

If this characterizes the present state of speech science, and I think it does, then the real bottleneck is the absence of a satisfactory perceptual theory to account for listeners' behavior in terms of observable spectral or waveform details. That we are far from such a theory is obvious, but how to go about attaining one is less clear. Attempts to mimic the steps believed to occur during the encoding stages of peripheral auditory processing are attractive as a first step, but it is unlikely that this encoding alone will be able to explain all of the fundamental perceptual skills that come naturally to humans, but not to speech recognition devices. Even the simplest of objectives, such as being able to categorize static critical-band spectra of vowels on the basis of a distance metric (Bladon and Lindblom, 1981), or to relate pairs of vowel spectra in terms of phonetic similarity (Klatt, 1982c), are well beyond our capabilities and understanding. Figure 34 shows pairs of critical-band spectra of vowels similar to /a/ that illustrate some of the difficulties encountered by a Euclidean metric. Spectral changes that affect peak locations are phonetically more important than other changes, even for low-pitched male voices synthesized to conform to the all-pole model of the vocal tract transfer function. But efforts to interpret critical-band spectra in terms of peak locations are thwarted in higher-pitched voices because individual harmonics are resolved, and breathy vowels introduce unexpected extra peaks. So long as we cannot always interpret spectral data from high-pitched voices in terms of formant parameters, or characterize the perceptual implications of spectral details, it is very likely that a synthetic female voice will remain an elusive goal, as may some aspects of the perceived naturalness of all male and female voices created by rules.

The second set of fundamental problems that we have identified arises when contemplating the creation of "natural" rule systems that manipulate articulatory structures. Where are the data that might facilitate creation of realistic models and model behavior? The acoustic consequences of any articulation depend on the cross-sectional area of the tube that is formed, and precision of specification is most important in locations of narrow constrictions. However, x-ray data, which are sparse, give only rough outlines in two dimensions, from which cross-sectional area must be inferred. And x-ray data do not characterize the masses and
 

Go to Page | Contents VI. Conclusions | Index | Bibl. | Page- | Page+

 KLATT 1987, p. 781 
SSSHP Contents | Labs
Smithsonian Speech Synthesis History Project
National Museum of American History | Archives Center
Smithsonian Institution | Privacy | Terms of Use