SSSHP Contents | Labs

 KLATT 1987, p. 783 
Go to Page | Contents VI. Conclusions | Index | Bibl. | Page- | Page+
 

differ. It is less easy to tell whether individual differences are perceptually important, but if one has some idea of discrimination limits, the perceptual salience of various speech cues, and the articulatory basis of acoustic discrepancies, then good guesses can be made as to the specific rules needed in the future. In this sense, all of the systems are amenable to incremental improvements so long as their designers have sufficient patience to follow this cookbook method of uncovering acoustic deficiencies.

Part of this process might even be automated. Holmes (1984) describes an effort to automatically time align a sentence with its synthetic imitation produced by rule, and then incrementally adjust formant frequency table values in the Holmes et al. (1964) rule program until natural and synthetic utterances are maximally similar. If the rules are correctly formulated and complete, such optimization procedures should result in improved imitations of other sentences as well. However, before such optimization efforts realize their full potential, many additional rules appear to be needed at the segmental level, e.g., to derive nuances of vowel quality change as a function of stress and phonetic environment. In the absence of a correct rule framework, automatic training will simply fail to converge, no matter how much data are supplied.

Text-to-speech programs and research may begin to have an influence on the way phonologists and phoneticians view phonetics and phonemic theory. These linguists have traditionally been reluctant to ascribe psychological reality to the phoneme, preferring to rely on distributional properties of observed sounds as a basis for theorizing (see Fry, 1974 for a good review). To the extent that speech generation programs begin to look like models of human behavior, their representations of language processes and units may become the cornerstones of new linguistic theories. If a synthesis-by-rule program can attract theoretical linguists to the problems inherent in specification of feature implementation rules, and thereby better couple their insights to the problem of allophonic variation, acoustic-phonetic detail, and timing of phonetic events, it is possible that real progress can be made in both engineering and linguistics. At the very least, it can be expected that these programs, in modifiable form, will become a part of the experimental facilities of modern phonetics laboratories, and will influence future generations of students in ways that are hard to predict.

In a similar vein, it is difficult to estimate the impact on the general public of computers that speak and listen. Talking machines may be just a passing fad, but the potential for new and powerful services is so great that this technology could have far reaching consequences, not only on the nature of normal information collection and transfer, but also on our attitudes toward the distinction between man and computer.

It is sometimes said that speech synthesis is not only easier than automatic speech recognition, but also that the field is so mature that the remaining problems are minor and scientifically uninteresting. I hope that this review has tended to dispel this view by pointing to specific areas where basic knowledge is lacking. and significant progress can still be made.
 

ACKNOWLEDGMENTS

Preparation of this review was supported in part by an NIH grant. I am very grateful to Ignatius Mattingly, John Holmes, Jared Bernstein, Osamu Fujimura, Stefanie Shattuck-Hufnagel, and David Pisoni for numerous suggestions based on an earlier draft.
 

APPENDIX: DEMONSTRATION

[ Ed: The speech synthesis examples are on tape SSSHP 32.
Selections from the examples are online at Indiana University:

( http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html )

Index to Demonstrations 1 - 5 (below), 6 - 19, 20 - 36 ]
 

The enclosed 33 1/3-rpm recording contains illustrations of some of the milestones in the development of systems for text-to-speech conversion. For convenience in locating and listening to examples as they are described in the text, it may be desirable to transfer the recording onto a cassette tape. The assistance of H. David Maxey, Michael Hecker, John Holmes, Patrick Nye, Joe Olive, and James Flanagan in assembling these materials is gratefully acknowledged. My thanks also go to Kenneth Stevens, who served as narrator.

The record has been inserted inside the back cover of this issue.
 

Part A: Development of speech synthesizers

The objective of early research on speech synthesis was to test whether the synthesizer design is capable of high-quality imitations of human voices.

  1. The VODER of Homer Dudley, 1939.  Dudley of AT&T Bell Laboratories designed a speech synthesizer known as the "Voder" (Dudley et al., 1939). It was demonstrated at the 1939 World's Fair in New York. (text)

  2. The Pattern Playback designed by Franklin Cooper, 1951.  The Haskins Laboratories Pattern Playback (Cooper et al., 1951) was designed to permit converting back into sound the patterns observed on broadband sound spectrograms. (text)

  3. PAT, the "Parametric Artificial Talker" of Walter Lawrence, 1953.  Lawrence (1953) of the Signals Research and Development Establishment, Christchurch, England, designed the "PAT" ("Parametric Artificial Talker") parallel formant synthesizer. It was first demonstrated at a conference in London in 1952. (text)

  4. The "OVE" cascade formant synthesizer of Gunnar Fant, 1953.  Fant (1953) of the Royal Institute of Technology in Stockholm, Sweden designed a cascade formant synthesizer ("OVE I"). It was demonstrated at the same London conference in 1952. (text)

  5. Copying a natural sentence using Walter Lawrence's PAT formant synthesizer, 1962.  Tony Anthony [Ed: James "Tony" Anthony] and Walter Lawrence attempted to match a natural recording using an updated version of PAT (Anthony and Lawrence, 1962). Demonstrated at the 1962 Stockholm Speech Communication Conference. Compare with the OVE II version of the same utterance, next. (text)

Go to Page | Contents Demonstration | Index | Bibl. | Page- | Page+

 KLATT 1987, p. 783 
SSSHP Contents | Labs
Smithsonian Speech Synthesis History Project
National Museum of American History | Archives Center
Smithsonian Institution | Privacy | Terms of Use