NMAH | Smithsonian Speech Synthesis History Project (dk

NOTES

For an application requiring a limited set of sentences with known structure. such as telephone numbers, some success has been achieved in concatenating "vocoded" whole words. This is because it is possible to smooth vocoder parameters at word boundaries, modify durations, and impose a sentence fundamental frequency contour on the word string (Rabiner et al., 1971; Olive and Nakatani, 1974). Also, Cooper et al., 1984) describe an early plan to concatenate recorded words in a reading machine for the blind application, where the motivation of the listeners might overcome weaknesses of the presentation, but the approach was subsequently abandoned in favor of synthesis by rule. (text)
Postvocalic devoicing and flapping are actually late rules, occurring after vowel durations are computed. The proper ordering of rules is an important issue in the design of text-to-speech systems. (text)
Looking at the same data, we might not agree with their intuitions. (text)
The parameter k was assumed to be 0.5 by Delattre et al. (1955) and by Holmes et al. (1964). (text)
Actually, Peterson et al. (1958) proposed the term "dyad" as a set of diphones all having essentially the same articulatory trajectory from the middle of one segment to the middle of the next, but differing in prosodic values such as duration and fundamental frequency contour. Hank Truby [Ed: Henry M. Truby] was the first to use the term "diphone" by separating out prosody as independent variables in synthesis, and calling the remaining phonetic transition (as represented by synthesizer control data) a "diphone." As the term diphone has spread in usage, some authors allow it to refer to larger synthesis units such as consonant clusters when needed to maximize synthesis fidelity (Dixon and Maxey, 1968), but we will restrict the term here to mean a transition between adjacent phonetic segments. (text)
For example, English vowels can be divided into tense inherently long vowels and lax short vowels (House, 1961). (text)
The number of distinguishable stress levels at the lexical and phrasal levels continues to be an area of linguistic dispute; see Vanderslice and Ladefoged (1971) and Coker et al. (1973) for extremal positions. (text)
In phonological theory, there is usually a distinction made between a rule that changes a feature or segment discretely, and a feature implementation algorithm that is subject to low-level physiological constraints, contextual influences, and graded behavior. Thus a parameter adjustment rule needed in speech synthesis probably should correspond to the feature implementation level of description (e.g., voice onset time is slightly longer for high versus low vowels even though glottal timing commands might be the same in two situations), whereas allophone selection rules should correspond to actual rule-governed changes to motor commands, as reflected by a change to some segmental feature. (text)
Not all phonological simplifications preserve boundary information; for example [h] deletion and flapping result in an inability to distinguish between "but her" and "butter." (text)
If errors were independent, words correct would be approximately equal to phonemes correct to the sixth or seventh power, times the probability of getting the stress correct. (text)
It is perhaps unfair to evaluate this system against a random sample of words because it was intended to be used in the context of a large morpheme dictionary, and therefore would be activated only for rare words -- words that may be more regular in their pronunciation. (text)
Use of the solid curve is equivalent to assuming that another million-word text sample will contain exactly the same 50 000 words, whereas it is likely that a different set of rare words will be found in the new text. (text)
It is surprising how outdated this corpus has become if the goal is to obtain a lexicon representative of modern textual material; Allen and Finkel removed more than 15% of the items as outmoded or too parochial when they were collecting morphemes by hand. We would all benefit from a modern replication of the Kucera and Francis task, especially now that it is practical to examine much larger data bases than only a million words. (text)
In theory, every time a new rule was added to the morph decomposition process, it was necessary to go back and check the entire lexicon for accidental incorrect decompositions. (text)
An even less sensitive test is the diagnostic rhyme test (Voiers, 1983) which involves a single pair of alternative responses for each familiar CVC word. (text)
Most of these "errors" can be attributed to problems with phonemic symbolization; phonetically trained listeners typically perform at better than 99% correct on the same task (Rabiner, 1969). (text)
Multipulse linear prediction was designed to make possible the detailed modeling of the voicing source waveform, but in fact it is simply a method of introducing zeros into the representation of any speech sound. It appears that multipulse has little advantage for voiced segments in text-to-speech systems because the rule system imposes an fo contour different from that observed in the original natural speech recording. However, multipulse may be able to better approximate, e.g., the coherent release of plosive bursts (Maeda, 1987). (text)
Pisoni and Koen (1981) obtained similar results, although the difference between natural and synthetic speech was greater, perhaps because the MITalk system that they used is not quite as intelligible. (text)
Carlson and Granström (1976) had noted the same kind of listener adaptation without feedback in an earlier experimental evaluation. With feedback, listeners can improve considerably in performance on intelligibility tests, even with poor quality synthetic speech (Schwab et al., 1986). (text)
For example, Xerox Corp. has retrofitted a number of Kurzweil Reading Machines for the blind that are located in public libraries with the more intelligible Prose-2000 text-to-speech board. Digital Equipment Corporation has offered a special price for DECtalk units sold to handicapped individuals and manufacturers of handicapped devices, resulting in a more than one million dollar price reduction on units sold to this population. (text)

BIBLIOGRAPHY

Aho, A., and Ullman, J. (1972). The Theory of Parsing, Translation and Computing (Prentice-Hall, New York).

Ainsworth, W. A. (1973). "A System for Converting English Text into Speech," IEEE Trans. Audio Electroacoust. AU-21, 288-290.

Allen, D. R., and Strong, W. J. (1985). "A Model for the Synthesis of Natural Sounding Vowels," J. Acoust. Soc. Am. 78, 58-69.

Allen, J. (1976). "Synthesis of Speech from Unrestricted Text," Proc. IEEE 64, 422-433.

Allen, J., Hunnicutt, S., and Klatt, D. H. (1987). From Text to Speech: The MITalk System (Cambridge U.P., Cambridge, UK).

Allen, J., Hunnicutt, S., Carlson, R., and Granström, B. (1979). "MITalk-79: The MIT Text-to-Speech System," J. Acoust. Soc. Am. Suppl. 1 65, S130.

Ananthapadmanabha, T. V. (1984). "Acoustic Analysis of Voice Source Dynamics," Speech Transmission Laboratory, Royal Institute of Technology, Stockholm, Sweden, QPSR 2-3, 1-24.

Anderson, M., Pierrehumbert, J., and Liberman, M. (1984). "Synthesis by Rule of English Intonation Patterns," Proc. Int. Conf. Acoust., Speech Signal Process. ICASSP-84, 2.9.1-2.9.4.

Anthony, J., and Lawrence, W. (1962). "A Resonance Analogue Speech Synthesizer," Proc. 4th Int. Cong. Acoust., Copenhagen, Denmark.

Armstrong, L. E., and Ward, I. C. (1931). A Handbook of English Intonation (Cambridge U.P., Cambridge, England), 2nd ed.

ASHA (1981). Position Paper of the Ad Hoc Committee on Communication Processes for Nonspeaking Persons, American Speech and Hearing Association, Rockville, MD.

Atal, B. S., and Hanauer, S. L. (1971). "Speech Analysis and Synthesis by Linear Prediction of the Speech Wave," J. Acoust. Soc. Am. 50, 637-655.

Atal, B. S., and Remde, J. R. (1982). "A New Model of LPC Excitation for Producing Natural-Sounding Speech at Low Bit Rates," Proc. Int. Conf. Acoust., Speech Signal Process. ICASSP-82, 614-617.

Atal, B. S., and Schroeder, M. R. (1975). "Recent Advances in Predictive Coding: Applications to Speech Synthesis," in Speech Communication, edited by G. Fant (Almqvist and Wiksell, Uppsala, Sweden), Vol. I, pp. 27-31.

Atkinson, R. C. (1972). "Teaching Children to Read Using a Computer," Am. Psychol. 27, 169-178.

Baer, T. (1981). "Observation of Vocal Fold Vibration: Measurement of Excised Larynges," in Vocal Fold Physiology, edited by K. N. Stevens and M. Hirano (University of Tokyo Press, Tokyo, Japan), pp. 119-136.

Barnwell, T. P. (1971). "An Algorithm for Segment Duration in a Reading Machine Context," Research Laboratory of Electronics, Tech. Report 479, M.I.T., Cambridge, MA.

Bassak, G. (1980). "Phoneme-Based Speech Chip Needs Less Memory," Electronics 53, 43-44.

Bell, T. (1983). "Talk to Me," Personal Computing 7, 120-206 (September 1983).

Bernstein, J. (1986). "Voice Identity and Attitude," Proceedings Speech Tech '86 (Media Dimensions, New York), pp. 213-215.

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use