NOTES
- For an application requiring a limited set of sentences with
known structure. such as telephone numbers, some success has been
achieved in concatenating "vocoded" whole words. This is because
it is possible to smooth vocoder parameters at word boundaries,
modify durations, and impose a sentence fundamental frequency
contour on the word string (Rabiner et al., 1971; Olive and
Nakatani, 1974). Also, Cooper et al., 1984) describe an early plan
to concatenate recorded words in a reading machine for the blind
application, where the motivation of the listeners might overcome
weaknesses of the presentation, but the approach was subsequently
abandoned in favor of synthesis by rule. (text)
- Postvocalic devoicing and flapping are actually late rules,
occurring after vowel durations are computed. The proper ordering
of rules is an important issue in the design of text-to-speech
systems. (text)
- Looking at the same data, we might not agree with their
intuitions. (text)
- The parameter k was assumed to be 0.5 by Delattre et al.
(1955) and by Holmes et al. (1964). (text)
- Actually, Peterson et al. (1958) proposed the term "dyad" as
a set of diphones all having essentially the same articulatory
trajectory from the middle of one segment to the middle of the
next, but differing in prosodic values such as duration and
fundamental frequency contour. Hank Truby
[Ed: Henry M. Truby] was the first to use
the term "diphone" by separating out prosody as independent
variables in synthesis, and calling the remaining phonetic
transition (as represented by synthesizer control data) a "diphone."
As the term diphone has spread in usage, some authors allow it to
refer to larger synthesis units such as consonant clusters when
needed to maximize synthesis fidelity (Dixon and Maxey, 1968), but
we will restrict the term here to mean a transition between adjacent
phonetic segments. (text)
- For example, English vowels can be divided into tense
inherently long vowels and lax short vowels (House, 1961).
(text)
- The number of distinguishable stress levels at the lexical
and phrasal levels continues to be an area of linguistic dispute;
see Vanderslice and Ladefoged (1971) and Coker et al. (1973) for
extremal positions. (text)
- In phonological theory, there is usually a distinction made
between a rule that changes a feature or segment discretely, and
a feature implementation algorithm that is subject to low-level
physiological constraints, contextual influences, and graded
behavior. Thus a parameter adjustment rule needed in speech
synthesis probably should correspond to the feature implementation
level of description (e.g., voice onset time is slightly longer
for high versus low vowels even though glottal timing commands
might be the same in two situations), whereas allophone selection
rules should correspond to actual rule-governed changes to motor
commands, as reflected by a change to some segmental feature.
(text)
- Not all phonological simplifications preserve boundary
information; for example [h] deletion and flapping result in
an inability to distinguish between "but her" and "butter."
(text)
- If errors were independent, words correct would be approximately
equal to phonemes correct to the sixth or seventh power, times the
probability of getting the stress correct.
(text)
- It is perhaps unfair to evaluate this system against a random
sample of words because it was intended to be used in the context
of a large morpheme dictionary, and therefore would be activated
only for rare words -- words that may be more regular in their
pronunciation. (text)
- Use of the solid curve is equivalent to assuming that another
million-word text sample will contain exactly the same 50 000 words,
whereas it is likely that a different set of rare words will be
found in the new text. (text)
- It is surprising how outdated this corpus has become if the
goal is to obtain a lexicon representative of modern textual material;
Allen and Finkel removed more than 15% of the items as outmoded or
too parochial when they were collecting morphemes by hand. We would
all benefit from a modern replication of the Kucera and Francis task,
especially now that it is practical to examine much larger data
bases than only a million words. (text)
- In theory, every time a new rule was added to the morph
decomposition process, it was necessary to go back and check the
entire lexicon for accidental incorrect decompositions.
(text)
- An even less sensitive test is the diagnostic rhyme test
(Voiers, 1983) which involves a single pair of alternative responses
for each familiar CVC word. (text)
- Most of these "errors" can be attributed to problems with
phonemic symbolization; phonetically trained listeners typically
perform at better than 99% correct on the same task (Rabiner,
1969). (text)
- Multipulse linear prediction was designed to make possible
the detailed modeling of the voicing source waveform, but in fact
it is simply a method of introducing zeros into the representation
of any speech sound. It appears that multipulse has little advantage
for voiced segments in text-to-speech systems because the rule
system imposes an fo contour different from that observed in the
original natural speech recording. However, multipulse may be able
to better approximate, e.g., the coherent release of plosive bursts
(Maeda, 1987). (text)
- Pisoni and Koen (1981) obtained similar results, although the
difference between natural and synthetic speech was greater, perhaps
because the MITalk system that they used is not quite as
intelligible. (text)
- Carlson and Granström (1976) had noted the same kind of listener
adaptation without feedback in an earlier experimental evaluation.
With feedback, listeners can improve considerably in performance on
intelligibility tests, even with poor quality synthetic speech
(Schwab et al., 1986). (text)
- For example, Xerox Corp. has retrofitted a number of Kurzweil
Reading Machines for the blind that are located in public libraries
with the more intelligible Prose-2000 text-to-speech board. Digital
Equipment Corporation has offered a special price for DECtalk units
sold to handicapped individuals and manufacturers of handicapped
devices, resulting in a more than one million dollar price reduction
on units sold to this population. (text)
BIBLIOGRAPHY
Aho, A., and Ullman, J. (1972). The Theory of Parsing, Translation
and Computing (Prentice-Hall, New York).
Ainsworth, W. A. (1973). "A System for Converting English Text into
Speech," IEEE Trans. Audio Electroacoust. AU-21, 288-290.
Allen, D. R., and Strong, W. J. (1985). "A Model for the Synthesis
of Natural Sounding Vowels," J. Acoust. Soc. Am. 78, 58-69.
Allen, J. (1976). "Synthesis of Speech from Unrestricted Text,"
Proc. IEEE 64, 422-433.
Allen, J., Hunnicutt, S., and Klatt, D. H. (1987). From Text to
Speech: The MITalk System (Cambridge U.P., Cambridge, UK).
Allen, J., Hunnicutt, S., Carlson, R., and Granström, B. (1979).
"MITalk-79: The MIT Text-to-Speech System," J. Acoust. Soc. Am.
Suppl. 1 65, S130.
Ananthapadmanabha, T. V. (1984). "Acoustic Analysis of Voice Source
Dynamics," Speech Transmission Laboratory, Royal Institute of
Technology, Stockholm, Sweden, QPSR 2-3, 1-24.
Anderson, M., Pierrehumbert, J., and Liberman, M. (1984).
"Synthesis by Rule of English Intonation Patterns," Proc. Int. Conf.
Acoust., Speech Signal Process. ICASSP-84, 2.9.1-2.9.4.
Anthony, J., and Lawrence, W. (1962). "A Resonance Analogue
Speech Synthesizer," Proc. 4th Int. Cong. Acoust., Copenhagen,
Denmark.
Armstrong, L. E., and Ward, I. C. (1931). A Handbook of
English Intonation (Cambridge U.P., Cambridge, England), 2nd ed.
ASHA (1981). Position Paper of the Ad Hoc Committee on
Communication Processes for Nonspeaking Persons, American Speech and
Hearing Association, Rockville, MD.
Atal, B. S., and Hanauer, S. L. (1971). "Speech Analysis
and Synthesis by Linear Prediction of the Speech Wave," J. Acoust.
Soc. Am. 50, 637-655.
Atal, B. S., and Remde, J. R. (1982). "A New Model of
LPC Excitation for Producing Natural-Sounding Speech at Low Bit
Rates," Proc. Int. Conf. Acoust., Speech Signal Process. ICASSP-82,
614-617.
Atal, B. S., and Schroeder, M. R. (1975). "Recent Advances in
Predictive Coding: Applications to Speech Synthesis," in Speech
Communication, edited by G. Fant (Almqvist and Wiksell, Uppsala,
Sweden), Vol. I, pp. 27-31.
Atkinson, R. C. (1972). "Teaching Children to Read Using a Computer,"
Am. Psychol. 27, 169-178.
Baer, T. (1981). "Observation of Vocal Fold Vibration:
Measurement of Excised Larynges," in Vocal Fold Physiology, edited by K. N.
Stevens and M. Hirano (University of Tokyo Press, Tokyo, Japan),
pp. 119-136.
Barnwell, T. P. (1971). "An Algorithm for Segment
Duration in a Reading Machine Context," Research Laboratory of
Electronics, Tech. Report 479, M.I.T., Cambridge, MA.
Bassak, G. (1980). "Phoneme-Based Speech Chip Needs
Less Memory," Electronics 53, 43-44.
Bell, T. (1983). "Talk to Me," Personal Computing 7,
120-206 (September 1983).
Bernstein, J. (1986). "Voice Identity and Attitude,"
Proceedings Speech Tech '86 (Media Dimensions, New York), pp. 213-215.
|