NMAH | Smithsonian Speech Synthesis History Project (dk

part-of-speech alternatives for most words of the sentence from the morpheme decomposition routine, and assumed tentatively that all unanalyzable words were nouns. The syntactic analysis proceeded left-to-right, attempting to add as many words as possible to each phrasal constituent. A backup algorithm suggested by Lorinda Cherry at Bell Laboratories sought possible verbs if it turned out that this process failed to recover a verb, as would be the case when a noun/verb ambiguity like "permit" was present in a sentence such as "Police permit mopeds." While the performance of this parser was never extensively tested, examination of some sample texts (Allen et al., 1987, pp. 89-92) suggests that it works reasonably well, but produces several inappropriate pauses and pseudopauses at falsely detected boundaries.

If a parts-of-speech categorization is not available for most words, the simplest parsing strategy would be to use function words such as prepositions, conjunctions, and articles to find obvious phrase boundaries, leaving the remaining boundaries undetected. This is the strategy employed in the Prose-2000 and in the Infovox SA-101. The Votrax Type-n-Talk appears to use only punctuation marks as parsing cues.

DECtalk employs not only function words, but also a moderate-sized dictionary of verbs that unambiguously indicates the beginning of a verb phrase (Klatt, 1975a). Detection of the beginning of a verb phrase in a long clause permits DECtalk to break the intonation contour into two rise-fall "hat-pattern" units that help the listener parse the sentence. However, it is better to miss a noun-phrase/ verb-phrase boundary than to insert prosodic boundary gestures (fall-rise intonation contour and lengthening of a phrase-final syllable) at locations where they do not belong. In an earlier experimental system that assumed that any word that could be a verb was a verb, listeners were distracted and often confused by extra prosodic boundaries, while the absence of a prosodic gesture just sounded like the speaker was talking too fast. DECtalk also provides a simple mechanism for a user to indicate a phrase boundary when one is missed -- the [ ) ] symbol can be inserted between the words in question. DECtalk does not try to disambiguate noun/verb ambiguities; the most frequent pronunciation is given unless the user requests the second most frequent pronunciation by attaching a special symbol to the front of the orthography.

DECtalk and other text-to-speech systems make a large number of syntactic errors that lead to noticeable misphrasings. In the future, syntactic routines will be expected to provide better detection of the following:

phrasal constituency -- particularly the locations of left-branching constituents and non-adjacent sister constituents that should probably be marked by prosodic gestures,
internal structure and compounding relations within long noun/ adjective strings,
when to "pop" from an embedded clause that is not terminated by a comma,
how to determine the nature of conjoined units on either side of a conjunction so as to be able to insert a syntactic break when appropriate,
syntactic deletion sites where some sort of prosodic gesture should be synthesized to indicate the location of the missing material (Cooper et al., 1978),
how to detect tags and parenthetical material such as "This is the answer, he told us," that are usually said in a noninflected way,
resolution of part-of-speech ambiguity, for (1) words that can be either an unstressed preposition or a stressed verbal particle such as "on" in "He takes on hard jobs," (2) instances where "that" is functioning as a (stressed) demonstrative, e.g., "I know (that) THAT book is red" rather than as an unstressed clause introducer, as in "I know that books are red," and (3) instances of compounds that are pronounced with reduced stress on the second word, such as "He lived in Baker House (this is largely a lexical/ semantics problem).

D. Semantic analysis

Semantic and pragmatic knowledge is needed to disambiguate sentences like the ones the New Yorker is fond of reprinting. For example, in a sentence such as "She hit the old man with the umbrella," there may be a pseudopause (a slowing down of speaking rate and a fall-rise in pitch) between the words "man" and "with" if the woman held the umbrella, but not if the old man did. Similarly, a "rocking chair" will have the word "chair" destressed if the combination of adjective and noun has been associated by frequent use into a single compound-noun entity. Emphasis or contrastive stress may be applied to an important word depending on the meaning: "The OLD man sat in a rocker" (not the younger man). Finally, words that have lost their importance in a dialog, either because of prior occurrence of the word or by anaphoric reference, should be destressed.

No text-to-speech system is capable of dealing automatically with any of these issues. DECtalk employs the simplest possible solution by providing the user with an input inventory of symbols to facilitate user specification of the locations of missing pseudopauses (the [ ) ] symbol), unmarked compound words (spell as "rocking-chair"), and emphasis (precede the emphasized word by an emphasis symbol ["]).

It is possible to think of applications where the computer is not simply attempting to speak ASCII text, but may know a great deal about the meaning of the message, perhaps having formulated the text from a deep-structure semantic representation in, e.g., a data base information retrieval application (Young and Fallside, 1979). In such cases, one would want to take advantage of the ability to mark for emphasis important words when forming the input to the text-to-speech system. Hirshberg and Pierrehumbert (1986) provide an excellent review of the factors influencing the intonational structuring of discourse.

In the future, systems that have available parts-of-speech information from a large morpheme lexicon can be expected to develop better syntactic analysis routines that are particularly suited to the problems of text synthesis. Perhaps computer science efforts to produce expert systems will

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use