part-of-speech alternatives for most words of the sentence from the
morpheme decomposition routine, and assumed tentatively that all
unanalyzable words were nouns. The syntactic analysis proceeded
left-to-right, attempting to add as many words as possible to each
phrasal constituent. A backup algorithm suggested by Lorinda Cherry
at Bell Laboratories sought possible verbs if it turned out that
this process failed to recover a verb, as would be the case when a
noun/verb ambiguity like "permit" was present in a sentence such
as "Police permit mopeds." While the performance of this parser was
never extensively tested, examination of some sample texts (Allen et
al., 1987, pp. 89-92) suggests that it works reasonably well, but
produces several inappropriate pauses and pseudopauses at falsely
detected boundaries.
If a parts-of-speech categorization is not available for most words,
the simplest parsing strategy would be to use function words such as
prepositions, conjunctions, and articles to find obvious phrase
boundaries, leaving the remaining boundaries undetected. This is
the strategy employed in the Prose-2000 and in the Infovox SA-101.
The Votrax Type-n-Talk appears to use only punctuation marks as
parsing cues.
DECtalk employs not only function words, but also a moderate-sized
dictionary of verbs that unambiguously indicates the beginning of a
verb phrase (Klatt, 1975a). Detection of the beginning of a verb
phrase in a long clause permits DECtalk to break the intonation
contour into two rise-fall "hat-pattern" units that help the listener
parse the sentence. However, it is better to miss a noun-phrase/
verb-phrase boundary than to insert prosodic boundary gestures
(fall-rise intonation contour and lengthening of a phrase-final
syllable) at locations where they do not belong. In an earlier
experimental system that assumed that any word that could be a
verb was a verb, listeners were distracted and often confused by
extra prosodic boundaries, while the absence of a prosodic gesture
just sounded like the speaker was talking too fast. DECtalk also
provides a simple mechanism for a user to indicate a phrase boundary
when one is missed -- the [ ) ] symbol can be inserted between the
words in question. DECtalk does not try to disambiguate noun/verb
ambiguities; the most frequent pronunciation is given unless the
user requests the second most frequent pronunciation by attaching
a special symbol to the front of the orthography.
DECtalk and other text-to-speech systems make a large number of
syntactic errors that lead to noticeable misphrasings. In the
future, syntactic routines will be expected to provide better
detection of the following:
- phrasal constituency -- particularly the locations of left-branching
constituents and non-adjacent sister constituents that should probably
be marked by prosodic gestures,
- internal structure and compounding relations within long
noun/ adjective strings,
- when to "pop" from an embedded clause that is not terminated
by a comma,
- how to determine the nature of conjoined units on either side of
a conjunction so as to be able to insert a syntactic break when
appropriate,
- syntactic deletion sites where some sort of prosodic gesture
should be synthesized to indicate the location of the missing
material (Cooper et al., 1978),
- how to detect tags and parenthetical material such as "This is the
answer, he told us," that are usually said in a noninflected
way,
- resolution of part-of-speech ambiguity, for (1) words that can
be either an unstressed preposition or a stressed verbal particle
such as "on" in "He takes on hard jobs," (2)
instances where "that"
is functioning as a (stressed) demonstrative, e.g., "I know (that)
THAT book is red" rather than as an unstressed clause introducer,
as in "I know that books are red," and (3) instances of compounds
that are pronounced with reduced stress on the second word, such
as "He lived in Baker House (this is largely a lexical/ semantics
problem).
D. Semantic analysis
Semantic and pragmatic knowledge is needed to disambiguate
sentences like the ones the New Yorker is fond of reprinting. For
example, in a sentence such as "She hit the old man with the
umbrella," there may be a pseudopause (a slowing down of speaking
rate and a fall-rise in pitch) between the words "man"
and "with"
if the woman held the umbrella, but not if the old man did.
Similarly, a "rocking chair" will have the
word "chair" destressed
if the combination of adjective and noun has been associated by
frequent use into a single compound-noun entity. Emphasis or
contrastive stress may be applied to an important word depending on
the meaning: "The OLD man sat in a rocker" (not the younger man).
Finally, words that have lost their importance in a dialog, either
because of prior occurrence of the word or by anaphoric reference,
should be destressed.
No text-to-speech system is capable of dealing automatically with
any of these issues. DECtalk employs the simplest possible solution
by providing the user with an input inventory of symbols to
facilitate user specification of the locations of missing
pseudopauses (the [ ) ] symbol), unmarked compound words
(spell as "rocking-chair"), and emphasis (precede the
emphasized word by an emphasis symbol ["]).
It is possible to think of applications where the computer is not
simply attempting to speak ASCII text, but may know a great deal
about the meaning of the message, perhaps having formulated the
text from a deep-structure semantic representation in, e.g., a data
base information retrieval application (Young and Fallside, 1979).
In such cases, one would want to take advantage of the ability to
mark for emphasis important words when forming the input to the
text-to-speech system. Hirshberg and Pierrehumbert (1986) provide
an excellent review of the factors influencing the intonational
structuring of discourse.
In the future, systems that have available parts-of-speech
information from a large morpheme lexicon can be expected to
develop better syntactic analysis routines that are particularly
suited to the problems of text synthesis. Perhaps computer science
efforts to produce expert systems will
|