NMAH | Smithsonian Speech Synthesis History Project (dk

evoked or the detailed acoustic effects of the transformation. For example, only the fronting/ backing anticipation of the interacting vowels in a VCV sequence described by Öhman has been implemented as a change to the F2 trajectory. The list of discrete allophones inserted and manipulated by internal Klattalk rules, shown in Table III, is rather small. All other allophonic variation is created by modifying synthesizer control parameter data directly rather than by defining a discrete symbol.

In summary, it is likely that this area of allophonic detail and prosodic specification is one of the weaker aspects of rule systems, and contributes significantly to the perception of unnaturalness attributed to synthetic speech. Incremental improvements that are made to these rules on the basis of comparisons between rule output and natural speech cannot help but lead to improved performance of text-to-speech systems.

II. TEXT-TO-PHONEMES CONVERSION

Having considered the steps required to go from an abstract linguistic description to synthetic speech, we now turn to the problem of deriving this description from text. The recognition of printed characters, as required in, e.g., a reading machine for the blind, is beyond the scope of this review. We will assume that an ASCII representation of each input sentence is available as input to the text analysis module of a text-to-speech system. From considerations outlined in the previous section, it is clear that the text analysis routines have a formidable task. Ideally, the input is to be analyzed in such a way as to:

reformat everything encountered (e.g., digits, abbreviations) into words and punctuation,
parse the sentence to establish the surface syntactic structure,
find the semantically determined locations of contrastive and emphatic stress,
derive a phonemic representation for each word,
assign a (lexical) stress pattern to each word.

For example, the input ASCII string for a typical input sentence, shown below, was processed by rules of Klattalk to derive an abstract linguistic representation consisting of phonemes, stress, and syntactic symbols. First, the word-formatting module transformed the numerals "23" into the words "twenty-three."

INPUT TEXT:
	The 23 protesters were arrested.
REFORMATTED INTO WORDS:
	The twenty-three protesters were arrested.
(PARTIAL) SYNTACTIC ANALYSIS:
	The twenty-three protesters ) were arrested.
SEMANTIC ANALYSIS:
	None.
(PARTIAL) MORPHEMIC ANALYSIS:
	The twenty-three protest-er-s ) were arrest-ed.
PHONEMIC CONVERSION AND LEXICAL STRESS ASSIGNMENT:

A crude syntactic analysis of the sentence is then performed based on locations of any orthographic commas, as well as the syntactic role of function words and verbs that are detected during the dictionary matching process. In the sample text just above, the verb "were" is detected and marked as the beginning of the verb phrase through use of the [ ) ] symbol. The end of a declarative sentence is indicated by the period symbol. The most important aspects of syntactic structure are the locations of clause boundaries, and the location of the boundary between the noun phrase and the verb phrase, although there are other syntactic factors that affect the rhythm and intonation of longer sentences. Liberal use of commas in text would help a great deal in formulating natural phrasing; their presence is generally a reliable cue, but unfortunately their absence does not indicate the absence of an intonational phrase boundary.

There is no semantic analysis in Klattalk or any other present-day text-to-speech system. Every sentence is spoken in a sort of semantically "neutral" way, i.e., without emphatic or contrastive stress, unless the user indicates an important word by placing the phonemic [ " ] symbol before it in the orthography.

Next, a phonemic representation is obtained for the words in the manner shown in Fig. 30. Each word is compared with entries in a small pronunciation dictionary. If no match is found, the word is broken into smaller pieces (morphemes) by attempting to remove common suffixes such as "-ed," "-ing," etc. It may be necessary to add a silent "e" or to reconstitute the "y" in order to recover the true form of the root, as in "biting = bite + ing." Then the remaining root is again compared with entries in the phonemic dictionary. If there is still no match, a set of letter-to-phoneme rules are invoked to predict the pronunciation. In this sentence, two words had affixes removed, five words/roots were found in the dictionary, and the remaining one was processed by letter-to-sound rules. No errors were made. The morpheme "protest" was found to have two alternative pronunciations in the dictionary, one with primary stress on the first syllable and the other with primary stress on the second syllable, but a selectional restriction associated with the "-er" suffix caused correct selection of the noun form.

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use