NMAH | Smithsonian Speech Synthesis History Project (dk

A part of the phonemic conversion process concerns the derivation of a stress pattern for the syllables of a word. Stress must be predicted if the word is not in the system vocabulary, or if the orthographic word is broken down into root plus affixes and an affix changes the stress pattern given for the root. The stress level of a syllable will be indicated by inserting a stress symbol just prior to the vowel in the phonemic representation. Absence of a stress symbol means that the syllable is unstressed.

A. Text formatting

A practical text-to-speech system has to be prepared to encounter words containing nonalphabetic characters, digit strings and unpronounceable ASCII characters. MITalk was one of the first systems to include algorithms for handling special cases such as how to speak digits in different formats (Allen et al., 1979), e.g., "$35.61, 35.61, 2000, the year 1971, 10:15 p.m." This system also expanded many common abbreviations into full word equivalents. Commercial systems, which must be prepared to deal with more exotic material such as embedded escape sequences and other nonalphabetic characters, have adopted two general strategies. The Infovox SA-101 and the Prose-2000 provide the user with a set of logical switches which determine what to do with certain types of nonalphabetic strings. For example, " - " is translated to either "dash" or "minus" depending on the state of a switch. DECtalk, on the other hand, ignores escape characters, and usually spells out words containing nonalphabetic characters. The reasoning is that it is impossible to do the right thing in general, and the correct option for a particular application should be determined by a host computer. Even a simple strategy, such as interpreting a tab as an indicator of a new paragraph that should begin with a higher fundamental frequency, is not a safe assumption in arbitrary text; DECtalk therefore requires that a host computer insert a special "new paragraph" symbol in the text instead whenever tabs can be interpreted as new paragraphs. O'Malley et al. (1986) point out that many abbreviations are ambiguous, but can be disambiguated in particular applications. For example "N." is spoken as a letter in a name, as "North" in a street address, and as "New" in a state abbreviation, but these are easy fields to distinguish in a properly structured data base.

B. Letter-to-phoneme conversion

One issue in the preparation of rules and data structures for synthesis is how to best represent phonemes, allophones, stress, and syntactic symbols. Dictionaries generally do not agree on a standard representation, although the International Phonetic Association publishes one standard, and The Journal of the Acoustical Society of America employs a similar standard set of phonemic symbols that are used here in the examples. However, computers often require a representation that can be printed within the limitations of the ASCII character set. There is no agreement on either the set of phonetic symbols to be represented or the phonetic/ alphabetic correspondences in this situation. The problem does not really require solution until such time as researchers wish to share data bases consisting of dictionaries or rules, and even then the most important issue is clear definition since computers are very good at symbol translation if they know what each symbol is intended to mean.

In my research, I have found it convenient to work with two different computer representations. One is case insensitive (upper case and lower case letters are equivalent) and requires two letters to represent vowels and some consonants. It is easy to type and easy to learn, so it is the way that words are input to Klattalk in phonemic form. The representation is nearly identical to the ARPAbet (Shoup, 1980). The second representation consists of a single ASCII character per phonetic symbol and so is an efficient way to store dictionaries and compare strings. Both representations can be parsed without the need for spaces between phonetic elements -- in fact, "space" is the symbol used to indicate a word boundary. The two-character representation is defined and explained in Conroy et al. (1986, pp. 79-97), while the one-character set is described in Minow and Klatt (1983, Chap. 4). They are reprinted in Tables IV and V. The following are the only somewhat nonstandard symbols allowed in the abstract representation for a sentence: (1) there are two variants of schwa, and , although the one to be used in any context is largely determined by the adjacent phonetic segments, (2) there is a separate symbol /yu/ for the more usual /y/ + because the fronting of in this environment would otherwise have to be done by a special rule, (3) there is a mapping of stressed and unstressed onto the single symbol , since this will cause no confusion and will make possible a slight saving in table space, (4) a silence phoneme is defined which is inserted by rule at

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use