SSSHP Contents | Labs

 KLATT 1987, p. 776 
Go to Page | Contents IV. Perceptual evaluation | Index | Bibl. | Page- | Page+
 

cognitive load, see Table VI. There are only a few studies that have attempted quantitative evaluations of text-to-speech systems to date; much of the data on the capabilities and limitations of the current technology comes from work performed at Indiana University by David Pisoni and his colleagues (Pisoni et al., 1985).

A. Intelligibility of isolated words

The measurement of intelligibility can be performed in many different ways. Since consonants have been more difficult to synthesize than vowels, the modified rhyme test (House et al., 1965) is often used, in which the listener selects among six familiar words that differ only by an initial consonant or a final consonant. This is not a very severe test of system performance since the response alternatives may exclude a confusion that would be made if a blank answer sheet were used, but the test does facilitate rapid presentation to naive subjects and automatic scoring of answer sheets.15  If possible, an open response, including perhaps a rating of goodness of each item, should be used with such a test in order to better determine systematic error patterns and deficiencies, especially if there are relatively few errors.

Logan et al. (1986) evaluated the intelligibility of eight text-to-speech systems by presenting listeners with a recording of the modified rhyme test words. The results are summarized in Table VII. Also included are comparable data obtained earlier with the Haskins text-to-speech system (Cooper et al., 1984). Systems are rank ordered according to performance. When percent correct is fairly high, a good way to compare systems is to use percent error (simply 100 minus percent correct) because relative changes in percent error better reflect the difficulty of comprehension and the difficulty of making improvements. The frequency of occurrence of perceptual errors in running text is approximated by the reciprocal of the percent error values given in the table. Looked at in this way, the expected rate of perceptual errors for DECtalk is about (100% / 3%), or one segmental misperception about every 33 syllables of text. The error rate for the Prose-2000 is about twice that of DECtalk, while it appears that Type-n-Talk is seriously flawed (see also Cochran, 1986).

When Logan et al. (1986) ran the same vocabulary used in the modified rhyme test, but with open response, the error rate went up quite a bit -- typically 3 to 4 times the closed-response error rate -- but the relative rankings of systems did not change. Open response, however, had the advantage that systematic error tendencies could be detected and (hopefully) corrected. For example, DECtalk 1.8 had a problem with nasals adjacent to high front vowels -- a problem that was then corrected in DECtalk 3.0. The test used is perhaps not ideal for detection of all likely consonantal confusions because the words are not particularly well balanced phonetically, and there are no consonant clusters or unstressed syllables. Other word lists address some of these deficiencies (Lehiste and Peterson, 1959; Nusbaum et al., 1984a), but there is a clear need for better diagnostic instruments in the evaluation of text-to-speech systems,

The intelligibility of several linear prediction based systems has been studied by Pols and Olive (1983). They presented consonant-vowel-consonant (CVC) nonsense syllables to high school students after a brief introduction to phonemic representations. The syllables were either (1) natural speech digitized at 10 000 12-bit samples/s, (2) 10-pole linear-prediction coded versions of these syllables, or (3) syllables synthesized using the Olive (1977) LP diphone concatenation scheme. The results are shown in Table VIII. This is a very difficult task for naive unpracticed subjects, as indicated by the relatively low 93% phoneme recognition performance for natural speech.16  Two points of interest are that (1) linear prediction coded speech can suffer a serious reduction in intelligibility, even when there is no effort to
 

Go to Page | Contents A. Isolated words | Index | Bibl. | Page- | Page+

 KLATT 1987, p. 776 
SSSHP Contents | Labs
Smithsonian Speech Synthesis History Project
National Museum of American History | Archives Center
Smithsonian Institution | Privacy | Terms of Use