SSSHP Contents | Labs

 KLATT 1987, p. 778 
Go to Page | Contents C. Comprehension | Index | Bibl. | Page- | Page+
 

paragraphs. Results of answering multiple-choice questions about the content of the paragraphs are shown in Table XI. The text-to-speech systems performed about equally well, suggesting that the test is not sensitive enough to compare systems, and that the limit on performance is the memory capacity of these college students rather than the difficulty of comprehending synthetic speech. Pisoni also observed that subjects typically got better on the second half of the test when listening to synthetic speech, even though there was no feedback of correct answers.19  On the second half, listening subjects performed about as well as the readers.

One might conclude that current text-to-speech systems produce quite satisfactory speech since there is no measurable decrement in listening comprehension after a familiarization period. Thus synthetic speech should be a viable method of presenting information over an auditory channel in most applications. Such conclusions are perhaps premature because (1) similar experiments have not been performed over the telephone, or with less-educated subjects, and (2) multiple-choice tests and recall measures may not be sensitive enough to reveal differences in perceptual processing between natural and synthetic speech. Pisoni (1982) used a reaction-time experiment to show that listeners do indeed devote somewhat more time to speech perception when exposed to synthetic speech as compared with natural speech, and Manous et al. (1985) measured a decrement in accuracy and speed of response for text-to-speech systems versus natural speech using a more sensitive comprehension test in which listeners had to immediately respond "true" or "false" to each sentence they heard. The capacity of short-term memory for earlier items in a list can also be reduced when listening to synthetic speech (Luce et al., 1983).

In summary, studies have shown that there is a wide range of performance between text-to-speech systems in terms of segmental intelligibility. Measured in terms of error rate, a system with a 3% error rate is twice as good as one with a 6% error rate, at least in terms of the average time interval between misperceptions in running text. Language is sufficiently redundant that these differences in segmental intelligibility often appear to be slight, but this is not the case when listening to unfamiliar names or difficult material. Furthermore, errors are usually the result of deviations of synthesizer parameters from values seen in natural speech. To the extent that error rate reflects a tendency for misspecification of parameters in general, it is also an indicator of how unnatural the speech is likely to sound,

D. Naturalness

Naturalness is a multi-dimensional subjective attribute that is not easy to quantify. Any of a large number of possible deficiencies can cause synthetic speech to sound unnatural to varying degrees. Fortunately, systems can be compared for relative subjective naturalness with a high degree of inter-subject and test-retest agreement (IEEE, 1969; Munson and Karlin, 1962). A standard procedure is to play pairs of test sentences synthesized by each system to be compared, and obtain judgments of preference (Logan and Pisoni, 1986). As long as the sentences being compared are the same, and the sentences are played without a long wait in between, valid data can be obtained. It is more difficult to compare systems that have been heard on different days or with different synthetic materials since extraneous factors can add an unpredictable amount of "noise" into listener preference judgment data (Nusbaum et al., 1984b).

Naturalness should not be confused with intelligibility. Some of the low bit rate linear-prediction systems sound like slightly distorted recordings of natural speech (which is what they are), and so are judged fairly natural, but they test out to have rather poor intelligibility scores (Nixon et al., 1985). On the other hand, intelligibility and naturalness ratings of text-to-speech systems appear to be fairly highly correlated.

E. Suitability for a particular application

Text-to-speech devices are being introduced in a wide range of applications. A sampling of commercial uses appears in Table XII. Noncommercial applications are described in Sec. V. These devices are not good enough to fully replace a human, but they are likely to be well received by the general public if they are part of an application that offers a new service, or provides direct access to information stored on a computer, or permits easier or cheaper access to a present service because more telephone lines can be handled at a given cost. Both intelligibility and naturalness are considered important factors to the success of any application, but it is interesting to note that one large commercial concern is planning an application that will use DECtalk set up to speak in a monotone, purposely trying to indicate to the customer that he/she is talking to a smart computer rather than to a poor imitation of a human. What is important at this early stage in the exposure of the public to synthetic speech is to avoid applications that might lead to user frustration and generate negative attitudes toward all devices that "talk like a computer." For example, intelligibility over the telephone
 

Go to Page | Contents E. Suitability | Index | Bibl. | Page- | Page+

 KLATT 1987, p. 778 
SSSHP Contents | Labs
Smithsonian Speech Synthesis History Project
National Museum of American History | Archives Center
Smithsonian Institution | Privacy | Terms of Use