NMAH | Smithsonian Speech Synthesis History Project (im

might be criticized because the set of phonetic features, on which their much more principled account of phonological capacity depends, as yet lacks a fully satisfactory and explicit basis in phonetic capacity (Abramson and Lisker 1970). Obviously, it is very desirable to state clearly, when a certain component is being investigated, how this component is assumed to depend on other components.

Third, the ultimate check of a hypothesis concerning any or all of the components is of course the intuition of the native speaker (Chomsky 1965: 21). However, the only reliable way to consult his intuition is to present him with speech which we have made sure conforms to our current phonetic or phonological hypothesis and find out whether he considers it well-formed. To do this, however, we need carefully controlled speech stimuli (Lisker et al. 1962; Mattingly 1971).

Synthesis by rule is a technique which seems to meet these requirements. With the computer we can simulate our phonological and phonetic formulations rigorously; errors of form and logic come to light all too quickly. We are compelled to be explicit about the assumptions we make about other components; if they are simplistic or inadequate we will not be allowed to forget the fact. And we can check the native speaker's intuition directly by producing controlled synthetic speech.

Let us briefly consider what an ideal speech synthesis by rule system would be like. It would, in the first place, simulate all the components we have just discussed. Phonetic capacity would be represented by a synthesizer and computer programs controlling it which are capable of generating just those sounds which can be distinguished in production and perception by the speaker-hearer; phonological competence, by the rules of some language, stated in a form which would be an acceptable input to the system; phonological capacity, by a part of the computer program itself, which would impose severe limitations on the form or substance of the rules; and phonetic skill, by an additional set of rules specific to some particular speaker. The combined effect of all components should be such as to restrict the possible utterances to just those which are well-formed speech in a particular language (assuming appropriate syntactic and semantic constraints) from one particular speaker to another.

For each component, moreover, we would want to include all those aspects, and only those, which are relevant to the capacity and competence underlying the production and perception of speech. Suppose, for instance (contrary to our present expectations) that, from a psychological standpoint, speech production proved to be only a matter of transmitting certain cues definable in acoustic terms and invariantly related to phonetic features, and that speech perception consisted simply in detecting these cues. Our 'neural vocal tract simulation' could then be just a terminal analog synthesizer. There would then be no reason for including neuromotor commands, gestures or shape change in a parsimonious synthesis by rule system, because these matters would be irrelevant to phonetic capacity. They might continue to be of great interest from the standpoint of the physiologist and acoustician interested in speech, but would have no claim on the linguist's attention.

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use