of the speech synthesis research of the past thirty years has been
prompted by interest in vocoding (i.e. voice coding). The channel
capacity (equivalently, the bandwidth in the radio spectrum) required
for transmission of speech is many times greater than it ought to
be, considering the amount of information, in Shannon's sense, which
is carried by the speech signal. Since the channel capacity available
for radio and cable communications is limited, many schemes have
been devised to 'compress' speech by analyzing the speech wave and
transmitting only the information needed to synthesize an intelligible
version at the receiving end. For example, in Dudley's (1939) original
Vocoder, built at Bell Telephone Laboratories, the spectrum of
telephone speech (250-3000 Hz) is analyzed by a bank of 10 filters.
The smoothed, rectified output of each filter represents the energy
in a certain part of the spectrum as a function of time. Another
circuit tracks Fo, the fundamental frequency (for voiceless excitation,
the output of this circuit is zero). The vocoder transmits the outputs
of the Fo tracker and of the filters. Since these functions vary
relatively slowly, the channel capacity needed for all 11 functions
is far less than the unprocessed speech signal would require. To
synthesize the speech, the frequency of a buzz source is varied
according to the Fo function (a hiss source is used when this function
has zero value). The buzz or hiss excites each of a set of filters
matching those used in the analysis, and the amplitude of the output
from each synthesizing filter is determined by the function for the
corresponding analyzing filter. Summing the outputs of the synthesis
filters yields an intelligible version of the original speech.
A second type of vocoder is the formant vocoder (Munson and
Montgomery 1950). In a formant vocoder, the analyzer tracks the
excitation state, Fo, and the frequencies and amplitudes of the
lowest three formants of the original speech, and transmits these
functions; in the synthesizer, resonant circuits representing the
three formants are appropriately excited, and the transmitted
functions also determine the frequency and the amplitude for each
resonator. The saving in channel capacity is greater than for a
filter-bank vocoder, but correct analysis is much more difficult.
Both filter-bank and formant synthesizers have proved to be of value
for phonetic and phonological research as well as for communications.
Besides vocoding, there are certain other possible applications for
synthetic speech. If it is necessary for a machine to communicate
with its user -- a computer operator or a student undergoing
computer- assisted instruction -- and heavy demands are already
being made on his visual attention, spoken messages may be the
solution. But if fast random access to a large inventory of
messages is required, storage of natural speech becomes cumbersome,
for speech makes the same exorbitant demands on storage capacity
as it does on channel capacity (Atkinson and Wilson 1968).
Synthetic speech, if it could be stored in some kind of minimal
representation, would be an attractive alternative. Still another
application is a reading- machine for the blind. In such a device,
printed text must be converted to spoken output with the aid of a
dictionary in which written and spoken elements
|