Our ideal system is not concerned with performance as such. Even
though our model is dynamic and the output is audible, the process
of synthesis is a derivation according to rules, not a life-like
imitation of a speaker's actual speech behavior. The output is
acceptable to the hearer because it follows the rules, not just
because, on the one hand, it is intelligible, despite errors and
deviations, or on the other, because it is highly natural-sounding --
though one might expect that the output of an ideal system would be
natural-sounding, if not physically naturalistic. Here our emphasis
differs somewhat from that of Ladefoged (1967) and Kim (1966) who
share our conviction that it is important to do synthesis by rule,
but for whom linguistic and phonetic theory 'must lead to the
specification of actual utterances by individual speakers of each
language; this is physical phonetics' (Ladefoged 1967: 58). From our
point of view it is not physical realism but psychological
acceptability which is the proper evidence for correctness at the
phonological and phonetic levels, just as it is on the syntactic
level.
In the preceding discussion, we have deliberately generalized the
concept of 'synthesis by rule' to embrace phonology and phonetics.
It would be possible to generalize still further, to include syntax
and semantics in a synthesis by rule system. But while computer
simulations of syntactic and semantic rules are certainly desirable,
the motivation for coupling them to a phonological and phonetic
synthesis by rule system is less compelling, primarily because a
set of syntactic rules can in practice be evaluated more or less
independently of the associated phonology and
phonetics.
4. CURRENT WORK IN SYNTHESIS BY RULE
We turn now to an assessment of the progress which has been made
toward the ideal which has just been sketched. The first thing to
be said is that most of the activity and most of the progress so
far falls under the heading of phonetic capacity. Since the other
components all depend, directly or indirectly, on phonetic capacity,
this is just as it should be. Moreover, since we want to assess the
role of the different stages of the speech chain in phonetic capacity,
it is good that, in the present state of our knowledge, the research
has been pluralistic: different types of systems have been developed
in which the contribution of different stages has been emphasized.
This has been difficult to do because appropriate data on which to
base investigations at stages before the acoustic stage are hard to
collect. At present, most of the work has been at the acoustic stage;
the relationship between shape and acoustic output is quite well
understood and several synthesis-by-rule systems operating on
vocal-tract shape have been developed; systems which represent the
movements of the actual articulators are beginning to show results;
and some work has been done at the neuromotor command
stage. 8
__________
8. There is, of course, another way to synthesize speech by rule,
and that is to compile an utterance from an inventory of shorter
segments, themselves either natural or synthetic. Such approaches
may have practical value, but from a theoretical standpoint they
merely serve to remind us that there is no simple correspondence
between phones and segments of the acoustic signal. See the
discussion in Liberman et al. 1959. Systems in which speech
is compiled from natural segments have been described in Harris
1953, Peterson et al. 1958, Cooper et al. 1969.
Systems using synthetic segments are described in Estes et al.
1964, Dixon and Maxey 1968 and Cooper et al. 1969.
|