NMAH | Smithsonian Speech Synthesis History Project (ss

A Personal Narrative

When I started my graduate work in Germanic Linguistics in the Department of Modern Languages and Linguistics at Cornell University in 1972, I had not the slightest inkling of the career path that lay ahead of me. Like most linguistics majors at the time, I envisioned an academic career, but I soon became disillusioned by the terrible job market and the prospect of ending up selling shoes or driving a cab. At the same time, I took my first computer course. I had never touched a computer before, but, despite the tedious process of punching programs on cards and the day-long turnaround times to get programs executed (only to find a bug and repeat the painful process), I instantly became enthralled with the idea of combining my interests in linguistics and computers. I switched my major to general linguistics and added a minor in computer science.

When it came time to embark upon my Ph.D. thesis work in 1974, I decided to add my interest in phonetics to the mix, opting to explore, through speech synthesis, a certain hypothesis I had about sound change and more generally, the interface between phonology and phonetics. I was fortunate in having available to me the Cornell Phonetics Lab (see History of the Cornell Phonetics Laboratory), which housed a DEC PDP 11/40 computer with an OVE IIId speech synthesizer. The lab even had a paper terminal, which, slow as it was, certainly beat punch cards! A fully equipped lab of this sort was a rare commodity at the time. I had almost sole use of the lab, since there was virtually no other serious work in phonology or phonetics at Cornell University at that time, and the lab became my home away from home for many years.

I started my Ph.D. work by implementing a program to test my hypotheses in the programming language SLIP, a list-processing extension to FORTRAN 4. Despite SLIP's list-processing capabilities, it quickly became evident that this language was poorly suited to the formulation of the kinds of linguistic rules involved in synthesis; concepts that could be embodied in a few lines of standard linguistic notation often required pages of code in SLIP. Every time I revised my hypotheses, a major programming effort was required to test them.

To overcome these barriers, I decided to shift my thesis topic away from the development of a particular linguistic theory, and instead focus on developing a generalized tool with which linguists could easily test a wide range of phonological and phonetic theories. Toward this end, I developed SRS (Speech Research System), which included a special linguistically-oriented notation for expressing synthesis rules. With this interactive system, linguists could efficiently express and test synthesis rules for a variety of languages.

In 1978, I ventured out of the Phonetics Lab to attend my first conference, a meeting of the Acoustical Society of America in Providence, Rhode Island. There, I presented a talk about SRS and demonstrated its first words. After the talk, a number of researchers and developers in phonetics and speech synthesis invited me to their institutions or suggested I apply for jobs. Clearly my decision to move from Germanic linguistics into speech synthesis research had been wise.

One person, from System Development Corporation in Virginia, actually offered me a job. Until that time, I had not considered a commercial career. Since I was not finished with my Ph.D. work, and still uncertain of my career path, I declined the job, accepting some part-time consulting work for the company instead.

The consulting opportunity gave me the best of all worlds. I could continue my Ph.D. work, get a feel for what the commercial world was about, and even make some money while I was at it! When I got my Ph.D. in 1979, I decided to continue along the path I was on. I remained at the Phonetics Lab, where I had various part-time positions teaching and doing research in the area of speech synthesis, and at the same time consulted for a number of companies. Much of the SRS rule development that I did was funded through my consulting activities.

Between 1979 and 1983, I used SRS to develop a set of text-to-speech synthesis rules for English. In 1980, I taught a class in speech synthesis in which my students used SRS to develop rudimentary rule sets for German, Dutch, and Spanish. In 1981, the first of my three children was born, and my home away from home became a combination lab/nursery-changing table, playpen and all. (Despite hearing more synthetic than natural speech in her formative first six months, I'm pleased to report that my daughter has developed quite normally.) In 1983, I did my first serious work on the development of synthesis rules for a language other than English, collaborating with Dr. Mary Beckman (then a graduate student at Cornell and now a linguistics professor at Ohio State University), and Dr. Osamu Fujimura (then head of the Linguistics and Speech Analysis Department at Bell Labs and now a professor in Speech and Hearing Science at Ohio State University) on the development of SRS-based synthesis rules for Japanese from a Romanized input.

Despite the linguistically-oriented rule formalism and flexible interactive environment provided by SRS, as I learned more and more about the nature of speech through my work with SRS, it became clear to me that the particular linguistic framework built into the system was preventing the development of more sophisticated models for higher-quality speech synthesis. Although at the phonetic level, SRS used different "streams" for different synthesizer parameters, the parameter values and segment durations all had to be set in relation to a single, linear string of phoneme-sized segments at the phonological (abstract linguistic) level. Since SRS included a bias toward this particular approach, data analysis and rule formation also absorbed the bias. As my research uncovered preferable alternatives, the need for a more adaptable synthesis rule development tool became apparent.

The clearest requirements for this tool were (a) a multi-tiered data structure that could make explicit the relationships between all relevant (user-definable) phonological units (e.g., phrases, words, syllables, phonemes) and quantitative phonetic values, and (b) a flexible rule formalism for manipulating this structure. In response to these needs, in 1983, I began the development of the Delta System, in collaboration with two computer scientists, Jim Kadin and Kevin Karplus. The Delta System was designed to combine the best features of general-purpose programming languages and specialized rule development tools like SRS.

Looking back now, it is clear how exceedingly naive I was about the resources that would be required to develop the Delta System. I had imagined that I could continue my easy lifestyle, obtaining sufficient funding for the Delta development through part-time consulting work and part-time teaching positions at Cornell, while at the same time continuing my basic research in the area of synthesis. While ultimately I did succeed in completing the system's development, it was a 10-year process that required revenues and hours far beyond what I could ever have imagined, and led me more and more into the commercial world.

In 1983, I began doing business as "Eloquent Technology," continuing my consulting work and bootstrapping the development of the Delta System through a combination of consulting revenue, private investments, SRS and Delta licensing revenue, loans, grants, contracts, and other revenue-generating activities. Three years later, the Department of Modern Languages and Linguistics came to the realization that I wasn't going away, and gave me a long-term position as a half-time Senior Research Associate. In 1988, Eloquent Technology was incorporated, and I hired my first employees, who worked out of my house.

With the Delta System, I was able to explore new models for synthesis, including the phone-and-transition model for the expression of generalizations concerning the timing of formant patterns. This model makes it possible to better capture the acoustic regularities underlying speech than do the more conventional SRS-type models on which most present-day rule-based synthesis rule systems are based, and leads to an easy division of speech generation rules among those that are universal to all languages and those that are specific to a group of dialects or to a particular dialect.

Between the years 1990 and 1996, I was the principal investigator or project director on 13 grants/contracts in the area of speech synthesis (one to Cornell and 12 to Eloquent Technology, Inc.). In the various grant projects, my collaborators and I researched a large number of phonetically-diverse languages and five dialects of English; developed a modular approach to multi-language and multi-dialect text-to-speech synthesis; developed synthesis rules for a number of languages; and optimized Delta and the rules developed with it for eventual productization. In our modular approach, language-universal components generate the phonological and acoustic properties common to all languages, dialect-universal components generate the properties common to all dialects of the language in question, and language-specific components fill in those that are language-specific.

In 1995, Eloquent Technology, Inc. (ETI), with six employees, was bursting at the seams in its basement quarters, and moved to an outside office. On August 26, 1996, the company formed a strategic partnership with IBM, which acquired certain portions of the technology developed by ETI, and ultimately incorporated it into its ViaVoice line of speech products. This day was momentous not only for ETI, but for me personally, as finally, after more than fifteen years, I was able to start collecting a salary that I didn't have to put back into research and development. Between 1996 and early 1998, with a staff of nine, ETI developed complete text-to-speech systems for five additional languages/dialects (German, UK English, Italian, Castilian Spanish, and Parisian French) and incorporated them into the ETI-Eloquence product, thereby illustrating the power of the extensive technology foundation on which the product was based.

Since the release of its first multi-language version of ETI-Eloquence in 1998, ETI has added many new features to the system, improved the quality of the languages and dialects, and added new languages and dialects, including Brazilian Portuguese, Finnish, Japanese, Mandarin Chinese, Canadian French, Mexican Spanish, and Korean. ETI has also optimized the system for minimal memory utilization, so that it will be more generally useful for embedded applications.

In January, 2001, Eloquent Technology, Inc. merged with SpeechWorks International, Inc., a publicly-traded company headquartered in Boston. The Eloquent group of approximately fifteen has remained intact in Ithaca. Its technology will now be more broadly marketed, and will be integrated into SpeechWorks solutions of various kinds.

Susan Hertz
2001

	SSSHP Contents \| Labs \| Abbr. \| Index

Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use