CARNEGIE MELLON PRONOUNCING DICTIONARY

CMUDICT

Author

Robert L. Weide (weide@cs.cmu.edu)

Peter Jansen (pjj@cs.cmu.edu) for questions regarding the combination process.

Date: 9-7-94

-------------------------------------------

Information extracted and edited by Jean-Louis Duchet
COLEX project, Université de Nantes and FORELL-AIT, Université de Poitiers, May 1997.

--------------------------------------------

Copyright

The Carnegie Mellon Pronouncing Dictionary [cmudict.0.2] is Copyright 1993 by Carnegie Mellon University. Use of this dictionary, for any research or commercial purpose, is completely unrestricted. If you make use of or redistribute this material, we would appreciate acknowlegement of its origin.

Contents

The CMUDICT directory contains a pronunciation dictionary of American English (cmudict.0.1.Z is the first one we put out, cmudict.0.3.Z is the latest and most up-to-date) containing approximately 100k words and their transcriptions; lists of the words are in cmulex.0.1.Z and cmulex.0.3.Z. We use these dictionaries at CMU in our speech understanding systems. The phone set for this dictionary contains 39 phones, which can be found in phoneset.0.3.

Note: The number of entries makes it possible to include proper nouns to an extent rarely achieved by any other dictionary.

Symbols

Example of an entry:

DAFFODILS D AE1 F AH0 D IH2 L Z

Stress is indicated by means of a numeral [012] attached to a vowel:

0 = no stress

1 = primary stress

2 = secondary stress

Alternate transcriptions are identified with a numeral in parentheses as part of the lexical entry.

Example:

DUPLICATED D UW1 P L IH0 K EY2 T IH0 D

DUPLICATED(2) D Y UW1 P L AH0 K EY2 T IH0 D

Each entry word is followed by two spaces. The phonetic transcription follows. Sounds are separated by single spaces. Only vowels may be represented by three characters, the last one being a numeral indicating the stress level. The end of the transcription is indicated by ASCII character 10 (^10 is the search code to be used with MS-Word), which appeard as a blank square with most screen fonts. It is sometimes (erratically?) separated from the last character of the transcription by three spaces.

Phonetic symbols and their IPA equivalents

The same symbol is used for [] and [], which are treated phonemically as two stress-conditioned distributional variants of one and the same phoneme: only the stress digit makes it possible to distinguish between them: AH0 and AH1 or AH2. Similarly [] and [] are represented by ER0 and ER1 (or ER2).

The difference symbolized by AH0 and AH2 in the following pair of examples reflects the difference between [] (unstressed) and [] (with primary or secondary stress):

PUNCTILIOUS P AH0 NG K T IH1 L IY0 AH0 S

PUNCTUALITY P AH2 NG K CH UW0 AE1 L IH0 T IY0

Similarly the difference between IY0 and IH0 reflects the phonetic difference between unstressed tense [i] (at least potentially syllabic) in -ious, and unstressed lax [] in -ity.

Vowels

AA[]AY[]
AE[]IH[]AW[]
AH0[]IY[]EY[]
AH1[]OW[]
AO[]
UH[]ER0[]
EH[]UW[]ER1[]

Consonants

P B M T D N K G S Z F V W H are obvious enough.

The other consonant symbols are:
NG: []; SH : []; CH: [t];
Y: [j]; ZH : []; JH : [d]

Sample file

The dictionary file for the letter L (128Kb) may be downloaded.

It is in text only format.


Jean-Louis Duchet, Laboratoire FORELL,
Equipe d'analyse informatique des textes,
Université de Poitiers.