Acoustic model A model describing the probabilistic behavior of the encoding of the linguistic information in a speech signal. LVCSR systems use acoustic units corresponding to phones or phones in context. The most predominant approach uses continuous density hidden Markov models (HMM) to represent context dependent phones.
Allophone A pronunciation variant of a phoneme in a particular context, such as the realization of the phoneme /t/ in type (aspirated /t/), butter (flapped /t/), or hot (final unreleased /t/). Triphones and quinphones are two common models of allophones used by speech recognizers.
Automatic Language Recognition Process by which a computer identifies the language being spoken in a speech signal.
Automatic Speaker Recognition Process by which a computer identifies the speaker from a speech signal.
Backoff Mechanism for smoothing the estimates of the probabilities of rare events by relying on less specific models (acoustic or language models)
Confidence score Posterior probability associated to an hypothesis (e.g. a recognized word, an identified speaker, ...). For a speech recognizer, the sum of the word confidence scores is an estimate of the number of correct words. Confidence scores are commonly evaluated by computing the NCE metric.
Filler word Words like uhm, euh, ...
FIR filter A Finite Impulse Response (FIR) filter produces an output that is the weighted sum of the current and past inputs.
HMM Hidden Markov Models (or Probabilistic functions of Markov chains)
IIR filter An Infinite Impulse Response (IIR) filter produces an output that is the weighted sum of the current and past inputs, and past outputs.
Language model A language model captures the regularities in the spoken language and is used by the speech recognizer to estimate the probability of word sequences. One of the most popular method is the so called n-gram model, which attempts to capture the syntactic and semantic constraints of the language by estimating the frequencies of sequences of n words.
Lattice A word lattice is a weighted acyclic graph where word labels are either assigned to the graphs edges (or links) or to the graph vertices (or nodes). Acoustic and language model weights are associated to each edge, and a time position is associated to each vertex
Lexicon or pronunciation dictionary A list of words with pronunciations. For a speech recognizer it includes all words known by the system, where each word has one or more prononciations with associated probabilities.
LVCSR Large Vocabulary Speech Recognition (large vocabulary means 20k words or more). The size of the recognition vocabulary affects the processing requirements.
MAP estimation (Maximum A Posteriori) A training procedure that attempts to maximize the posterior probability of the model parameters (which are therefore seen as random variables) Pr(M|X,W) (X is the speech signal, W is the word transcription, and M represents the model parameters).
MAP decoding A decoding procedure (speech recognition) which attempts to maximize the posterior probability Pr(W|X,M) of the word transcription given the speech signal X and the model M.
MLE (Maximum Likelihood Estimation) A training procedure (the estimation of the model parameters) that attempts to maximize the training data likelihood given the model f(X|W,M) (X is the speech signal, W is the word transcription, and M is the model).
MMIE (Maximum Mutual Information Estimation) A discriminative training procedure that attempts to maximize the posterior probability of the word transcription Pr(W|X,M) (X is the speech signal, W is the word transcription, and M is the model). This training procedure is also called Conditional Maximum Likelihood Estimation.
MFCC Mel Frequency Cepstrum Coefficients. The Mel scale approximates the sensitivity of the human ear. Note that there are many other frequency scales "approximating" the human ear (e.g. the Bark scale).
MLP Multi-Layer Perceptron is a class of artificial neural network. It is a feedforward network mapping some input data to some desired output representation. It is composed of three or more layers with nonlinear activation functions (usually sigmoids).
OOV word Out Of Vocabulary word -- Each OOV word causes more than one recognition error (usually between 1.5 and 2 errors). An obvious way to reduce the error rate due to OOVs is to increase the size of the vocabulary.
%OOV Out Of Vocabulary word rate.
Percent of correct words The percentage of reference words that are correctly recognized. This measure can be used to evaluate speech recognizers whenever insertion errors can be ignored. It is defined as the %WAcc + %Ins where %Ins is 100 times the number of inserted words divided by the the number of reference words.
Perplexity The relevance of a language model is often measured in terms of test set perplexity defined as pow(Prob(text|language-model),-1/n), where is n is the number of words in the test text. The test perplexity depends on both the language being modeled and the model. It gives a combined estimate of how good the model is and how complex the language is.
Phone Symbol used to represent the pronunciations in the lexicon for a speech recognizer or a speech synthesis system. The number of phones can be somewhat smaller or larger then the number of phonemes in the language. The phone set is chosen to optimize the system accuracy.
Phoneme An abstract representation of the smallest phonetic unit in a language which conveys a distinction in meaning. For example the sounds /d/ and /t/ are separate phonemes in English because they distinguish words such as do and to. To illustrate phoneme differences across languages, the two /u/-like vowels in the French words tu and tout are not distinct phonemes in English, whereas the two /i/-like vowels in the English words seat and sit are not distinct phonemes in French.
Pitch or F0 The pitch is the fundamental frequency of a (periodic or nearly periodic) speech signal. In practice, the pitch period can be obtained from the position of the maximum of the autocorrelation function of the signal. See also degree of voicing, periodicity and harmonicity. (In psychoacoustics the pitch is a subjective auditory attribute).
PLP analysis Perceptual Linear Prediction features are derived as follows: Compute the perceptual power spectral density (Bark scale); perform equal loudness preemphasis and take the cubic root of the intensity (intensity-loundness power law); apply the IDFT to get the equivalent of the autocorrelation function; fit a linear prediction (LP) model and transform the result into cepstral coefficients (LPCC analysis).
Quinphone (or pentaphone) Phone in context where the context usually includes the 2 left phones and the 2 right phones.
Recording channel Means by which the audio signal is recorded (direct microphone, telephone, radio, etc.)
Sampling Rate Number of samples per second used to code the speech signal (usually 16000, i.e. 16 kHz for a bandwidth of 8 kHz). Telephone speech is sampled at 8 kHz. 16 kHz is generally regarded as sufficient for speech recognition and synthesis. The audio standards use sample rates of 44.1 kHz (Compact Disc) and 48 kHz (Digital Audio Tape). Note that signals must be filtered prior to sampling, and the maximum frequency that can be represented is half the sampling frequency. In practice a higher sample rate is used to allow for non-ideal filters.
Sampling Resolution Number of bits used to code each signal sample. Speech is normally stored in 16 bits. Telephony quality speech is sampled at 8 kHz with a 12 bit dynamic range (stored in 8 bits with a non-linear function, i.e. A-law or U-law). The dynamic range of the ear is about 20 bits.
Spectrogram A spectrogram is a plot of the short-term power of the signal in different frequency bands as a function of time.
Speech Analysis Feature vector extraction from a windowed signal (20-30ms). It is assumed that speech has short time stationarity and that a feature vector representation captures the needed information (depending of the task) for future processing. The most popular set of features are cepstrum coefficients obtained with a Mel Frequency Cepstral (MFC) analysis or with a Perceptual Linear Prediction (PLP) analysis.
Speaker diarization Speaker diarization, also called speaker segmentation and clustering, is the process of partitioning an input audio stream into homogeneous segments according to speaker identity. Speaker partitioning is a useful preprocessing step for an automatic speech transcription system. By clustering segments from the same speaker, the amount of data available for unsupervised speaker adaptation is increased, which can significantly improve the transcription performance. One of the major issues is that the number of speakers is unknown a priori and needs to be automatically determined.
Triphone (or Phone in context) A context-dependent HMM phone model (the context usually includes the left and right phones)
Voicing The degree of voicing is a measure of the degree to which a signal is periodic (also called periodicity, harmonicity or HNR). In practice, the degree of periodicity can be obtained from the relative height of the maximum of the autocorrelation function of the signal.
Word Accuracy The word accuracy (WAcc) is a metric used to evaluate speech recognizers. The percent word acccuracy is defined af %WAcc = 100 - %WER. It should be noted that the word accuracy can be negative. The Word Error Rate (WER, see below) is a more commonly used metric and should be prefered to the word accuracy.
Word Error Rate The word error rate (WER) is the commonly used metric to evaluate speech recognizers. It is a measure of the average number of word errors taking into account three error types: substitution (the reference word is replaced by another word), insertion (a word is hypothesized that was not in the reference) and deletion (a word in the reference transcription is missed). The word error rate is defined as the sum of these errors divided by the number of reference words. Given this definition the percent word error can be more than 100%. The WER is somewhat proportional to the correction cost.