Speech-to-Text Conversion


Speech-to-text conversion is the process of converting spoken words into written texts. This process is also often called speech recognition. Although these two terms are almost synonymous, Speech recognition is sometimes used to describe the wider process of extracting meaning from speech, i.e. speech understanding. The term voice recognition should be avoided as it is often associated to the process of identifying a person from their voice, i.e. speaker recognition.

How does it work?

All speech recognition systems rely on at least two models: an acoustic model and a language model. In addition large vocabulary systems use a pronunciation model. It is important to understand that there is no such thing as a universal speech recognizer. To get the best transcription quality, all of these models can be specialized for a given language, dialect, application domain, type of speech, and communication channel.

Like any other pattern recognition technology, speech recognition cannot be error free. The speech transcript accuracy is highly dependent on the speaker, the style of speech and the environmental conditions. Speech recognition is a harder process than what people commonly think, even for a human being. Humans are used to understanding speech, not to transcribing it, and only speech that is well formulated can be transcribed without ambiguity.

From the user's point of view, a speech recognition system can be categorized based in its use: command and control, dialog system, text dictation, audio document transcription, etc. Each use has specific requirements in terms of latency, memory constraints, vocabulary size, and adaptive features.


The VoxSigma software suite offers large vocabulary multilingual speech-to-text capabilities with state-of-the-art accuracy. It has been specifically designed for professional users, needing to transcribe large quantities of audio and video documents such as broadcast data, either in batch mode or in in real-time. It can also be used to analyze call-center data.

The complete voice-to-text conversion process is done in three steps. The speech recognition software first identifies the audio segments containing speech, then it recognizes the language being spoken if it is not known a priori, and finally it converts the speech segments to text and time-codes. VoxSigma includes adaptive features allowing the transcription of noisy speech such as speech with background music. The result is a fully annotated XML document including speech and non speech segments, speaker labels, words with time codes and high quality confidence scores. This XML file can be directly indexed by a search engine, or alternatively can be converted into plain text.

The VoxSigma sofware suite is offered as a Web service via a REST API over HTTPS, allowing customers to quickly reap the benefits of regular improvements to the technology and take advantage of additional features offered by the online environment. The services are available 24/7/365 with failover servers and geographic redundancy.

Vocapia Research also offers services to adapt, tune or create specific models or systems tailored to exactly match your needs. Tailoring models for your application is the best way to ensure you get the best possible results for your needs. High accuracy is essential to maximize your ROI, as to a first approximation, the cost of using a speech-to-text system is proportional to the system's error rate. Therefore using a system with a 80% accuracy (i.e. 20% error) may cost almost twice that of using a system with a 90% accuracy (i.e. 10% error). This is also be the case for systems with 90% and 95% accuracy, although the difference in error rate is 5%, the first system makes twice as many errors as the second.