Speech Recognition Mar 27 · 2 min read

Speech Recognition Accuracy Comparison Test 2023

Speech Recognition (SR), also known as Automatic Speech Recognition (ASR), is a system for processing received sounds with hardware-based techniques and software and converting the sound to text.

What is Speech Recognition?

Speech Recognition (SR), also known as Automatic Speech Recognition (ASR), is a system for processing captured audio and converting the sound to text. This is the first step to let users control devices and systems by speaking instead of using conventional tools such as keystrokes or buttons.

Why Speech Recognition?

Phone conversations are still the main interaction between people and businesses, but manual conversation analysis requires a vast amount of time and effort. Today, this process is easier with speech analysis software that leverages automatic speech recognition (ASR) technology. ASR assists in the automatic transcription of recordings (speech-to-text), and it takes much less effort and time.

SR technology is the core technology behind Conversational AI solutions such as virtual assistants and voice-enabled IVR systems. Companies of all sizes from different industries are now using Conversational solutions powered by SR technology to contribute to the lives of their customers and employees positively.

What Do We Work on?

Recently, speech technologies have been moving from deep neural network-based Hybrid modeling to end-to-end (E2E) modeling. While E2E models achieve state-of-the-art results in most benchmarks in terms of SR accuracy, Hybrid models are still used in a large proportion of commercial SR systems.

As SESTEK, we are an R&D center with 100+ engineers, and we closely follow state-of-the-art technologies and upgrade our solutions to produce the best solutions for our customers.

For this reason, we performed a study to train our models with new technology, compared these versions and measured their performances.

Difference Between Hybrid and E2E

Traditional Hybrid speech recognition systems work by independently training separate modules such as the acoustic model, language model, and phonetic dictionary and combining these modules during decoding of the input audio recording. On the other hand, E2E has a much simpler training pipeline decoding process through a single neural network. This reduces the training and decoding time and allows joint optimization with downstream processing, such as natural language understanding (NLU).

As for the disadvantages of Hybrid systems, the optimal state of each module does not guarantee that the combined system used during deciphering is also in an optimal state. The training of each module may require different expertise, and an expert in linguistics may be required for a phonetic dictionary.

E2E has been able to eliminate these disadvantages of Hybrid systems.

SR Accuracy Test

Word Error Rate (WER) is the best measurement method for comparing SR accuracies. WER is shown in (%) and is derived by comparing a reference transcript with the SR transcript for the audio. A low WER indicates a transcript with high accuracy.

WER = (substitutions + insertions + deletions) / number of words spoken

While conducting our tests, we used 1-hour Call Center records in English from 2 different industries, transcribed them into text, and calculated final word-error rates within the data set.

SESTEK has been benchmarked against major SR providers and has consistently scored the lowest WER score in this test.

Speech Recognition Accuracy Comparison

Disclaimer: Regarding the output, we are not suggesting that we are certainly better than the other vendors. The speech recognition process includes calculating and optimizing millions of parameters over a vast search space. It is hugely stochastic (a pattern that may be analyzed statistically but not predicted precisely). A vendor’s SR engine can perform better than others for a specific recording, but the same engine can perform differently for another.

Debi Çakar, Product Analyst, Product Management Team, SESTEK

Back to Blog

Keep Exploring

User Experience Oct 22 · 4 min read

UX at the center of self-service: A success story in banking

The term “user experience” was first used in the 1990s due to the rising use of solutions and interfaces that enable people to use products and services without a live intermediary.

Project Management Jan 12 · 4 min read

How do you prefer the delivery of your AI project? Waterfall or Agile?

The business world is changing and evolving rapidly towards an AI centralized structure. It is convenient to notice that project delivery methods have to adapt to these changes as fast...

Conversational IVR Dec 28 · 3 min read

Digitizing Debt Collection with Conversational IVR

Traditional debt collection methods that rely on manual processes are far from offering effective solutions. In addition to increasing the workload for agents, the success rate of banks ‘outbound collection campaigns stood at roughly 5%.

ABOUT SESTEK

SESTEK is a conversational automation company helping organizations with conversational solutions to be data-driven, increase efficiency and deliver better experiences for their customers. Sestek’s AI-powered solutions are built on text-to-speech, speech recognition, natural language processing and voice biometrics technologies.

SESTEK is a part of UNIFONIC

Call Us On

United States
+1 315 961 84 04
2 Park Ave 20th Floor
New York NY 10016
Middle East & Africa
+971 4 390 1646
Office # 2605 Marina Plaza
Al Marsa Street, Marina Dubai
Dubai, UAE
Europe & Turkey
+90 212 286 25 45
Vadistanbul Bulvar 1B Blok Ofis No:4 / 34396 Sariyer, Istanbul
info@sestek.com