Speech Recognition Jun 03 · 2 min read

Speech Recognition Accuracy Test 2024 – Arabic Edition

Introducing SESTEK's Speech Recognition Accuracy Test 2024 – Arabic Edition, where SESTEK is benchmarked against leading SR providers. This test highlights SESTEK's superior performance and reliability in Arabic speech recognition.

We are thrilled to unveil our latest benchmarking results for Arabic Speech Recognition (SR) services. In our comprehensive evaluation, we compared our Arabic SR solutions with those from providers such as Google, Azure, AWS, Whisper, and Speechmatics. This assessment utilized a publicly available dataset featuring diverse native Arabic speakers and a dataset comprised of customer service representative phone calls.

The Dialect Challenge in Speech-to-Text

Creating an effective SR engine demands sophisticated algorithms and models capable of translating complex audio into text. This conversion necessitates an in-depth comprehension of language nuances, including accents, and dialects.

A primary hurdle for SR technology is the variability of regional dialects, especially in Arabic. Systems trained primarily on standardized linguistic data often fail to accurately transcribe speech that diverges from the norm.

While Modern Standard Arabic (MSA) serves as the formal language in most official settings across the Middle East and Northern Africa (MENA), the everyday spoken language differs greatly. Regional dialects vary widely in terms of pronunciation, grammar, and vocabulary. To overcome these variations, SR systems must be trained on extensive datasets encompassing a variety of dialects, enhancing both accuracy and functionality.

Our accuracy tests employed the Word Error Rate (WER) method, a common metric for evaluating SR systems. WER calculates the percentage of discrepancies in the SR output compared to the accurate "ground truth" transcription, factoring in substitutions, deletions, and insertions relative to the total word count of the ground truth. The lower the WER, the better the engine.

Test Dataset

The benchmark was conducted using the following datasets:

1. Arabic Mediaspeech Dataset

Context: Publicly available set from A1 Arabiya, France 24 Arabic, BBC News.

Subset: Random 1-hour subset used for tests (results as of April 15, 2024).

Results:

Speech Recognition accuracy rate

2. Customer-Service Representative Phone Call

Context: Real-life telephone conversations in the Egyptian dialect.

Technique: Fine-tuning was done for a specific domain and customer.

Results:

Speech Recognition accuracy rate

The following models were utilized for test:

AssemblyAi Uni-1 (nano)
Google's latest-short
Speechmatics enhanced
Whisper Large-v3

The Impact of Fine-Tuning

Our test highlights the critical role of fine-tuning in enhancing SR system accuracy. By training on extensive datasets that include a range of dialects and refining acoustic models to better handle these variations, SR systems can improve transcription accuracy for non-standard languages. This is essential for ensuring reliable SR performance in practical applications where audio quality and background noise may vary.

Wrapping Up

As SESTEK, we have been developing SR engines for different languages over the last 20 years. We have vast expertise in customer service vertical and we are happy with our near-zero error rate for Arabic language.

This benchmark also underscores the substantial benefits that fine-tuning offers for specific dialects, revealing notable variability in accuracy across different SR providers. As we continue to confront the unique complexities of the Arabic language, the need for ongoing technological enhancements remains clear. Through dedicated fine-tuning and advancements, we aim to set new standards in Arabic speech recognition accuracy.

Disclaimer: The speech recognition process includes calculating and optimizing millions of parameters over a vast search space. It is hugely stochastic (a pattern that may be analyzed statistically but not predicted precisely). A vendor’s SR engine can perform better than others for a specific recording, but the same engine can perform differently for other recordings.

Author: Debi Çakar, SESTEK Product Team

Back to Blog

Keep Exploring

GPT-4o May 20 · 4 min read

Exploring GPT-4o: The Revolutionary Power of Voice

Discover how GPT-4o is revolutionizing the tech world with its voice capabilities. Join us as we explore OpenAI's new model, highlighting its real-time translation, and sentiment analysis features for human-like interactions.

Newsletter Apr 05 · 1 min read

SESTEK Q1 Update

As we are wrapping up Q1, we want to take a step back and give you a snapshot of what differentiates us from the rest of the market. We are a unique R&D center; we are proud of our teams and we are proud of the products. Here is a brief summary of why.

Virtual Agent Jul 01 · 6 min read

Introduction to Virtual Agents: Exploring the Fundamentals

In this blog post, we explore how virtual agents are revolutionizing self-service operations by discussing what they are, how they work, the different types, and key use cases.

ABOUT SESTEK

SESTEK is a conversational automation company helping organizations with conversational solutions to be data-driven, increase efficiency and deliver better experiences for their customers. Sestek’s AI-powered solutions are built on text-to-speech, speech recognition, natural language processing and voice biometrics technologies.

SESTEK is a part of UNIFONIC

Call Us On

United States
+1 315 961 84 04
2 Park Ave 20th Floor
New York NY 10016
Middle East & Africa
+971 4 390 1646
Office # 2605 Marina Plaza
Al Marsa Street, Marina Dubai
Dubai, UAE
Europe & Turkey
+90 212 286 25 45
Vadistanbul Bulvar 1B Blok Ofis No:4 / 34396 Sariyer, Istanbul
info@sestek.com