Siri and Alexa’s ‘all business’ sibling will take ASR to another level

Adding Automatic Speech Recognition (ASR) to your business now offers tangible ROI for a small initial investment

Look beyond the new ways Siri and Alexa can help us and you’ll see a clock counting down the impending arrival of their ‘all business’ sibling. It will not be long until some marketing exec christens the next iteration of Automatic Speech Recognition (ASR) technology, but until then there’s time to look at how ASR has developed for business use.

ASR, or speech to text technology, has been climbing the accuracy mountain for over 50 years and has finally reached a peak, thanks to deep learning, where it can support important business functions.

During 2014 transcription accuracy jumped to almost 90%. Over a period of months in that year Google, Microsoft, AWS and IBM traded blows about the new standards in accuracy they were setting. But this was a phoney war as the results were achieved in ‘laboratory’ conditions and didn’t reflect real-life situations. Testing the same system in true to life conditions of a business to customer call can bring the Word Error Rate (WER) down to somewhere between 20%-25%.

However, the pursuit of ‘business quality’ accuracy continued.


A crucial part of every business-to-customer phone call is the environmental noise that is present in both channels of any conversation. This can vary from low intensity office noises, to loud music playing or even worse, vivid conversations taking place near the customer’s phone. This variability in conditions poses a huge difficulty in ASR which, as every technology under the artificial intelligence (AI) umbrella, is dependent on data to improve its performance, and assumes all possible conditions are – to some degree – present in the training data.

This is why the key driver for ASR accuracy is not entirely based on the volume of data. If it were, Google or AWS would offer the best performing ASR tools. They do not because a significant success driver is data quality in combination with the training data emanating from a relevant environmental and semantic space (i.e matched training conditions).

This window of opportunity has seen development of ASR for a specific vertical that outperforms the best results of generic engines offered by Google and AWS.

ImpacTech’s personalized solution

ASR – Automatic speech recognition is the machine based translation of spoken language to written text. In a conversational setting, ASR must be able to recognize an unlimited set of words (i.e open vocabulary) spoken in a natural way (not dictated). In addition, a complete ASR solution must be able to distinguish the turn of the speakers in the conversation.
FIgure 1: A general representation of Impactech’s ASR system.

ImpacTech’s deep ASR system offers real time and open vocabulary recognition of conversational speech optimized for each target domain, ensuring that even small sets of matched training data are utilized to their maximum potential. Furthermore, once the system is put in production, the incoming stream of data is utilized in a Continual Learning fashion in order to ensure a regular adaptation and further customization of the whole system.

As shown in the figure above, the input phone call is first passed through a Voice Activity Detection (VAD) module that detects which participant in the conversation is talking. Once the VAD signals the start of speech in either of the two channels, our ASR system starts transcribing. VAD is trained to ignore any ‘noise’ (secondary conversations, music and other office noises happening in the background) and focuses on the key participants.

In the next step, the Acoustic Modelling (AM) and Language Modelling (LM) modules work together to provide fully time-stamped output relative to the beginning of the conversation at two levels: per word and per dialogue turn within each transcribed conversation. This output can be used in a next layer as input to various AI modules, that exploit natural language understanding (NLU) methods to draw conclusions on topics of interest.

An acoustic model is a statistical model that represents the speech generation process and the relationships that exists between the acoustic signal and the various linguistic units that formulate speech, such as syllables or phonemes. The AMs are trained from audio recordings and their text transcriptions.
A language model is a probabilistic model that is used to predict the next word in a sentence. It inherently takes into account the grammatical and syntactic attributes of a language and provides a probability to each possible spoken utterance recognized by the AM. The LMs are trained from text transcriptions.

On top of state-of-the-art deep architectures in the VAD and ASR systems, ImpacTech offers customized statistical models, tailored to the needs of each customer. A smart data selection module evaluates the importance and richness of each transcribed conversation, and the most interesting ones are kept and utilized in an adaptive fashion to optimize all statistical models. The more the system is used the better it learns the environment and the domain, leading to a continuous improvement of results.

For an initial customization with as little as 10 hours of target speech this system can be adapted to any specific vertical whether through using pre-recorded audio or live phone conversations and optimized by speed or accuracy to the specific transcription needs of a business.


Blog - Sentiment Analysis img
Figure 2: Compliance analysis on top of ASR output

And when transcription is no longer a time-consuming, labour intensive task so many business opportunities open.

The Microsoft State of global customer service survey reported that 40% of people still prefer using phone or voice channels when dealing with contact centers. And when you consider that 80% of all business communications take place over the phone the potential is huge.

When converted to text, speech is searchable and easier to check for keyword usage. For example, in finance, it significantly increases compliance efficiency by making it easier to identify regulatory infringements (see figure 2 above).

This ASR system also improves the customer experience (CX) when they contact a business. The first point of contact is often an interactive voice response (IVR) system, when used with ASR technology it enables a caller to perform self-service tasks, such as checking account balances or authenticating their identity before speaking with an agent and identifying the reason for the call so that they can be connected to the appropriate agent.

However, this is a one-dimensional view of the potential of automatic transcription as speech contains so much more information than just text.

Our ASR system also makes it available and purposed for natural language processing (NLP) and NLU.



Sentiment analysis is the interpretation and classification of emotions (positive, negative and neutral) within text data using text analysis techniques. Sentiment analysis allows businesses to identify customer sentiment toward products, brands or services in online conversations and feedback.

When combined with sentiment analysis the transcription becomes more valuable as conversational intent is determined. It enables a business to understand how a customer feels which enables deeper engagement with customers and is valuable to marketing, sales and customer support teams.

The potential efficiency gains in sales is highly significant. Lead management is optimized by identifying the most engaged leads and how close a deal is to being closed. It is also easier to identify engagement opportunities.

In marketing the combination of speech to text and sentiment analysis can be used for competitive insight and market research, and to accurately measure campaign success.

ASR and sentiment analysis extracts feedback about customer support to accurately measure the efficiency of the customer support process.

When this technology is fully embedded in a business it can also drive product adoption by interpreting discussions around products and brands. These insights can play a role in product development and brand protection through the analysis of these conversations.


More than any other modern technology ASR is a game-changer for call centres. Their role has expanded beyond simply increasing revenue. Nowadays, the goal is to create and sustain the highest level of satisfaction throughout the customer journey.

With ASR, quality control and agent effectiveness can be measured efficiently and in real time. Bench-marking can be set quantitatively by identifying the level every agent needs to achieve in attentiveness and responsiveness. It also speeds up response time to problems because they are easier to identify.

A business can build a knowledge base from real-life interactions, and create best practices based on analysis of the data. Problems like dispute resolution can be handled more efficiently by using the knowledge base to build response methods based on the best previous outcomes.

But perhaps most importantly this ASR system has such a low barrier to entry it is accessible to businesses of all sizes and offers a far simpler first step towards digital transformation with less organisational disruption than most other rou.tes involving AI technology

You are viewing the most recent article