How Automated Speech Recognition (ASR) Can Be Used in Video Marketing: Q&A With Kathryn Lye of Speechmatics


“The ‘automatic’ nature of ASR means that speech within content can be exposed for subsequent natural language processing (NLP) tools to recognize keywords, frequently used words and other unique elements that users might use as search terms.”

Kathryn Lye, vice president of marketing at Speechmatics, discusses how automated speech recognition (ASR) technology has changed the digital media landscape. She talks about how ASR works with audio files and can be used in video marketing.

In this edition of MarTalk Connect, Lye showcases how marketers can increase ROI by using ASR for video marketing strategies in 2020. She further throws light on how brands can build voice-to-text capabilities in-sync with data security parameters, and more.

Key takeaways from this Q&A on automated speech recognition technology:

  • Learn how to adopt speech recognition technology in video marketing
  • Top three advantages of using automated speech recognition technology
  • The latest in automated speech recognition technology and the future of video marketing

Here are the edited excerpts of our conversation with Kathryn Lye of Speechmatics on interactive video marketing:

Kathryn, please tell us about your career path, what your role is at Speechmatics, and how you bring about exceptional results in your role.

My career has predominantly been spent in tech and high-growth tech start-ups in B2B. Although starting in what would be classed as a generalist role, I’ve very much found myself immersed in the digital marketing world (albeit with a product marketing slant) before moving into management and leadership roles. Roles of late have been very much about building out a function and team and increasingly attempting to humanize a very digital world. I think the key is always ensuring the customer is front of mind and working with technology to provide engaging experiences at scale. Then using any insight derived from that data to learn and continuously improve those experiences and bring increasing business value. In my mind, get this right and revenue will follow.

In what ways has automated speech recognition (ASR) technology changed the digital media landscape?

The exponential growth of media content means that challenges in the digital media landscape are being exacerbated. The growth of content means people are consuming content in an increasing number of ways and businesses are having to adapt and think of better ways to manage their content assets, whether it’s a video streaming platform, archived content for a broadcaster, or recorded interviews for a journalist. Content is king and while diversity of content and choice is a powerful benefit, identifying the right content, enhancing searchability, and the discoverability of this content arguably delivers far more value. Likewise, knowing what’s being said about brands, for example, in video content, is important, and we’re likely to see marketers demanding this from their media monitoring solutions.

ASR providers provide content owners and managers the ability to unlock the information held within their media content. Historically, only the title of the content could be used for indexing, with little information available about the content itself, so searching could prove challenging. The role of ASR enables speech within the content to be transformed to text providing a searchable archive to help users identify the exact content they are trying to find.

The ‘automatic’ nature of ASR means that speech within content can be exposed for subsequent NLP tools to recognize keywords, frequently used words, and other unique elements that users might use as search terms. This hands-off capability of ASR means that the extraction of this data has minimal impact on the organizations that deploy ASR.

The extraction of voice data within media content that ASR can expose not only enhances searchability but in the case of consumer-facing solutions, enables powerful personalization. Key trends, words, associated genres, etc. can be used alongside other data points like abandonment rates and engagement time of content to deliver better-targeted personalization, recommendations, and intelligent engagements.

Essentially ASR enables content owners to expose extensive levels of metadata within their content to enhance the capability to index, find, and personalize their content to deliver better experiences for the users of the content.

What are the top 4 challenges media companies face when adopting speech recognition technology?

  1. Is the technology accurate enough for use?
  2. Will the technology improve the experience expected of their customers?
  3. How difficult is the technology to integrate and deploy into production?
  4. Will the technology provide operational efficiencies?

From our research, 53% of media organizations have already integrated speech technology within their solutions as they believe it generates significant competitive advantage (60%) and better customer experience (33%). A further 20% of companies said that speech technology was a priority for their business in the next 5 years, and another 20% said that they are currently considering the adoption of speech technology.

Integrating any technology comes with challenges, and speech is no different. Media companies find that the complexities of deploying speech technology into production is a key challenge (53%), along with the technology not yet being a suitable accuracy level, and so a combination of human and machine is required (53%). The complexity of deploying voice technology is a constant challenge that means resources must be allocated to ensure successful integration. It also requires the speech technology provider to ensure they have processes, procedures, documents, support, and training in place to ensure that the deployment process is as easy as possible for their customers.

Just 29% of businesses said that the cost was a key challenge, indicating that the media market is concerned about accuracy and output over price. The concerns around costs are minimal because media companies understand and value that deploying voice comes with operational efficiencies and reduced cost, and so a return on investment is both realistic and achievable.

How does ASR work with audio files? How accurate is the technology at this point?

While ASR has become increasingly popular in recent years, through the growth of smart speakers and other smart devices, ASR technology is nothing new. The adoption of consumer devices, however, has put this technology into the public eye, and this demand has forced the rapid advancement in machine learning practices to deliver better and better word error rate accuracy across more and more languages.

One of the challenges around ASR is that the transformation from speech to text seems easy, and when done right, it is a seamless and elegant operation. However, it is an incredibly difficult task due to the complexities of language and the way in which they are spoken. Accents, dialects add additional levels of complexity in addition to understanding context. For example: ‘o’, the letter ‘0′, the number and ‘oh’ the word all sounds the same but mean different things. Another example is “I read a red book” with the transcription engine required to not only transcribe the word itself but use contextual analysis to know which version of the word to place in the transcript.

Accuracy comes in a variety of forms; however, Word Error Rate is the most common measurement. While accurate words are vital to transcription, other elements, such as proper punctuation, deliver significant value, especially when it comes to human readability.

With the growing trend in video marketing, how can ASR technology be used for video captioning?

High-quality transcription with ASR is the first step to automate and speed up the production of captions. With more and more content being produced, content providers are under increasing pressure to not only make their content accessible but to remain on the right side of compliance in some industries. The expectation around the ability to consume content through social media platforms with no sound, but with captions, is the norm now, and so the demand to create these captions has grown in response. Historically, human transcribers and stenographers were the only way to do this and, while they are still very much an important element within the journey from video to captions, the ability to offload the initial bulk transcription task enables these stenographers to deliver more captions, with better quality and more quickly.

What are your top 3 tips for marketers to increase ROI when using ASR for video marketing strategies in 2020?

  1. Choosing the right tool is crucial. Think about your requirements and test the tools with your content. In my experience, choosing the wrong tool will cost you time which you likely don’t have.
  2. Transcriptions are a requirement for good captions. Given changes in how people expect to consume content, captions are not optional; they should form a logical part of your workflow.
  3. Well-captioned content will improve experience and engagement. If captions are present, there’s more chance your video content will be consumed by more people. If you don’t provide captions, people will certainly scroll past.

What is your number one tip for brands just starting out on a video content strategy? What should they prioritize?

Firstly, and just like any other piece of content, think about the story you want to tell and what you want the outcome to be what is it that you want people to think, feel, or do? Remember, video is typically best consumed in small, digestible chunks, so break up your story into 30 second (or less) snippets. Consider how best to convey the message, are talking heads appropriate, a short animation, a screen capture, something else, or a combination.

Budget can be a huge factor, and video can be costly to commission. Once you know what the outcome you’re looking for is and the message you’re attempting to deliver, think hard about what should go to a professional videographer for production / post-product vs. what you can shoot yourself. Most of all, try stuff, you’ll soon learn what works and what doesn’t. But don’t forget your captions.

What role does data security play in automatic speech recognition? How can brands build voice to text capabilities in-sync with data security parameters?

Data security has never been such a hot topic. Data security plays a huge role in ASR, and organizations need to go into relationships with their ASR partners with their eyes wide open and understand how their providers’ data security policies can impact them. Providers that support on-premises deployment options have the upper-hand when it comes to data security as this capability means voice data captured and transcribed remains with the brand or customer of the solution. When using cloud-only solutions, which typically are very secure, brands will need to consider any potential implications of ‘handing over’ their data to the ASR provider.

Is ASR more suited to B2B or B2C marketers? What are the best practices for both spaces?

It’s important for both. We’ve probably seen it more in B2C so far, but B2B will follow suit. When it comes to captioning, as marketers, I hope we can self-regulate rather than having legislation imposed, which we’re seeing in some markets. Captions are just good practice, and tools are increasingly available, so there’s no reason we can’t.

The bit that’s not really being talked about is media monitoring or the management of content assets. Communications teams place high value in knowing when and where their brands are being spoken about but, so far as I can tell, there’s been little expectation or pressure on the platform providers to include voice. With advances in technology, there’s no reason for people not to consider voice as part of their mix. If you want or need people to be able to search content, voice should be part of the mix, especially if you create a high volume of content that you may wish to surface, repurpose, or reference at a later date.

Regardless of B2B or B2C, we’re marketing to people and ultimately the technology is there to support us and enable us to scale and be more productive.

What does 2020 have in store for the automated speech recognition technology landscape? 

ASR providers are putting extensive effort into the reduction of word error rate (WER), and soon the battlegrounds over differentiation will shift to new areas and metrics. Because of this, ASR providers will be forced to look to more customer-relevant terms for accuracy, whether that be accurate identification of speakers, readability, accurate placement of punctuation, or something else.

There will be more focus on the real-world application of the technology, which will also reduce the importance of WER as the default measure. The conversation will shift from something that has seemed somewhat academic to something that is more meaningful. For example, the ability to accurately transcribe video or audio recorded in noisy locations or with lots of background noise, or audio from low-quality recording devices will be a focus as everyone becomes a creator of content. This will be harder to test and compare between providers, but the industry will need to find a way.

New applications will be found, and this will continue to fuel growth in the market. I think the market will continue to change with new providers appearing and others being acquired, which will happen (and we’ve seen it happen with Sonos acquiring SnipsOpens a new window ). How this plays out is yet to be seen, but there’s no doubt there’ll be increasing investment in the technology and more reasons for organizations to innovate and unlock value from their voice data.

Thank you, Kathryn, for sharing your insights on automated speech recognition technology. We hope to talk to you again.

About Kathryn LyeOpens a new window :

Kathryn is VP marketing at Speechmatics, an award-winning provider of Automatic Speech Recognition (ASR). As an experienced B2B marketing professional, Kathryn has a proven proactive and results-driven approach. Seeking out continuous improvements to marketing performance, she is an advocate for digital marketing and the use of technology to scale and improve the productivity of marketing teams while optimizing the buyer journey.

About SpeechmaticsOpens a new window :

Speechmatics leads the market as an any-context speech recognition engine for companies to rapidly build innovative applications. Analysts have recognized Speechmatics as a pioneer in machine learning voice engineering. It enables companies to build applications that detect and transcribe voice in any context in real-time. Its neural networks consider acoustics, languages, dialects, multiple speakers, punctuation, capitalization, context, and implicit meanings. In 2019, Speechmatics received the Queen’s Award for Enterprise Innovation. ​ 


MarTalk ConnectOpens a new window is an interview series where marketing technology companies that are making a difference, connect with us and share their stories. Join us as we talk to them about their product journeys, insights on the categories they serve, and some bonus pro-tips.


If you are martech company and wish to connect with us to tell your story,
Write to [email protected]Opens a new window

Found this interview interesting, want to know more about the future of automated speech recognition? Follow us on TwitterOpens a new window , Facebook,Opens a new window and LinkedInOpens a new window .