Gladia believes real-time processing is the next frontier of audio transcription APIs

French startup Gladia, which offers a speech-recognition application programming interface (API), has raised $16 million in a Series A funding round. Essentially, Gladia’s API lets you turn any audio file into text with a high level of accuracy and low turnaround time.

While Amazon, Microsoft and Google all offer speech-to-text APIs as part of their cloud-hosting product suites, they don’t perform as well as newer models offered by specialized startups.

There has been tremendous progress in this field over the past couple of years, especially after the release of Whisper by OpenAI. Gladia competes with other well-funded companies in the space, such as AssemblyAI, Deepgram and Speechmatics.

Gladia originally offered a fine-tuned version of Whisper’s speech-to-text model with some much needed improvements. For instance, the startup supports diarization out of the box — it can detect when there are multiple speakers in a conversation and separate the recording, and transcribed text, depending on who’s talking.

Gladia supports 100 languages and a wide variety of accents. This reporter can confirm that it works, as we’ve been using Gladia to transcribe some interviews, and accents weren’t an issue.

The startup offers its speech-to-text model as a hosted API that users can leverage in their own applications and services. Over 600 companies use Gladia, including several meeting recorders and note-taking assistants like Attention, Circleback, Method Financial, Recall, Sana and Veed.io.

That particular use case is interesting, because many companies have to chain API calls. They first turn speech into text, which they then feed into a large language model (LLM), such as GPT-4o or ‎Claude 3.5 Sonnet, to extract knowledge from large walls of text.

With the new funding, Gladia wants to simplify that pipeline by integrating audio intelligence and LLM-based tasks in a single API call. For instance, a customer could get a conversation summary generated from a handful of bullet points without having to rely on a third-party LLM API.

The other issue that Gladia is looking to solve is latency. You may have seen some demos of real-time audio conversations with an AI-based calling agent (11x has a good demo on its website), and these systems have to be able to transcribe in near real time to make such conversations sound as human-like as possible.

“We realized that real time wasn’t very good in terms of quality in the market in general. And people had a weird use case. They were doing real-time processing, and then they were grabbing the audio and running it in batch. We wondered: ‘Why are you doing this?’ They told us: ‘The quality isn’t good in real-time processing, so we transcribe it in batch afterwards,’” co-founder and CEO Jean-Louis Quéguiner (pictured above; right) told TechCrunch.

Gladia chose to tackle this problem, and it can currently transcribe a live conversation with a latency of under 300 milliseconds. The company claims that the real-time processing is now more or less as good as the default, asynchronous batch transcription API, but it’s hard for us to judge without some proper testing. As Quéguiner says, the startup is aiming for “batch quality with real-time capabilities.”

AI calling agents aside, you could imagine a call center using those real-time capabilities to help calling agents find relevant information in the middle of a call. “Our single API is compatible with all existing tech stacks and protocols, including SIP, VoIP, FreeSwitch and Asterisk,” co-founder and CTO Jonathan Soto (pictured above; left) said in a statement.

XAnge is leading the Series A funding round. Illuminate Financial, XTX Ventures, Athletico Ventures, Gaingels, Mana Ventures, Motier Ventures, Roosh Ventures and Soma Capital also participated.

Gladia believes we are on the brink of a “ChatGPT moment” for audio applications. GPT technology has been around for years, but ChatGPT really popularized LLMs with its consumer chat-like interface.

As Apple or Google start including transcription models within iOS or Android, consumers will start to understand the value of automated transcription within the apps they use. Developers will likely then integrate audio features in their products, and that’s where API providers like Gladia will come in.

Source

Must Read