Google's Gemini updates: How Project Astra is powering some of I/O's big reveals

Google is improving its AI-powered chatbot Gemini so that it can better understand the world around it — and the people conversing with it.

At the Google I/O 2024 developer conference on Tuesday, the company previewed a new experience in Gemini called Gemini Live, which lets users have “in-depth” voice chats with Gemini on their smartphones. Users can interrupt Gemini while the chatbot’s speaking to ask clarifying questions, and it’ll adapt to their speech patterns in real time. And Gemini can see and respond to users’ surroundings, either via photos or video captured by their smartphones’ cameras.

“With Live, Gemini can better understand you,” Sissie Hsiao, GM for Gemini experiences at Google, said during a press briefing. “It’s custom-tuned to be intuitive and have a back-and-forth, actual conversation with [the underlying AI] model.”

Gemini Live is in some ways the evolution of Google Lens, Google’s long-standing computer vision platform to analyze images and videos, and Google Assistant, Google’s AI-powered, speech-generating and -recognizing virtual assistant across phones, smart speakers and TVs.

At first glance, Live doesn’t seem like a drastic upgrade over existing tech. But Google claims it taps newer techniques from the generative AI field to deliver superior, less error-prone image analysis — and combines these techniques with an enhanced speech engine for more consistent, emotionally expressive and realistic multi-turn dialogue.

“It’s a real-time voice interface and [has] extremely powerful multimodal capabilities combined with long context,” Oriol Vinyals, principal scientist at DeepMind, Google’s AI research division, told TechCrunch in an interview. “You could imagine how that combination will feel very powerful.”

The technical innovations driving Live stem in part from Project Astra, a new initiative within DeepMind to create AI-powered apps and “agents” for real-time, multimodal understanding.

“We’ve always wanted to build a universal agent that will be useful in everyday life,” Demis Hassabis, CEO of DeepMind, said during the briefing. “Imagine agents that can see and hear what we do, better understand the context we’re in and respond quickly in conversation, making the pace and quality of interactions feel much more natural.”

Gemini Live — which won’t launch until later this year — can answer questions about things within view (or recently within view) of a smartphone’s camera, like which neighborhood a user might be in or the name of a part on a broken bicycle. Pointed at a portion of computer code, Live can explain what that code does. Or, asked about where a pair of glasses might be, Live can say where it last “saw” the glasses.

Live is also designed to serve as a virtual coach of sorts, helping users rehearse for events, brainstorm ideas and so on. Live can suggest which skills to highlight in an upcoming job or internship interview, for instance, or give public speaking advice.

“Gemini Live can provide information more succinctly and answer more conversationally than, for example, if you’re interacting in just text,” Sissie said. “We think that an AI assistant should be able to solve complex problems … and also feel very natural and fluid when you engage with it.”

Gemini Live’s ability to “remember” is made possible by the architecture of the model underpinning it: Gemini 1.5 Pro (and to a lesser extent other “task-specific” generative models), which is the current flagship in Google’s Gemini family of generative AI models. It has a longer-than-average context window, meaning it can take in and reason over a lot of data — about an hour of video (RIP, smartphone batteries) — before crafting a response.

“That’s hours of video that you could have interacting with the model, and it would remember all that has happened before,” Vinyals said.