Google is improving its AI-powered chatbot Gemini so that it can better understand the world around it — and the people conversing with it.
At the Google I/O 2024 developer conference on Tuesday, the company previewed a new experience in Gemini called Gemini Live, which lets users have “in-depth” voice chats with Gemini on their smartphones. Users can interrupt Gemini while the chatbot’s speaking to ask clarifying questions, and it’ll adapt to their speech patterns in real time. And Gemini can see and respond to users’ surroundings, either via photos or video captured by their smartphones’ cameras.
“With Live, Gemini can better understand you,” Sissie Hsiao, GM for Gemini experiences at Google, said during a press briefing. “It’s custom-tuned to be intuitive and have a back-and-forth, actual conversation with [the underlying AI] model.”
Gemini Live is in some ways the evolution of Google Lens, Google’s long-standing computer vision platform to analyze images and videos, and Google Assistant, Google’s AI-powered, speech-generating and -recognizing virtual assistant across phones, smart speakers and TVs.
At first glance, Live doesn’t seem like a drastic upgrade over existing tech. But Google claims it taps newer techniques from the generative AI field to deliver superior, less error-prone image analysis — and combines these techniques with an enhanced speech engine for more consistent, emotionally expressive and realistic multi-turn dialogue.
“It’s a real-time voice interface and [has] extremely powerful multimodal capabilities combined with long context,” Oriol Vinyals, principal scientist at DeepMind, Google’s AI research division, told TechCrunch in an interview. “You could imagine how that combination will feel very powerful.”
The technical innovations driving Live stem in part from Project Astra, a new initiative within DeepMind to create AI-powered apps and “agents” for real-time, multimodal understanding.
“We’ve always wanted to build a universal agent that will be useful in everyday life,” Demis Hassabis, CEO of DeepMind, said during the briefing. “Imagine agents that can see and hear what we do, better understand the context we’re in and respond quickly in conversation, making the pace and quality of interactions feel much more natural.”
Gemini Live — which won’t launch until later this year — can answer questions about things within view (or recently within view) of a smartphone’s camera, like which neighborhood a user might be in or the name of a part on a broken bicycle. Pointed at a portion of computer code, Live can explain what that code does. Or, asked about where a pair of glasses might be, Live can say where it last “saw” the glasses.
Live is also designed to serve as a virtual coach of sorts, helping users rehearse for events, brainstorm ideas and so on. Live can suggest which skills to highlight in an upcoming job or internship interview, for instance, or give public speaking advice.
“Gemini Live can provide information more succinctly and answer more conversationally than, for example, if you’re interacting in just text,” Sissie said. “We think that an AI assistant should be able to solve complex problems … and also feel very natural and fluid when you engage with it.”
Gemini Live’s ability to “remember” is made possible by the architecture of the model underpinning it: Gemini 1.5 Pro (and to a lesser extent other “task-specific” generative models), which is the current flagship in Google’s Gemini family of generative AI models. It has a longer-than-average context window, meaning it can take in and reason over a lot of data — about an hour of video (RIP, smartphone batteries) — before crafting a response.
“That’s hours of video that you could have interacting with the model, and it would remember all that has happened before,” Vinyals said.
Live is reminiscent of the generative AI behind Meta’s Ray-Ban glasses, which similarly can look at images captured by a camera and interpret them in near-real time. Judging from the pre-recorded demo reels Google showed during the briefing, it’s also quite similar — conspicuously so — to OpenAI’s recently revamped ChatGPT.
One key difference between the new ChatGPT and Gemini Live is that Gemini Live won’t be free. Once it launches, Live will be exclusive to Gemini Advanced, a more sophisticated version of Gemini that’s gated behind the Google One AI Premium Plan, priced at $20 per month.
Perhaps in a jab at Meta, one of Google’s demos showed a person wearing AR glasses equipped with a Gemini Live-like app. Google — doubtless keen to avoid another dud in the eyewear department — declined to say whether those glasses or any glasses powered by its generative AI would come to market in the near future.
Vinyals didn’t completely shut down the idea, though. “We’re still prototyping and, of course, showcasing [Astra and Gemini Live] to the world,” he said. “We’re seeing the reaction from folks that can try it, and that will inform where we go.”
Other Gemini updates
Beyond Live, Gemini is getting a range of upgrades to make it more useful day-to-day.
Gemini Advanced users in more than 150 countries and over 35 languages can take advantage of Gemini 1.5 Pro’s larger context to have the chatbot analyze, summarize and answer questions about long (up to 1,500 pages) documents. (While Live is arriving later in the year, Gemini Advanced users can interact with Gemini 1.5 Pro starting today.) Documents can now be imported from Google Drive or uploaded directly from a mobile device.
Later this year for Gemini Advanced users, the context window will grow even larger — to 2 million tokens — and bring with it support for uploading videos (up to two hours in length) to Gemini and having Gemini analyze big codebases (more than 30,000 lines of code).
Google claims that the large context window will improve Gemini’s image understanding. For example, given a photo of a fish dish, Gemini will be able to suggest a comparable recipe. Or, given a math problem, Gemini will provide step-by-step instructions on how to solve it.
And it’ll help Gemini to trip plan.
In the coming months, Gemini Advanced will gain a new “planning experience” that creates custom travel itineraries from prompts. Taking into account things like flight times (from emails in a user’s Gmail inbox), meal preferences and information about local attractions (from Google Search and Maps data), as well as the distances between those attractions, Gemini will generate an itinerary that updates automatically to reflect any changes.
In the more immediate future, Gemini Advanced users will be able to create Gems, custom chatbots powered by Google’s Gemini models. Along the lines of OpenAI’s GPTs, Gems can be generated from natural language descriptions — for example, “You’re my running coach. Give me a daily running plan” — and shared with others or kept private. No word on whether Google plans to launch a storefront for Gems like OpenAI’s GPT Store; hopefully we’ll learn more as I/O goes on.
Soon, Gems and Gemini proper will be able to tap an expanded set of integrations with Google services, including Google Calendar, Tasks, Keep and YouTube Music, to complete various labor-saving tasks.
“Let’s say you have a flier from your kid’s school, and there’s all these events that you want to add to your personal calendar,” Hsiao said. “You’ll be able to take a picture of this flier and ask the Gemini app to create these calendar entries directly onto your calendar. This is going to be a great time saver.”
Given generative AI’s tendency to get summaries wrong and generally go off the rails (plus Gemini’s not-so-glowing early reviews), take Google’s claims with a grain of salt. But if the improved Gemini and Gemini Advanced actually perform as Hsiao describes — and that’s a big if — they could be great time savers indeed.
We’re launching an AI newsletter! Sign up here to start receiving it in your inboxes on June 5.