Gemini breaks new ground with a faster model, longer context, AI agents and more

AiCreatesAi
By -
Gemini Model Family Updates

Gemini Model Family Updates

In December, we introduced our first natively multimodal model, Gemini 1.0, available in three sizes: Ultra, Pro, and Nano. Shortly after, we launched 1.5 Pro, which featured improved performance and an innovative long context window capable of handling 1 million tokens.

Developers and enterprise clients have been utilizing 1.5 Pro in remarkable ways, finding its long context window, multimodal reasoning abilities, and overall performance to be incredibly valuable.

Feedback from our users indicated that some applications require lower latency and reduced costs. This feedback drove us to continue innovating, leading to the introduction of Gemini 1.5 Flash: a lightweight model designed for fast, efficient deployment at scale.

Both 1.5 Pro and 1.5 Flash are now available in public preview, each supporting a 1 million token context window via Google AI Studio and Vertex AI. Additionally, 1.5 Pro is now accessible with a 2 million token context window for developers on the waitlist and Google Cloud customers.

We are also unveiling enhancements across the Gemini model family, announcing the next generation of open models, Gemma 2, and sharing our progress on the future of AI assistants with Project Astra.

Updates to the Gemini Model Family

The New 1.5 Flash: Optimized for Speed and Efficiency

1.5 Flash is the latest member of the Gemini model family and the fastest model served through the API. It’s designed for high-volume, high-frequency tasks, offering a cost-efficient solution while maintaining our breakthrough long context window.

Although 1.5 Flash is lighter than 1.5 Pro, it’s highly capable of multimodal reasoning across extensive datasets, delivering impressive quality relative to its size.

1.5 Flash is particularly effective in summarization, chat applications, image and video captioning, and data extraction from lengthy documents and tables. This is due to its training via a process known as “distillation,” where essential knowledge and skills from a larger model are transferred to a smaller, more efficient model.

To learn more about 1.5 Flash, explore our updated Gemini 1.5 technical report on the Gemini technology page, where you can also find information about its availability and pricing.

Significant Improvements to 1.5 Pro

Over the past few months, we have made substantial enhancements to 1.5 Pro, our top-performing model for a broad range of tasks.

Beyond extending its context window to 2 million tokens, we’ve upgraded its abilities in code generation, logical reasoning, planning, multi-turn conversations, and understanding audio and images through advancements in data and algorithms. These improvements are reflected in strong results on public and internal benchmarks.

1.5 Pro now follows increasingly complex and nuanced instructions, including those that specify product-level behavior related to role, format, and style. We’ve also enhanced control over the model’s responses for specific use cases, such as customizing the persona and response style of a chat agent or automating workflows with multiple function calls. Users can now guide model behavior by setting system instructions.

We’ve introduced audio understanding in the Gemini API and Google AI Studio, allowing 1.5 Pro to reason across image and audio for videos uploaded to Google AI Studio. Additionally, we are integrating 1.5 Pro into various Google products, including Gemini Advanced and Workspace apps.

For further details on 1.5 Pro, check out our updated Gemini 1.5 technical report and the Gemini technology page.

Gemini Nano: Expanding Multimodal Capabilities

Gemini Nano is advancing beyond text-only inputs, now including the ability to process images as well. Starting with Pixel, applications using Gemini Nano with Multimodality will be able to perceive the world as humans do — not just through text, but also through sight, sound, and spoken language.

Read more about Gemini 1.0 Nano on Android.

Next Generation of Open Models

Today, we’re excited to share updates to Gemma, our open model family built from the same research and technology that created the Gemini models.

We’re introducing Gemma 2, our next generation of open models focused on responsible AI innovation. Gemma 2 features a new architecture designed for exceptional performance and efficiency and will be available in various sizes.

The Gemma family is also expanding with PaliGemma, our first vision-language model inspired by PaLI-3. Additionally, we’ve enhanced our Responsible Generative AI Toolkit with the LLM Comparator, a tool for evaluating model response quality.

Read more on the Developer blog.

Progress in Developing Universal AI Agents

As part of Google DeepMind’s mission to build AI responsibly to benefit humanity, we’ve always aimed to develop universal AI agents that can assist in everyday life. Today, we’re sharing our progress in building the future of AI assistants with Project Astra, an advanced seeing and talking responsive agent.

To be truly useful, an agent must understand and respond to the complex and dynamic world as people do — absorbing and remembering what it sees and hears to grasp context and take action. It must also be proactive, teachable, and personal, allowing users to communicate naturally without lag or delay.

While we’ve made significant strides in developing AI systems that can comprehend multimodal information, achieving real-time conversational response remains a challenging engineering task. Over the past few years, we’ve focused on improving how our models perceive, reason, and converse to ensure that interactions feel natural and fluid.

Building on Gemini

Leveraging the advancements of Gemini, we’ve developed prototype agents that can process information faster by continuously encoding video frames, integrating video and speech inputs into an event timeline, and caching this information for efficient recall.

By utilizing our advanced speech models, we’ve enhanced the agents’ vocal capabilities, enabling them to produce a wider range of intonations. These agents can better understand the context in which they are used and respond swiftly during conversations.

With these technological advancements, it’s easy to envision a future where people have access to an expert AI assistant through devices like phones or glasses. Some of these capabilities will be integrated into Google products, like the Gemini app and web experience, later this year.

Continued Exploration

We’ve made incredible progress so far with our family of Gemini models, and we’re always striving to advance the state-of-the-art even further. By investing in a relentless production line of innovation, we’re able to explore new ideas at the frontier, while also unlocking the possibility of new and exciting Gemini use cases.