AI2's Molmo shows open source can meet, and beat, closed multimodal models

The common wisdom is that companies like Google, OpenAI, and Anthropic, with bottomless cash reserves and hundreds of top-tier researchers, are the only ones that can make state-of-the-art foundation model. But as one among them famously noted, they “have no moat” — and AI2 showed that today with the release of Molmo, a multimodal AI model that matches their best while also being small, free, and truly open source.

To be clear, Molmo (multimodal open language model) is a visual understanding engine, not a full-service chatbot like ChatGPT. It doesn’t have an API, it’s not ready for enterprise integration, and it doesn’t search the web for you or for its own purposes. You can think of it as the part of those models that sees an image, understands it, and can describe or answer questions about it.

Molmo (coming in 72B, 7B, and 1B-parameter variants), like other multimodal models, is capable of identifying and answering questions about almost any everyday situation or object. How do you work this coffee maker? How many dogs in this picture have their tongues out? Which options on this menu are vegan? What are the variables in this diagram? It’s the kind of visual understanding task we’ve seen demonstrated with varying levels of success and latency for years.

What’s different is not necessarily Molmo’s capabilities (which you can see in the demo below, or test here), but how it achieves them.