Hugging Face claims its new AI models are the smallest of their kind



A team at AI dev platform Hugging Face has released what they’re claiming are the smallest AI models that can analyze images, short videos, and text.

The models, SmolVLM-256M and SmolVLM-500M, are designed to work well on “constrained devices” like laptops with under around 1GB of RAM. The team says that they’re also ideal for developers trying to process large amounts of data very cheaply.

SmolVLM-256M and SmolVLM-500M are just 256 million parameters and 500 million parameters in size, respectively. (Parameters roughly correspond to a model’s problem-solving abilities, such as its performance on math tests.) Both models can perform tasks like describing images or video clips and answering questions about PDFs and the elements within them, including scanned text and charts.

To train SmolVLM-256M and SmolVLM-500M, the Hugging Face team used The Cauldron, a collection of 50 “high-quality” image and text datasets, and Docmatix, a set of file scans paired with detailed captions. Both were created by Hugging Face’s M4 team, which develops multimodal AI technologies.

SmolVLM
Benchmarks comparing the new SmolVLM models to other multimodal models. Image Credits:SmolVLM

The team claims that both SmolVLM-256M and SmolVLM-500M outperform a much larger model, Idefics 80B, on benchmarks including AI2D, which tests the ability of models to analyze grade-school-level science diagrams. SmolVLM-256M and SmolVLM-500M are available on the web as well as for download from Hugging Face under an Apache 2.0 license, meaning they can be used without restrictions.

Small models like SmolVLM-256M and SmolVLM-500M may be inexpensive and versatile, but they can also contain flaws that aren’t as pronounced in larger models. A recent study from Google DeepMind, Microsoft Research, and the Mila research institute in Quebec found that many small models perform worse than expected on complex reasoning tasks. The researchers speculated that this could be because smaller models recognize surface-level patterns in data, but struggle to apply that knowledge in new contexts.




Source