How many penguins are in this wildlife video? Can you track the orange ball in the cat video? Which teams are playing, and who scored? Give me step-by-step instructions from this cooking video?
Molmo 2, a revolutionary AI vision model developed by the Allen Institute for AI (Ai2), sets a new benchmark for open-source AI systems. The model excels in analyzing short video clips, tracking specific objects, and providing detailed insights into various events captured in videos. Ai2’s Molmo 2 outperforms other open-source models in short video analysis and tracking, even rivalling closed systems like Google’s Gemini 3 in video tracking capabilities.
In a recent demonstration at the Ai2 offices in Seattle, researchers showcased Molmo 2’s ability to analyze a diverse range of video content with precision and accuracy.
- During a soccer clip analysis, Molmo 2 identified a defensive mistake that led to a goal.
- In a baseball video, the model recognized the teams, players, and explained its reasoning based on uniforms and stadium branding.
- When presented with a cooking video, Molmo 2 generated a structured recipe with ingredients and detailed instructions.
- The model accurately counted and identified each flip performed by a dancer in a video.
- In a tracking demo, Molmo 2 successfully followed and maintained IDs for four penguins as they moved around the frame.
- For a racing clip, the model tracked specific cars and identified the correct vehicle based on the query.
Big year for Ai2
The release of Molmo 2 marks a significant milestone for Ai2, a nonprofit organization founded by the late Microsoft co-founder Paul Allen. Ai2 has received substantial funding, partnered with leading institutions for AI research initiatives, and developed open models for text, images, and now video analysis.
With a focus on advancing AI research and making innovations freely available, Ai2 aims to collaborate with the community and drive progress in the field of artificial intelligence.
As part of its commitment to open AI systems, Ai2 has also introduced Bolmo, an experimental text model that enhances language processing capabilities at the character level.
Expanding into video analysis
Molmo 2 represents Ai2’s foray into video analysis, offering a powerful tool for understanding and interpreting video content. The model’s open nature sets it apart from closed systems, allowing developers to customize and optimize its performance for specific applications.
By leveraging high-quality human annotations and a smaller training dataset, Molmo 2 achieves impressive performance metrics while remaining efficient and accessible for a wide range of applications.
As Ai2 continues to innovate and expand its AI capabilities, Molmo 2 represents a significant step forward in video analysis and understanding, paving the way for future developments in the field.
A work in progress
While Molmo 2 showcases remarkable capabilities in video analysis, it also has limitations that the Ai2 team is actively addressing. The model’s tracking capability is currently optimized for a limited number of objects, with room for improvement in crowded scenes and long-form video analysis.
Despite these challenges, Ai2 remains dedicated to pushing the boundaries of AI research and developing open models that empower developers and researchers worldwide.