Summary:
1. Cohere, a Canadian AI company, has introduced Command A Vision, a visual model tailored for enterprise use cases.
2. The model is designed to extract insights from visual data, such as diagrams, charts, and scanned documents, to aid in decision-making.
3. Command A Vision outperformed other models in benchmark tests, showcasing its efficiency in analyzing unstructured data for businesses.
Article:
In the realm of AI-powered analysis and Deep Research features, the demand for models and services that simplify document processing for businesses is on the rise. Cohere, a leading Canadian AI company, has stepped up to the plate by unveiling Command A Vision, a visual model specifically crafted for enterprise applications. This new model, built on the foundation of the company’s Command A model, boasts an impressive 112 billion parameters and aims to unlock valuable insights from visual data, enabling businesses to make data-driven decisions through document optical character recognition (OCR) and image analysis.
Command A Vision is designed to tackle the most challenging enterprise vision challenges, from interpreting complex product manuals with intricate diagrams to analyzing real-world photographs for risk detection. With the ability to read and analyze a wide range of visual data types commonly used by enterprises, including graphs, charts, diagrams, scanned documents, and PDFs, Command A Vision proves to be a versatile and indispensable tool for businesses.
One of the key advantages of Command A Vision is its efficiency in processing visual data while requiring only two or fewer GPUs, similar to its text model counterpart. Additionally, the model retains the text capabilities of Command A, enabling it to read text on images and comprehend at least 23 different languages. Cohere emphasizes that Command A Vision not only reduces the total cost of ownership for enterprises but is also fully optimized for retrieval use cases, making it a valuable asset for businesses seeking to streamline their operations.
Cohere’s approach to architecting Command A models, including the visual model, involves following a Llava architecture that transforms visual features into soft vision tokens, which are then divided into different tiles. These tiles are fed into the Command A text tower, a dense, 111-billion-parameter textual LLM, allowing a single image to consume up to 3,328 tokens. The training process for the visual model consists of three stages: vision-language alignment, supervised fine-tuning (SFT), and post-training reinforcement learning with human feedback (RLHF), enabling the model to map image encoder features to the language model embedding space effectively.
In benchmark tests, Command A Vision surpassed other models with similar visual capabilities, outscoring competitors such as OpenAI’s GPT 4.1, Meta’s Llama 4 Maverick, and Mistral’s Pixtral Large and Mistral Medium 3 in various tests like ChartQA, OCRBench, AI2D, and TextVQA. With an average score of 83.1%, Command A Vision demonstrated superior performance compared to its counterparts, highlighting its efficiency in extracting information from graphical documents commonly used by enterprises.
As the importance of Deep Research continues to grow, the need for models capable of analyzing unstructured data becomes more pronounced. Cohere’s Command A Vision offers a solution tailored to the unique needs of businesses, providing an open weights system for enterprises looking to transition away from closed or proprietary models. With the interest from developers already piqued, Command A Vision stands as a promising tool for enterprises seeking to enhance their data analysis capabilities and streamline their workflows effectively.