Get insightful updates delivered straight to your inbox by subscribing to our weekly newsletters tailored for enterprise AI, data, and security leaders. Sign up now to receive only the most relevant information!
Nous Research, an innovative artificial intelligence startup known for its involvement in the open-source AI movement, discreetly unveiled Hermes 4 recently. This new line of large language models boasts the ability to rival top proprietary systems in performance while offering users unprecedented control and minimal content restrictions.
The launch signifies a significant advancement in the ongoing debate between open-source AI supporters and tech giants over the control of advanced AI capabilities. Unlike models from industry leaders like OpenAI, Google, or Anthropic, Hermes 4 is designed to handle almost any request without the typical safety precautions found in commercial AI systems.
Nous Research introduces Hermes 4, the latest in hybrid reasoning models. https://t.co/E5EW9hBurb
Hermes 4 builds on our tradition of user-centric models with enhanced test-time compute capacities.
Attention was given to ensuring the models are engaging and creative to interact with, free from censorship, and neutrally aligned while maintaining cutting-edge performance in math, coding, and reasoning for open weight models.
“Hermes 4 builds on our legacy of user-aligned models with expanded test-time compute capabilities,” Nous Research announced on X (formerly Twitter). “Special attention was given to making the models creative and interesting to interact with, unencumbered by censorship, and neutrally aligned while maintaining state of the art level math, coding, and reasoning performance for open weight models.”
How Hermes 4’s ‘hybrid reasoning’ mode outperforms ChatGPT and Claude on math benchmarks
Hermes 4 introduces what Nous Research calls “hybrid reasoning,” allowing users to toggle between fast responses and deeper, step-by-step thinking processes. When activated, the models generate their internal reasoning within special <think> tags before providing a final answer — similar to OpenAI’s o1 reasoning models but with full transparency into the AI’s thought process.
AI Scaling Reaches its Limits
Changes in power caps, increasing token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to learn how top teams are:
- Turning energy into a strategic advantage
- Designing efficient inference for real throughput gains
- Unlocking competitive ROI with sustainable AI systems
Secure your spot to stay ahead: https://bit.ly/4mwGngO
The technical accomplishment is significant. During testing, Hermes 4’s largest 405-billion parameter model achieved a score of 96.3% on the MATH-500 benchmark in reasoning mode and 81.9% on the challenging AIME’24 mathematics competition — performance that matches or surpasses many expensive proprietary systems.
“The challenge is making thinking traces useful and verifiable without runaway reasoning,” noted AI researcher Rohan Paul on X, highlighting one of the technical breakthroughs in the release.
Notably, Hermes 4 excelled in the “RefusalBench” test created by Nous Research to measure how frequently AI systems decline to answer questions. In reasoning mode, the model scored 57.1%, significantly outperforming GPT-4o (17.67%) and Claude Sonnet 4 (17%).
Inside DataForge and Atropos: The breakthrough training systems behind Hermes 4’s capabilities
Behind the impressive capabilities of Hermes 4 lies a sophisticated training infrastructure developed by Nous Research over several years. The models were trained using two innovative systems: DataForge, a graph-based synthetic data generator, and Atropos, an open-source reinforcement learning framework.
DataForge generates training data through “random walks” on directed graphs, converting simple pre-training data into complex instruction-following examples. For example, it can turn a Wikipedia article into a rap song and then create questions and answers based on that transformation.
Atropos functions as multiple specialized training environments where AI models practice specific skills such as mathematics, coding, tool use, and creative writing, receiving feedback only upon producing correct solutions. This approach ensures that only high-quality responses are included in the training data through “rejection sampling.”
Atropos is Nous’ Reinforcement Learning framework
Atropos is an open source reinforcement learning environment by Nous that has hundreds of “gyms” (like math, coding, games, tool‑use, vision) to train and evaluate LLM trajectories via scalable, async RL loops.
In other words… pic.twitter.com/fjxaQKClEZ
“Nous used these environments to generate the dataset for Hermes 4!” explained Tommy Shaughnessy, a venture capitalist at Delphi Ventures who has invested in Nous Research. “All in the dataset contains 3.5 million reasoning samples and 1.6 million non-reasoning samples! Hermes was trained on RL data, not just static datasets of question and answer!”
The training process required 192 Nvidia B200 GPUs and 71,616 GPU hours for the largest model — a significant but not unprecedented computational investment that demonstrates how specialized techniques can compete with the massive scale of tech giants.