For decades, the compute industry has relied on Moore’s Law – and successfully so. The principle that the number of transistors on a chip doubles every two years has been the bastion of the digital age.
However, the era of Moore’s Law is ending, just as compute demand has never been more meteoric. Transistor scaling is reaching its physical limit at the nanoscale, while the advent of Gen AI is driving the need for multibillion-parameter AI models and training clusters requiring hundreds of thousands or even millions of chips for a single model.
The underlying battleground of compute is changing – instead of innovating new ways to drive performance from a single chip, there must be a fundamental rethinking at the scale of hundreds, thousands, and even millions of chips on a per-system and per-rack basis.
Amdahl’s Law for AI Scale Success
For rack-scale thinking, Amdahl’s Law tells us that even the most advanced GPUs cannot deliver their theoretical performance without addressing the challenges unique to the system level. Interconnects must shuffle data between chips at blistering speeds, cooling systems must extract tens of kilowatts of heat per rack, and power delivery architectures must reliably feed thousands of processors running at near-constant peak load.
We can draw on some lessons from our past. During the mainframe and minicomputer eras, processor improvements alone were initially sufficient to deliver performance gains. However, as workloads ballooned in complexity, differentiation came from the shift to systems-level orchestration.
The answer was client-server architectures and virtualization, which ultimately led to what we now know as cloud computing. In the AI era, this pattern is repeating: true efficiency and performance improvements will emerge only when each component of a rack system is co-optimized. This represents more than a technical nuance – it represents a radical inflection point in how computing infrastructure is built, scaled, and monetized.
Leading industry incumbents have already recognized this shift. Nvidia has acquired Mellanox, Cumulus Networks, and Augtera, with Enfabrica rumored to be next. The company is building a formidable networking stack to complement its GPUs and deliver holistic rack-level solutions. More recently, AMD acquired ZT Systems, a rack-level infrastructure and data center systems provider, to internalize systems design expertise critical for AI.
Where Startups Fit In
Despite heavyweight players working to consolidate and vertically integrate at the rack scale, several unique gaps remain that hyperscalers and chip incumbents cannot – or likely will not – address alone. These gaps are ripe for startup disruption.
Interconnects are the backbone of system- and rack-level communication, where even minor bottlenecks between compute nodes can cripple performance and increase latency.
Meeting the unprecedented bandwidth demands of all-to-all communication across thousands of GPUs requires novel interconnect solutions that balance cost, speed, and energy efficiency.
A critical dimension of this evolution is photonics, both on-chip and off-chip. Co-packaged optics and integrated photonics are reshaping switch and compute node integration by placing optical interfaces directly beside or within chips, cutting power consumption while boosting bandwidth density.
Meanwhile, multipoint-to-multipoint photonic networks are emerging as a path to truly scalable all-to-all GPU communication, enabling larger clusters and unprecedented efficiency for AI workloads.
Startups are driving much of this innovation, as evidenced by recent acquisitions such as Ciena’s acquisition of Nubis Communications, a TDK Ventures portfolio company, and Credo’s purchase of Hyperlume.
In addition to advances in connectivity and bandwidth, hardware and software must be tightly paired and intelligently orchestrated to unlock true performance. Rack-aware AI solutions, for instance, show tremendous promise by adapting software to hardware topology, architecture, and bandwidth instead of forcing hardware to conform to software constraints.
Meta has already embraced this approach, designing “AI Zones” within their racks that leverage specialized rack training switches (RTSWs) and custom algorithms to optimize GPU communication for large-scale language model training.
Finally, there is a monumental opportunity in power management, distribution, and cooling as the industry must rise to meet the challenges of responsibly handling and mitigating the tens of kilowatts per rack that are generated in today’s data centers.
An Investor’s Perspective
For investors, the signal is clear: the next wave of AI infrastructure winners will not be defined solely by who makes the fastest chip, but by who enables rack-scale performance. History offers precedent.
Just as Cisco and Arista rose to prominence by solving campus and data center networking, and VMware defined an era through virtualization and orchestration, the coming decade will crown system-level innovators as indispensable to AI’s infrastructure backbone.
The AI “chip wars” are evolving into “system wars.” In that transition, the greatest opportunities and returns will accrue to those who can engineer at scale.