Shortly after Amazon CEO Andy Jassy revealed AWS’s major $50 billion investment deal with OpenAI, Amazon invited me to a private tour of its chip development lab, central to the agreement, mostly at its expense. Industry experts are closely observing Amazon’s Trainium chip, produced at this facility, considering its potential impact on lowering costs for AI inference and challenging Nvidia’s near monopoly. Intrigued, I decided to attend.
My tour guides were the lab’s director, Kristopher King, and director of engineering Mark Carroll, along with the team’s PR representative, Doron Aronson, who arranged the visit. AWS has been Anthropic’s main cloud platform since the AI lab’s inception, a partnership robust enough to endure Anthropic’s subsequent collaboration with Microsoft and Amazon’s expanding relationship with OpenAI.
The OpenAI deal makes AWS the exclusive provider of the model maker’s new AI agent builder, Frontier, an important aspect of OpenAI’s business if agents rise as expected in Silicon Valley. Whether that exclusivity holds as announced remains to be seen, especially with reports suggesting Microsoft may believe the deal violates its agreement with OpenAI, granting it access to all of OpenAI’s models and tech.
What attracts OpenAI to AWS is the commitment to supply it with 2 gigawatts of Trainium computing capacity, a massive pledge considering that Anthropic and Amazon’s Bedrock service are already consuming Trainium chips faster than Amazon can produce them. There are 1.4 million Trainium chips deployed over three generations, with Anthropic’s Claude utilizing over 1 million Trainium2 chips. Initially designed for faster, cheaper model training, Trainium has now been optimized for inference, the industry’s current bottleneck in performance.
Trainium2 manages most inference traffic on Amazon’s Bedrock, supporting enterprise customers in building AI applications with multiple models. King mentioned, “Our customer base is expanding as fast as we can get capacity out there,” hinting Bedrock’s potential to rival AWS’s massive EC2 compute cloud service. Beyond offering an Nvidia alternative, Amazon claims its new chips, operating on new Trn3 UltraServers, cost up to 50% less for equal performance than classic cloud servers. With Trainium3’s release, AWS also developed new Neuron switches, enhancing chip-to-chip communication via a mesh configuration, reducing latency.
Amazon’s chip design team earned commendations from Apple in 2024, highlighting the Graviton chip, ARM-based CPUs, and the Inferentia chip for inference, along with Trainium’s introduction. These chips exemplify Amazon’s strategy: identify market demands, then build competitive in-house alternatives.
The challenge with adopting new chips involves high switching costs, as applications written for Nvidia’s chips need re-architecting to function elsewhere. However, Trainium now supports PyTorch, facilitating application transitioning with a simple one-line change and recompilation, aiming to reduce Nvidia’s market dominance. AWS also announced a partnership with Cerebras Systems, integrating their inference chip on Trainium servers for enhanced AI performance.
Amazon goes beyond chip development by designing the servers that house them. This includes networking components and “Nitro,” a hardware-software combo providing virtualization technology, along with state-of-the-art liquid cooling systems and server sleds. All these efforts aim to optimize cost and performance.
Amazon’s custom chip-designing unit emerged from acquiring Israeli chip designer Annapurna Labs in January 2015, retained in name and identity, with its logo prevalent in the Austin office. Here, in a vibrant locale known as “The Domain,” lies what’s characterized as the “silicon bring-up” process, where massive overnight efforts ensure chips work correctly post-development.
Trainium3 has transitioned from air-cooled prototypes to liquid cooling, with team-led engineering responses to emergent issues during bring-up sessions. The lab thrives on solving dynamic challenges, illustrating the high-stakes environment pervading chip development.
Including a welding station for intricate hardware repairs, other lab equipment tools allow for comprehensive testing and component analysis by engineers like Arvind Srinivasan. A remarkable feature within the lab is the display of sleds, trays housing Trainium AI chips, Graviton CPUs, and custom boards, integral to Anthropic Claude’s performance.
Aside from the lab, AWS has a private data center dedicated to testing and quality assurance, separate from customer workloads and situated in a co-location facility. This setup includes ultra-secure measures and specialized cooling systems, maintaining an environment buzzing with high-tech activity.
The equipment integrates freshly minted chips like the Trn3 UltraServer, designed for maximum efficiency with slot-based configurations and Neuron switch arrangements between sleds, underscored by the environmental benefits of a closed liquid system.
Amazon’s chip team’s prominence has soared, achieved through focused innovation and overcoming intricate bring-up events. Moreover, CEO Andy Jassy consistently promotes the lab’s achievements, emphasizing Trainium’s burgeoning financial significance, alongside open acknowledgment during major agreements like
