At its re:Invent conference, AWS today announced the general availably of its Trainium2 (T2) chips for training and deploying large language models (LLMs). These chips, which AWS first announced a year ago, will be four times as fast as their predecessors, with a single Trainium2-powered EC2 instance with 16 T2 chips providing up to 20.8 petaflops of compute performance. In practice, that means running inference for Meta’s massive Llama 405B model as part of Amazon’s Bedrock LLM platform will be able to offer “3x higher token-generation throughput compared to other available offerings by major cloud providers,” according to AWS.
These new chips will also be deployed in what AWS calls the ‘EC2 Trn2 UltraServers.’ These instances will feature 64 interconnected Trainium2 chips which can scale up to 83.2 peak petaflops of compute. An AWS spokesperson informed us that these performance numbers of 20.8 petaflops are for dense models and FP8 precision. The 83.2 petaflops value is for FP8 with sparse models.
AWS notes that these UltraServers use a NeuronLink interconnect to link all of these Trainium chips together.
The company is working with Anthropic, the LLM provider AWS has put its (financial) bets on, to build a massive cluster of these UltraServers with “hundreds of thousands of Trainium2 chips” to train Anthropics models. This new cluster, AWS says, will be 5x as powerful (in terms of exaflops of compute) compared to the cluster Anthropic used to train its current generation of models and, AWS also notes, “is expected to be the world’s largest AI compute cluster reported to date.”
Overall, those specs are an improvement over Nvidia’s current generation of GPUs, which remain in high demand and short supply. They are dwarfed, however, by what Nvidia has promised for its next-gen Blackwell chips (with up to 720 petaflops of FP8 performance in a rack with 72 Blackwell GPUs), which should arrive — after a bit of a delay — early next year.
Trainium3: 4x faster, coming in 2025
Maybe that’s why AWS also used this moment to immediately announce its next generation of chips, too, the Trainium3. For Trainium3, AWS expects another 4x performance gain for its UltraServers, for example, and it promises to deliver this next iteration, built on a 3-nanometer process, in late 2025. That’s a very fast release cycle, though it remains to be seen how long the Trainium3 chips will remain in preview and when they’ll also get into the hands of developers.
“Trainium2 is the highest performing AWS chip created to date,” said David Brown, vice president of Compute and Networking at AWS, in the announcement. “And with models approaching trillions of parameters, we knew customers would need a novel approach to train and run those massive models. The new Trn2 UltraServers offer the fastest training and inference performance on AWS for the world’s largest models. And with our third-generation Trainium3 chips, we will enable customers to build bigger models faster and deliver superior real-time performance when deploying them.”
The Trn2 instances are now generally available in AWS’ US East (Ohio) region (with other regions launching soon), while the UltraServers are currently in preview.