A host of chip developers are hungry to take a bite out of Nvidia's (NVDA) dominant position in the market for server accelerators used to handle the demanding task of training AI/deep learning models.
At last week's Hot Chips industry conference, Intel (INTC) , Huawei and startup Cerebras Systems each announced or shared details about powerful new AI training accelerators. Of the three, it was Cerebras that made the biggest waves.
Cerebras' Big Reveal
Cerebras' Wafer Scale Engine (WSE), revealed for the first time at Hot Chips, is easily the biggest processing chip ever developed. The WSE features a chip die size of 46,225 square millimeters (mm2), and thus requires an entire 12-inch (300mm) chip wafer to manufacture. For comparison, Nvidia's flagship Tesla V100 server GPU, -- it launched in mid-2017, is widely deployed for AI training and other high-performance computing (HPC) workloads and remains the biggest GPU ever developed -- features a die size of "only" 815 mm2.
The WSE, which isn't aiming to replace a single V100 but rather dozens of them, packs 400,000 processing cores optimized for training AI/deep learning models. It also contains a massive 18GB of high-speed SRAM, and delivers an also-eye-popping 9 petabytes per second (that's equal to 9,000 terabytes per second) of memory bandwidth. And since it's inevitable that a chip this big will have some defective cores, the WSE also features redundant cores and connections to compensate for defective cores.
Cerebras' Wafer Scale Engine (WSE) placed next to Nvidia's Tesla V100 GPU. Source: Cerebras.
By any standard, the WSE is a very impressive work of engineering. Packing this much processing power and memory on a single chip, as opposed to splitting it up between many chips communicating via high-speed interconnects, is bound to yield benefits in terms of performance and power efficiency, and perhaps also price/performance.
However, the WSE does consume a giant 15 kilowatts (kW) of power -- 50 times more than a single V100. It also relies on Taiwan Semiconductor's (TSM) relatively old 16-nanometer (16nm) manufacturing process node -- the V100 uses TSMC's 12nm node, and newer chips from Apple (AAPL) , AMD (AMD) and others use a 7nm node -- and requires liquid cooling. And while Cerebras says that unnamed customers (possibly cloud giants) are already using the WSE, it hasn't yet shared the chip's price (one has to assume it will be significant) or when it will be available in volume.
Intel used Hot Chips to share additional details about its first accelerator meant specifically for AI training. The chip, which has been codenamed Spring Crest and is now officially known as the NNP-T, is a product of Intel's Nervana Systems AI chip unit, which it acquired in 2016.
The NNP-T is a more conventional product than the WSE: It features a 688 mm2 die size, and (like the V100) will come with 32GB of HBM2 memory that's housed on nearby chips. However, Intel did share competitive specs in areas such as performance, memory bandwidth and power efficiency, while continuing to highlight its efforts to support a number of different AI-related developer tools and software frameworks.
But like the WSE, the NNP-T relies on TSMC's 16nm manufacturing node. And it's running a little late: Whereas Intel said last year that Spring Crest would launch during the second half of 2019, it now says the chip will sample to select cloud providers this year ahead of a broader launch at some point in 2020. It's worth adding here that the NNP-T relies on high-speed, PCIe Gen 4 connections for I/O connectivity; while AMD's just-launched, second-gen, Epyc server CPUs support PCIe Gen 4, the first Intel server CPUs to support them (the company's upcoming Ice Lake server CPU family) aren't expected to ship in volume until the second half of 2020.
Towards the end of last week, China's Huawei officially launched a powerful AI training chip known as the Ascend 910, after previously sharing details about it at Hot Chips. Huawei claims the Ascend 910, which relies on TSMC's cutting-edge, 7nm+ manufacturing process along with HBM2 memory, consumes 310 watts or power and can deliver up to 256 teraflops of deep learning performance. Nvidia, for its part, has said the 300-watt Tesla V100 can deliver up to 125 teraflops of performance.
Since this is Huawei that we're talking about, political considerations loom large here. It's likely that many U.S. cloud giants and enterprises will be reluctant to use Huawei's silicon for AI-related projects in the current political climate. On the flip side, Chinese tech giants such as Alibaba, Tencent and Baidu could be pressured or at least encouraged by Beijing to use Huawei's chips.
Where Nvidia Stands
As Nvidia frequently points out, the company's competitive strengths in both AI training and the broader HPC accelerator market go well beyond its chip engineering work. Among other things, it has built a massive developer ecosystem for its CUDA GPU programming model, created dozens of domain-specific software libraries that run on top of CUDA, and has also worked closely with cloud giants and other developers to improve the performance of various AI models and software frameworks when run on its GPUs.
Thanks to such efforts, the V100's performance when running popular neural networks in fields such as image-recognition and natural-language processing have improved considerably since the GPU first began shipping in mid-2017. Nonetheless, given how Intel, Huawei and especially Cerebras are pushing the envelope, there is now a little more pressure on Nvidia to roll out a compelling successor to the Tesla V100.
During Nvidia's Aug. 15th earnings call, CEO Jensen Huang suggested it could still be a little while before the V100's successor arrives: He noted large-scale deployments of new data center GPUs take time, and -- though it's quite possible that Nvidia launches a new flagship GPU next year, while keeping the V100 around -- said he expects the Volta architecture that the V100 is based on "to be successful all the way through next year."
Nvidia has remained tight-lipped about what the V100's successor will look like. However, the chip is generally expected to rely on a next-gen GPU architecture known as Ampere, and to be built using a 7nm process. Also, given what Nvidia shared at Hot Chips about an experimental project involving a multi-chip accelerator that performs AI inference (the running of trained AI models against new data and content), the V100's successor could conceivably feature (like AMD's latest desktop and server CPUs) multiple processing chips housed within the same chip package.
Given how dominant Nvidia currently is in the AI training accelerator market, and given how large its chip and software investments in the space remain, it would be a mistake to assume that Cerebras, Intel or Huawei -- or for that matter, other rivals such as AMD and Google -- will cause major share losses in the near-term. However, at a time when Nvidia's flagship offering is more than 24 months old, their progress is definitely worth keeping an eye on.