In a talk at this week’s Hot Chips event at Stanford University, Bill Dally, NVIDIA’s chief scientist and senior vice president of research previewed a deep neural network (DNN) accelerator chip designed for efficient execution of natural language processing tasks.
The 5nm prototype achieves 95.6 TOPS/W in benchmarking and 1711 inferences/s/W with only 0.7% accuracy loss on BERT, demonstrating a practical accelerator design for energy-efficient inference with transformers.
He explored a half dozen other techniques for tailoring hardware to specific AI tasks, often by defining new data types or operations.
Dally described ways to simplify neural networks, pruning synapses and neurons in an approach called structural sparsity, first adopted in NVIDIA A100 Tensor Core GPUs.
“We’re not done with sparsity,” he said. “We need to do something with activations and can have greater sparsity in weights as well.”
In a separate talk, Kevin Deierling, NVIDIA’s vice president of networking, described the unique flexibility of NVIDIA BlueField DPUs and NVIDIA Spectrum networking switches for allocating resources based on changing network traffic or user rules.
“Today with generative AI workloads and cybersecurity, everything is dynamic, things are changing constantly,” Deierling said. “So we’re moving to runtime programmability and resources we can change on the fly.”