The history of modern software development has been a dance between what hardware can provide and what software demands. Over the decades, steps in this dance have taken us from the original Intel 8086, which we now consider very basic functionality, to today’s versatile processors that provide virtualization support, encrypted memory, and end-to-end access to data. and expanded instruction sets that power the most demanding application stacks.
This dance is swaying from side to side. Sometimes our software has to stretch to accommodate the capabilities of the next generation of silicon, and sometimes it has to squeeze every last ounce of available performance. We are now finally seeing the arrival of a new generation of hardware that combines familiar CPUs with new system-level accelerators that provide the ability to run complex AI models on both client hardware and servers, both on-premises and in the public cloud.
You’ll find AI accelerators not only in familiar Intel and AMD processors, but also in Arm’s latest-generation Neoverse server-class designs, which combine these features with low power demands (as do Qualcomm’s mobile and laptop offerings). It’s an attractive combination of features for hyperscale clouds like Azure, where low power and high density can help keep costs low while allowing growth to continue.
At the same time, system-level accelerators promise an interesting future for Windows, allowing us to use built-in artificial intelligence assistants as an alternative to the cloud, as Microsoft continues to improve the performance of its Phi series small language models.
Azure Boost: Silicon for virtualization transport
Ignite 2023 saw Microsoft announce its own custom silicon for Azure, hardware expected to begin shipping to customers in 2024. Microsoft has been using custom silicon and FPGAs in its own services for some time. The use of Zipline hardware compression and Project Brainwave FPGA-based AI accelerators are good examples. The newest product is Azure Boost, which offloads virtualization processes from the hypervisor and host operating system to speed up storage and networking for Azure VMs. Azure Boost also includes the Cerberus built-in supply chain security chipset.
Azure Boost aims to give your virtual machine workloads access to as much available CPU as possible. Instead of using CPU to compress data or manage security, dedicated hardware comes into play, allowing Azure to run more customer workloads on the same hardware. Running systems at high utilization is key to public cloud economics, and any investment in hardware will pay off quickly.
Maia 100: Silicone for large tongue models
Large language models (and generative AI in general) demonstrate the importance of intensive computing, with OpenAI using Microsoft’s GPU-based supercomputer to train GPT models. Even on a system like Microsoft’s, large baseline models like GPT-4 require months of training with over a trillion parameters. The next generation of LLMs will need even more computing for both education and operations. If we are building applications based around these LLMs using Retrieval Augmented Generation, we will need additional capacity to create embeddings for our source content and provide the underlying vector-based search.
GPU-based supercomputers are a significant investment, even if Microsoft can recoup some of its capital costs from subscribers. Operating costs are also high due to heavy cooling requirements as well as power, bandwidth and storage. Therefore, we can expect these resources to be limited to a very small number of data centers where sufficient space, power and cooling are available.
But if large-scale AI is to be a successful differentiator for Azure against competitors like AWS and Google Cloud, it will need to be ubiquitous and cost-effective. This will require new silicon (for both training and inference) that can be run at higher densities and lower power than today’s GPUs.
When we look at Azure Project Brainwave FPGAs, we see that they use programmable silicon to implement key algorithms. While they worked well, they were single-purpose devices that served as accelerators for certain machine learning models. You could develop a variant of the MSc that supports complex neural networks, but it would require a large number of simple processor array implementations to support the multidimensional vector arithmetic that drives these semantic models. This is beyond the capabilities of most FPGA technologies.