
Machine learning and artificial intelligence (AI) have already penetrated so deeply into our life and work that you might have forgotten what interactions with machines used to be like. We used to ask only for precise quantitative answers to questions conveyed with numeric keypads, spreadsheets, or programming languages: “What is the square root of 10?” “At this rate of interest, what will be my gain over the next five years?”
But in the past 10 years, we’ve become accustomed to machines that can answer the kind of qualitative, fuzzy questions we’d only ever asked of other people: “Will I like this movie?” “How does traffic look today?” “Was that transaction fraudulent?”
Deep neural networks (DNNs), systems that learn how to respond to new queries when they’re trained with the right answers to very similar queries, have enabled these new capabilities. DNNs are the primary driver behind the rapidly growing global market for AI hardware, software, and services, valued at
US $327.5 billion this year and expected to pass $500 billion in 2024, according to the International Data Corporation.
Convolutional neural networks first fueled this revolution by providing superhuman image-recognition capabilities. In the last decade, new DNN models for natural-language processing, speech recognition, reinforcement learning, and recommendation systems have enabled many other commercial applications.
But it’s not just the number of applications that’s growing. The size of the networks and the data they need are growing, too. DNNs are inherently scalable—they provide more reliable answers as they get bigger and as you train them with more data. But doing so comes at a cost. The number of computing operations needed to train the best DNN models
grew 1 billionfold between 2010 and 2018, meaning a huge increase in energy consumption And while each use of an already-trained DNN model on new data—termed inference—requires much less computing, and therefore less energy, than the training itself, the sheer volume of such inference calculations is enormous and increasing. If it’s to continue to change people’s lives, AI is going to have to get more efficient.
We think changing from digital to analog computation might be what’s needed. Using nonvolatile memory devices and two fundamental physical laws of electrical engineering, simple circuits can implement a version of deep learning’s most basic calculations that requires mere thousandths of a trillionth of a joule (a femtojoule). There’s a great deal of engineering to do before this tech can take on complex AIs, but we’ve already made great strides and mapped out a path forward.
The biggest time and energy costs in most computers occur when lots of data has to move between external memory and computational resources such as CPUs and GPUs. This is the “von Neumann bottleneck,” named after the classic computer architecture that separates memory and logic. One way to greatly reduce the power needed for deep learning is to avoid moving the data—to do the computation out where the data is stored.
DNNs are composed of layers of artificial neurons. Each layer of neurons drives the output of those in the next layer according to a pair of values—the neuron’s “activation” and the synaptic “weight” of the connection to the next neuron.
Most DNN computation is made up of what are called vector-matrix-multiply (VMM) operations—in which a vector (a one-dimensional array of numbers) is multiplied by a two-dimensional array. At the circuit level these are composed of many multiply-accumulate (MAC) operations. For each downstream neuron, all the upstream activations must be multiplied by the corresponding weights, and these contributions are then summed.
Most useful neural networks are too large to be stored within a processor’s internal memory, so weights must be brought in from external memory as each layer of the network is computed, each time subjecting the calculations to the dreaded von Neumann bottleneck. This leads digital compute hardware to favor DNNs that move fewer weights in from memory and then aggressively reuse these weights.
A radical new approach to energy-efficient DNN hardware occurred to us at IBM Research back in 2014. Together with other investigators, we had been working on crossbar arrays of nonvolatile memory (NVM) devices. Crossbar arrays are constructs where devices, memory cells for example, are built in the vertical space between two perpendicular sets of horizontal conductors, the so-called bitlines and the wordlines. We realized that, with a few slight adaptations, our memory systems would be ideal for DNN computations, particularly those for which existing weight-reuse tricks work poorly. We refer to this opportunity as “analog AI,” although other researchers doing similar work also use terms like “processing-in-memory” or “compute-in-memory.”
There are several varieties of NVM, and each stores data differently. But data is retrieved from all of them by measuring the device’s resistance (or, equivalently, its inverse—conductance). Magnetoresistive RAM (MRAM) uses electron spins, and flash memory uses trapped charge. Resistive RAM (RRAM) devices store data by creating and later disrupting conductive filamentary defects within a tiny metal-insulator-metal device. Phase-change memory (PCM) uses heat to induce rapid and reversible transitions between a high-conductivity crystalline phase and a low-conductivity amorphous phase.
Flash, RRAM, and PCM offer the low- and high-resistance states needed for conventional digital data storage, plus the intermediate resistances needed for analog AI. But only RRAM and PCM can be readily placed in a crossbar array built in the wiring above silicon transistors in high-performance logic, to minimize the distance between memory and logic.
We organize these NVM memory cells in a two-dimensional array, or “tile.” Included on the tile are transistors or other devices that control the reading and writing of the NVM devices. For memory applications, a read voltage addressed to one row (the wordline) creates currents proportional to the NVM’s resistance that can be detected on the columns (the bitlines) at the edge of the array, retrieving the stored data.
To make such a tile part of a DNN, each row is driven with a voltage for a duration that encodes the activation value of one upstream neuron. Each NVM device along the row encodes one synaptic weight with its conductance. The resulting read current is effectively performing, through Ohm’s Law (in this case expressed as “current equals voltage times conductance”), the multiplication of excitation and weight. The individual currents on each bitline then add together according to Kirchhoff’s Current Law. The charge generated by those currents is integrated over time on a capacitor, producing the result of the MAC operation.
These same analog in-memory summation techniques can also be performed using flash and even SRAM cells, which can be made to store multiple bits but not analog conductances. But we can’t use Ohm’s Law for the multiplication step. Instead, we use a technique that can accommodate the one- or two-bit dynamic range of these memory devices. However, this technique is highly sensitive to noise, so we at IBM have stuck to analog AI based on PCM and RRAM.
Unlike conductances, DNN weights and activations can be either positive or negative. To implement signed weights, we use a pair of current paths—one adding charge to the capacitor, the other subtracting. To implement signed excitations, we allow each row of devices to swap which of these paths it connects with, as needed.
With each column performing one MAC operation, the tile does an entire vector-matrix multiplication in parallel. For a tile with 1,024 × 1,024 weights, this is 1 million MACs at once.
In systems we’ve designed, we expect that all these calculations can take as little as 32 nanoseconds. Because each MAC performs a computation equivalent to that of two digital operations (one multiply followed by one add), performing these 1 million analog MACs every 32 nanoseconds represents 65 trillion operations per second.
We’ve built tiles that manage this feat using just 36 femtojoules of energy per operation, the equivalent of 28 trillion operations per joule. Our latest tile designs reduce this figure to less than 10 fJ, making them 100 times as efficient as commercially available hardware and 10 times better than the system-level energy efficiency of the latest custom digital accelerators, even those that aggressively sacrifice precision for energy efficiency.
It’s been important for us to make this per-tile energy efficiency high, because a full system consumes energy on other tasks as well, such as moving activation values and supporting digital circuitry.
There are significant challenges to overcome for this analog-AI approach to really take off. First, deep neural networks, by definition, have multiple layers. To cascade multiple layers, we must process the VMM tile’s output through an artificial neuron’s activation—a nonlinear function—and convey it to the next tile. The nonlinearity could potentially be performed with analog circuits and the results communicated in the duration form needed for the next layer, but most networks require other operations beyond a simple cascade of VMMs. That means we need efficient analog-to-digital conversion (ADC) and modest amounts of parallel digital compute between the tiles. Novel, high-efficiency ADCs can help keep these circuits from affecting the overall efficiency too much. Recently, we unveiled a high-performance PCM-based tile using a new kind of ADC that helped the tile achieve better than 10 trillion operations per watt.
A second challenge, which has to do with the behavior of NVM devices, is more troublesome. Digital DNNs have proven accurate even when their weights are described with fairly low-precision numbers. The 32-bit floating-point numbers that CPUs often calculate with are overkill for DNNs, which usually work just fine and with less energy when using 8-bit floating-point values or even 4-bit integers. This provides hope for analog computation, so long as we can maintain a similar precision.
Given the importance of conductance precision, writing conductance values to NVM devices to represent weights in an analog neural network needs to be done slowly and carefully. Compared with traditional memories, such as SRAM and DRAM, PCM and RRAM are already slower to program and wear out after fewer programming cycles. Fortunately, for inference, weights don’t need to be frequently reprogrammed. So analog AI can use time-consuming write-verification techniques to boost the precision of programming RRAM and PCM devices without any concern about wearing the devices out.
That boost is much needed because nonvolatile memories have an inherent level of programming noise. RRAM’s conductivity depends on the movement of just a few atoms to form filaments. PCM’s conductivity depends on the random formation of grains in the polycrystalline material. In both, this randomness poses challenges for writing, verifying, and reading values. Further, in most NVMs, conductances change with temperature and with time, as the amorphous phase structure in a PCM device drifts, or the filament in an RRAM relaxes, or the trapped charge in a flash memory cell leaks away.
There are some ways to finesse this problem. Significant improvements in weight programming can be obtained by using two conductance pairs. Here, one pair holds most of the signal, while the other pair is used to correct for programming errors on the main pair. Noise is reduced because it gets averaged out across more devices.
We tested this approach recently in a multitile PCM-based chip, using both one and two conductance pairs per weight. With it, we demonstrated excellent accuracy on several DNNs, even on a recurrent neural network, a type that’s…
