Skip to Content



The success of machine learning rests on scalability

As ML takes hold, engineers will need to design systems that can dynamically adjust the processing resource they deliver.
November 14, 2019

Provided byArm

Steve Roddy is Vice President, Machine Learning Group, at Arm.

Change is constant, and artificial intelligence (AI) and machine learning (ML) are changing everything, all over again. For developers trying to bring new products and services to market that leverage AI and ML, the challenges are compounded by the fact that the technology landscape is still developing.

Unlike the traditional embedded sector, which exhibits a linear relationship between the need for more processing performance and the way that performance is used, there is a disparity between AI and ML, and the hardware platforms they will run on: ML is constantly changing.

Visit the content hub

  • Scalable Machine Learning

The move away from sequential thinking

ML is likely to be used everywhere, just as traditional embedded software is used today. However, unlike traditional code, which is written line by line in a sequential pattern (even if auto-generation is used), ML will be deployed as models, created by frameworks that learn. Models will, in a very real sense, be birthed. And like any form of offspring, you can never really be sure of just what you will be getting until it arrives.

For developers, then, the predictable nature of embedded software will disappear or change significantly. Tools are being developed that help predict how a model will operate, or will impose certain restrictions on the way the model is formed to comply with the platform, but these are nascent and in no way a panacea. It is likely that adapting to the constraints of the system will lead to a loss in accuracy. The nature of ML is that it delivers the accuracy needed on the hardware provided. It follows, then, that if the hardware is able to adapt you can avoid compromising on accuracy.

There’s no one-size-fits-all for machine learning. Engineers need to design systems that offer scalable performance and can adjust the type of processing resource they deliver based on the task at hand.

The way a model performs on a fixed hardware platform will also change. The predictable nature of embedded software has long been a mainstay of design; indeed the idea that the code’s characteristics will change after it has been deployed is the stuff of engineers’ nightmares. Embedded systems are developed within performance parameters, an envelope based on power, cost, heat dissipation, size, weight, and any number of measurables that can be traded off against each other to meet defined targets. This is essentially how embedded development has always been done, but it isn’t the way it will be done in the future.

Scalability is the new norm

Instead, engineers will need to design systems that offer scalable performance, that are able to dynamically adjust the type of processing resource they deliver based on the task at hand. This is different to what embedded engineers may be comfortable with right now. For some years embedded processors have had the ability to vary their operating frequency and supply voltage based on workload. Essentially, a processor’s core can run slower when it isn’t busy; scaling back the main clock frequency directly translates to fewer transistors switching on and off per second, which saves power. When the core really needs to get busy, the clock frequency is scaled up, increasing the throughput. There is a relationship between supply voltage and clock frequency; by reducing both, the amount of power conserved is amplified. This kind of scaling isn’t going to be enough to deliver the power and performance needed in the embedded devices now being developed to run ML models.

That’s because the way we measure performance is going to change. Right now, processors are typically measured in terms of operations per second; we’re now measuring that in teraops, or trillions of operations per second (TOPS). Using TOPS to measure the performance of a processor executing inferences won’t make as much sense as it does when executing sequential code, because the way the model runs isn’t directly comparable to regular embedded software. ML processors will be measured on the accuracy they achieve when delivering a given number of inferences per second for a given amount of power. We don’t have a standard metric for that yet, but we can say that simply increasing the clock frequency to meet the inferences/s target isn’t guaranteed and will likely bust the power budget, without improving accuracy.

The path to inference is littered with variables

Why? The reason lies in the way ML models work. With many layers of probability to go through, there are just as many variables that can change the path through those layers. The real world will have a much greater impact on the way ML models execute, with far more variability than linear sequential embedded code. Take natural language processing and speech recognition as an example: the speaker’s voice and cadence will all play a role in the efficacy of the model, however there may also be interplay between these parameters that result in a different experience under various conditions. Simply increasing the processor’s speed in this case may not return the desired outcome.

Furthermore, one of the defining features of ML is its ability to learn. Even if reinforcement learning isn’t applied in the device itself, it is still possible that data will be fed back to a mainframe where the model may be tweaked based on the results observed. Even without this feedback it is likely that the model will be improved over time, purely because of the way ML is still evolving. This would lead to a new model being created and deployed (using over-the-air updates, for example), which will then have potentially entirely different processing requirements, functioning differently under the same or similar conditions.  

The changing nature of ML models means that, while current CPU architectures can be and are being used for ML, today’s architectures almost certainly can’t provide the most optimal way of executing them. Yes, models can run on CPUs using all the usual ALU features found in most processors. They can also benefit from highly parallel architectures that feature massively multiple instances of these features, such as GPUs, but it’s already clear that GPUs are not the best way to execute ML models. In fact, we already have examples of neural processing units, and the semiconductor industry is hard at work developing entirely new architectures for executing ML models more efficiently. At some point, either the hardware or the software becomes fixed in order to let the other move forward. The right way to address this is to commit to a common software framework that can be used across compatible but scalable hardware platforms, so that both evolve together.

Flexible heterogeneous architectures

By doing this, the scalability needed to support AI and ML can be extended from the core of the network to the very edge, without locking the architecture down to a fixed platform. Project Trillium is Arm’s heterogeneous ML compute platform composed of cores and software. Arm is expanding Project Trillium to address ML at every point in the network. The common software platform here is Arm’s neural network software libraries, Arm NN, that can run across Arm processor platforms and are also compatible with leading third-party neural network frameworks. The hardware includes the existing Arm Cortex-A and Arm Mali GPU processors that are being enhanced for AI and ML, as well as totally new processors for ML acceleration.

In terms of scalability, ML can and does run on processors as small and resource-constrained at the Cortex-M class, and as feature-rich as the Mali GPUs. However, true scalability is needed to meet all the needs of ML from the core to the edge, which is where the next step in processor evolution comes in. Neural processing units, or NPUs, represent the new generation of processor architecture that will support ML in more applications.

Only Arm offers this level of scalability across the ML landscape. Choosing scalable architectures that can be composed of MCUs, CPUs, GPUs and NPUs will help future-proof hardware platforms against new software applications that haven’t even been conceived yet.

There are many unknowns, in terms of what ML models we’ll be creating in the future, how much compute power they will need to deliver the desired accuracy, how quickly computer scientists will be able to improve models so they need less power—all of these considerations have a direct impact on the underlying hardware. The only thing we do know is that meeting end users’ changing expectations requires a flexible and scalable platform.

Deep Dive


It’s time to retire the term “user”

The proliferation of AI means we need a new word.

Modernizing data with strategic purpose

Data strategies and modernization initiatives misaligned with the overall business strategy—or too narrowly focused on AI—leave substantial business value on the table.

How ASML took over the chipmaking chessboard

MIT Technology Review sat down with outgoing CTO Martin van den Brink to talk about the company’s rise to dominance and the life and death of Moore’s Law.


Why it’s so hard for China’s chip industry to become self-sufficient

Chip companies from the US and China are developing new materials to reduce reliance on a Japanese monopoly. It won’t be easy.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at with a list of newsletters you’d like to receive.