The push toward disaggregation and customization in hardware is starting to be mirrored on the software side, where operating systems are becoming smaller and more targeted, supplemented with additional software that can be optimized for different functions.

There are two main causes for this shift. The first is rising demand for highly optimized and increasingly heterogeneous designs, which are significantly more efficient when coupled with smaller, more targeted OSes. The second is the rollout of AI/ML just about everywhere, which doesn’t benefit from monolithic operating systems that are designed to allocate resources and manage permissions, either for a single device or across a distributed network. In both cases, connecting other software to a single OS through application programming interfaces adds power and performance overhead.

“ML inference code can run on a variety of processing elements, each with different power and performance profiles,” said Steve Roddy, chief marketing officer at Quadric. “You might have a situation where in one state of operation of a mobile phone the OS might decide to run ML Workload A on the dedicated ML processor — whether that be the GPNPU, NPU, or offload accelerator — but in another scenario it might choose to run that same workload on a different resource.”

In some designs, there is no OS at all. Instead, frameworks like PyTorch and TensorFlow may have enough flexibility to handle resource allocation themselves, which is more efficient but more design-intensive. Yet that approach also provides the flexibility of migrating software independent of the hardware, which has significant power and performance benefits.

“In the industry it’s hard to understate the importance of ease of adoption,” observed Steven Woo, fellow and distinguished inventor at Rambus. “Nobody’s going to buy a more efficient car if the price is having to sit backwards. You need to think about fitting within existing business practices and supply chain behavior. Your ideal is to be essentially plug-and-play within each of those environments. When you upgrade to a new computer, you’d like a program to run without having to recompile it.”

With AI/ML, the key components are a computationally resource-intensive training phase and a generally less intensive inference phase. With inferencing, the workloads are specified in high-level graph code and are not necessarily hardwired to a specific type of processor. That frees up the OS to handle other functions. But it also provides the option of sizing the OS to the workload.

Consider a shopping recommendation application, for example, which suggests different chairs based on a user’s queries. The recommendation engine might run best and at lowest power on a dedicated ML processor. But when the user clicks an icon to show a recommended chair superimposed in the user’s living room, via the phone camera, the compute resources need to be adjusted to the task. The camera overlay of the virtual chair becomes the highest priority task and needs to run on a bigger ML processor, or multiple smaller ML processors working in parallel, while the recommendation engine might need to be switched over one or more CPUs to boost the performance.

As those types of workload-specific demands grow, it’s not clear whether monolithic operating systems can adapt to this level of specialization. So while SoCs are being disaggregated into heterogeneous and often customized components, such as chiplets, the same trend is underway on the software side.

“Major processor companies are increasingly deploying specialized instruction set processors in markets that previously relied solely upon scaling the speed and number of universal processors,” Roddy said. “Both Intel (‘accelerated computing’) and AMD (‘APU’) have introduced server and laptop processors with heterogenous instruction sets, much like the mobile phone processor vendors have utilized CPU + GPU + Vision processor (VPU) + ISP + Audio DSP architectures for more than a decade. Now, the ‘traffic cop’ OS at the heart of these latest beefy SoCs has to map a variety of specialized workloads across three or four or more processor targets.”

The more efficient approach is to have multiple smaller OSes, or to leverage that functionality elsewhere. Whether this happens, or whether OSes are retrofitted to handle additional tasks and slim down resources for others remains to be seen. But either is a significant change.

“That complexity feels like an extension of what existing operating systems do today, not a wholesale replacement,” Roddy said. “We don’t see any reason why today’s leading OS code bases will be eclipsed any time soon as long as they evolve. They don’t need to be ‘replaced’ in any sense, but they will need to get better at juggling tasks across heterogeneous resources.”

Rethinking OSes with AI
AI itself is one possible answer to faster allocation. Currently, one of the ways operating systems allocate resources is to guess the user’s next move, through a prediction mechanism like pre-fetch in search, in which the OS determines what to bring up next based on prior use patterns. The problem is that the current heuristics for anticipating next moves are written by humans, who must try to imagine and code for all possible scenarios. This is an argument for why OSes should better accommodate the functions of AIs — and for how AI itself can help.

“If you’re an AI person, you see the results of these heuristics and say to yourself, ‘This is a machine trying to predict what’s happening in the future,’” said Martin Snelgrove, CTO at Untether AI. “That’s what AI does for a living. Next-gen operating systems will throw out all of these hand-made heuristics and replace them with neural nets. Your machine should be looking at your pattern of usage. For example, it should discover that you always bring up a Word document right after you’ve terminated a Zoom call. It should then be using any available space to get Word mostly pre-loaded about 20 minutes into a Zoom call, because it knows your calls usually last about that long.”

Neural nets further the argument for smaller OSes. “Your operating system doesn’t need to be a gigabyte anymore, because all you’re doing is expressing the structure of the neural net, which isn’t a lot of code,” Snelgrove said. “The metaphor in Unix is that everything is a file. In AI, the key metaphor is that everything is an actor in the sense that the net receives inputs, thinks about them, and produces outputs. In Unix you can pipe things in, there can be filters, there can be inputs and outputs. In an AI system, a file can be an actor that just receives and sends the same thing. But most things will be active. So if you have a network whose job it is to paint all the roses pink and all the gardens green, that’s just an actor, and you give it images and it gives you images back.”

Looking ahead
The pragmatic thinking about increasing extensibility, rather than creating a completely new operating systems, permeates IBM’s approach. In 1974, the company introduced the Multiple Virtual Storage (MVS) OS. In the following decades, it has gone through several updates, culminating in the current 64-bit z/OS.

But in 2022, IBM shifted direction somewhat. It embedded its Telum inference accelerator on the same silicon as its processor cores, sharing memory and level 3 cache with the host, in order to reduce latency and allow for more real-time AI within z/OS applications.

Enterprise applications, it turns out, are where AI dreams go to die. The training sets may be vast, the models may be accurate, but what high-volume transactional workload customers need most is real-time inference.

“We had discussions, conducted surveys, and did a lot of research to understand the main challenges that clients faced in actually leveraging AI in mission-critical enterprise workloads,” said Elpida Tzortzatos, fellow and CTO for AI on IBM zSystems. “What we found was they couldn’t get the response times and throughput they needed.”

For example, financial institutions that wished to do fraud detection in real-time often could only spot-check because of the drag on throughput. [1] For AI to fulfill its promises, a fraud prediction needs to come back before the transaction completes. Otherwise, the system hasn’t done enough to protect the institution’s end customers, Tzortatos said.

In addition, IBM’s customers wanted to be able to easily consume AI without slowing down their applications or transactional workloads. “We heard, ‘I want to be able to embed AI into my existing applications without having to re-architect my systems or re-factor these applications. At the same time, I need to be able to meet stringent SLAs of one millisecond response times,‘” she noted.

Fig. 1 The AI ecosystem, as seen by IBM. Source: IBM

Fig. 1 The AI ecosystem, as seen by IBM. Source: IBM

All of this experience led Tzortatos to recognize the industry needs to continue to evolve and optimize operating systems for AI and inference and training. “I don’t think it will be a completely new operating system,” she said. “Operating systems are going to evolve. They’re going to become more composable, where you can plug in those newer technologies and not impact the applications that are running on those operating systems.”

For now, rather than creating a new OS, it seems that commercial AI/ML will continue to rely on frameworks, as well as the ONNX exchange format, which allows developers to work in nearly any framework. The resulting code will be compatible with most inference engines. And if all goes as it should, the results will run on the current installed base of enterprise-scale OSes, like Linux, Unix, and z/OS.

At the same time, specialized AI/ML hardware may completely eliminate the need for an OS. “Already, part of the work of an OS is being done in accelerators. A data center GPU has its own scheduler and manages the work that needs to be done,” said Roel Wuyts, manager of the ExaScience Life Lab at imec.

When working with accelerators, an OS is a moot point, Cerebras CEO Andrew Feldman said. “By the time you get to the accelerator, you want to bypass the OS. It doesn’t help you with accuracy. The OS is designed to allocate hardware resources, and our compiler does that instead. We want the user writing in a language that the ML world is familiar with, and we don’t want them ever thinking about the challenge of distributing that work across 850,000 programmable elements. As a result, they never think of the machine at all. They just write their TensorFlow or PyTorch. The compiler puts it in and we don’t have to worry about any latency or a reduction in speed brought on by an OS allocating hardware resources.”

And at the other end of the scale, for single, embedded devices, such as smart cameras, the status quo is fine. “Do we need a new OS for machine learning AI at the lower level? No, that is regular compute work,” said Wuyts. “We can describe it and we can schedule it, and we already do that quite well.”

However, when you go beyond a single device into networked, distributed systems, a fresh approach could be quite valuable. Wuyts proposed using the data from a smart camera with another application, say a face-detecting security system in a train station, which needs to have the capability to zoom in and search a database of known terrorists. That scenario makes a developer’s life extremely difficult because of the interplay of different applications, devices, and networked demands.

“For that situation, you could see a new kind of OS that is distributed. We already looked at that a bit in the past. If you have a neural network or machine learning network, you would actually try to run some parts on your smaller devices and some parts more in the back end,” Wuyts said.

A developer should be spared having to think about partitioning and communication between disparate components, he said. “Instead, that can be done by this operating system, in the same way that I currently write a Windows application or Unix application, I have a scheduler that can take care of most [basic partitioning], but lets me take control if I want to do something fancy. For those new types of applications that are now becoming used a lot, it make sense to run them as distributed machine learning applications.”

Quadric’s Roddy noted there is a significant distinction between AI/ML workloads that the OS manages versus how much AI/ML will be included in the OS itself. “Will the Linux kernel include an ML inference graph that determines task priority and resource allocations rather than relying on deterministic or heuristic code to manage the system? The external workloads to be managed (i.e. a camera function in a smartphone app) will be large, heavyweight ML graphs that run best on a GPU or GPNPU. But even if the Linux kernel task manager adds an ML inference component, it won’t be a monstrous multi-TOP/s network. Rather, it would be something far more lightweight that can run on modern applications CPUs running the rest of the OS kernel.”

Roddy noted that machine learning potentially could enhance the usefulness of power management in devices, such as learning the behavior of each user, for instance, to tune the power/performance profile of business executive’s cell phone differently than a high school student’s. “But that hypothetical is an enhancement of an existing OS function, not a radical re-write of the underlying functionality of an OS.”

Conclusion
One potential scheme several sources discussed is that in order to meet energy needs, training might be done, as it is now, on classical OSes, while inference might be performed with a smaller “OS” that’s essentially just a job scheduler. This also presumes a world in which, in order to save energy output, there will be less reliance on compute in the cloud, and more directly on edge devices.

On the other hand, said Steven Latre of imec, “I would assume and predict that in the future, training and inference is going to be much more of a natural combination, and not that separate as we have right now, where we train it once and we deploy it somewhere else.” That said, he still sees a future in which AI will naturally bifurcate.

“There are two complementary directions in which AI is evolving. Large-scale compute AI, which can be linked with other types of scientific computing, HPC types of workloads. With those very large models, the issue at a hardware level is bottlenecks that might arise in communication and compute. So the main challenge is to do the scheduling in the right way to alleviate all the different bottlenecks,” Latre said. “In the more edge situations, it’s a completely different approach, where the concerns are less bottleneck-oriented, but more energy constrained with more latency constraints because they’re in more real-time situations. So that probably also warrants not just a single OS, but different types of OSs for different situations.”

Discussing the possibilities of neural networks, Snelgrove offered this food for thought: “If you look up von Neumann’s original paper [2], it starts with pictures of neurons. If we’re at ENIAC now, it’s the 1950s. We shouldn’t think what it’s going to look like in the 2020s, we should just try to get to 1970. From there, we can discuss 1990.”

References

1. Sechuga, G. Preventing Fraud with AI and IBM Telum Processor: The Value of Investing in IBM z16
IBM Blog, 2022
https://community.ibm.com/community/user/ibmz-and-linuxone/blogs/gregory-sechuga/2022/07/05/preventing-fraud-with-ai-and-ibm-telum-processor

2. von Neumann, J. First Draft of a Report on the EDVAC, Contract between the US Army Ordnance Department and the University of Pennsylvania. 6/30/1955.
https://web.archive.org/web/20130314123032/http://qss.stanford.edu/~godfrey/vonNeumann/vnedvac.pdf

Source: https://semiengineering.com/disaggregating-operating-systems/