The power consumption of a device is influenced by every stage of the design, development, and implementation process, but identifying opportunities to save power no longer can be just about making hardware more efficient.
Tools and methodologies are in place for most of the power-saving opportunities, from RTL down through implementation, and portions of the semiconductor industry already are using them. Both are considered mature, and so are the standards for defining power intent.
Huge opportunities still remain for additional power and energy savings, but many of those involve questioning system-level decisions that have been blindly accepted for generations and many implementation nodes. Some of those decisions need to be reconsidered because they are preventing the construction of larger and more complex designs.
“There are three horsemen in the mix — power, energy, and thermal,” says Rob Knoth, product management director in the Digital & Signoff Group at Cadence. “They’ve always been there, and power is probably the most prominent, but energy has come to the forefront over the last few years. Now we’re seeing thermal show up. All of them are interesting because you can attack them at specific points in your flow with specific tools.”
And therein lies a problem. “The architect’s dilemma is that you need low-level information to make early estimates,” says Frank Schirrmeister, vice president of solutions and business development at Arteris IP. “This dilemma has never been resolved and probably will not be resolved in my business lifetime. In order to make architectural decisions as early as possible, we need a set of information, a set of tools, and a set of abilities to support these decisions. We need these decisions as early as possible, but they also need to reflect the implementation effects as accurately as possible.”
To add to that, power cannot be presented as a single number. Some people are concerned about total energy, because that may impact battery life. Others are more concerned about peak power because that can cause operational problems on a chip, or power over time, which may create thermal problems.
To do the analysis, you need to know exactly how the system is going to be used. “Imagine you have an SoC with 100 different blocks,” says Ninad Huilgol, founder and CEO of Innergy Systems. “They’re all interacting together, and you don’t know how they’re going to produce a power density peak beforehand. When you have a simulation that is running, they all interact together to suddenly produce a power density peak.”
Various markets are focusing on different aspects. “Edge AI, or edge intelligence, has different care-abouts and different questions than a data center hyperscaler compute-type application,” says Cadence’s Knoth. “Both of them, however, are going to be pushing certain aspects of the technology, some of which reinforce each other, some of which are separate. Edge is going to care more about certain aspects of energy because of the battery life. And it’s critical to think about what you run in software versus what you run in hardware. What do you communicate back to your base station for them to run and send back to you? There are some very tricky problems where the IoT industry is uniquely suited to lead and to innovate. It doesn’t mean they’re the only leader. The people that are developing massive hyperscale compute data centers are leading in a whole different class. Frequently, they’re the ones pushing hardest, because you look at the massive amount of infrastructure dollars that are required to field that compute.”
RTL and implementation techniques
Power saving techniques have been applied to the RTL and implementation levels for a number of years, but there are further power and energy savings possible. At the implementation level, newer technologies are adding problems which, if not addressed, will lead to power being wasted.
“Technologies have conspired to make it much more difficult to supply voltage reliably,” says Marc Swinnen, director of product marketing at Ansys. “You are going to have some voltage drop, and often people just build in a margin, saying I may see up to 100 millivolts drop. My timing then has to assume that every cell could be that much slower. Obviously not every cell is going to see that maximum voltage drop, so the more accurately you can model the actual voltage drop, the more accurately you can design your power distribution network to avoid this error, and you can back off from this voltage drop margin. You’re trying to drop that margin and that can have a huge impact.”
At the RT level, clock gating and power gating have been in use for a long time. While they optimize the power and energy associated with the defined task, they do nothing to help identify if the tasks were optimal in terms of power for the function being performed.
“We have a term called ideal power,” says Knoth. “It is an attempt to identify wasted activity. For example, if you have a block where the clock is free running, and it’s actually under reset, you could have gated that clock. We can analyze the toggles going on inside that block, add up the power due to those toggles from that hierarchy, and then display those in a report that shows where power is wasted. Using this methodology, we saw hardware engineers improving what they’re doing from a design methodology perspective. There’s a whole bunch of other deeper scrubbing techniques that can be used.”
Looking into the RTL can provide other possible power savings. “A power artist will suggest edits to your RTL by looking at how you do things,” says Ansys’ Swinnen. “It could be that you have implemented a function this way, but if you implement the same function a different way, you will save power and achieve the same function. There is a library of optimizations that will automatically scan through RTL and identify each of the places it can upgrade the RTL to a more power-efficient implementation. It will tell you how much power it would save based on estimates and will actually implement those if you approve.”
Few people would argue that the earlier tradeoffs can be evaluated, the bigger the impact they can possibly have. “The broader your scope, the more parties you bring to the table, the more you step back and look at it earlier, the more you start seeing bigger opportunities,” says Knoth. “These are bigger trends that go beyond making the one widget you’re producing better. You really have to look at how that widget fits inside the gizmo, which fits inside the product in the data center that gets connected to the hydroelectric power plant or the solar farm.”
The problem is that without estimates that are accurate enough, bad decisions are also possible. “As designs have become larger and more complex, it has become increasingly difficult to produce accurate estimates,” says Schirrmeister. “For example, you need floor-planning information to estimate how many registers are needed in a path across silicon, because propagating signals across large chip sizes is incredibly difficult and cannot be done in one clock cycle. For a NoC, we try to optimize the number of registers, which has an impact on the power consumption and the amount of interconnect you carry around on chip. We annotate, from the .lib, all the way back to NoC generation, early estimates of how long the path will be. Will it have to be refined later on? Absolutely. The multi-dimensional reality of the problem makes it very hard, especially where there are vertical dependencies.”
To be able to perform an analysis for thermal, long timeframes have to be considered, and you have to look at realistic workloads. That most likely means running actual software. “Most of the industry uses their RTL code mapped to an emulator, runs real software workloads on that platform, and gets vectors out from which they do a power estimation,” says Knoth. “With multiple iterations a day, they can tune the software to more effectively use the power features in the hardware. Overnight, they’re able to make tweaks to the hardware. Now you have this system-level co-optimization where you’re hunting down wasted power and ensuring you’re creating the most optimal system possible.”
The industry always has looked for ways to insert abstract models instead of using RTL, both because it may run faster and because the analysis can be performed before RTL is ready. “Analyzing power consumption of software execution has been relegated to emulation platforms until now,” says Innergy’s Huilgol. “One technique that can help is building power models of the hardware that could be simulated in software environments. These models can provide accurate feedback about both average and instantaneous power consumption of various hardware modules as software runs. This enables hardware and software co-optimization for power before tape-out.”
Similar approaches were taken for functional verification of hardware and software in the past, and now attempts are being made to apply that to power. “We are not inventing black magic, and we can’t fight physics,” says Huilgol. “But you don’t need to run detailed power simulations all the time. We take a tiny sampling at the block level, combine those together and run it at subsystem level, system level, emulation, software, etc. There are two aspects to power. One is data path, and the other is control path. We account for mainly the control path, but when there are data-path dependencies, there is a facility in our models to make them data-path aware. These are statistical power models that operate on a transaction model. How do you improve the resolution? You can have smaller cycles or single cycles. But if your resolution is 15 cycles, or more, quite large transactions, there will be some statistical error that is captured.”
Rethinking the past
In the past, Moore’s Law made it quite easy to migrate from one node to the next, making use of extra gates, higher performance, and lower power. That meant that continuity across time was important, especially to ensure that existing software would continue to run on new hardware.
Over time, that has baked in some inefficiencies that will be difficult to break free from. “A lot of things weren’t possible in the past,” says Knoth. “Perhaps it was because the process node couldn’t fit all the compute in the semiconductor that would be deployed on the edge. But now it can. Perhaps you didn’t have the tools to do the analysis with the right accuracy in the right amount of time, or because the packaging technology wasn’t available. But every now and then you have to take a breath, step back, revisit the landscape, and ask, ‘Did we correctly optimize this equation, or did we just do the best we could?’ At times we need to put on our scientist cap and not be afraid to question some of those fundamental principles that we’ve codified.”
It’s important to consider the complexity of integration. “There are two levels of complexity — the application complexity going up on the top, and then the implementation complexity going down on the semiconductor technology,” says Schirrmeister. “That is the number of transistors we’re dealing with. Because you have the application complexity, with the number of functions increasing as much as it has, and continuing to increase, you have to deal with things like shared memory, coherency, and so forth. If you don’t have cache, you always have to move things around. Cache coherency was a solution to a problem that introduces a new problem.”
Processors have been driven by performance. “Adding a branch predictor or speculative execution to a processor will increase the number of gates in the circuit, thus increasing both dynamic and static power consumption,” says Russell Klein, program director for the Catapult HLS team at Siemens EDA. “But those features increase the performance of the computation running on the processor. So power definitely goes up, but energy, which is power multiplied by time needed to perform the computation, may go up or down. It depends on the ratio of performance increase to the power increase. If, say, power goes up 20% but performance improves by only 10%, total energy for the computation increases.”
Power, energy, and thermal cannot always be optimized in a simple manner. “It may seem counterintuitive, but increasing performance can reduce average energy consumption for some workloads,” says Maurice Steinman, vice president of engineering for Lightelligence. “Such workloads can benefit from the so-called ‘race to idle,’ where deep power savings states can be entered for extended durations if work can be completed faster. Consider workloads that maintain a predictable (but less than 100% utilization) compute demand profile, say 25% of available performance. One approach may reduce operating frequency to 25% (and accordingly reduce operating voltage). The device would now remain fully active, but at reduced power. Another approach would endeavor to complete the work quickly thus enabling drastic power savings — 25% on, 75% off, where off could require zero or near-zero energy consumption thus resulting in lower average power than constant operation at 25% clock rate. It may even be advantageous to overclock/overvoltage to further increase the off time to more than 75%.”
Balancing hardware and software
One of the largest balancing tricks related to system complexity and power is establishing the hardware/software boundary. “Any function implemented in software is going to be orders of magnitude slower than the equivalent function implemented in hardware,” says Siemens’ Klein. “Anything in software is, by definition, not optimal. Highly optimized software on a very efficient processor can’t approach the efficiency of even a bad hardware implementation.”
Partitioning decisions are becoming easier, says Klein “What should be left in software, what should be done on a processor, and what makes more sense to create a custom hardware accelerator that is a sidecar for that processor — that is where you start seeing huge 100X, 1,000X type of time or power reductions, depending upon where you’re optimizing your system.”
As performance improvements become more difficult, those kinds of approaches become essential. “Bottom line, bigger processors are less energy-efficient, so getting a bigger processor to address your performance needs only makes sense if you don’t care about power,” says Klein. “The right answer is to move the heavy lifting off the CPU and into a bespoke accelerator.”
That approach has seen increasing popularity. “Dedicated hardware accelerators and co-processors can increase a system’s performance due to diminished performance gains by moving to more advanced nodes,” says Andy Jaros, vice president for IP sales and marketing at Flex Logix. “Dedicated accelerators alleviate the processing burden on CPUs from expending tremendous compute cycles to execute complex algorithms. Utilizing eFPGAs for those dedicated hardwired accelerators provides needed power efficiency, yet still maintains programmability when the workload changes.”
Whenever you can specialize, there are huge opportunities for gains. “Today it has become a lot easier to specialize a processor by adding instructions,” says Schirrmeister. “Most of these instruction customizations are done for the purpose of low power. I’ve seen cases where an added instruction in the processor allowed you to stay in half the memory. That’s huge from a power perspective. But while you’re doing that in the isolated island, the overall complexity of what you’re trying to do has increased.”
Or you can move that function all the way into hardware. “The other solution is to offload computationally complex operations into bespoke accelerators,” says Klein. “High-level synthesis (HLS) is the easy way to do this. It’s still hardware design, so you still have to have smart engineers to make it work. But with HLS you are starting from a software C or C++ algorithm. There is no interpretation of the algorithm, which is a manual process that is slow and error-prone. And a golden reference is readily available in the form of the original function from software, which makes verification a lot easier.”
All of these choices are becoming easier. “In the past, the big problem with making a decision at the architecture level was that you had to reassess this decision later on in the project, but the flows were not connected,” says Schirrmeister. “For cases like configurable processors and the NoC, the flows have been connected. If you go back, it takes time to rerun the tools, but it’s no longer people having to manually verify the architectural decision. Automated generation allows you to run through more data points.”
Optimizing power, energy, or thermal issues alone is not easy. But the need for addressing each of the three issues is growing, and while they are interconnected, it is not always easy to determine which should be optimized or how. It is only by looking at the entire system that decisions can be made. In the past, modeling, analysis, and design flows made this more difficult, especially when it crossed the hardware/software barrier, but more tools are appearing. It is still not easy, but as industry awareness grows and more people want to tackle the problem, better tools and flows will become available.