Circuit aging is emerging as a first-order design challenge as engineering teams look for new ways to improve reliability and ensure the functionality of chips throughout their expected lifetimes.
The need for reliability is obvious in data centers and automobiles, where a chip failure could result in downtime or injury. It also is increasingly important in mobile and consumer electronics, which are being used for applications such as in-home health monitoring or for navigation, and where the cost of the devices has been steadily rising. But aging also needs to be assessed in the context of variation models from the foundries, different use cases that may stress various components in different ways, and different power and thermal profiles, all of which makes it harder to accurately predict how a chip will behave over time.
“The foundries provide the SPICE models first,” said Sharad Mehrotra, vice president, R&D, in the Digital & Signoff Group at Cadence. “Then we, as an industry, figure out how to do the library characterization, and work to fold that into a static timing analysis methodology. That flows into implementation, where those tools are also cognizant of the variability effects, so the right choices can be made for device sizes, Vt (threshold voltage) types, and so on, to achieve the PPA given the variability constraints. Similar things are happening now with device aging as that becomes a first order concern, especially for applications like automotive and HPC. These chips are going through a lot of stress in the data center environment, so it’s important for the data center providers to be able to predict how these chips will perform over time, not just when they’re fresh out of the factory.”
Fig. 1: Reliability and PPA interdependence. Source: Cadence/Arm/Arm DevSummit
These issues are particularly pronounced at the most advanced nodes, and in complex heterogeneous packages. “As we’re pushing devices harder, we are trying to squeeze every last ounce of compute power out of our designs,” said Lee Harrison, director of product marketing in the Tessent group at Siemens Digital Industries Software. “We’ve also got various challenges in terms of temperature and voltage. It’s the age-old problem of whenever you’re designing a chip, you’re always right up against the timing margin. There’s never a huge amount of slack there to start with, so we’re always pushing the bounds of the technology, trying to get things to go as fast as possible.”
But maximizing performance accelerates aging, which in turn reduces reliability. How to balance that equation is a challenge. “There are different points of view on aging in ICs,” noted André Lange, group manager at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “First, technology guys want to know what happens on a microscopic scale. They want to draw conclusions from the information on how to reduce aging in their devices, i.e., on how to make them more reliable. Nevertheless, at some point in time, there will be nothing that can be improved further. This is where the second stage starts — knowing about the remaining uncertainty and aging. This is typically investigated during technology qualification.”
For instance, AEC-Q100 requires foundries to investigate the impacts of hot carrier injection, bias-temperature instability, or time-dependent dielectric breakdown.
“Designers have to deal with this remaining aging and degradation,” said Lange. “We see that foundries continue adding reliability models with improved accuracy into their PDKs to allow designers to investigate the reliability of their designs according to the requirements of their applications. While reliability of ICs has been an important topic in automotive for years, we see its importance increasing in other market segments, such as industrial, medical, and even consumer. Designers pay attention to the impacts of aging more and more. However, challenges remain to be addressed, including verification effort availability of models, definition of mission profiles, and setting reasonable stress conditions for validation.”
One solution is to add more timing slack to minimize the effects of circuit aging on timing. “This can be achieved by introducing more pessimism to account for the aging effects,” said Ashraf Takla, CEO of Mixel. “For example, additional de-rating factors can be added by using an aging-aware static timing analysis flow.”
High-temperature and long-lifetime applications such as automotive, and high-speed applications such as AI, accelerate aging effects. That makes simulation essential.
“More margin needs to be built-in to account for deterioration of performance with aging,” Takla said. “Safe operating area (SOA) verification is also mandatory to ensure all devices are operating within the maximum allowed limits of the technology. In some cases, metal aging is of concern, and then extensive EM verification is also required.”
Different use cases
How quickly devices age often comes down to different usage patterns. Heavy usage generally accelerates aging, but predicting how a heterogeneous device will be used isn’t always obvious.
“The more the circuits switch, the more they age, which poses challenges for timing because one part of the circuit may age faster than another because it gets used more,” noted Marc Swinnen, product marketing director at Ansys. “Today’s methodologies typically look at aging across the board, like the entire chip ages and everything ages equally, but that’s not reality. Things age differentially, and that’s a difficult thing to put into a flow.”
That also impacts thermal gradients. Localized temperatures need to be calculated, along with chip- or package-wide temperatures and joule self-heating, which occurs when the currents cause local heating on particular wires. “They can get much warmer than the surrounding circuits, and the heat bleeds out. That feeds into the overall aging equation,” Swinnen explained.
This is another methodology problem. “The models for aging transistors, how they age over time given all the information — those exist, and they have for a while, and they are pretty well established,” he said. “The foundries do that, but that is not essentially the problem. The problem is how to apply this aging information to a 200 million-instance design, especially since there’s differential aging.”
From a physics perspective, excessive heat, excessive activity, and higher voltages tend to accelerate aging, but not equally. For example, bias temperature instability and hot carrier injection happen over time, but slowly, while electromigration seems to accelerate. The trick is being able to detect those issues and make adjustments at the correct time.
“To detect them, process monitors can be used,” said Adam Cron, distinguished architect at Synopsys. “We have temperature sensors to keep track of what’s happening locally inside the design. Path margin monitors can be sprinkled around, especially in the active areas if we know where that area is, to help find the issue and then adjust Vmin across time to handle the difference or change frequency, or things like that, to handle aging and delay. Then, toward the end, maybe a logic BiST periodically could be used to catch the catastrophic electromigration issues like shorts and opens.”
Addressing aging issues
Aging has both physical and electrical aspects. “For example, voltage thresholds are changing and source-drain channels are starting to collapse,” said Pradeep Thiagarajan, principal product manager in Siemens’ Custom IC Verification Division. “There are many other phenomenon that affect different devices in different ways, but essentially three main effects stand out — HCI, NBTI (negative bias temperature instability), and PBTI (positive bias temperature instability).”
Major foundries are aware of these effects and have taken steps to account for them. “A lot of the foundries are now supporting fairly accurate device aging models, which is how the device degrades over time when it’s stressed at a certain temperature and at a certain voltage,” Cadence’s Mehrota said. “These device models can be brought into a reliability simulation in SPICE. Then, if the same chain of events happens from there and we are able to comprehend it we can use library characterization and put together a static timing analysis (STA) methodology. Once we know how to do STA with that, then we can optimize it further upstream in our implementation tools, ECO, post-route, and so on.”
Fig. 2: Aging phenomenon. Source: Cadence/Arm/Arm DevSummit
Some of this is dealt with through a standardized aging model, which accounts for HCI, BTI, and PBTI, as well as other aging effects. “Without that, all the foundries and IDMs had to rely heavily on their own homegrown aging models, which represented the process technology behavior,” Siemens’ Thiagarajan said. “Also, the EDA vendors offer different proprietary aging solutions. The foundries needed to support all these multiple solutions to fulfill their customer needs, and they needed to support various model interfaces to integrate the aging models into circuit simulators due to the lack of a common industry interface solution. The EDA suppliers were also required to support their own unique interface for every foundry, and because of this, there was a non-standard approach, which added complexity and increased support costs for the supplier as well as the end user. This really drilled the need for an industry-standard aging platform that enables aging modeling, aging simulation and analysis to support any degradation mechanism. This led to the OMI (open model interface).”
OMI started as an investigation by the Si2 Compact Model Coalition (CMC) in 2013. The first version of OMI was released five years later, based on TSMC’s TMI interface. The interface provides users with the flexibility to customize the CMC standard models to fit their own applications, but without touching the actual native implementation of the CMC standard models. The initial goal was to enable foundries, IDMs, and EDA vendors to support a single common standard interface.
Things didn’t work out quite as planned, however. Every foundry has a different approach, and no single standard exists for how different models are expressed. “In addition to OMI, TMI and URI are other types of models in use today,” Mehrotra said. “The important thing is that a simulator can consume all of these models, then do a reliability simulation with them. This part of the methodology is getting quite accurate and well calibrated against silicon data that the foundries have. So, the foundry part is fairly well established. What is not as well established is the methodology of consuming these foundry models.”
In a SPICE simulation, this is straightforward. “You put together a certain stress condition, temperature, duty cycle, calendar age, voltage,” Mehrota said. “These are the four parameters for, let’s say, NBTI or BTI aging. You do a simulation saying, ‘After this calendar year, this number of calendar years, with these stress conditions, this is what the device aging will be.’ The tricky part is that the device is not aging uniformly over the lifecycle of operation, so you might be stressing it at a certain voltage and temperature over one part of it, then you recover and do a different operating characteristic, and so on. A single simulation will not give you the perfect answer. The challenge is dealing with a variable mission profile for the device. Mission profile is a term of art, which says how the different stress conditions are changing over time. How do you deal with a variable mission profile? How do you deal with what we think of as a PSpice linear or PSpice constant aging profile, which is a certain amount of time at this condition, certain amount of recovery, certain amount of time at that condition, recovery and so on? Then, at the end of it, what does my device aging look like? Can you model all of this correctly in your STA so that you have the right aging degradation for each instance in your design? And then, can you do STA based on that?”
Ansys’ Swinnen added that the traditional methodology has been to create libraries for zero day, and 1, 5, and 10 years, to establish timing. “Then you can time the circuit across different ages in its life, but that assumes that everything ages equally,” he said. “You need to input the expected activity of each block — like that the transmit block is always active, but some other exceptional block only rarely gets activated, so that’s not going to age much. That means any path going between the two blocks is going to see its source transistors age much more than its receiving transistors, which means the setup and hold starts getting tricky. It’s not that they both tracked each other’s aging characters. They’ve gotten out of sync. A methodology is needed that can capture the activity per block or per region, and then ascribe the right library to those elements, then time it with that differential set of timing characteristics for each block.”
In principle this is do-able. “It’s just that in practice the flows are not necessarily in place yet,” Swinnen said. “It depends on how much effort you want to put into this to get a usable aging flow. Also where do you capture this aging activity? And temperature is involved, as well. You need to know the average temperature of this block versus the average temperature of that block, and that’s a whole thermal analysis you need to do. In principle, it’s solvable. It’s just complex, and a lot of data has to be pulled together into the same place and timed, and that’s been the problem.”
Additionally, Siemens’ Thiagarajan said the first step is a fresh simulation without any stress from voltage or temperature effects. “From that, you get the baseline from a transient analysis. The second step is to apply stress on top of it such as an extreme temperature and voltage condition required for a certain application. Once that is run, you can then see the degradation of the device based on the aging model that was used, and you can extrapolate the age from that analysis and feed that back. The third step is to run the actual simulation at the advanced age value to see the expected degradation of a signal profile at the right time, or the fall time, or the amplitude, or any skew in the signal on a clock or a DC signal. As part of this, the self-heating aspect of devices also must be accounted for, because there can be localized heating for each MOS device. So that is modeled, as well. Then, you need to assess how the change in temperature would affect the aging.”
While still evolving, the majority of pieces are in place today for mature process nodes that are well understood. But much work still needs to be done at future nodes, and in heterogeneous designs in advanced packages.
“Even some of the advanced finFET process nodes that are production tested and understood, the pieces are there,” Thiagarajan said. “However, for the newer advanced nodes that we’re now looking into, that really push the boundaries of physics with a smaller channel length as we approach 3nm and 2nm, these will need another set of innovative eyes to be able to model the aging correctly, and then find ways to increase the lifetime of these devices.”
But whether this will improve, or become more difficult with more customization and more options and features remains to be seen. At least for now, EDA companies are taking a hard look at what’s needed over the lifetimes of chips. The challenge will be to scale it, and so far, no one is talking about how to do that.