There is a tenuous balance between the number of corners a design team must consider, the cost of analysis, and the margins they insert to deal with them, but that tradeoff is becoming a lot more difficult. If too many corners of a chip are explored, it might never see production. If not enough corners are explored, it could reduce yield. And if too much margin is added, the device may not be competitive.
In a semiconductor chip, a corner is the extreme of a parameter that can affect operation, such as temperature or voltage. A device has to be verified over the possible range of that parameter so that all manufactured devices will be operational. Over time, devices have become more complex, such as with the introduction of finFETs. They also have become larger, causing more variation across the chip. And they have become more complex, utilizing more voltage domains or integrating more dies within a package.
Due to a combination of all these factors, the number of corners that must be considered is increasing, and the range of those corners is becoming a greater percentage of the total range. Each of these presents a different axis, and the combination of those axes has to be explored. This is creating what some term a corner explosion.
“When we talk about the process corners, we’re talking about two main things — random variation and design robustness,” says Hitendra Divecha, product management director in the Digital & Signoff Group at Cadence. “Most of the challenges are random variations, which are manufacturing issues. The corners started exploding because you have to take care of the pessimism introduced as a result of the addition of multiple layers, and the whole manufacturing process becoming more complex in nature.”
There are more extreme corners, too. “There are two parts to this,” says Rob Aitken, a Synopsys fellow. “Yes, there are more corners that are getting created. But the bigger problem is that the corners are getting larger in terms of the total range. Is one of these axes more important or more difficult than the other? What’s the net effect of this? When we think about voltages getting so low that everything is close to Vt, then any kind of variation you get from fabrication is a much larger percentage of what the total range was.”
Time can be another compounding factor. “You have variation from one location on a die to another die,” says Pradeep Thiagarajan, principal product manager at Siemens Digital Industries. “You’ve got variation between dies within a wafer. And then you’ve got variation across wafers for a given lot. Furthermore, there are many lots that are being fabricated at different points in time, with different levels of process maturity. Designers need to accommodate all of these variations, which span area and time, in their designs to ensure robust functionality before taping out.”
Most recently, smaller geometries are creating real issues related to device aging. “Aging adds a further wrinkle in the local variability aspect,” says Synopsys’ Aitken. “When you start out fresh from a fab, there’s some variation across wafers, and there’s variation between devices, and all of that is accounted for by various models. But when you consider aging, you’re adding workload-dependent effects, and those affect different parts of your circuit differently in ways that you will be challenged to predict in advance.”
Design robustness has become a multidimensional nightmare. “Design teams have to create better products, but they have to ensure the design is robust enough to sustain all of these variations (see figure 1),” says Cadence’s Divecha. “If they are overly pessimistic, they will tend to over-design, which might put them in a situation where their competition is making a better-performing product. If they become too optimistic, they are essentially getting into the area where the product might not yield.”
Fig. 1: Multiple dimensions of design robustness. Source: Cadence
These problems will continue to get worse. “As you go down Moore’s ladder, there are more effects that were secondary or tertiary,” says Marc Swinnen, director of product marketing at Ansys. “These are becoming close to first-order effects that you need to consider.”
The voltage issue
With each process node, voltages are lowered. While this has a number of advantages, such as reducing power, it also has disadvantages. “We are seeing the library variants where the main operational voltage is very close to switching thresholds,” says Divecha. “This causes a lot more variation because currents are smaller, and outputs don’t switch typically until the input waveforms are in the tail region. This results in delay variation increases. Ultra-low voltage operation is significantly contributing to overall variability.”
But it becomes even more problematic. “Several generations ago, voltage drop was a tractable problem and you could keep all the cells within a reasonably narrow voltage margin,” says Ansys’ Swinnen. “But now, voltage — especially with dynamic voltage drop — has gotten more extreme. Just as you have temperature variation across the chip, you now have voltage variation across the chip, and some cells will be slower because they see a very weak voltage supply while others will be faster. That adds to your corners. Now you have to do one for every temperature and process, and now you also have to do one for voltage.”
The desire to reduce power has added yet another problem. “Different regions or subsystems within a chip or package could have completely different voltage domains,” says Siemens’ Thiagaraja. “This means they may have completely different ranges, with a different nominal voltage, or they could actually have the same nominal voltage and range, but they could be powered from a different independent power supply. These independent power supplies could be in different extremes. So even within a chip, if one IP is interfacing with another IP on the same die, and if they are using different voltage domains, you’ve introduced another level to the problem, which needs to be simulated. Even if you do all the corners, it only gives you the extreme cases. And what you ultimately have to do is a statistical analysis.”
Consideration for aging
While aging happens in all semiconductor devices, it has become a concern only recently as geometries have become much smaller. Today, a semiconductor manufacturer will provide libraries that mimic the effects of aging. These libraries then can be used to see how the device may perform after 5 or 10 years.
However, this approach has issues. “If you manufacturer a chip and sell it to someone else who is using it, you are effectively guessing what they will be doing with it,” says Aitken. “That means you have to account for it, either by margining it, or by guessing, or by giving a property that says this thing can age by X percent and it should still work. They then leave it as an exercise to whoever buys it to determine its robustness.”
Without knowing how the device is to be used, you are severely hamstrung. “If you have differential aging across your chip, that means that on a path that goes from a heavily used area to a lightly used area, your hold and your setup are going to vary over time,” says Swinnen. “It’s not like they’re both going to get slower, or both are going to get faster. One could get slower while the other one stays fast. You could have a hold time violation or something like that.”
Differential aging means you need to simulate with some idea of what the activity will be. Activity is increasingly becoming a factor in all stages of the development process, from architecture through to design and manufacturing, and now into operational usage.
Is it a reasonable compromise? “Doing the modeling in the libraries is an attempt to reduce the number of corners, but not mimicking exactly what’s going to happen in the real world,” says Divecha. “Most companies are handling this with over-design today, and EDA vendors are developing statistical models where we provide customers with a way to characterize a library that includes various stress parameters. Instead of doing everything corner-wise, we are going more and more into the statistical world.”
Advanced packaging adds a new dimension. There are now multiple dies mounted either on each other or on a substrate. “If there is a critical path that originated in one die, goes into another die, and then finally ends in a third, how do you make sure that works given the variations across those dies?” asks Sathish Balasubramanian, head of product management and marketing for AMS at Siemens. “People take the critical path — and it’s a very manual effort of running SPICE — and they do variation within a die, given a margin to the boundary. And then they make sure that across a given spectrum of PVT they are considering, the entire path falls into that range.”
Common design techniques are minimizing this today for 2.5D approaches. “Outside of localized clusters, most of the communication between elements in the system is effectively asynchronous, in that it takes some number of cycles to happen, and the system isn’t waiting,” says Aitken. “The most obvious way to build stuff in 3D is to avoid synchronous transfer, but you may want to take advantage of the large number of inter-die interconnects and do something with them. That potentially gives you a big benefit, but it comes at a corner cost that no one has figured out the right answer to. Perhaps you use some kind of ML corner reduction scheme that says there’s 10,000 of them, but only 50 or so matter. Or maybe you will do die-matching, where you are going to make sure that if I pick die to stack on top of each other, they sit in the same corner space. Then I can neglect a whole bunch of those 10,000 corners and just care about these other ones. Both of those are legitimate approaches, but I don’t think anybody has settled on either of them at this point.”
This may add a new attribute that has to be tracked for chiplets. “We may have to consider lot-to-lot variation,” says Divecha. “This is on top of wafer-to-wafer variation, die-to-die variation, and intra-die variation. We need to come up with a model, which is statistical in nature, to provide to designers so that when they do sign-off, they don’t have to look at all these corners. In the past, foundries pushed back because they thought it exposed too much of their IP and that would become available to their competition, but today this is becoming essential.”
There are a few techniques to control the number of corners. “It’s a combination of EDA tools, plus the library, plus the library characterization, coupled with an understanding by the designers of what it is that they’re asking for,” says Aitken. “Everyone has their favorite way of reducing the corners. One angle that is been looked at is using AI/ML to confirm you are making the right decisions, and identifying the corners that you care about. Then you can extend the capability so you can take a library that is characterized at such and such a point, or at these various points, and use ML to generate new library corners that you care about.”
Understanding the design is important. “One technique is corner domination,” says Swinnen. “Some corners dominate others, meaning that this one’s always going to be worse than whatever combination of that one, so we can drop that one and just consider this one. That can reduce the number of corners you have to look at, but you still end up with a lot of corners. Another approach with voltage drop is to do a Monte Carlo analysis of the space, where you capture a statistical probability. You get a curve showing the probability of it meeting the requirements. Engineers don’t like it much because it doesn’t give them a yes/no answer. But even looking at timing numbers from a library, it may say it’s a 4ns delay, but there is a distribution. We may have arbitrarily chosen three sigma as the cutoff point, but it’s always a distribution. That’s how you tame the combinational explosion. You just explore it statistically.”
Another emerging solution is to make the designs more variation tolerant. “If you want to build an adaptable system that’s able to adjust its voltage and/or its frequency on the fly, and tailor that to what a monitor is telling you, there’s a whole realm of things that you need to do to make sure that it works,” says Aitken. “You have to make sure you have positioned enough monitors and that you’re monitoring the right things, and that you’re adapting accordingly. Essentially, the lesson is you want to push as close to the failure cliff as you can get, but you don’t want to go over it. Getting over it and trying to recover from that is a much harder problem than just never failing in the first place.”
But there is a limit to how much verification can be done. “Verification teams cannot simulate every possible functional scenario for every different aspect of their design or across different ends of the chip,” says Thiagaraja. “They can only invest so much in their test methodology, and there are always going to be usage scenarios, along with temperature considerations, self-heating situations, voltage situations, that cannot all be simulated entirely. Otherwise, you would never make your tape-out.”
The right solution has to consider economics. “Someone developing an IoT design may be perfectly fine running the minimum, the smallest number of corners that are required,” says Divecha. “They can just do margining and get away with that. At the other extreme, customers doing mobile or high-performance, these guys are looking at 300, 400 corners that they sign off with, and that’s because of the requirements placed on these devices. They have to think about their own costs, and the cost of computing. That means the number of CPUs required to do a whole bunch of analysis, memory requirements. Some customers can’t necessarily afford that.”
An increasing number of engineers have to explore a rapidly growing number of corners and the number of axes for those corners, and the range of the parameters is increasing. Advanced designs already are well beyond what can be dealt with by margining. While the industry is exploring a range of techniques to ensure the most important corners are identified and analyzed, there are no longer any absolutes in this space. Adaptable systems may be the only viable way forward.