Chiplets are beginning to impact chip design, even though they are not yet mainstream and no commercial marketplace exists for this kind of hardened IP.
There are ongoing discussions about silicon lifecycle management, the best way to characterize and connect these devices, and how to deal with such issues as uneven aging and thermal mismatch. In addition, a big effort is underway to improve observability of chiplets over time, something that is particularly important as these devices are used in safety-critical and mission-critical applications.
All of these issues need to be solved to enable widespread adoption, and the chip industry has recognized that the slowing of Moore’s Law combined with fixed reticle sizes will require changes in how chips are designed, manufactured, and packaged. It’s physically impossible to fit all the functions required for many applications into a single SoC, and the goal now is for an orderly, predictable, and repeatable approach to disaggregating many of these components. In theory, this will allow devices to be more easily customized, speed time to market, and avoid expensive scaling of components that don’t require it, such as analog functions.
Achieving that goal will require solving some complex and thorny issues, however. For one thing, it will require much better observability, monitoring, and analysis of what goes into a package. While the concept of putting multiple chips into a package dates back to the 1990s with multi-chip modules, with chiplets the dies typically are smaller and thinner, and the dynamics of how to characterize, test, and observe them have changed significantly.
“Historically we have called these multi-chip modules, which are very popular in the wireless world today,” said Nilesh Kamdar, RF/microwave portfolio lead at Keysight. “You pick up any smartphone, and the wireless piece of the smartphone is a front-end module that’s 20 to 30 chips squeezed together on a small space that’s smaller than a fingernail. That has been happening in the industry for at least a decade, if not longer. Also, several aerospace, and other high-frequency problems have required this kind of integration, so we’ve done this in the past.”
Fig. 1. Multi-chip RF module layout shown in Keysight’s PathWave ADS. Source: Keysight
The big shift now underway involves a much broader application for this approach, as well as improvements in the chiplet design, and standard ways to connect, test, and measure what’s happening inside the chiplets themselves, as well as the advanced package that surrounds them.
“Automotive is a great example of what’s changing,” Kamdar said. “At a recent conference, a vice president from a major OEM spoke about not being able take a data center and put it in the backseat of a car, because that’s what it takes to have an autonomous vehicle today. If you do more integration — and if you somehow make the board disappear, and everything is squeezed together — maybe vertically we can squeeze that into the backseat of a car. There are many similar use cases. The power need for chiplets might be lower if you consider physical data centers. There are a lot of benefits, and that’s what’s driving chiplets today.”
However, these changes are deceptively complex, and the industry may need to go back to school on this. “Building chiplets at scale is such a different model that we all need to reassess our skills,” he said. “We need to reassess how organizations are set up and how architecture happens. We need to reassess the role of a system designer. They may have been looking at things in a different way, and they may have been saying, ‘I’m a system designer. I design the spec for the systems. I break it down into smaller level components, individual ICs, and distribute the specs. I walk away and come back six months later and see how everyone did.’ Maybe that’s not even possible. Maybe there are multiple system designers that need to exist up and down the chain. Those are the kinds of conversations that need to happen. Within the biggest players in the industries, those conversations are happening already, but not everywhere.”
The number of potential interactions in the context of a multi-chiplet design is significant, and in many cases, design-specific. “If you believe in multi-die, if you believe in chiplets, you’ve got to believe it will only exacerbate the whole [design and integration] problem,” said Shekhar Kapoor, senior director of product line management for the Synopsys EDA Group. “Chiplets are going to come from many places, many sources. There are going to be a lot of choices, a lot of options for everyone. The biggest issue is the current usages around all this. Large companies are doing this in a bespoke and customized fashion. But if you go broad go with standardization, how do you know a chiplet coming in is going to fit into the environment, in the product that you are trying to build?”
Despite a focus on standards such as UCIe and Bunch of Wires, there are still the nuances in terms of how the individual chiplets are characterized in the context of a system. “How do you really know the profile of it? That’s where even more monitoring will come into the picture, which is almost like a signature,” Kapoor said. “You can read it and know if it is ideal for your environment. Given that the industry is moving toward more chiplet enablement, this will be a core consideration. More requirements will emerge, more standards will emerge, so then you can see whether something is going to be a fit or not.”
There are other challenges to overcome, as well.
“With chiplets, all the high-speed signals are within the package, so observability is much more challenging,” said Sue Hung Fung, product line marketing manager for UCIe at Cadence. “This can be done through link error checks, eye scans, BiST, etc., in order to get to a known good die (KGD). Test methods are all built around this. In addition, having good monitors for the health of the link would be valuable, and there are new and different proposals from different vendors.”
The key is to monitor signal quality in the context of the rest of the components in a package, which becomes more difficult as more functions are disaggregated into chiplets.
“Can we monitor the signals and quality of these during data transfer? Training techniques are done prior to mission mode to improve data transfer robustness,” Hung Fung noted. “Retraining is not desired as it can cause data interrupt. We need to be able to continuously monitor and report each lane and detect any events that may cause failure before a failure occurs. Prevention of system failure and repair of those failures includes redundant lane remapping or other repair methods to detect marginally failing lanes. Training and continuous monitoring of these internal chiplet signals are the challenges for analyzing the link behavior.”
Working groups for UCIe are looking to standardize some of this observability in order to have an open link ecosystem. But adding observability and monitoring also can vary greatly by vertical segment.
Randy Fish, director of product line management for the Synopsys EDA Group, explained that because there’s been no standard approach and few commercial suppliers for observability solutions, almost all the solutions have been bespoke. “If you go into any of the leading semis, they do something,” he said. “The question is, are there functions around multi-die that will force us to standardize to have a cohesive or coherent infrastructure for monitoring and debug — basically to see what is going on especially if you’re getting multiple die from multiple suppliers. Some of the multi-die solutions are going into automotive, and there they do care about things like aging and what is happening to these die. And as we know, it’s advanced nodes. It’s not like you have 6 die of mature 10-year-old technologies. These are advanced nodes that don’t have a long history. For that reason, there are a number of factors forcing this to happen.”
Chiplets also introduce some interesting contrasts. Keysight’s Kamdar noted that during a recent CEO panel, one of the panelists said chiplets have a unique dichotomy. “On the one hand, each chiplet could be an independent IP that you can procure from an IP vendor and integrate it into your system for relatively low cost, relatively easily. However, the entire stack up that you’re trying to build suddenly forces you to know everything. Previously, you could have just said, ‘I need six things. I’m going to buy five off the shelf from an IP vendor and they will figure out what it needs. I’m going to focus on one.’ But now you may not be successful in doing that. You may actually need to know how to do all six, and figure out how all of that happens because the complexity of the problem just went up. This may force the industry to initially only allow the big players to figure this out. It may take a lot longer for the smaller vendors to be able to be successful in this environment.”
Nevertheless, it will take more than one company to accelerate chiplet integration and adoption.
“Keysight attended the TSMC Symposium, which is a more public event, and as a follow-up there was a workshop with just the partners that are part of the 3D Fabric Alliance,” Kamdar said. “TSMC started and ended the whole day by talking about how we all need to work on this together, which was echoed by other participants, including AMD and Qualcomm. Speakers from both companies said not a single EDA company knows how to solve the chiplet problem on their own. The entire industry must work together.”
That’s easier said than done, however. “The biggest problem is to establish an ecosystem,” said Andy Heinig, head of department for efficient electronics in Fraunhofer IIS’ Engineering of Adaptive Systems Division. “Packaging, chip design, or IP are not the biggest issue. It’s more about getting everything in place. This is the real challenge today. Then we can discuss performance, because any chiplet solution will have to compete against existing solutions. It’s a bit of a chicken-and-egg problem, though, because you need performance — especially because the chips are more expensive than an SoC solution — but you can’t focus on performance because you have to get it running first.
Another key concern with chiplets is heat dissipation. This is part of the characterization, but it also is highly dependent on use cases, packaging choices, and the overall system-in-package architecture.
“For chiplets marginality in the design is slim given optimal PPA (targeting aggressive pj/bit and beachfront density), which is crucial when designing chiplet PHYs,” said Rishi Chugh, vice president of product marketing, IP Group at Cadence. “Reliability is key, and so is observability to screen the KGD, as well as get it to commercial operational success. Data integrity schemes like CRC (cyclic redundancy check), eye scan, BiST, and monitoring circuits are implemented in the design for robustness, and the design must be over-provisioned with failure mechanisms to ensure the data line is resilient.”
An entire chapter within the UCIe protocol is dedicated to initialization and training associated with the UCIe protocol, which covers the observability aspect of the protocol, Chugh added.
Additionally, there is debate as to whether the actual fundamentals of adding observability into a system is the hardest part, or whether the change in thinking around these concepts is more difficult.
“It’s actually not that complicated in contrast to other things because it’s ‘just’ another block to be connected to. There’s observation, and we have capabilities to trace things,” said Frank Schirrmeister, vice president solutions and business development at Arteris IP. “Users are already requesting things like looking at the registers from a software perspective. So now the challenge becomes making those registers available in the NoC. From a NoC perspective, there are the protocols themselves like CHI, ACE, AMBA, OCP or others, and those are the mechanisms of the language — how they talk and how they interact. Within the NoC, with the more complex protocols, there are things happening over multiple cycles, so you need to wait for responses, you put things in the pipeline.”
This is similar to speculative execution in processors. “We talk about these credits, like how long do I have to wait for response and so forth,” Schirrmeister explained. “Those are all parts of the protocols. In the NoC, you need to understand issues like how deep are the buffers? When am I actually waiting for data? It is partly performance. Then, for the observability, you can connect to the data, and the sensors might use their own network depending on how you want to configure it. In the case of on-chip monitors, you need to decide whether to put this on a special observability bus, for example. There’s always a discussion how much debug do I do actually have? At the end of the day, it’s ‘just’ another interconnect of those components and you need to decide how to export it out of the chip, and so forth. How much you store on chip is all a question of how much silicon real estate am I willing to spend for that?”
That becomes particularly important when stitching together chiplets. “How do I ensure I’ve got enough space for this computational entity that’s just looking like data, which isn’t actually adding any value to the immediate function?” asks Gajinder Panesar, chief architect at Picocom. “Also, I may not be a monitoring expert, but I know I need it. So I need something that says, ‘Just press that button.’ You have an environment, we are doing the designing, and ‘this’ happens. Ideally, we should be observing the behavior of CPU performance and then dynamically tuning certain aspects of the core to get better performance.”
One of the pieces that has yet to be developed is dynamic control of the devices, and the adjustments that can be made over its lifetime.
“Say we have all the ability to model everything up front,” said Lee Harrison, director of product marketing in the Tessent group at Siemens Digital Industries Software. “We’ve got all of the monitors built in to do the in-system side of things, but it’s closing that loop. For the newer geometries, there’s still quite a bit of learning to be done to really optimize how we can tweak the various parameters of the device to extend that reliability. The piece that closes that loop to in-life system is where there’s a huge amount of value. Yet there is still work to be done.
Changing responsibility roles
Commercial chiplets add another thorny issue, which is who’s responsible when something unexpected is observed or goes wrong.
“If I’m a chipmaker, I make the chip and I may go through an OSAT for testing,” said Paul Karazuba, vice president of marketing at Expedera. “I may use ASE as a packaging house, but I’m selling it with my name and my warranty. It’s going to get interesting when we have chiplets. In all of our meetings concerning chiplets, the question is always raised about who is going to be responsible for what. Let’s say I was to make an AI chiplet and I sell into a system and package with six other companies’ chiplets. Which company is going to be the warranty? Which company will do the service on it? There is no real consensus right now.”
The working idea is the company whose name is on the outside of the package will be responsible, Karazuba said. “That company is probably going to be the one that has an end responsibility for the service to its customers, but it brings in another layer of service that chiplet makers need to provide, and that’s going to be interesting. The fear is that circa 2000, Intel-Microsoft-Dell triangle of everyone pointing fingers at each other. That’s an unspoken fear in the industry right now.”
And maybe it’s not one of the chiplets. What happens if a substrate or a physical interconnect is defective?
“From a testing point of view, a chiplet may test perfectly well,” Karazuba said. “But when there’s a physical interconnect problem, how does the chiplet maker understand that versus the multichip module maker? It’s going to be interesting. The only way to solve these issues is going to be trial and error. We can design as many legal contracts as we want to as semiconductor makers, but we’re in uncharted waters here and things are going to have to be adjusted. Support models are going to have to be adjusted to reflect the new reality of monolithic silicon not being a primary vehicle of the semiconductor sale.”
Synopsys’ Kapoor already has seen reflections in the ecosystem. “There always have been ecosystems, but your active ecosystem from wherever you are was maybe the next circle around it. If you are doing design, you’d only be concerned about the foundry design rules and design rule manual. That’s changing when you talk about the chiplets. Even with the design, now you’re thinking about test even more than you ever did before. You’re talking to Advantest and Teradyne. Even though you’re just a designer, you have to figure out what you need to put in from an ATPG point of view and how it’s going to be tested. The relevant ecosystem size is increasing.”
Still, the industry has no choice but to solve these issues. “We’ve talked about the chiplet marketplace. You’ll be able to pull a die out and have it ready. We’re still far away from that, but the steps are becoming clearer about what we need to achieve. Connectivity is fundamental. UCIe standards are a must, and with that comes the protocols and the rules from a connectivity point of view that you have to establish. Next would be very clearly defined models. The challenges we’re talking about are impacted by thermals, in particular, and power. Some standards already exist around that, and we will go from connectivity to the characterized model, so we can more reliably use that. Then we’ll need some sort of signature, which is where we can see from a testability point of view the lifespan and how all die are going to be changing differently.”
The inputs into all of this will come from chip and system monitors, which will also need to be based on standards.