0109
0109
System-in-Package Design
Pete Ehrett Todd Austin Valeria Bertacco
University of Michigan, Ann Arbor
{wpehrett, austin, valeria}@umich.edu
978-3-9819263-2-3/DATE19/2019
c EDAA 510
(a) Intra-rail buffering bridge
(b) 2-rail mesh bridge
Fig. 4: Bridge chiplet examples. (a) is an active bridge that connects
Fig. 2: Assembly-time customization with SiPterposer. The bridge two adjacent clusters on the same rail via bidirectional buffers; if
chiplet (Fig. 4b) connects two cores on different rails (denoted by attached across a set of blown fuses, it can act as a repeater for long
green arrows). The blown fuses enable the cores to communicate in interconnect paths. (b) is a passive bridge suitable for constructing
isolation from accelerator pairs 0/1 and 2/3, and vice versa. a small mesh by directly connecting parts of two adjacent rails; this
pattern can extend across additional rails to enable larger designs.
around failed components [12] or applying ECC to correct
transient faults and crosstalk [13]. To our knowledge, however,
these methods have not been leveraged in the SiP space, and no
prior work has applied ECC to tolerate SiP assembly defects.
III. S YSTEM OVERVIEW
SiPterposer structure. To achieve our goals of generality,
flexibility, and resilience, we propose constructing SiPs using
a generic, fully-passive interposer based on a simple internal
wiring pattern consisting of long, straight data rails (see Fig.
2). Each rail consists of a large number of parallel wires that
Fig. 3: μbump clusters comprise a grid of 512 µbumps, each attached span the entire width of the interposer but are not directly
to a distinct interposer wire. The top shows a simplified view of the connected either to each other or to wires in any other rail.
internal structure; the white dots indicate which µbump connects to µbumps are connected to each wire at regular intervals, and a
each wire. Each wire may be fused between each cluster on a rail. group of µbumps, one per wire in the rail, comprises a cluster.
Our evaluation finds that SiPterposer has significant eco- A chiplet may span and connect to one or more of the clusters,
nomic advantages over traditional 2.5D methods while provid- either on the same rail or on different rails.
ing increased reliability and assembly-time flexibility, which For our analysis throughout the remainder of this paper,
could substantially lower the cost of custom silicon. we define each cluster to support a 512-bit data connection.
Because µbumps are much larger than interwire distances,
II. R ELATED W ORK they are offset slightly from the centers of the wires to which
they attach, forming a two-dimensional grid (Fig. 3). In our
Recent work in the SiP space has detailed the economic and design, the µbumps are a typical 20µm wide with 40µm
technological advantages of building large chips in a ‘disinte- pitch [8] and are configured in a 16x32 grid, resulting in cluster
grated’ fashion – dividing them into multiple independently- dimensions of ~0.7mm by ~1.4mm. Further, we partition each
fabricated chiplets and then integrating them on an inter- cluster into eight 64-bit logical links (akin to one node in
poser [7]. Most such work assumes the use of a custom a full-mesh network with 64-bit full-duplex links). Different
interposer, and those that attempt to provide greater flexibility sets of chiplets may communicate simultaneously by using
either continue to impose some design restrictions or else disjoint subsets of the available links. Fig. 3 illustrates this
require active logic within the interposer [4]. layout/partitioning scheme.
The materials and mechanical reliability of microbumps Portions of interposer wiring between each cluster of
(µbumps) in chiplet/interposer bonds have been studied ex- µbumps may be separated during the assembly process, caus-
tensively [8]. [9] presented an empirical study of defect rates ing them to act as small electronic fuses (Figs. 2 and 3).
across an image sensor bonded to a substrate using a large Although we refer to these wire segments as ‘fuses’, there
array of fine-pitch µbumps. There has also been extensive prior is no need to add specialized fuse components or other
work on correcting bonding defects between layers of a 3D discontinuities to the fabric; existing technology can fuse link
chip, most of which focuses on replacing defective through- wires at the pitch we propose [14]. At assembly-time, blowing
silicon vias (TSVs) with redundant ones [10]. [11] proposed all fuses at a specific point in a rail can completely disconnect
adding ECC to correct TSV link defects. However, all these a set of chiplets from others. Alternatively, blowing fuses to
are designed either to minimize the number of TSVs in a 3D disconnect only a subset of the logical links within a rail would
system or to provide error correction far stronger than SiPs allow, for example, system-wide broadcasts over one link,
demand. Few, if any, address µbump defects in general or while other links carry out communication between adjacent
2.5D integration in particular. By contrast, this paper takes chiplets. Blowing fuses can also enable a chiplet to act as a
a targeted approach by accounting for 2.5D-specific design repeater (Fig. 4a) or as a node in a mesh with different parts
considerations and realistic defect models. of a cluster connecting different network edges (Fig. 5).
Finally, in the NoC space, interconnect reliability has Other electrical- and protocol-level considerations either can
been explored broadly, with various works discussing routing be handled with standard techniques or are design-dependent.
Power distribution, for instance, may use typical VLSI [15] using this process, in Fig. 5.
and SiP [16] methods, while interfaces between clock domains Importantly, because SiPterposer’s electrical structure does
are handled within the chiplets using existing NoC method- not require an active interposer, it may be implemented on any
ologies [17]. The proposed physical structure may also be desired substrate material – Si, organic, glass, etc. (our evalu-
tessellated to produce a different interposer size without costly ation assumes Si). Furthermore, it may easily be layered with
redesign. Since SiPterposer is based on a passive interposer, other chiplet placement or system configuration techniques
communication protocols are defined and handled chiplet- (e.g., [18]) as part of a holistic design methodology.
side. These can range from AMBA buses to packet-switched Defect-tolerance. The overall SiP concept is quite promising,
networks to fully-custom designs. For the rest of this work, all but it also introduces a new point of failure into chip fab-
chiplets are assumed to use a packet-based network protocol. rication via the chiplet-to-interposer assembly process. Prior
Bridge chiplets. In addition, we propose the use of dedicated, work estimates this process yield at 99%-99.5% per 1024-
generic bridge chiplets which, in conjunction with the inter- µbump chiplet – a loss that, even in a system with relatively
poser structure, enable assembly-time customization of the few chiplets, can be responsible for as much as 26% of the
system’s connectivity. These bridge chiplets can be designed in total manufacturing cost [6]. Rather than trying to reduce the
a handful of different patterns and then mass-produced. Simple incidence of these defects, we instead propose tolerating them
units (e.g., Fig. 4b) include only passive wiring; these ‘bridge’ by adding a module to each chiplet to provide lightweight
electrical gaps by directly connecting wires in one data rail ECC on each interposer bond. Although this requires both
to another. Other bridges may be active devices, deployed additional wires to carry parity information and chiplet-side
horizontally across portions of a rail separated by blown fuses encode/decode logic, it requires no active logic on the inter-
to act as a buffer in the middle of that rail (e.g., Fig. 4a), poser, preserving its simplicity and its low manufacturing cost.
create clock boundaries, or be full-blown routing devices. To In our design, each µbump cluster provides a 512-bit data
illustrate the reusability of a small set of bridges across many connection to an interposer rail, partitioned into eight 64-bit
different designs, we limit our selection for the remainder of logical links. We further divide each 64-bit link into four 16-bit
this work to Fig. 4b’s passive unit (and its derivatives). sublinks and then apply a form of ECC to each sublink, either
Arbitrary topology construction. By blowing fuses in the Hamming single-error-correction (SEC) or Bose-Chaudhuri-
interposer wiring and connecting bridge chiplets across dis- Hocquenghem double-error-correction (DEC). SEC requires 5
connected regions, we can create any arbitrary interconnect parity bits per sublink (672 total µbumps per cluster), while
topology that a chip may require, as follows: DEC requires 10 parity bits per sublink (832 µbumps per
1) Align the network graph to a Manhattan layout. cluster). The sublinks within each logical link are interleaved
2) Rotate the graph so as many edges as possible run along to better resist physically-adjacent defects [13].
the axis of SiPterposer’s internal wiring, lowering overhead The nature of chip warpage during die bonding [8] inspired
by reducing the number of bridge chiplets required. us to also design a third, defect-pattern-aware coding method.
3) Map the nodes in the graph to chiplets and arrange them Since warpage-induced mechanical stress causes bonding de-
as blocks atop SiPterposer’s µbump clusters according to fects to occur most often at a chip’s edges, we suggest a
their logical layout in the network graph. hybrid concentric coding structure, with four logical links in
4) For each edge in the graph: the center of the chiplets (using SEC) and four links along
their more-defect-prone edges (using DEC). As our evaluation
a) If possible, map the edge to an unused subset of inter-
shows, this hybrid approach provides a better yield-overhead
poser/bridge wires already in the design.
balance than either SEC-only or DEC-only.
b) If too few bridge wires: (i) add or extend a bridge, or
(ii) time-multiplex access via chiplet-side logic, akin to
virtual channels (VCs) in a NoC. IV. E VALUATION
c) If too few interposer wires: (i) move the nodes connected We evaluated SiPterposer’s defect-tolerance and overall per-
by the edge to another rail (adjusting previously mapped formance by simulating assembly of a hypothetical 48-chiplet
edges accordingly), or (ii) time-multiplex access. system. First, we determined the whole-chip assembly yield
d) Blow fuses at the endpoints of the new link, to reduce for varying defect rates, ECC, and bonding defect patterns.
wire loading and permit other parts of the newly sepa- Second, we synthesized our ECC hardware and used the results
rated interposer wires to be used freely for other edges. with whole-system models to simulate SiPterposer’s impact on
As an example, we illustrate building a simple 2x2 mesh network performance and overall chip power/area overhead.
A. Chiplet Bond Resilience with ECC TABLE I: Router synthesis results (with ECC)
Baseline SEC DEC Hybrid
We began our evaluation of defect-tolerance by creating
a worst-case scenario for the error correction schemes we Power (mW) 3.09 4.92 13.25 9.09
propose, configuring our 48 chiplets into a fully-connected Area (µm2 ) 2108 4973 15752 10363
Area overhead - 8x6 mesh – 0.07% 0.34% 0.21%
system. Each chiplet uses exactly one µbump cluster, and Area overhead - SoC – 0.04% 0.22% 0.13%
every µbump on a given chiplet is directly connected to the
corresponding µbump on every other chiplet (assuming no TABLE II: Network params TABLE III: Chipset params
defects). We defined a failed chip as one in which there exists
an uncorrectable fault in any link between any pair of chiplets. Network clk 2GHz Chiplet area 72mm2
Routing fn dor (mesh), Chiplet pwr 4W total
We assumed known-good-dies in our simulations in order to min (SoC) (56mW/mm2 )
isolate the effects of coding on µbump bonding defects. VCs/buffer size 3/8 CPU clk 900MHz
Router pipeline 4 cycles Chiplet clk 500MHz
To model assembly defects, we assigned each µbump bond Link traversal 1 cycle Memory 2GB/500MHz
an independent failure probability based on its physical posi- Pkt size (flits) 16 (mesh), Vid/img size 1080p
tion within a cluster, relative to one of three potential defect 1 (SoC)
patterns. The first pattern, uniform, assumes that each µbump
bond has an equal chance of failure. The second pattern, edge- increases the number of bridge chiplets needed, which, in turn,
weighted, incorporates the effect of die warpage via a linear accentuates our proposed system’s overheads.
increase in bond failure probability with a µbump’s distance We established two baseline systems: a monolithic SoC,
from the center of a chiplet, from a baseline at the center to 10x and a traditional SiP using a fixed-topology passive interposer.
that value at the outermost corner. The third pattern, empirical, The latter is necessary because the novelty of this work lies
simulates real-world failures using data derived from [9]. in the methods we propose for constructing SiPs; thus, it is
For each coding method and defect pattern, we conducted important to evaluate SiPterposer against both traditional SoCs
Monte Carlo simulations (100K trials) of chip assembly to and non-reusable interposer designs. All active logic in each
calculate whole-chip assembly yields while sweeping the base system is assumed to use a 45nm process node; interposers use
per-µbump failure probability. For the edge-weighted and 65nm global wire widths. Since the interposers and bridges are
empirical defect patterns, we normalized the failure probability passive, the three systems differ only in encode/decode logic
of an overall chiplet bond to that of a chiplet bond having a overhead and link wire dimensions.
uniform defect pattern with the same base per-µbump failure First, we constructed HDL models of our SEC and DEC
probability. Fig. 6 compares each coding method vs. a system ECC modules. We then integrated these into an open-source
with no error correction. In general, there is little effect on NoC router model [19] and synthesized the modified router
chip assembly yield with increasing defect pattern complexity, with Synopsys Design Compiler using an IBM 45nm library to
from uniform to edge-weighted to empirical. Hybrid coding is determine the ECC modules’ area and power overheads; these
an exception: its defect-tolerance increases significantly on the are summarized in Table I. As the ECC modules were inserted
edge-weighted defect pattern. It performs even better with an directly before/after the input/output buffers, off the critical
empirical defect distribution, since this pattern’s defects are path of the router, they exposed no additional timing overhead
even more heavily biased towards the edges of each chiplet. to the design. Hybrid coding would entail area and power
consumption exactly between the SEC and DEC routers.
B. Interconnect Performance Next, we evaluated network performance and power over-
To evaluate SiPterposer’s network performance and electri- head using a combination of BookSim [20], ORION 2.0 [21],
cal characteristics, we modeled two systems-in-package – one and LTSPICE. Since adding ECC to the routers introduced no
synthetic, one inspired by real-world SoCs – and evaluated additional latency, and because we assume equal link widths
their overheads vs. SoC and traditional-SiP equivalents. across the three example systems, performance overhead could
Mesh network, synthetic traffic. Our first, synthetic, system only come from added delay from longer inter-chiplet links on
is an 8x6 full-mesh network containing 48 identical 4mm2 SiPterposer. Using LTSPICE with ORION’s wire models and
chiplets with 1W nominal power consumption, in which each [22]’s µbump models, we computed the delay of the longest
chiplet has a 64-bit full-duplex data connection to each of its link as 62.2ps, small enough to not impact network timing.
neighbors. This is representative of a homogeneous multicore Finally, we constructed a model of each interconnect (SoC,
chip, and constitutes a worst-case scenario for SiPterposer, SiP, and SiPterposer) in BookSim and analyzed the link wire
since the large number of links required in the topology power with ORION for uniform random traffic at varying