On Micro-architectural Mechanisms for Cache Wear out Reduction

Abstract:

Hot carrier injection (HCI) and bias temperature instability (BTI) are two of the main deleterious effects that increase a transistor’s threshold voltage over the lifetime of a microprocessor. This voltage degradation causes slower transistor switching and eventually can result in faulty operation. HCI manifests itself when transistors switch from logic “0” to “1” and vice versa, whereas BTI is the result of a transistor maintaining the same logic value for an extended period of time. These failure mechanisms are especially acute in those transistors used to implement the SRAM cells of first-level (L1) caches, which are frequently accessed, so they are critical to performance, and they are continuously aging. This paper focuses on microarchitectural solutions to reduce transistor aging effects induced by both HCI and BTI in the data array of L1 data caches. First, we show that the majority of cell flips are concentrated in a small number of specific bits within each data word. In addition, we also build upon the previous studies, showing that logic “0” is the most frequently written value in a cache by identifying which cells hold a given logic value for a significant amount of time. Based on these observations, this paper introduces a number of architectural techniques that spread the number of flips evenly across memory cells and reduce the amount of time that logic “0” values are stored in the cells by switching OFF specific data bytes. Experimental results show that the threshold voltage degradation savings range from 21.8% to 44.3% depending on the application. The proposed architecture of this paper analysis the logic size, area and power consumption using Tanner tool.

Existing System:

Modern day computer systems have benefited from being designed and manufactured using an ever-increasing budget of transistors on very reliable integrated circuits. However, as technology moves forward, such a “free lunch” is over as increasingly smaller technology nodes pose significant reliability challenges. Not only do variations in the manufacturing process make the resulting transistors unreliable at low voltage operation, but they take less and less time to wear out, decreasing their lifetimes (from tens of years in current systems to 1–2 years or fewer in the near future) and making them more prone to failures in the field. Thus, lifetime reliability must be treated as a major design constraint. This concern holds for all kinds of computing devices, ranging from server processors to embedded systems, such as tablets and mobiles, where lifetime is an assertive requirement and the market share strongly depends on their reliability.

The two main phenomena that speed up aging are referred to as hot carrier injection (HCI) and bias temperature instability (BTI). The former effect increases with transistor activity over the lifetime of the processor; that is, when a transistor flips from being ON to OFF and vice versa, leading to threshold voltage (Vth) degradation, which in turn causes an increase in transistor switching delay and can result in timing violations and faulty operation when the critical paths become longer than the processor’s clock period. Overall, HCI is accentuated in the microprocessor components with frequent switching. On the other hand, BTI accelerates transistor degradation when a transistor is kept ON for a long time, and takes two forms: negative BTI (NBTI), which affects pMOS transistors when a “0” is applied to the gate, and positive BTI (PBTI), which affects nMOS when a “1” is applied.

A significant amount of the transistors in most modern chip multiprocessors are used to implement SRAM storage along the cache hierarchy. Therefore, it is important to target these structures to slow down aging. The first-level (L1) data cache is a prime candidate, since it is regularly written, yet stores data for significant amounts of time. Besides, its availability is critical to system performance. The SRAM cell transistors are stressed by HCI and BTI when the stored logic value flips and when it is retained for a long period without flipping (i.e., a duty cycle), respectively. Note that these situations are strongly related to each other. Thus, a given technique designed to exclusively attack BTI might exacerbate HCI as a side effect, and vice versa

Prior architectural research has analyzed cache degradation mainly due to BTI effects. There have been some attempts to diminish BTI aging by periodically inverting the stored logic values in the cells, by implementing redundant cell regions in the cache, and by reducing the cache supply voltage. Gunadiet al. propose a tentative approach to combat BTI and HCI by balancing the cache utilization. However, the cache contents are flushed from time to time, which might incur in significant performance degradation. Unlike the previous works, we extensively analyze the data patterns of the stored contents in L1 data caches in terms of how they affect BTI and HCI, and based on the results of this paper, we propose micro-architectural mechanisms to extend the cache data array lifetime by reducing the Vth degradation, or simply dVth, caused by both phenomena, without incurring performance losses

Disadvantages:

  • threshold voltage degradation is high
  • duty cycle distributions is high

Proposed System:

This paper makes two main contributions. First, we characterize the cell flips and the duty cycle patterns that high-performance applications cause to each specific memory cell. We find that most applications exhibit regular flip and duty cycle patterns, although they are not always uniformly distributed, which exacerbates the HCI and BTI effects on a small number of cells within the 512-bit cache lines. Results also confirm the previous work, claiming that most applications write a significant number of near-zero and zero data values into thecache. This behavior has been exploited in the past to address static energy consumption and performance with data compression. Unlike these works, this paper takes advantage of such a behavior to mitigate aging.

Second, based on the previous characterization study, we devise microarchitectural techniques that exploit such a behavior to mitigate aging. The proposal provides a homogeneous degradation of the different cell transistors belonging to the same cache line. For this purpose, the devised techniques aim to reduce cell aging from bit flips and duty cycle and pursue two objectives: 1) to spread the bit flips evenly across the memory cells and 2) to balance the duty cycle distribution of the cells. To accomplish the former objective, we propose to progressively shift the bytes of the incoming data lines according to a given rotation shift value that is regularly updated. To attain the latter objective, the mechanism is enhanced to power OFF those memory cells storing a zero byte value. The result is a switch-OFF or sleep-state, in which all the cell transistor gate terminals are isolated from electric field stress, thus allowing a partial recovery from BTI.

Figure 1: Implementation of a 6T SRAM cell. The labeled transistors refer to the inverter loop of the cell

The proposed approaches attack aging in the data array. Given that the tag array is much smaller than the data array, resilient technologies could be used to address tag wearout. For example, resilient 8T cells introduce a 19% area overhead compared with typical6T cells. According to CACTI, implementing the tags with 8T cells results in just a 1.95% area overhead for a 16-kB L1 cache.

To help microprocessor architects understand how the logic value (i.e., “0” or “1”) distribution a the cache cells as well as the bit flips caused by write operations affect wearout, this section summarizes the implementation of a typical SRAM cell and explains how it suffers from BTI and HCI effects.

As shown in Fig. 1, each cached bit is implemented with an SRAM memory cell consisting of six transistors (6T). The labeled transistors form an inverter loop that holds the stored logic value; this paper uses these labels to refer to these transistors. The remaining pass transistors controlled by the word line signal allow read and write operations to the cell through the bitline (BL) and its complementary (BL).

When the SRAM cell is under a “0” duty cycle, that is, when the cell is stable and storing a “0,” the pMOS transistor TP1 and the nMOS transistorTN2 are under stress and they suffer from NBTI and PBTI, respectively. On the contrary, under a “1” duty cycle, transistors TP2 and TN1 are affected by NBTI and PBTI, respectively. The wearout effects induced by each type of duty cycle are complementary, meaning that, for a given duty cycle, the pair of transistors not under stress are partially under recovery from BTI degradation. Thus, if every cache cell experiences a balanced distribution (i.e., 50%) of “0” and “1” duty cycles, wearout effects due to BTI are minimized and evenly distributed a the inverter loop transistors. Moreover, this reduces the probability of the circuit failing due to static noise margin (SNM) changes.

On the other hand, HCI affects all SRAM cell transistors on a write operation if the logic value flips, regardless of the type of transistor. This effect can be mitigated by avoiding bit flips during write operations. In addition, in order to minimize the chances of SRAM cell faults due to HCI wearout, those remaining bit flips must be evenly distributed a the cells.

To sum up, the inverter loop transistors are continuously aging regardless of whether the cell stores “0” or “1,” or is transitioning. This fact makes such transistors particularly sensitive to wear out. Note that the nMOS pass transistors just age when the SRAM cell is being accessed, which represents a very small fraction of the overall execution time, making them much less aging-sensitive than the inverter loop.

Hardware Implementation and Operation:

1) Hardware Components and Area Overhead: The BW mechanism can be implemented with 16 4-to-1 multiplexers; one for each data word within the incoming line. Fig. 2 shows one of the multiplexers and its associated inputs used in the write circuit. LabelBi refers to the different data bytes from the word, B0 and B3 being the LSB and MSB, respectively. Each data input consists of the data bytes ordered according to one of the four possible shift functions. The multiplexer is controlled by theCBW0andCBW1control bits that correspond to the current shift function. For the read circuit, another 16 4-to-1 multiplexers can be used for the requested line; however, the order of the data inputs differs from those of the write circuit, since in this operation, the contents must be realigned instead of shifted. These multiplexers are only used when reading and writing a given line; thus, they are shared a all the lines in the data array.

Figure 2: Write circuit for the BW mechanism.

2) Control Bit Inversion: Both HCI and BTI phenomena should be evaluated not only in the data array bits but also in the additional control bits added by our mechanisms and implemented as SRAM cells. Recall that theCBW0andCBW1 bits make up a 2-bit counter and they are updated between regular shift phases of 8M processor cycles, which results in an implicit balanced (i.e.,near-optimal) duty cycle distribution in such bits. However, the CpBW bit is set to “1” when the associated line is written for the first time within a phase, and set to “0” every time a new phase starts. We have evaluated that such writes normally come soon after the phase begins, causing a highly biased “1” duty cycle in these bits, which exacerbates BTI in transistorsTP2-TN1.

3) Read/Write Operations: With the aim to clarify how both BW and SZB schemes work together, Fig. 3plots a cache block diagram with both mechanisms represented as gray boxes. On a cache read hit, after the way multiplexer selects the target line from the selected set, its contents and the associated control bits are forwarded to the SZB read circuit. Once the SZB tristate buffers have forwarded the zero bytes, the BW multiplexers realign the bytes and serve the original line to the processor. Note that, on a read operation, there is no need to restore the power to those memory cells that originally would hold zero bytes.

Figure 3: Block diagram of the L1 data cache access, including the proposed components (gray boxes)

Advantages:

  • threshold voltage degradation is reduced
  • duty cycle distributions is reduced

Software implementation:

  • Tanner tool