A Way-Filtering-Based Dynamic Logical–Associative Cache Architecture for Low-Energy Consumption
Last-level caches (LLCs) help improve performance but suffer from energy overhead because of their large sizes. An effective solution to this problem is to selectively power down several cache ways, which, however, reduces cache associativity and performance and thus limits its effectiveness in reducing energy consumption. To overcome this limitation, we propose a new cache architecture that can logically increase cache associativity of way-powered-down LLCs. Our proposed scheme is designed to be dynamic in activating an appropriate number of cache ways in order to eliminate the need for static profiling to determine an energy-optimized cache configuration. The experimental results show that our proposed dynamic scheme reduces the energy consumption of LLCs by 34% and 40% on single- and dual-core systems, respectively, compared with the best performing conventional static cache configuration. The overall system energy consumption including CPU, L2 cache, and DRAM is reduced by 9.2% on quad-core systems. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.
Many circuit-level techniques for reducing the leakage energy of the cache memory have been proposed. The gated-Vdd proposed by Powell et al. is used in many cache designs, but it incurs data loss. The drowsy cache scheme is proposed to reduce leakage energy in unused individual cache lines without data loss. The DRG-cache scheme is proposed to reduce leakage energy of a cache memory without data loss. Its hardware complexity is lower than that of the drowsy cache, but its energy reduction is smaller. However, it is difficult for these two techniques to be applied to real circuits due to process variations, which can make low-voltage SRAM cells faulty. Many architectural studies have been conducted to reduce the leakage energy of the cache memory. The most well-known technique is the selective cache ways, which was developed to decrease dynamic energy but is also effective in reducing leakage energy. This technique selectively disables a subset of cache ways to reduce cache energy, as shown in Fig. 1(a). This approach is quite effective because many programs do not require an entire cache capacity. This fact has been exploited by many studies to reduce energy consumption or to increase performance. Many researchers tried to reduce the number of ways with little performance degradation,. Determining how many cache ways are required is very important. Hwang and Li distinguished repeated and fresh misses to determine the appropriate associativity in a cache set. A dynamic cache resizing technique for multicore systems has recently been proposed. Qureshi and Patt used utility monitors to compute an appropriate cache size when cache resizing is performed. We implement the proposed cache size estimation scheme to compare it with our proposed cache architecture.
Figure 1: Conceptual diagrams of (a) selective cache ways and (b) selective cache sets
The selective-sets scheme is proposed to mitigate performance degradation of selective cache ways. It partially turns OFF some cache sets while maintaining cache associativity, as shown in Fig. 1(b). It suffers from the following drawbacks. First, its set mapping changes when cache upsizing occurs. Cache indexing is based on modular hash operation. Thus, downsizing (which means decrease of the number of cache sets) does not change cache set mapping, but upsizing (which means increase of the number of cache sets) requires cache set remapping. To deal with this problem, this scheme flushes all cache sets when cache upsizing occurs. This process causes large performance overhead because cold misses occur after flushing. Moreover, this process is not suitable for multicore systems because it flushes all sets, including the cache contents of the other core. If the running programs of the other core have high temporal locality and/or deadline requirements, this process incurs critical performance degradation and/or deadline misses. In particular, in many core systems, this process causes catastrophic performance degradation because a flushing process caused by just one core affects the performance of all other cores. Second, the implementation overhead of the selective-sets scheme is larger than that of the selective-ways scheme. Changing the number of cache sets also changes the number of tag bits. Thus, the selective-sets scheme must use a tag array that is as large as required by the smallest number of supported cache sets, i.e., the selective-sets scheme requires larger tags than the selective-ways scheme, which also means that tag access delay of selective-sets is longer than that of selective-ways. The finer powered down granularity of the selective-sets scheme also increases its hardware overheads. In summary, the selective-sets scheme is not suitable to real systems due to its lack of generality. For this reason, modern real processors such as Intel Atom N270 and ARM Cortex-A9/15 adopt the selective-ways scheme rather than the selective-sets scheme. Thus, our scheme is based on selective-ways to easily apply it to commercial processors and to eliminate the flushing overheads.
- Power consumption is high
we propose a way-filtering (WF)-based logical–associative LLC architecture to reduce the energy consumption of LLCs. This architecture logically increases the associativity of LLCs when one to three cache ways are activated, and thus improves performance and reduces energy consumption. To further decrease tag way energy consumption, we utilize a partial tag-based WF scheme. In addition, a sequential logical way accessing and indexing scheme is proposed to support multiple LLC logical way accesses when multiple logical way hits occur in one physical way using the partial tag-based way filter. To make our proposed WF-based logical–associative LLC architecture to be practical, we propose a dynamic resizing algorithm to eliminate the need for static cache profiling to determine an energy-optimized LLC configuration. “Energy-optimized configuration” means the configuration that consumes the least energy. Starting from one LLC data way, our dynamic resizing algorithm activates more or fewer LLC data ways using the approximate standard deviation of cache misses of LLC logical sets as a metric for measuring cache way demand. A logical set is a set of cache lines that use the same index in logical ways. A logical way is an internal cache way that is divided from a cache way.
Non-filtering Parallel Logical–Associative Cache:
A sequential–associative cache architecture is proposed for direct-mapped caches so that applying sequential association to direct-mapped LLCs is possible. However, this architecture suffers from the following problems. Typical LLCs adopt high associativity to minimize memory accesses, and thus, the sequential-associativity should be high. However, the conventional sequential cache association incurs performance loss in highly sequential–associative direct-mapped LLCs due to low sequential way prediction accuracy and large penalty of wrong predictions. In addition, this architecture cannot support power down of each sequential cache way to reduce leakage energy consumption. Thus, we propose a new cache architecture that can be applied to set-associative LLCs. We call this cache architecture “logical–associative cache.”
Figure 2: NF parallel logical–associative cache for a four-way cache. (a) and (b) All tag ways are accessed in parallel. The solid arrows indicate tag way accesses and the dotted arrows indicate data way accesses when a tag hit occurs. LW stands for logical way
Our first idea to design the “logical–associative cache” architecture is to activate all tag ways to access them in parallel and not to activate all data ways to reduce their leakage energy consumption. Increasing cache associativity to support parallel access is a very effective solution to increase performance. We compared the execution times, energy consumptions, EDPs, and LLC misses of a two-way 1024-set LLC configuration and a 16-way 128-set LLC configuration. These two cache configurations have the same capacity (256 kbytes). The results are presented in Table I. On average, the 16-way 128-set cache shows 7.4% better performance, 5.4% less energy, and 10.9% less EDP compared with the two-way 1024-set cache. If the capacity is the same, increasing the cache associativity is a good solution for increasing the system performance. Our first idea to increase parallelism is shown in Fig. 2, where the tag ways are physically set-associative, whereas the data ways are logically associative within the activated data ways.
Proposed Way-Filtering-Based Logical–Associative LLC Architecture:
To reduce the energy consumption in the tag ways of LLCs, we apply a partial tag matching scheme. It extracts a few bits from the tag bits to early identify a cache miss. Fig. 3 shows our proposed partial tag-based way filter. A partial tag consists of a few least significant bits from the original tag bits and a few most significant bits from the index bits. Because the cache linesin a cache way are logically divided, the most significant bits (3 bits for eight logical ways) from the index bits are used as part of a partial tag. Its because the number of logical cache sets becomes smaller than that of cache sets when logical cache ways is applied. A partial tag-based way filter is allocated to each cache way, and it is powered down when its corresponding cache way is turned OFF. Experimentally, we find that a 4-bit partial tag and eight logical ways show optimized results.
Figure 3: Overall diagram of our proposed WF-based logical–associative cache architecture. LW stands for logical way
- Power consumption is low
- Xilinx ISE