A Dual-Clock VLSI Design of H.265 SampleAdaptive Offset Estimation for8k Ultra-HD TV Encoding
Sample adaptive offset (SAO) is a newly introduced in-loop filtering component in H.265/High Efficiency Video Coding (HEVC). While SAO contributes to a notable coding efficiency improvement, the estimation of SAO parameters dominates the complexity of in-loop filtering in HEVC encoding. This paper presents an efficient VLSI design for SAO estimation. Our design features a dual-clock architecture that processes statistics collection (SC) and parameter decision (PD), the two main functional blocks of SAO estimation, at high- and low speed clocks, respectively. Such a strategy reduces the overall area by 56% by addressing the heterogeneous data flows of SC and PD. To further improve the area and power efficiency, algorithm-architecture co-optimizations are applied, including a coarse range selection (CRS) and an accumulator bit width reduction (ABR). CRS shrinks the range of fine processed bands for the band offset estimation. ABR further reduces the area by narrowing the accumulators of SC. They together achieve another 25% area reduction. The proposed VLSI design is capable of processing 8k at 120-frames/s encoding. It occupies 51k logic gates, only one-third of the circuit area of the state-of-the-art implementations.The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.
While 1080p High Definition (HD) and 4k ultra-HDformats are dominating today’s high-end video applications, 8k ultra-HD is being promoted as the next-generationvideo specification. The 8k is expected to deliver extremelyhigh visual experience by having up to 7680×4320 pixelsper frame and 120 frames/s . To transmit such a hugedata throughput (TP) in the communication channel, deepcompression from the latest video coding technology, HighEfficiency Video Coding (H.265/HEVC), plays acrucial role. The implementation of the corresponding videocodecs, however, is challenged by the multiplication of theultrahigh TP requirement and an increased complexity perpixel. Compared with the previous H.264/Advance VideoCoding (AVC) standard, H.265/HEVC doubles coding efficiency by employing a number of new coding tools. Especially,a sample adaptive offset (SAO) component is newly introducedas one of the in-loop filters (ILFs), which contributes to up to18% Bjøntegaard delta (BD)-rate reduction.In H.264/AVC, deblocking filter (DBF) is the only ILF.
Its VLSI implementation has been discussed in many previous works. In H.265/HEVC, DBF has beensimplified, and SAO dominates the complexityof ILF especially in a video encoder. Several previousworks discussed SAO’s implementation. Jooet al.proposed to utilize the intraprediction mode to predict theedge offset (EO) type, so that the number of EO types couldbe reduced to save the encoding time. Choi and Joo evaluated several algorithm-level improvements for SAO.Chenetal.developed a low complexity SAO algorithmbased on class combination, band offset (BO) predecision,and merge separation category. Rediess et al. discussed the architectures of statistics collection (SC) andparameter decision (PD), the two main components of SAO.Modyet al.  designed an SAO estimation architecturesupporting 4k at 60-frames/s encoding. Zhu et al.developed a fast SAO estimation algorithm and its VLSI architecture supporting 8k encoding. The complete implementationsof SAO estimation both require relatively largecircuit area, which still has plenty room for improvement.
- Area coverage is high
- Less performance
This paper aims at designing an efficient VLSI architectureof SAO estimation in H.265/HEVC. To achieve high areaefficiency, we propose three techniques.
- Dual-Clock SAO Architecture:The highly heterogeneous data flow of SC and PD in SAO causes each part to require a completely different preference in working frequency. Such a different preference is the main obstacle for an efficient implementation. This technique addresses heterogeneous data flow by separately driving SC and PD at high- and low-speed clocks, respectively, so that each part could be integrated together efficiently. It reduces the overall area by 56%, from 156k to 68k gates.
- Coarse Range Selection (CRS) for BO: Based on the analysis of band distributions in each coding tree block (CTB), and on hardware resources for finding best bands, this technique estimates the range of bands before SC with an accuracy of 60%–80% and shrinks the range of fine processed bands 32 to 8 to reduce the circuit area.
- Accumulator Bit Width Reduction (ABR): By exploiting the mutual exclusion relationship among categories/bands existed in the accumulation process in SC, this technique carries out an early termination to accumulators reaching a threshold, to further reduce their circuit area.
SAO aims at reducing the distortion of the reconstructedpictures, by adaptively adding offsets to the reconstructed samples at both encoder and decoder. The SAO parameters, i.e.,how the offsets should be generated and applied, are signaledat the coding tree unit (CTU)level. The offset to be applieddepends on the classification of the target sample. There aretwo kinds of classifiers: EO and BO. The sample classificationof EO depends on the comparison between the current sampleand its neighboring ones, while the sample classification ofBO depends on the value of current sample itself. The optimalclassifier and the offsets for each CTU are found during theencoding process, called SAO estimation, which comprises theSC and PD phases, as shown in Fig. 1.
Figure 1: Overview of SAO. Details ofthe proposed SC engine and PD engine are in Figs. 2 and 3, respectively.
Heterogeneous Data Flows of SC and PD:
The main obstacle to an efficient SAO implementationcomes from the highly heterogeneous data flows of SC andPD. The SC for each EO or BO classifier comprises much simple iteration. On the other hand, PD involves significantlyless iterations (56 or less for each CTU) with each of thembeing much more complex.
Figure 2: Architecture of 2×2 SC engine. The details of the dark gray part in Fig. 1.SAcc and CAcc are the abbreviations for the accumulators of Sum and Count.
The serial characteristics of SC and the largenumber of iterations involved, however, make SC inefficient tobe parallelized. As shown in the gray part in Figs. 2 and 3, thehardware components of these parts have a quadratic growthin area with the increase of N.In the meanwhile, the functionof SC decides that a short critical path can be achieved, andthus, a high frequency is preferred. However, a high workingfrequency is not preferred in PD, because: 1) it does notneed many clock cycles to perform the limited number ofiterations involved for each CTU and 2) each iteration involvesthe complex computation that results in a long critical path.The big difference in preference to the selection of workingfrequencies, thus, becomes the key challenge for integratingSC and PD efficiently.
Figure 3: Proposed architecture of PD Engine. The details of the light graypart in Fig. 1.
- Area efficiency is high
- High Performance
- Xilinx ISE