A Framework for the Automatic Vectorization of Parallel Sort on x86-based Processors

Abstract:

The continued growth in the width of vector registers and the evolving library of intrinsics on the modern x86 processors make manual optimizations for data-level parallelism tedious and error-prone. In this paper, we focus on parallel sorting, a building block for many higher-level applications, and propose a framework for the Automatic SIMDization of Parallel Sorting (ASPaS) on x86-based multi- and many-core processors. That is, ASPaS takes any sorting network and a given instruction set architecture (ISA) as inputs and automatically generates vector code for that sorting network. After formalizing the sort function as a sequence of comparators and the transpose and merge functions as sequences of vector-matrix multiplications, ASPaS can map these functions to operations from a selected “pattern pool” that is based on the characteristics of parallel sorting, and then generate the vector code with the real ISA intrinsics. The performance evaluation on the Intel Ivy Bridge and Haswell CPUs, and Knights Corner MIC illustrates that automatically generated sorting codes from ASPaS can outperform the widely used sorting tools, achieving up to 5.2x speedup over the single-threaded implementations from STL and Boost and up to 6.7x speedup over the multi-threaded parallel sort from Intel TBB.

Existing System:

Writing efficient vectorized (SIMD) code by hand is a time-consuming and error-prone activity. First, vectorizing existing (complex) codes requires expert knowledge in restructuring algorithms to exploit SIMDization potentials. Second, a comprehensive understanding of the actual vector intrinsics is needed. The intrinsics for data management and movement are equally important as those for computation because programmers often need to rearrange data in the vector units before sending them to the ALU. Unfortunately, the flexibility of the data-reordering intrinsics is restricted, as directly supporting an arbitrary permutation is impractical [2]. As a consequence, programmers must resort to a combination of data-reordering intrinsics to attain a desired computational pattern. Third, the vector instruction set architectures (ISA) continue to evolve and expand, which in turn, lead to potential portability issues.

Proposed System:

we focus on the sorting primitive and propose a framework – Automatic SIMDization of Parallel Sorting (a.k.a ASPaS) – to automatically generate efficient SIMD codes for parallel sorting on x86-based multicore and manycore processors, including CPUs and MIC, respectively. ASPaS takes any sorting network and a given ISA as inputs and automatically produces vectorized sorting code as the output. The code adopts a bottom-up approach to sort and merge segmented data. Since the vectorized sort function puts partially sorted data across different segments, ASPaS gathers the sorted data into contiguous regions through a transpose function before the merge stage.

Conclusion:

we propose the ASPaS framework to automatically generate vectorized sorting code for x86-based multicore and manycore processors. ASPaS can formalize the sorting and merging networks to the sequences of comparing and reordering operators of DSL. Based on the characteristics of such operators, ASPaS first creates an ISAfriendly pool to contain the requisite data comparing and reordering primitives, then builds those sequences with primitives, and finally maps them to the real ISA intrinsics. Besides, the ASPaS codes can exhibit a efficient memory access pattern and thread-level parallelism. The ASPaSgenerated codes can outperform the compiler-optimized ones and meanwhile yield highest performance over multiple library sorting tools on Ivy Bridge, Haswell, and Knights Corner architectures.

REFERENCES:

[1] S. Maleki, Y. Gao, M. Garzaran, T. Wong, and D. Padua, “An Evaluation of Vectorizing Compilers,” in ACM Int. Conf. Parallel Arch. Compil. Tech. (PACT), 2011.

[2] Intel. Intel Xeon Phi Coprocessor System Software Developers Guide. Intel Corporation. Document ID: 488596.

[3] X. Huo, B. Ren, and G. Agrawal, “A Programming System for Xeon Phis with Runtime SIMD Parallelization,” in ACM Int. Conf. Supercomput. (ICS), 2014.

[4] D. S. McFarlin, V. Arbatov, F. Franchetti, and M. Puschel, “Au- ¨ tomatic SIMD Vectorization of Fast Fourier Transforms for the Larrabee and AVX Instruction Sets,” in ACM Int. Conf. Supercomput. (ICS), 2011.

[5] D. Unat, X. Cai, and S. B. Baden, “Mint: Realizing CUDA Performance in 3D Stencil Methods with Annotated C,” in ACM Int. Conf. Supercomput. (ICS), 2011.

[6] N. Maruyama, T. Nomura, K. Sato, and S. Matsuoka, “Physis: An Implicitly Parallel Programming Model for Stencil Computations on Large-scale GPU-accelerated Supercomputers,” in Int. Conf. High Perf. Comput., Netw., Storage and Anal. (SC), 2011.

[7] A. R. Benson and G. Ballard, “A Framework for Practical Parallel Fast Matrix Multiplication,” in ACM SIGPLAN Symp. Principles Pract. Parallel Program. (PPoPP), 2015.

[8] S. W. Al-Haj Baddar and K. W. Batcher, Designing Sorting Networks: A New Paradigm. Springer Sci. & Bus. Media, 2011.

[9] K. E. Batcher, “Sorting Networks and Their Applications,” in ACM Spring Joint Computer Conf., 1968. [10] T. N. Hibbard, “An Empirical Study of Minimal Storage Sorting,” Commun. ACM, 1963.

[11] R. C. Bose and R. J. Nelson, “A Sorting Problem,” J. ACM, 1962.

[12] P. Plauger, M. Lee, D. Musser, and A. A. Stepanov, C++ Standard Template Lib., 1st ed. Prentice Hall PTR, 2000.

[13] B. Schling, The Boost C++ Libraries. XML Press, 2011.

[14] J. Reinders, Intel Threading Building Blocks, 1st ed. O’Reilly & Associates, Inc., 2007.

[15] K. Hou, H. Wang, and W.-c. Feng, “ASPaS: A Framework for Automatic SIMDization of Parallel Sorting on x86-based Manycore Processors,” in ACM Int. Conf. Supercomput. (ICS), 2015.