A GPU-aware Parallel Index for Processing High-dimensional Big Data

Abstract:

The problem of the curse of dimensionality for processing large high-dimensional datasets has been an open challenge. Numerous research efforts have been proposed for improving query performance in high-dimensional space through hierarchical indexing using the R-tree or its variants and exploring parallel processing of the R-tree on GPUs. Despite these existing efforts, the curse of dimensionality remains to be a grand challenge since the existing methods deteriorate drastically as the dimensionality of datasets increases. To cope with this problem, we present a novel GPU-aware parallel indexing method called G-tree, which offers consistent and stable performance in high-dimensional space. The rationale of the G-tree is to combine the efficiency of the R-tree in low-dimensional space with the massive parallel processing potential of GPUs by introducing a new data structure and three new optimization techniques to better utilize the GPU memory structure for accelerating both index search and index node access on GPUs. The first two optimizations promote effective parallelism utilization in GPU memory access. We dedicate the third optimization to further speed up the G-tree index by conducting progressive filtering using our dimension filters. We evaluate the validity of the G-tree approach by extensive experiments on high-dimensional datasets, showing that the G-tree outperforms the existing state-of-the-art techniques.

Existing  System:

We see a pressing demand for innovative methods and algorithms that can effectively address the problem of the curse of dimensionality in processing large high-dimensional datasets. To truly address the curse of dimensionality problem, we aim at designing and implementing innovative data access methods and algorithms that do not deteriorate in performance when the dimensionality of the datasets becomes large. Concretely, we argue that it is important to rethink the big data processing framework for high-dimensional data by exploring innovative data structures and optimizations for maximizing both data and computation parallelism.

Proposed System:

We present a graphics processing unit (GPU)-aware parallel indexing technique called G-tree, and its parallel computational framework. Our experimental results show that the G-tree can deliver superior performance compared to the existing state-of-the-art methods for high-dimensional data. We broadly classify state-of-the-art research in multi-dimensional data processing into three categories. The first category is focused primarily on improving the search performance of the R-tree in high-dimensional data spaces.

Conclusion:

We have presented a GPU-aware parallel index scheme, called G-tree, for large-scale high-dimensional query processing. The rationale of the G-tree design is to combine the efficiency of the R-tree in lower-dimensional space with the parallel computing capability of GPUs in higher dimensionality. We employ three design strategies. First, we introduce a new data structure (a structure of arrays) to better utilize the GPU memory structure, accelerating both index search and index node access on GPUs by promoting effective parallelism utilization in GPU memory access. Second, unlike previous approaches, the G-tree by design introduces a BFS base lookup without a queue or a stack, enabling the G-tree to be more efficient at handling datasets of high dimensionality. Third, we further speed up search and access of the G-tree index by introducing the selective dimensionality filters, which select a small subset of all dimensions to generate the candidate set of index nodes, on which we perform a parallel search based on the SoA using the BFS as the traversal order, without resorting to a queue or a stack. Extensive experiments on both synthetic and real datasets show that the G-tree approach outperforms the existing representative algorithms, such as the X-tree, the SR-tree, and the Parallelized R-tree by several orders of magnitude. Our ongoing work includes update operations and further optimizations to support multiple-query processing on GPUs, such as identifying cases where we can incorporate the L1 and L2 caches in recent generations of GPUs to further reduce global memory access latency.

REFERENCES

[1] V. Gaede and O. Günther, “Multidimensional access methods,” ACM Comput. Surv., vol. 30, no. 2, pp. 170–231, Jun. 1998.

[2] A. Guttman, “R-trees: A Dynamic Index Structure for Spatial Searching,” in Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, 1984, pp. 47– 57.

[3] S. Berchtold, C. Böhm, and H.-P. Kriegal, “The pyramid-technique: towards breaking the curse of dimensionality,” in Proceedings of the 1998 ACM SIGMOD international conference on Management of data – SIGMOD ’98, 1998, vol. 27, no. 2, pp. 142– 153.

[4] C. Kim et al., “Designing Fast Architecture-sensitive Tree Search on Modern Multicore/Many-core Processors,” ACM Trans. Database Syst., vol. 36, no. 4, p. 22:1–22:34, 2011.

[5] J. Kim, S. Hong, and B. Nam, “A Performance Study of Traversing Spatial Indexing Structures in Parallel on GPU,” in 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, 2012, pp. 855–860.

[6] S. You, J. Zhang, and L. Gruenwald, “Parallel Spatial Query Processing on GPUs Using R-trees,” in Proceedings of the 2Nd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data, 2013, pp. 23–31.

[7] T. K. Sellis, N. Roussopoulos, and C. Faloutsos, “The R+ -Tree: A Dynamic Index for Multi-Dimensional Objects,” in Proceedings of the 13th International Conference on Very Large Data Bases, 1987, pp. 507–518.

[8] N. Beckmann et al., “The R*-tree: an efficient and robust access method for points and rectangles,” in Proceedings of the 1990 ACM SIGMOD international conference on Management of data – SIGMOD ’90, 1990, vol. 19, no. 2, pp. 322–331.

[9] S. Berchtold, D. A. Keim, and H.-P. Kriegel, “The X-tree: An Index Structure for High-Dimensional Data,” in Proceedings of the 22th International Conference on Very Large Data Bases, 1996, pp. 28–39.

[10] N. Katayama and S. Satoh, “The SR-tree: An Index Structure for High-dimensional Nearest Neighbor Queries,” in Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, 1997, pp. 369–380.

[11] J. T. Robinson, “The K-D-B-tree: A Search Structure for Large Multidimensional Dynamic Indexes,” in Proceedings of the 1981 ACM SIGMOD International Conference on Management of Data, 1981, pp. 10–18.

[12] D. A. White and R. Jain, “Similarity indexing with the SS-tree,” in Proceedings of the Twelfth International Conference on Data Engineering, pp. 516–523.

[13] I. Kamel and C. Faloutsos, “On Packing R-trees,” in Proceedings of the Second International Conference on Information and Knowledge Management, 1993, pp. 490–499.

[14] D. Comer and Douglas, “Ubiquitous B-Tree,” ACM Comput. Surv., vol. 11, no. 2, pp. 121–137, Jun. 1979.

[15] R. Zhang, B. C. Ooi, and K.-L. Tan, “Making the pyramid technique robust to query types and workloads,” in Proceedings. 20th International Conference on Data Engineering, pp. 313–324