A Deep Learning-Based Data Minimization Algorithm for Fast and Secure Transfer of Big Genomic Datasets

Abstract:

In the age of Big Genomics Data, institutions such as the National Human Genome Research Institute (NHGRI) are challenged in their efforts to share volumes of data between researchers, a process that has been plagued by unreliable transfers and slow speeds. These occur due to throughput bottlenecks of traditional transfer technologies. Two factors that affect the efficiency of data transmission are the channel bandwidth and the amount of data. Increasing the bandwidth is one way to transmit data efficiently, but might not always be possible due to resource limitations. Another way to maximize channel utilization is by decreasing the bits needed for transmission of a dataset. Traditionally, transmission of big genomic data between two geographical locations is done using general-purpose protocols, such as hypertext transfer protocol (HTTP) and file transfer protocol (FTP) secure. In this paper, we present a novel deep learning-based data minimization algorithm that 1) minimizes the datasets during transfer over the carrier channels; 2) protects the data from the man-in-the-middle (MITM) and other attacks by changing the binary representation (content-encoding) several times for the same dataset: we assign different codewords to the same character in different parts of the dataset. Our data minimization strategy exploits the alphabet limitation of DNA sequences and modifies the binary representation (codeword) of dataset characters using deep learning-based convolutional neural network (CNN) to ensure a minimum of code word uses to the high frequency characters at different time slots during the transfer time. This algorithm ensures transmission of big genomic DNA datasets with minimal bits and latency and yields an efficient and expedient process. Our tested heuristic model, simulation, and real implementation results indicate that the proposed data minimization algorithm is up to 99 times faster and more secure than the currently used content-encoding scheme used in HTTP of the HTTP content-encoding scheme and 96 times faster than FTP on tested datasets. The developed protocol in C# will be available to the wider genomics community and domain scientists.

Existing System:

Low-cost, high-throughput instruments and cloudbased services have solved big data generating and processing challenges to certain extents, they do not efficiently solve data transfer speed and security problems witnessed in , transferring big data between two or more places (e.g. between two biology laboratories or between lab-cloud-lab), which still results in a bottleneck due to the use of traditional transfer protocols such as HTTP [9] and FTP [10].

Proposed System:

We present a novel deep learning-based data minimization algorithm that 1) minimizes the datasets during transfer over the carrier channels; 2) protects the data from the man-in-the-middle (MITM) and other attacks by changing the binary representation (content-encoding) several times for the same dataset: we assign different codewords to the same character in different parts of the dataset. Our data minimization strategy exploits the alphabet limitation of DNA sequences and modifies the binary representation (codeword) of dataset characters using deep learning-based convolutional neural network (CNN) to ensure a minimum of code word uses to the high frequency characters at different time slots during the transfer time. This algorithm ensures transmission of big genomic DNA datasets with minimal bits and latency and yields an efficient and expedient process.

Conclusion:

In this paper, we implemented a novel deep learning based data minimization algorithm to integrate with transfer protocols to reduce the size of big genomic datasets during the transfer phase, and then to transfer the data securely in less time. The implementation results illustrate that the proposed data minimization algorithm is capable of reducing the transfer time 99-fold, compared to the standard content encoding of HTTP, and 96-fold compared to FTP on tested datasets. We used GZIP and MFCompress algorithms as optional compression algorithms, in addition to our data minimization algorithm to assess how the transfer protocol behaves in terms of transfer time and size. Also, we showed that our data minimization algorithm provides the best size reduction, reduces transfer time, and securely transfers big genomic datasets.

References:

[1] C. Mora, D. P. Tittensor, S. Adl, A. G. Simpson, and B. Worm, “How many species are there on earth and in the ocean?” PLoS biology, vol. 9, no. 8, p. e1001127, 2011.

[2] F. S. Collins and H. Varmus, “A new initiative on precision medicine,” New England Journal of Medicine, vol. 372, no. 9, pp. 793–795, 2015.

[3] T. C. Carter and M. M. He, “Challenges of identifying clinically actionable genetic variants for precision medicine,” Journal of healthcare engineering, vol. 2016, 2016.

[4] N. Drake et al., “Cloud computing beckons scientists.” Nature, vol. 509, no. 7502, pp. 543–544, 2014.

[5] R. Spencer, “The square kilometre array: The ultimate challenge for processing big data,” in Data Analytics 2013: Deriving Intelligence and Value from Big Data, IET Seminar on. IET, 2013, pp. 1–26.

[6] M. Schatz, “The next 20 years of genome research,” bioRxiv, p. 020289, 2015.

[7] Census, “World population.”

[8] J. Hadfield and N. Loman, “Next generation genomics: world map of high-throughput sequencers,” 2014.

[9] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee, “Hypertext transfer protocol–http/1.1,” Internet Engineering Task Force (IETF), Tech. Rep., 1999.

[10] J. Postel and J. Reynolds, “File transfer protocol,” The Internet Engineering Task Force, 1985.

[11] S. Deorowicz and S. Grabowski, “Compression of dna sequence reads in fastq format,” Bioinformatics, vol. 27, no. 6, pp. 860–862, 2011.

[12] A. J. Cox, M. J. Bauer, T. Jakobi, and G. Rosone, “Large-scale compression of genomic sequence databases with the burrows–wheeler transform,” Bioinformatics, vol. 28, no. 11, pp. 1415–1419, 2012.

[13] S. W. Hodson, S. W. Poole, T. M. Ruwart, and B. W. Settlemyer, “Moving large data sets over high-performance long distance networks,” Citeseer, Tech. Rep., 2011.

[14] N. I. of Health et al., “An overview of the human genome project,” 2005.

[15] R. A. Gibbs, J. W. Belmont, P. Hardenbol, T. D. Willis, F. Yu, H. Yang, L.-Y. Ch’ang, W. Huang, B. Liu, Y. Shen et al., “The international hapmap project,” 2003.

[16] W. S. Bush and J. H. Moore, “Genome-wide association studies,” PLoS computational biology, vol. 8, no. 12, p. e1002822, 2012.

[17] M. D. Mailman, M. Feolo, Y. Jin, M. Kimura, K. Tryka, R. Bagoutdinov, L. Hao, A. Kiang, J. Paschall, L. Phan et al., “The ncbi dbgap database of genotypes and phenotypes,” Nature genetics, vol. 39, no. 10, pp. 1181–1186, 2007.

[18] J. Kaye, C. Heeney, N. Hawkins, J. De Vries, and P. Boddington, “Data sharing in genomics–re-shaping scientific practice,” Nature reviews. Genetics, vol. 10, no. 5, p. 331, 2009.

[19] K. Holtman and A. Mutz, “Transparent content negotiation in http,” Internet Engineering Task Force (IETF), Tech. Rep., 1998.

[20] J. C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy, “Potential benefits of delta encoding and data compression for http,” in ACM SIGCOMM Computer Communication Review, vol. 27, no. 4. ACM, 1997, pp. 181–194.