Pangenome Search and Storage

Guillaume Holley (started October 2013)

The advent of High Throughput Sequencing (HTS) technologies raises a major concern about storage and indexing of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pan-genomes.

In the first and second year of this project, an alignment-free and reference-free data structure for indexing a pan-genome has been developed. A pan-genome is represented as a colored de Bruijn graph in which vertices represent colored k-mers, words of length k associated with the identity of the strains in which they occur. For storing and traversing such a graph, we proposed the Bloom Filter Trie (BFT), a trie that compresses and indexes strings of fixed-length k and their colors. The BFT is based on the burst trie and integrates Bloom filters to navigate in the trie and accelerate the graph traversal. The specificity of our work compared to existing pan-genome tools resides in four main characteristics:

  • It is reference-free as good quality reference genomes are available only for a small fraction of species,
  • it is alignment-free as sequence alignment methods are computationally expensive,
  • assembled and unassembled genomes are considered as input, and
  • the data structure is fully incremental and can be updated with new data.

Experimental results prove better performance compared to another state-of-the-art data structure. An API written in C is provided to access all functionalities of the the data structure as well as colored de Bruijn graph operations such that methods can be designed on top of the BFT. The BFT source code and documentation are available here. The data structure was presented at the 15th International Workshop on Algorithms in Bioinformatics in Atlanta, USA. An extended version of the paper was published in the journal Algorithms for Molecular Biology in 2016.

Pan-genomes are also ideal for compression. However, HTS-specific compression tools, developed to counter the massive growth of sequencing data and reduce storage costs, are not designed to process them. These methods can only compress genomes individually such that the compression ratio of a pan-genome does not improve with newly compressed genomes. During the third year of this project, a new alignment-free and reference-free compression method has been developed. It addresses the problem of pan-genome compression by encoding the sequences of a pan-genome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update archives with new genome sequences without full decompression of the archive. This method can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large P. aeruginosa dataset, this method outperforms all other tested tools with a 30 % compression ratio improvement in single-end mode compared to the best performing state-of-the-art HTS-specific compression method in our experiments.

Supervisors: Jens Stoye (Bielefeld University), Roland Wittler (Bielefeld University), S. Cenk Sahinalp (Simon Fraser University), Faraz Hach (Simon Fraser University)

Image source : “Ralstonia syzygii, the Blood Disease Bacterium and Some Asian R. solanacearum Strains Form a Single Genomic Species Despite Divergent Lifestyles” by Remenant B and al.

researchprojects/researchguillaumeholley.txt · Last modified: 2016/11/15 21:45 by guillaumeholley