Tina Zekic (started November 2014)
With the increasing number of available genome sequences, the concept of a pan-genome arised, representing a set of genomes belonging to different strains but the same species. A pan-genome is composed of three parts, a core genome, representing sequences shared among all strains of a species, a dispensable genome, containing sequences shared among a subset of strains and a singleton genome, representing strain-specific sequences. Recently, a data structure for a memory efficient storage of a pan-genome has been developed. The so-called Bloom Filter Trie (BFT) stores a pan-genome as a colored de Bruijn graph, by storing all kmers of the input sequences and the set of genomes they originate from.
The goal of this project is to extend this data structure by methods for the functional analysis of a pan-genome. The first step is the identification of the core genome. Matching sequences can be interrupted by SNPs or other structural variations, resulting in branching nodes and additional paths in the graph. Therefore, we first introduce a formal definition of a core genome in a de Bruijn graph.
For a set of input genomes, we consider the core as being composed of sequences shared by at least a required number of genomes, called the quorum and denoted by q. In order to model variations and include them in the core genome, we introduce the concept of a core bubble and its generalization, a super core bubble. A distance d is used to limit the maximal distance allowed between two core paths, where a core path is a path containing only core nodes. Besides bubbles, we introduce the definition of connected and disconnected core paths in order to cover the remaining cases.
Supervisors: Jens Stoye (Bielefeld University), Roland Wittler (Bielefeld University), Faraz Hach (Simon Fraser University)