Tina Zekic (started November 2014)

With the increasing number of available genome sequences, the concept of a *pan-genome* arised, representing a set of genomes belonging to different strains but the same species.
A pan-genome is composed of three parts, a *core* genome, representing sequences shared among all strains of a species, a *dispensable genome*, containing sequences shared among a subset of strains
and a *singleton* genome, representing strain-specific sequences.
Recently, a data structure for a memory efficient storage of a pan-genome has been developed. The so-called Bloom Filter Trie (BFT) stores a pan-genome as a colored de Bruijn graph, by storing all kmers of
the input sequences and the set of genomes they originate from.

The goal of this project is to extend this data structure by methods for the functional analysis of a pan-genome. The first step is the identification of the core genome. Matching sequences can be interrupted by SNPs or other structural variations, resulting in branching nodes and additional paths in the graph. Therefore, we first introduce a formal definition of a core genome in a de Bruijn graph.

For a set of input genomes, we consider the core as being composed of sequences shared by at least a required number of genomes, called the *quorum* and denoted by q. In order to model variations and include them in the core genome, we introduce the concept of a *core bubble* and its generalization, a *super core bubble*. A distance d is used to limit the maximal distance allowed between two core paths, where a core path is a path containing only
core nodes. Besides bubbles, we introduce the definition of *connected* and *disconnected core paths* in
order to cover the remaining cases.

*Supervisors: Jens Stoye (Bielefeld University), Roland Wittler (Bielefeld University), Faraz Hach (Simon Fraser University)*

researchprojects/researchtinazekic.txt · Last modified: 2016/11/15 13:45 by tinazekic