User Tools

Site Tools

Applying succinct data structures to massive sequence clustering

Robert Müller (started October 2016)

The advent of high-throughput sequencing (HTS) technologies has drastically changed how research is conducted in the life sciences. Clustering is a common task in many bioinformatics pipelines for the resulting massive data sets comprising millions of sequences. Intense research on related problems such as similarity joins and approximate dictionary matching led to a variety of tools, most of whom focus on speeding up the computation in order to handle the continuously growing data sets. However, they usually fall under the restriction that all data (structures) have to be held in the faster, but relatively small and expensive main memory (compared to disk space). While today even small laboratories can produce data on HTS-scale, they very often do not have the technical or financial capabilities to constantly upgrade their computational infrastructure in order to process these data sets. As cloud computing is not always an option, the goal of this project is to facilitate the processing of HTS-scale data sets on current lab infrastructures by augmenting state-of-the-art clustering tools with succinct data structures.

More specifically, we aim at examining the effect of introducing succinct data structures into clustering tools on their runtime and space consumption. Since succinct data structures often hide their dominant costs in lower-order terms and constants, we are particularly interested in the trade-off between space and runtime. We will analyse the effect of relaxing the space restriction of succinct data structures and study how much additional space we have to spend on top of the information-theoretical minimum in order to obtain practically superior tools. To that end, existing data structures have to be adapted or fine-tuned, and, potentially, new advanced data structures have to be developed. While the focus will be on tools working with the commonly accepted edit distance, we will also consider alternative, biologically motivated notions of similarity. Furthermore, we will investigate the accessibility of the used succinct data structures to practical optimisations such as parallelisation (broadword programming, GPU computations, threading etc.).

Supervisor: Markus Nebel (Bielefeld University)

researchprojects/researchrobert.txt · Last modified: 2016/11/15 13:45 by rmueller