User Tools

Site Tools

Efficient Grouping and Cluster Validity Measures for NGS Data

Markus Lux

Insights into large-scale omics data require computational methods which reliably and efficiently recognize latent structures and patterns. A particular example is given by metagenomics which is confronted with the problem of detecting clusters representing involved species in a given sample (binning). Albeit powerful technologies exist for the identification of known taxa, de-novo binning is still in it's infancy. Similarly, in single-cell genome assemblies, a major problem is given by sample contamination with foreign genetic material. From a computational point of view, in both metagenomics and single-cell genome assemblies, genomes can be represented as a clusters. Hence, grouping and cluster validity measures can be employed to detect and separate genomes. In my research project, I adapt, compare and evaluate novel machine learning techniques in the context of such data.

Typically, genomic data can be represented in a high-dimensional vector space, making it inherently difficult for classical learning techniques as they suffer from the curse of dimensionality. Thus, it is crucial to use subspace projections or strong regularization. Additionally, the number of clusters is unknown and clustering results heavily depend on the chosen set of parameters which can be large. In order to yield valuable results, it is also beneficial to incorporate biological side information. This is accompanied by the fact that such data sets are often large, making standard algorithms inapplicable because of quadratic or worse runtime or memory complexity.

Supervisors: Barbara Hammer (Bielefeld University)

researchprojects/researchmarkuslux.txt · Last modified: 2016/06/30 08:53 by markuslux