User Tools

Site Tools

This is an old revision of the document!

Workshop on Computational Data Science in Bioinformatics 2017

Organized by: Faculty of Technology, GRK 1906 DiDy
Place: Bielefeld University, rooms V2-105/115
Date: November 20-22, 2017


Monday, November 20th

9h30 Jochen Kruppa The interactive visualization of k-mer distributions in virus and bacteria genome sequences to reveal specific genetic regions
13h00 Susanne Gerber
15h30 Marcel Schulz Data science approaches for learning gene regulatory networks

Tuesday, November 21st

9h30 Annalisa Marsico Statistical Models of post-transcriptional gene regulation
14h00 Tobias Marschall Towards haplotype-resolved genome assembly – or how to solve multiple jigsaw puzzles simultaneously

Wednesday, November 22nd

9h30 Stephan Schiffels Unlocking human history - Computational methods for demographic inference from genome sequences
14h00 Alexander Schönhuth


The interactive visualization of k-mer distributions in virus and bacteria genome sequences to reveal specific genetic regions

by Jochen Kruppa

Bioinformatics methods often incorporate the frequency distribution of nulecobases or k-mers in DNA or RNA sequences, for example as part of metagenomic or phylogenetic analysis. Because the frequency matrix, with sequences in the rows and nucleobases in the columns, is multi-dimensional and therefore hard to visualize. Here, we present the R-package 'kmerPyramid' that allows to display each sequence, based on its nucleobase or k-mer distribution projected to the space of principal components, as a point within a 3-dimensional, interactive pyramid (Kruppa et al., 2017). Using the computer mouse, the user can turn the pyramid?s axes, zoom in and out and identify individual points. Additionally, the package provides the related frequency distribution matrices of about 2.000 bacteria and 5.000 viruses, respectively, calculated from NCBI GenBank. The ?kmerPyramid? can particularly be used for intra- and inter species comparisons. We show the application of clustering genetic regions, like coding and non-coding DNA sequences, the visualization of genomic islands in bacteria genomes, and the detection of low complexity regions in a genome. We are also able to visualize the direct comparison of two sequences considering higher k-mers. This feature might be a guidance for later motif search. The kmerPyramid is based on principal component analysis (PCA) that is used to project the multi-dimensional matrix of nucleobase and k-mer frequencies in the 3-dimensional space. PCA, as a method for dimension reduction, has already been demonstrated to preserve relevant information when exploring these frequencies (Dodsworth et al., 2013; Podar et al., 2013; Imelfort et al., 2014). The kmerPyramid package is available on GitHub (

Data science approaches for learning gene regulatory networks

by Marcel Schulz

Deciphering the gene regulatory mechanisms that control the establishment and maintenance of cellular programs is an essential task in computational biology and systems medicine. Over the last years a number of key technologies have been developed to measure genes and their surrounding epigenomic environment in great detail. However the integration of these different Omics data types poses a number of statistical challenges.

In this talk I will highlight some of our recent work to improve the estimation of regulatory networks from epigenomics data. I will present our mathematical approach for prediction of transcription factor regulation from paired gene expression and epigenomics time series data using hidden Markov models. In addition, I will introduce a new statistical method for the inference of competing endogenous RNA interaction networks from expression data and will show an application for the prediction of prognostic cancer biomarkers using these networks.

Statistical Models of post-transcriptional gene regulation

by Annalisa Marsico

RNA Binding proteins (RBPs) and non-coding RNAs function in coordination with each other to control post-transcriptional regulation (PTR). In human cells, hundreds of RBPs and thousands of non-coding RNAs have been annotated but the detailed functions of only a few have been explored so far. There is therefore a huge need of in silico methods to assist this task starting from the modeling of recent high-throughput data, such as CLIP-seq data, to shed lights on mechanisms of PTR I will present some of the machine learning methods developed in our lab to determine the precise location of RBP binding sites (Krakau et al., bioarxiv 2017), characterize their RNA sequence-structure preferences (Heller et al., NAR 2017) and predict long non-coding RNA functions and mechanisms of action. I will also demonstrate how the characterization of RBPs, lncRNAs and their interactions is the first step to better understand diseases associated with changes in PTR by presenting an application of our tools to the identification and characterization of biomarkers from immune response experimental data.

Towards haplotype-resolved genome assembly – or how to solve multiple jigsaw puzzles simultaneously

by Tobias Marschall

Genome assembly is like a one-dimensional jigsaw puzzle: Given many short sequence fragments, we are tasked to reconstruct the sequence corresponding to the whole genome. This classic bioinformatical problem has been studied for decades and, yet, very pressing and fundamental challenges remain unsolved. First, many species of interest (including humans) are diploid or even polyploid and hence harbor multiple similar yet distinct copies of each chromosome (called haplotypes). Second, state-of-the-art assembly projects routinely employ multiple technologies, which produce massive amounts of data and come with technology-specific errors and uncertainties. Haplotype-resolved genome assembly using data from multiple technologies is hence a significant data integration task. In my presentation, I highlight recent progress on multiple statistical and algorithmic methods that serve as building blocks towards this goal. Furthermore, I sketch a way forward for solving this problem in the mid-term future.

Unlocking human history - Computational methods for demographic inference from genome sequences

by Stephan Schiffels

In recent years, the number of publicly available human genomes from diverse populations has increased by several orders of magnitude. In particular in conjunction with ancient DNA, these large data sets present an opportunity for population genetic research to investigate our human past with unprecedented detail. Here I will present several studies that showcase how to exploit this data with new high-resolution methods. In particular, I will introduce MSMC, a method based on Coalescence Hidden Markov Models, and rarecoal, a method to efficiently model the rare joint site frequency spectrum across multiple populations. The results of these studies cover new understandings on deep human history ~50,000 years ago, the peopling of the Americas ~15,000 years ago, all the way to the early medieval Anglo-Saxon migrations into England. Building upon these developments, I will point out future directions for genetic data analyses in the era of population-scale ancient and modern sequencing data sets, as they are increasingly available today.

cds.1510324743.txt.gz · Last modified: 2017/11/10 14:39 by tischulz