This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Last revision Both sides next revision | ||
cds [2017/11/08 12:49] tischulz Added some abstracts and titles. |
cds [2017/11/19 16:33] jensstoye [Workshop on Computational Data Science in Bioinformatics 2017] |
||
---|---|---|---|
Line 7: | Line 7: | ||
| Organized by: | Faculty of Technology, GRK 1906 DiDy | | | Organized by: | Faculty of Technology, GRK 1906 DiDy | | ||
- | | Place:| Bielefeld University, rooms V2-105/115 | | + | | Place:| Bielefeld University, main building, room V2-105/115 | |
| Date:| November 20-22, 2017 | | | Date:| November 20-22, 2017 | | ||
\\ | \\ | ||
Line 16: | Line 16: | ||
**Monday, November 20th** | **Monday, November 20th** | ||
| 9h30 | Jochen Kruppa | The interactive visualization of k-mer distributions in virus and bacteria genome sequences to reveal specific genetic regions | | | 9h30 | Jochen Kruppa | The interactive visualization of k-mer distributions in virus and bacteria genome sequences to reveal specific genetic regions | | ||
- | | 13h00 | Susanne Gerber | | | + | | 13h00 | Susanne Gerber | Big Data integration and trans-Omics analysis for reconstructing relevant pathways and networks underlying neurodegenerative diseases | |
- | | 15h30 | Marcel Schulz | | | + | | 15h30 | Marcel Schulz | Data science approaches for learning gene regulatory networks | |
Line 26: | Line 26: | ||
**Wednesday, November 22nd** | **Wednesday, November 22nd** | ||
- | | 9h30 | Stephan Schiffels | | | + | | 9h30 | Stephan Schiffels | Unlocking human history - Computational methods for demographic inference from genome sequences | |
- | | 14h00 | Alexander Schönhuth | | | + | | 14h00 | Alexander Schönhuth | Genome Data Science | |
\\ | \\ | ||
Line 35: | Line 35: | ||
//by Jochen Kruppa// | //by Jochen Kruppa// | ||
- | Bioinformatics methods often incorporate the frequency distribution of nulecobases or k-mers in DNA or RNA sequences, for example as part of metagenomic or phylogenetic analysis. Because the frequency matrix, with sequences in the rows and nucleobases in the columns, is multi-dimensional and therefore hard to visualize. Here, we present the R-package ?kmerPyramid? that allows to display each sequence, based on its nucleobase or k-mer distribution projected to the space of principal components, as a point within a 3-dimensional, interactive pyramid (Kruppa et al., 2017). Using the computer mouse, the user can turn the pyramid?s axes, zoom in and out and identify individual points. Additionally, the package provides the related frequency distribution matrices of about 2.000 bacteria and 5.000 viruses, respectively, calculated from NCBI GenBank. The ?kmerPyramid? can particularly be used for intra- and inter species comparisons. We show the application of clustering genetic regions, like coding and non-coding DNA sequences, the visualization of genomic islands in bacteria genomes, and the detection of low complexity regions in a genome. We are also able to visualize the direct comparison of two sequences considering higher k-mers. This feature might be a guidance for later motif search. The kmerPyramid is based on principal component analysis (PCA) that is used to project the multi-dimensional matrix of nucleobase and k-mer frequencies in the 3-dimensional space. PCA, as a method for dimension reduction, has already been demonstrated to preserve relevant information when exploring these frequencies (Dodsworth et al., 2013; Podar et al., 2013; Imelfort et al., 2014). The kmerPyramid package is available on GitHub (https://github.com/jkruppa/kmerPyramid). | + | Bioinformatics methods often incorporate the frequency distribution of nulecobases or k-mers in DNA or RNA sequences, for example as part of metagenomic or phylogenetic analysis. Because the frequency matrix, with sequences in the rows and nucleobases in the columns, is multi-dimensional and therefore hard to visualize. Here, we present the R-package 'kmerPyramid' that allows to display each sequence, based on its nucleobase or k-mer distribution projected to the space of principal components, as a point within a 3-dimensional, interactive pyramid (Kruppa et al., 2017). Using the computer mouse, the user can turn the pyramid's axes, zoom in and out and identify individual points. Additionally, the package provides the related frequency distribution matrices of about 2.000 bacteria and 5.000 viruses, respectively, calculated from NCBI GenBank. The 'kmerPyramid' can particularly be used for intra- and inter species comparisons. We show the application of clustering genetic regions, like coding and non-coding DNA sequences, the visualization of genomic islands in bacteria genomes, and the detection of low complexity regions in a genome. We are also able to visualize the direct comparison of two sequences considering higher k-mers. This feature might be a guidance for later motif search. The kmerPyramid is based on principal component analysis (PCA) that is used to project the multi-dimensional matrix of nucleobase and k-mer frequencies in the 3-dimensional space. PCA, as a method for dimension reduction, has already been demonstrated to preserve relevant information when exploring these frequencies (Dodsworth et al., 2013; Podar et al., 2013; Imelfort et al., 2014). The kmerPyramid package is available on GitHub (https://github.com/jkruppa/kmerPyramid). |
+ | |||
+ | \\ | ||
+ | |||
+ | **Big Data integration and trans-Omics analysis for reconstructing relevant pathways and networks underlying neurodegenerative diseases** | ||
+ | |||
+ | //by Susanne Gerber// | ||
+ | |||
+ | Risks and costs of neurodegenerative diseases constantly grow as the average expected age of humans increases. Due to population ageing - and according to the estimates of the WHO - the current net costs of 160 billions USD worldwide for such diseases like Alzheimer's disease will almost double during the next ten years. However, despite decades of research and despite of the considerable progress achieved in the identification of risk genes, relevant epigenetic modifications, potent biomarkers, environmental/latent risk factors, and dozens of disease-associated Single Nucleotide Polymorphisms (SNPs), the key conditional factors for an outbreak of several neurodegenerative diseases, e.g Alzheimer’s disease (AD) are still unknown. Also, the question whether there are commonalities in the various (patho)physiological processes associated with neurodegeneration remains yet unanswered. Practical implication of this lacking progress in research is the fact that the current medication for Alzheimer is not more efficient now then it was 20 years ago. It became, however, generally accepted that the underlying mechanisms are polyfactorial and depend on multiple (partly unknown) genetic and non-genetic variables, epigenetics and cellular component factors at different scales. | ||
+ | |||
+ | The work of my research group (as well as of other colleagues in the field) will be introduced. | ||
+ | |||
+ | It aims at the point where the huge collections of disease-related data on various levels (involving Genomics-, Epigenomics-, Transcriptomics- and Proteomics data layers) have to be integrated – and subject to advanced computational and statistical methods on high-performance computing facilities. | ||
+ | |||
+ | By making use of multi-omic measurements data – combined with co-designing new more advanced computational data integration methods for supercomputing facilities, the aim of this research is to reconstruct the global biochemical networks across multiple omic layers – and even across different diseases. | ||
+ | |||
+ | \\ | ||
+ | |||
+ | **Data science approaches for learning gene regulatory networks** | ||
+ | |||
+ | //by Marcel Schulz// | ||
+ | |||
+ | Deciphering the gene regulatory mechanisms that control the establishment and maintenance of cellular programs is an essential task in computational biology and systems medicine. Over the last years a number of key technologies have been developed to measure genes and their surrounding epigenomic environment in great detail. However the integration of these different Omics data types poses a number of statistical challenges. | ||
+ | |||
+ | In this talk I will highlight some of our recent work to improve the estimation of regulatory networks from epigenomics data. I will present our mathematical approach for prediction of transcription factor regulation from paired gene expression and epigenomics time series data using hidden Markov models. | ||
+ | In addition, I will introduce a new statistical method for the inference of competing endogenous RNA interaction networks from expression data and will show an application for the prediction of prognostic cancer biomarkers using these networks. | ||
\\ | \\ | ||
Line 56: | Line 81: | ||
\\ | \\ | ||
+ | |||
+ | **Unlocking human history - Computational methods for demographic inference from genome sequences** | ||
+ | |||
+ | //by Stephan Schiffels// | ||
+ | |||
+ | In recent years, the number of publicly available human genomes from diverse populations has increased by several orders of magnitude. In particular in conjunction with ancient DNA, these large data sets present an opportunity for population genetic research to investigate our human past with unprecedented detail. Here I will present several studies that showcase how to exploit this data with new high-resolution methods. In particular, I will introduce MSMC, a method based on Coalescence Hidden Markov Models, and rarecoal, a method to efficiently model the rare joint site frequency spectrum across multiple populations. The results of these studies cover new understandings on deep human history ~50,000 years ago, the peopling of the Americas ~15,000 years ago, all the way to the early medieval Anglo-Saxon migrations into England. Building upon these developments, I will point out future directions for genetic data analyses in the era of population-scale ancient and modern sequencing data sets, as they are increasingly available today. | ||
+ | |||
+ | \\ | ||
+ | |||
+ | **Genome Data Science** | ||
+ | |||
+ | //by Alexander Schönhuth// | ||
+ | |||
+ | Die modernen Sequenziertechnologien haben die Biologie, und | ||
+ | insbesondere die Genomik mit sintflutartigen Datenmengen konfrontiert. | ||
+ | Die Konsequenzen sind gewaltig, nicht nur in Hinsicht auf die sich | ||
+ | dadurch ergebenden Chancen in punkto Lebensdauer und -qualität, | ||
+ | sondern auch hinsichtlich der der Data Science zuzurechnenden | ||
+ | Herausforderungen. In meinem Vortrag werde ich zwei gegenwärtig | ||
+ | dominante Themenkreise ansprechen. | ||
+ | |||
+ | Zum Ersten werde ich besprechen, wie man Cliquen in | ||
+ | Genom-Assembly-Graphen zügig enumerieren kann, um diese dann dazu | ||
+ | benutzen, um Virusgenome zu rekonstruieren. Diese Vorgehensweise der | ||
+ | Rekonstruktion von Virusgenomen ist neu. Die Ergebnisse zeigen, dass | ||
+ | dieser Data-Mining-orientierte, streng datenbezogene Ansatz | ||
+ | entscheidende Vorteile im Abgleich mit (weniger datenbezogenen) | ||
+ | State-of-the-Art-Methoden hat. | ||
+ | |||
+ | Zweitens werde ich besprechen, wie man DNA-Sequenz -- und auch | ||
+ | Sequenz im Allgemeinen -- mit Hilfe von Hilbert-Kurven repräsentieren | ||
+ | kann, um sie mit Deep Convolutional Neural Networks zu klassifizieren. | ||
+ | Convolutional Neural Networks haben in letzter Zeit insbesondere in | ||
+ | der Bildanalyse grosse Erfolge gefeiert. Die Idee ist, solche Erfolge | ||
+ | in der DNA-Sequenzanalyse zu reproduzieren. Hilbert-Kurven haben | ||
+ | aufgrund ihrer charakterisierenden Eigenschaften das Potenzial, Sequenz | ||
+ | in Bilder zu verwandeln, so dass die Stärken der Konvolution optimal | ||
+ | ausgenutzt werden, was sich in den entsprechenden Ergebnissen | ||
+ | niederschlägt. | ||
+ | |||
+ | |||
+ |