This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
researchprojects:researchjiayu [2016/05/11 18:32] 127.0.0.1 external edit |
researchprojects:researchjiayu [2016/11/15 14:15] (current) jiayu |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | + | ==== High performance cloud platform for comparative genomic analyses ==== | |
+ | [[:JiaYu|YU Jia]] (started October 2014) \\ | ||
+ | {{:919edgar_apache.png?nolink&400 |}} | ||
+ | NGS(next generation sequence) technology let researchers now have the potential to analyze multiple genomes of several related bacterial strains. EDGAR (Efficient Database framework for comparative Genome Analysis using BLAST score Ratios) is a software platform for bacterial genome analysis maintained by Justus-Liebig-University Giessen. It encloses all the NCBI publicly available bacterial genomes and uses statistical approach together with sequencing alignment to provide users othologous analysis. It also provides multiple comparative genomic visualization features. | ||
- | ==== Pangenome Data Analysis Platform ==== | + | As NGS technology developing, more genomes are released and EDGAR itself met its processing bottleneck. In order to keep its popular features yet to have a high performance back-end, we proposed three improvable aspects. On the algorithmic level, we test and benchmark the available alignment algorithms and applications to eventually replace the conventional BLAST sequencing alignment. On the computational level, we use cutting edge parallel computation architecture to provide EDGAR a scalable, fault-tolerant and efficient infrastructure. On the database level, a NoSQL database model with its compatible data schema will be used to enhance the data I/O. |
- | [[:JiaYu|YU Jia]] (started October 2014) \\\\ | + | |
+ | We first circled 2 most recent published sequence alignment applications, GHOSTX and DIAMOND to compare to BLAST. Both claim that they are thousands of times faster than BLAST. After carefully examined their alignment results, we found that GHOSTX cannot provide required sensitivity and its result was missing a significant amount of alignment results and thus was not compatible with the statistical orthologous analysis of EDGAR. DIAMOND on the other hand, is sensitive enough to provide results really identical to BLAST. It also fits to EDGAR's analysis. The benchmark of the three alignment applications indicates that DIAMOND has the best performance when dealing with the same dataset in the same environment. Therefore, we chose DIAMOND as the alternative for BLAST in EDGAR. | ||
- | | {{919edgar_apache.png|}}| The NGS has brought us a high potential to sequence a full genome in a short time. Therefore there are more and more strains are sequenced which are possible for pan-genome computation. Considering genome data volume is normally big, pan-genome computation would need massive processing power and storage resources. \\\\Bielefeld University and Justus Liebig University Giessen has collaboratively developed a powerful pan-genome computation platform **EDGAR**. But to face nowadays genome data flood, it has to be more efficient.|\\For my project, the first goal could be figuring a way to deal with such massive genome data. We will try to deploy **EDGAR** in a fully distributed mode for the possibility to horizontally scale-out.\\\\There is an alternative first step of my project, which is trying to add more functionalities into our existing **EDGAR** platform, for instance phylogenetic analysis, new machine learning algorithms, etc.\\\\//Supervisors: Alexander Goesmann (Bielefeld University), Alexander Sczyrba (Bielefeld University)// \\\\ | + | We then worked on the parallelization of DIAMOND as it is the second aspect of the EDGAR enhancement. We developed a new application, HAMOND. It integrates Apache Hadoop parallelism with DIAMOND to provide scalability, fault-tolerance and high-speed features. EDGAR-HAMOND integration showed to be significantly faster and high identity to previous results. HAMOND is also compatible with any in-house Hadoop cluster and Amazon cloud platform. The development of HAMOND is finished and the publication is ready to be submit. |
+ | The next step would be the database model change. Since we are inclined to utilize a NoSQL database, the current MySQL data schema of EDGAR will be re-designed. | ||
- | |Image source: |[[http://wwww.cebitec.uni-bielefeld.de/edgar.cebitec.uni-bielefeld.de/|EDGAR login]]| | + | //Supervisors: Alexander Goesmann (Justus Liebig University Giessen), Alexander Sczyrba (Bielefeld University), Fiona Brinkman (Simon Fraser University)// |
- | ||[[http://wwww.apache.org/|The Apache Software Foundation]]| | + |