Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
researchprojects:researchjiayu [2016/05/23 07:35]
rwittler
researchprojects:researchjiayu [2016/11/15 14:15] (current)
jiayu
Line 1: Line 1:
- + ==== High performance cloud platform for comparative genomic analyses ==== 
 +[[:JiaYu|YU Jia]] (started October 2014) \\
  
 +{{:​919edgar_apache.png?​nolink&​400 |}}
  
 +NGS(next generation sequence) technology let researchers now have the potential to analyze multiple genomes of several related bacterial strains. EDGAR (Efficient Database framework for comparative Genome Analysis using BLAST score Ratios) is a software platform for bacterial genome analysis maintained by Justus-Liebig-University Giessen. It encloses all the NCBI publicly available bacterial genomes and uses statistical approach together with sequencing alignment to provide users othologous analysis. It also provides multiple comparative genomic visualization features. ​
  
-==== Pangenome Data Analysis Platform ==== +As NGS technology developing, more genomes are released and EDGAR itself met its processing bottleneck. In order to keep its popular features yet to have a high performance back-end, we proposed three improvable aspects. On the algorithmic level, we test and benchmark the available alignment algorithms and applications to eventually replace the conventional BLAST sequencing alignment. On the computational level, we use cutting edge parallel computation architecture to provide EDGAR a scalable, fault-tolerant and efficient infrastructure. On the database level, a NoSQL database model with its compatible data schema will be used to enhance the data I/O.
-[[:JiaYu|YU Jia]] (started October 2014) \\ +
  
-| {{919edgar_apache.png|}}| The NGS has brought us a high potential ​to sequence a full genome in a short timeTherefore there are more and more strains are sequenced which are possible for pan-genome computationConsidering genome data volume is normally bigpan-genome computation would need massive processing power and storage resources. \\ Bielefeld University ​and Justus Liebig University Giessen has collaboratively developed a powerful pan-genome computation platform **EDGAR**But to face nowadays genome data floodit has to be more efficient.|\\ For my project, the first goal could be figuring a way to deal with such massive genome dataWe will try to deploy **EDGAR** in a fully distributed mode for the possibility to horizontally scale-out.\\ There is an alternative ​first step of my project, which is trying to add more functionalities into our existing **EDGAR** platform, for instance phylogenetic analysis, new machine learning algorithms, etc.\\ //​Supervisors:​ Alexander Goesmann (Bielefeld University),​ Alexander Sczyrba (Bielefeld University)// ​+We first circled 2 most recent published sequence alignment applications,​ GHOSTX and DIAMOND to compare ​to BLASTBoth claim that they are thousands of times faster than BLASTAfter carefully examined their alignment resultswe found that GHOSTX cannot provide required sensitivity ​and its result was missing a significant amount of alignment results ​and thus was not compatible with the statistical orthologous analysis of EDGAR. ​DIAMOND on the other handis sensitive enough ​to provide results really identical ​to BLASTIt also fits to EDGAR's analysis. The benchmark of the three alignment applications indicates that DIAMOND has the best performance when dealing with the same dataset ​in the same environmentTherefore, we chose DIAMOND as the alternative ​for BLAST in EDGAR.
  
 +We then worked on the parallelization of DIAMOND as it is the second aspect of the EDGAR enhancement. We developed a new application,​ HAMOND. It integrates Apache Hadoop parallelism with DIAMOND to provide scalability,​ fault-tolerance and high-speed features. EDGAR-HAMOND integration showed to be significantly faster and high identity to previous results. HAMOND is also compatible with any in-house Hadoop cluster and Amazon cloud platform. The development of HAMOND is finished and the publication is ready to be submit.
  
-|Image source: |[[http://​wwww.cebitec.uni-bielefeld.de/​edgar.cebitec.uni-bielefeld.de/​|EDGAR login]]| +The next step would be the database model changeSince we are inclined to utilize a NoSQL database, the current MySQL data schema of EDGAR will be re-designed.
-||[[http://​wwww.apache.org/​|The Apache Software Foundation]]|+
  
 +//​Supervisors:​ Alexander Goesmann (Justus Liebig University Giessen), Alexander Sczyrba (Bielefeld University),​ Fiona Brinkman (Simon Fraser University)//​
researchprojects/researchjiayu.1463988938.txt.gz · Last modified: 2016/05/23 07:35 by rwittler