This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
researchprojects:researchjiayu [2016/11/15 13:25]
researchprojects:researchjiayu [2016/11/15 14:15] (current)
Line 1: Line 1:
- ​==== ​Pangenome Data Analysis Platform ​====+ ​==== ​High performance cloud platform for comparative genomic analyses ​====
 [[:JiaYu|YU Jia]] (started October 2014) \\ [[:JiaYu|YU Jia]] (started October 2014) \\
 +{{:​919edgar_apache.png?​nolink&​400 |}}
-| {{:​919edgar_apache.png|}}| The NGS has brought us a high potential to sequence ​full genome ​in a short timeTherefore there are more and more strains ​are sequenced which are possible for pan-genome computationConsidering genome data volume is normally bigpan-genome computation would need massive processing power and storage resources. \\ Bielefeld University ​and Justus Liebig University Giessen has collaboratively developed a powerful pan-genome ​computation ​platform **EDGAR**But to face nowadays genome ​data floodit has to be more efficient.|\\ For my projectthe first goal could be figuring ​way to deal with such massive genome dataWe will try to deploy **EDGAR** in a fully distributed mode for the possibility ​to horizontally scale-out.\\ There is an alternative first step of my project, which is trying ​to add more functionalities into our existing **EDGAR** platformfor instance phylogenetic analysis, new machine learning algorithms, etc.\\ //​Supervisors:​ Alexander Goesmann (Bielefeld ​University),​ Alexander Sczyrba (Bielefeld University)//​+NGS(next generation sequence) technology let researchers now have the potential to analyze multiple genomes of several related bacterial strains. EDGAR (Efficient Database framework for comparative Genome Analysis using BLAST score Ratios) is software platform for bacterial ​genome ​analysis maintained by Justus-Liebig-University GiessenIt encloses all the NCBI publicly available bacterial genomes ​and uses statistical approach together with sequencing alignment to provide users othologous analysis. It also provides multiple comparative genomic visualization features.  
 +As NGS technology developing, ​more genomes ​are released and EDGAR itself met its processing bottleneck. In order to keep its popular features yet to have a high performance back-end, we proposed three improvable aspectsOn the algorithmic levelwe test and benchmark the available alignment algorithms ​and applications to eventually replace the conventional BLAST sequencing alignment. On the computational level, we use cutting edge parallel ​computation ​architecture to provide ​EDGAR a scalable, fault-tolerant and efficient infrastructureOn the database level, a NoSQL database model with its compatible data schema will be used to enhance the data I/O. 
 +We first circled 2 most recent published sequence alignment applicationsGHOSTX and DIAMOND ​to compare to BLAST. Both claim that they are thousands of times faster than BLASTAfter carefully examined their alignment resultswe found that GHOSTX cannot provide required sensitivity and its result was missing ​significant amount of alignment results and thus was not compatible ​with the statistical orthologous analysis of EDGARDIAMOND on the other hand, is sensitive enough to provide results really identical to BLAST. It also fits to EDGAR's analysis. The benchmark of the three alignment applications indicates that DIAMOND has the best performance when dealing with the same dataset ​in the same environment. Therefore, we chose DIAMOND as the alternative ​for BLAST in EDGAR. 
 +We then worked on the parallelization of DIAMOND as it is the second aspect of the EDGAR enhancement. We developed a new application,​ HAMOND. It integrates Apache Hadoop parallelism with DIAMOND ​to provide scalability,​ fault-tolerance and high-speed features. EDGAR-HAMOND integration showed to be significantly faster and high identity to previous resultsHAMOND ​is also compatible with any in-house Hadoop cluster and Amazon cloud platform. The development ​of HAMOND ​is finished and the publication is ready to be submit. 
 +The next step would be the database model change. Since we are inclined to utilize a NoSQL databasethe current MySQL data schema of EDGAR will be re-designed. 
 +//​Supervisors:​ Alexander Goesmann (Justus Liebig ​University ​Giessen), Alexander Sczyrba (Bielefeld ​University),​ Fiona Brinkman (Simon Fraser ​University)//​
researchprojects/researchjiayu.1479216347.txt.gz · Last modified: 2016/11/15 13:25 by jiayu