YU Jia (started October 2014)
NGS(next generation sequence) technology let researchers now have the potential to analyze multiple genomes of several related bacterial strains. EDGAR (Efficient Database framework for comparative Genome Analysis using BLAST score Ratios) is a software platform for bacterial genome analysis maintained by Justus-Liebig-University Giessen. It encloses all the NCBI publicly available bacterial genomes and uses statistical approach together with sequencing alignment to provide users othologous analysis. It also provides multiple comparative genomic visualization features.
As NGS technology developing, more genomes are released and EDGAR itself met its processing bottleneck. In order to keep its popular features yet to have a high performance back-end, we proposed three improvable aspects. On the algorithmic level, we test and benchmark the available alignment algorithms and applications to eventually replace the conventional BLAST sequencing alignment. On the computational level, we use cutting edge parallel computation architecture to provide EDGAR a scalable, fault-tolerant and efficient infrastructure. On the database level, a NoSQL database model with its compatible data schema will be used to enhance the data I/O.
We first circled 2 most recent published sequence alignment applications, GHOSTX and DIAMOND to compare to BLAST. Both claim that they are thousands of times faster than BLAST. After carefully examined their alignment results, we found that GHOSTX cannot provide required sensitivity and its result was missing a significant amount of alignment results and thus was not compatible with the statistical orthologous analysis of EDGAR. DIAMOND on the other hand, is sensitive enough to provide results really identical to BLAST. It also fits to EDGAR's analysis. The benchmark of the three alignment applications indicates that DIAMOND has the best performance when dealing with the same dataset in the same environment. Therefore, we chose DIAMOND as the alternative for BLAST in EDGAR.
We then worked on the parallelization of DIAMOND as it is the second aspect of the EDGAR enhancement. We developed a new application, HAMOND. It integrates Apache Hadoop parallelism with DIAMOND to provide scalability, fault-tolerance and high-speed features. EDGAR-HAMOND integration showed to be significantly faster and high identity to previous results. HAMOND is also compatible with any in-house Hadoop cluster and Amazon cloud platform. The development of HAMOND is finished and the publication is ready to be submit.
The next step would be the database model change. Since we are inclined to utilize a NoSQL database, the current MySQL data schema of EDGAR will be re-designed.
Supervisors: Alexander Goesmann (Justus Liebig University Giessen), Alexander Sczyrba (Bielefeld University), Fiona Brinkman (Simon Fraser University)