Monday, July 16, 2012
July Houston Hadoop Meetup - Genomic data analysis with Hadoop
Dianhui (Dennis) Zhu presented "Genomic data analysis with hadoop". He talked about using Hadoop framework to do pattern search in genomic sequence datasets. This is based on his three-year project at Baylor, which started using Hadoop a year ago. Dennis is Senior Scientific Programmer at HGSC.
Dianhui told us about the following issues
1. Setup a Hadoop test cluster with 4 nodes.
2. Code walk through and unit testing with Mokito and MRUnit
3. Live demo: running our Hadoop application on the 4-node cluster.
The interesting technical problem that Dennis showed was to break sequence into chunks, before it gets to the Mapper - which is usually trivial in the regular applications, but is quite hard with unlimited unstructured data of the genome. The audience analyzed the actual code, asked many questions, and wanted to compare to the existing open source projects.
Indeed, here is an article on the Cloudera blog, http://www.cloudera.com/blog/2009/10/analyzing-human-genomes-with-hadoop/, and it refers to the Crossbow open source project, http://bowtie-bio.sourceforge.net/crossbow/index.shtml. It will interesting to see how that compares to Dennis's work.