Monday, July 16, 2012

July Houston Hadoop Meetup - Genomic data analysis with Hadoop


Dianhui (Dennis) Zhu  presented "Genomic data analysis with hadoop".  He talked about using Hadoop framework to do pattern search in genomic sequence datasets. This is based on his three-year project at Baylor, which started using Hadoop a year ago. Dennis is Senior Scientific Programmer at HGSC.

Dianhui told us about the following issues

1. Setup a Hadoop test cluster with 4 nodes.
2. Code walk through and unit testing with Mokito and MRUnit
3. Live demo: running our Hadoop application on the  4-node cluster.

The interesting technical problem that Dennis showed was to break sequence into chunks, before it gets to the Mapper - which is usually trivial in the regular applications, but is quite hard with unlimited unstructured data of the genome. The audience analyzed the actual code, asked many questions, and wanted to compare to the existing open source projects.

Indeed, here is an article on the Cloudera blog, http://www.cloudera.com/blog/2009/10/analyzing-human-genomes-with-hadoop/, and it refers to the Crossbow open source project, http://bowtie-bio.sourceforge.net/crossbow/index.shtml. It will interesting to see how that compares to Dennis's work.

Sunday, July 8, 2012

FreeEed™ is now in the cloud!





SHMcloud™ Press Release 7/9/12

FreeEed™ is now in the cloud!

eDiscovery processing: text extraction, culling, and native/text and metadata csv delivery.

Special introductory offer until August 15: $1 per machine-hour. How fast is that? At a recent show we processed 100 GB of Enron data in 1 hour for under $100, as seen here.

How can you get started?
Just go to here,
download the SHMcloud(TM) Player,
and start!


Phone: 713-568-9753