Monday, March 3, 2008

An open-source eDiscovery engine?

Derek Gottfrid of New York Times was able to upload 4T of data and create 1.5T of PDF's out of it in 24 hours. He did it for $800 paid for upload, and $240 paid for processing (my estimate) . How?

On his blog he tells us how he did it, but in brief, he used

100 Amazon EC2 machines, running Hadoop (open-source version of Google's MapReduce) and his scripts which were already running on his machine. Essentially, he cloned his machine 100 times on EC2 and Hadoop took care of running them all concurrently.

Bravo, Derek!

I only wonder how long did the upload take. I asked him on the blog. And by the way, compare this to $1,000,000 if done by an eDiscovery vendor at the low price of $250/Gig. Now, I know that the discovery vendors also make it searchable and put it in the format suitable for upload to a litigation tool, but that can surely be cured for the remaining balance of $998,960.

Anybody wants to join me for this project?

Art: Fernando Botero - Man Reading a Paper 1996

1 comment:

Mark Kerzner said...

Ok, so I did start it, look here

http://code.google.com/p/ediscovery/