Wednesday, September 7, 2011

FreeEed used to process the complete Enron data

Enron data set is publicly available. In particular, EDRM provides this data in a convenient format. The data set was processed with FreeEed, and the results were made available for reference and for feedback. Each PST produced a (1) zip archive, containing native files, complete text from the each email with all attachments, and exceptions, if any; (2) a CSV file with the metadata, and (3) a short summary report.

Some interesting statistics, which will tell the users what to expect. Using the high-CPU EC2 machine, each Gigabyte took on the average one hour to process. The cost of processing was below $1 per Gig. The processing was done on one machine and took about 2 days. The time could have been shortened to under 4 hours by using 25-50 machines, but at this time we were interested in watching the process and on debugging it, not in the optimization.

While processing was going on, we were also fixing the bugs observed, mainly in the Tika parsers, and the Tika team fixed some bugs with a turn-around time of under one day. There is more work to do and more re-processing in sight, but the main take away: FreeEed is mature and stable and can be relied upon for processing. Now is the time to take it to the next level, by creating the Windows/Mac/Linux thin client and using Amazon EC2 for processing, which will make eDiscovery processing easily available for a non-geek user.

