Monday, September 26, 2011

FreeEed is close to run in Windows, also watch for Continuous Integration

I am adding Windows single-node processing, and it is going easier than I expected. Since Hadoop requires too much setup in Windows, instead I detect the Windows platform in the code, and take processing from beginning to end without Hadoop. There is a lot of useful stuffing in FreeEed, such as parsing all file formats with Tika and culling with Lucene, so even without Hadoop there is much usability left.



What gives me especial pleasure is watching how every new code iteration is automatically tested with Jenkins continuous integration server. Feels good to see that the code's build and correct operation is automatically tested. I will add more tests to the build, to better exercise all of the software's capabilities.

Sunday, September 18, 2011

Symantec: Files, Databases Overtake E-Mail in E-Discovery

Respondents gave a surprising answer to a question about how frequently various types of ESI are requested during legal and regulatory processes. Files and documents are requested in 67 percent of situations, followed by application and database records at 61 percent, and e-mail at 58 percent, they said. Microsoft SharePoint records are requested 51 percent of the time, while messaging formats such as instant messaging, texts, and BlackBerry PIN messages are needed 44 percent of the time. Data from social media trailed, being needed for 41 percent of ESI requests.

Continue reading...

Wednesday, September 7, 2011

FreeEed used to process the complete Enron data

From the beginning
To the end
Enron data set is publicly available. In particular, EDRM provides this data in a convenient format. The data set was processed with FreeEed, and the results were made available for reference and for feedback. Each PST produced a (1) zip archive, containing native files, complete text from the each email with all attachments, and exceptions, if any; (2) a CSV file with the metadata, and (3) a short summary report.

Some interesting statistics, which will tell the users what to expect. Using the high-CPU EC2 machine, each Gigabyte took on the average one hour to process. The cost of processing was below $1 per Gig. The processing was done on one machine and took about 2 days. The time could have been shortened to under 4 hours by using 25-50 machines, but at this time we were interested in watching the process and on debugging it, not in the optimization.

While processing was going on, we were also fixing the bugs observed, mainly in the Tika parsers, and the Tika team fixed some bugs with a turn-around time of under one day. There is more work to do and more re-processing in sight, but the main take away: FreeEed is mature and stable and can be relied upon for processing. Now is the time to take it to the next level, by creating the Windows/Mac/Linux thin client and using Amazon EC2 for processing, which will make eDiscovery processing easily available for a non-geek user.

Thursday, September 1, 2011

The first Enron tests are in

The first 5 Enron mail boxes are in and can be found here.

The processing is done on Amazon machines, and the results are pushed to the S3 cloud, so that the transfer costs are 0. As planned, FreeEed just reprocesses all projects and puts the data in the cloud, and the links on the web site do not have to change.

For each project, you get a zip of native files, a metadata in the CSV format, and a little text report.

ejusdem generis

I am reading this wonderful letter by Mike Godwin (PDF), which he wrote when he served as top lawyer for the Wikimedia Foundation, addressed to the FBI over Wikipedia's use of their seal.

I am mostly enjoying the style, but I am also looking up the terms, and I came across this, ejusdem generis, which is explained here and is used to interpret loosely written statutes. Well, amazingly enough this is the Talmud rule of prat-uklal (a specific example followed by a general category), and it is explained here!

Back to the letter, which is wonderful is style!