Sunday, August 28, 2011

FreeEed is ready to do Enron processing

The command-line working of FreeEed has been updated with the latest changes which resulted from GUI work. Now you can use the program equally well as a command-line utility and as a GUI tool. Why is this important? For any serious project, such as processing all of Enron data, you want to be able to script the work of FreeEed and integrate it into a larger whole.

The version 2.7.5 is uploaded to the FreeEed site, and it is ready to take on the Enron set (with Amazon machine, of course). And why is this so nice? -- Having FreeEed process a large set serves as regression testing: on every new release, we can re-process the set (which will grow by adding other data sources) and verify that the quality of processing did not go down and in fact improved.

The list of improvements can be always found here, but there it is for your reading pleasure:

  • Smaller FreeEed download
  • Capability to read remote resources as data source using URI notation. The URI syntax is documented, and the program takes you to the right web page for help. You can include ftp with user name and password, and a lot of other things - anything that is a valid URI and that the site where it resides will actually allow you to download. For example, if you want to process an Enron file from the EDRM site, you just give its URI as http://duaj3yp6waei2.cloudfront.net/edrm-enron-v2_bailey-s_pst.zip and FreeEed downloads the file.
  • Processing of dozens of archive formats: http://truezip.java.net/kick-start/no-maven.html
  • Processing of archives inside of archives recursively
  • Command-line running is restored and can be used for scripting large jobs
  • Option -enron to process Enron data set (specific test script)
What is next? Processing 'em on a Hadoop cluster, of course!

No comments: