Tuesday, August 30, 2011

FreeEed processing is made verifiably stable

FreeEed relies on Tika to process emails. In fact, one code line tells Tika to extract all attachments in whatever formats they happen to be, and process these also, adding the extracted text to the total.

Well, today I found a bug in Tika (it was not closing those attachments) that led FreeEed to crash. But also today the nice Tika programmers Mike McCandless, Nich Birch, and Jukka Zitting fixed the bug! So FreeEed is happily churning those Gigabytes of Enron PST on the EC2 machines now.

Sunday, August 28, 2011

FreeEed is ready to do Enron processing

The command-line working of FreeEed has been updated with the latest changes which resulted from GUI work. Now you can use the program equally well as a command-line utility and as a GUI tool. Why is this important? For any serious project, such as processing all of Enron data, you want to be able to script the work of FreeEed and integrate it into a larger whole.

The version 2.7.5 is uploaded to the FreeEed site, and it is ready to take on the Enron set (with Amazon machine, of course). And why is this so nice? -- Having FreeEed process a large set serves as regression testing: on every new release, we can re-process the set (which will grow by adding other data sources) and verify that the quality of processing did not go down and in fact improved.

The list of improvements can be always found here, but there it is for your reading pleasure:

  • Smaller FreeEed download
  • Capability to read remote resources as data source using URI notation. The URI syntax is documented, and the program takes you to the right web page for help. You can include ftp with user name and password, and a lot of other things - anything that is a valid URI and that the site where it resides will actually allow you to download. For example, if you want to process an Enron file from the EDRM site, you just give its URI as http://duaj3yp6waei2.cloudfront.net/edrm-enron-v2_bailey-s_pst.zip and FreeEed downloads the file.
  • Processing of dozens of archive formats: http://truezip.java.net/kick-start/no-maven.html
  • Processing of archives inside of archives recursively
  • Command-line running is restored and can be used for scripting large jobs
  • Option -enron to process Enron data set (specific test script)
What is next? Processing 'em on a Hadoop cluster, of course!

Sunday, August 21, 2011


The first phrase sets the stage and excites the interest that will carry you through the 82-pages article, "Nothing causes litigators greater anxiety than the possibility of doing, or failing to do, something during a civil case that waives attorney–client privilege or work-product protection."

However, as a technologist, I can't resist thinking that we are simply glorifying a technical problem and making a legal problem out of it. Imagine that computer search and analysis technology were able to be as precise or more so than the humans, would the problem exist at all? -- No! Or at least, not to that degree.

It is an old adage, but before Google, say, just 15 years ago, one would say, "Cataloging of all of the world's information, and making an answer to any question, even based on key words, available to anyone in under a second? What nonsense!" Nevertheless, today we expect that as a routine. More than that, Google's Page promises a brain implant that would whisper the answers to your thoughts.

So the challenge for FreeEed is clearly stated: after (1) making it work routinely with large data sets, such as Enron's email, for massive statistical verification, and (2) putting the processing into the compute cloud, making it available from any PC or a tablet computer, (3) automate the privilege review to make it usable and useful. Being open source and not a black box, it will inspire more trust, cooperation, and verification. Being a low-cost solution, it will be useful.

How to read data sources in eDiscovery

In addition to local files, FreeEed needs to read such resources as

ftp:// with user name and password

and so on. As it turns out, the regular Java URL classes do all that! Why it is not publicized by Oracle/SUN - I have no idea, but the following three lines are all that you need to read from anywhere on any network:

URL url = new URL(dir);
URLConnection con = url.openConnection();
BufferedInputStream in = new BufferedInputStream(con.getInputStream());

Hooray for Java, and FreeEed now has it - watch for the next release.

Friday, August 19, 2011

Open source software for lawyers getting traction

Open source software for lawyers will be discussed at the upcoming ILTA conference, here is a link to an article by Evan Koblentz about it. Evidently it is popular, since it has been "slashdotted." FreeEed is mentioned there.

It all started with Evan Koblentz's ground-breaking article "Open Source Could Change the Future of E-Discovery".

Thursday, August 18, 2011

Google highlights trouble in detecting web-based malware

Google issued a new study on Wednesday detailing how it is becoming more difficult to identify malicious websites and attacks, with antivirus software proving to be an ineffective defense against new ones.

Continue reading...

Art: Henri De Toulouse-Lautrec - Two Knights in Armor

Wednesday, August 17, 2011

I am on Java 7 now!


it took me some going back and forth, and pulling out some hair, but it's done. I am on Java 7.

The trick was probably not in the JDK itself, but in the switch: change java and JAVA_HOME, then force your build environment and your IDE (NetBeans) to pick up the changes. In the IDE, I had to pretend to rename and refactor packages, then bring them back. Finally, I got it, so now my pom.xml lists both source and target as Java 7, that is,


and even minor language changes work, like this switch with String's

switch (mycase) {
  case "aha!": {

  case "uhu!": {


There may be more glitches, but based on my experience, you can resolve all.

Tuesday, August 16, 2011

Houston eDiscovery Masters Series

It was awesome to meet so many top lawyers and eDiscovery consultants -- at happy hour! Many of them knew about FreeEed and could spell it right away. Those who did not, understood the idea of open source eDiscovery with all its implications in a moment.

Monday, August 8, 2011

Free Computer Forensics Toolkit (SIFT) from SANS

Below is a re-print of the litsupport group announcement. For FreeEed, this comes in a very opportune time: eDiscovery needs integrated forensics capabilities, and the Linux-based SIFT looks like a perfect match.

ugust 04, 2011, Washington DC

The SANS Institute reported today a comparison of the capabilities of the recently-upgraded SIFT forensics toolkit with the most popular commercial forensics tools. Although the commercial tools maintain advantages over SIFT in some areas, the free SIFT tool exceeds the capabilities of the commercial tools in other areas. “Even if SIFT cost tens of thousands of dollars,” says, Alan Paller, director of research at SANS, “it would be a very competitive product.” At no cost, it should be part of the portfolio in every organization that has skilled forensics analysts.

The comparison:

SIFT Workstation 2.1 Guidance Software EnCase v6 AccessData FTK 3
Memory Analysis X Limited Limited
Super Timeline Generation and Analysis X
Mobile Device Support Limited
Automated Windows Registry Parsing X Limited Limited
File System Parsing X X X
Network Forensics X
Malware Analysis X

Some testimonials about the SIFT Workstation

The SIFT Workstation has quickly become my "go to" tool when conducting an exam. The powerful open source forensic tools in the kit on top of the versatile and stable Linux operating system make for quick access to most everything I need to conduct a thorough analysis of a computer system." -Ken Pryor, GCFA Robinson, IL Police Department

"Configuring a forensic analysis platform on your workstation can take a lot of time, and installing/setting up applications can be a pain at times. The SANS SIFT workstation has done the heavy lifting already with a wealth of useful, relevant tools - things like volatility, sleuthkit (with autopsy and ptk), pyflag and (my personal favorite) log2timeline. It gives the best of both worlds, both CLI and GUI. The best thing is, you don't need a dongle or have to worry about licensing, since it's all free/open source! SIFT is an excellent platform
for analysis and I have found it to be very beneficial during investigations. -
Frank McClain - GCFA, GCIH, CHFI"

For more information regarding the SIFT Workstation 2.1 release or to download it, the link you should use is: http://computer-forensics.sans.org/community/downloads

Rob Lee, Lead Digital Forensics and Incident Response -
SANS institute rlee@sans.org
801 4th Street SE
Washington DC 20003

Sunday, August 7, 2011

The Potential Market Size of Big Data in coming 2-3 years?

Michael Segel, Solutions Architect at Nokia, had this to say:

Its big. :-)

Ok seriously... relatively speaking, the potential is unlimited.
Even though you have a finite number of companies, the permutations and multiple projects that would fit in to this 'big data' space make the opportunity huge.

Focusing on Hadoop, you have HDFS for storage, Map/Reduce for parallel processing, HBase/Hive for persistence, and then other components in the ecosystem like Oozie for control, Flume, Scribe for data acquisition... All of this components allow a company to look at their information in a different light. You can Use HDFS for storage, Add in HBase and not use map/reduce. You can use Map/Reduce on small data sets that are computational intensive, so you don't need HBase/Hive.

Think of Hadoop as being a new tool kit that lets you do things you couldn't do before.

This is why its a disruptive technology.

The reason you are seeing a huge growth in Hadoop and derivative technology is that its currently transitioning from leading edge to mainstream. Enough early adopters have shown success that mainstream fortune 1000 companies are starting to look at it.
Add to this start-ups and SMB where the low cost and shortened runways make adoption of Hadoop easier.

Your limitations are human ones. Potential adopters vision and budget are your only real constraints.


I think it summarizes this so well that I could not resist re-posting.

Friday, August 5, 2011

Malpractice Suit Targets Quality of BigLaw’s Temporary Lawyers

A legal malpractice suit filed against McDermott Will & Emery raises questions about outside lawyers hired to help screen documents for clients.

An amended malpractice suit filed last week by J-M Manufacturing, also known as JM Eagle, claims McDermott’s outside contract lawyers “negligently performed their duties” while screening documents, the Wall Street Journal (sub. req.) reports. The newspaper says the suit “is seen in the industry as an important case concerning the quality of work performed by a growing cadre of temp lawyers who are paid as little as $25 to $30 an hour to review documents.” To continue reading, click here.

My comment: Interesting. Mind-numbing work, and the computers should be better at such routine tasks. Once the first stage of eDiscovery with the open source FreeEed is complete, the next one is exactly that - help for automated review. Encouraging article.

FreeEed news: V 2.5, distribution and manual

We are working hard on improving FreeEed, and it's moving well, thanks in part to our users' feedback. We also got an official tester on the team, so that should improve the quality of our releases.

You can always find the latest version on the FreeEed web site, and this page now has a manual with screen shots.

What's next? Firstly, we want to start processing the Enron data, posting the results on our web site. Secondly, we are rushing to allow you to run FreeEed on any computer, be it Windows, Mac, or Linux, while harnessing the computational power of Amazon for processing. Can't wait for this to happen!

Thursday, August 4, 2011

NameNode HA from HortonWorks

While in Santa Clara, I attended the "NameNode HA" presentation by Sanjay Radia, and was very much impressed by it. Back to Houston, I re-told what I heard at Houston Hadoop Meetup, and was at a loss to explain exactly what is new in the HortonWorks project. I asked Sanjay, who graciously provided the details. With a little editing, it goes as follows.

To answer your question, Mark, there were no previous offerings like that, and certainly not from Cloudera. Cloudera does not have such an offering, and that is why they are working with us.

The only solution currently in use is Avatar NN which is used at Facebook for doing manual faiover.

So, what it different? Firstly, we deal with all the corner cases for fencing. Secondly, our NN HA is hot, not cold or warm. And thirdly - the NN does not have the necessary states such as Active and Standby - you can try to fake the safe mode to do that.

Our doc on the jira gives the details.

Art: Raja Ravi Varma - A Student