Sunday, March 27, 2011

Open source eDiscovery - preparing for beta release

We had good interest and feedback so far (thank you, everybody!), and we are preparing for a beta release, where the software actually does some useful work.

I have put a number of documentation pages on the project's wiki. I have also added the basic culling on the text of the documents.

The two enhancements that remain for beta are

  • organizing metadata into standard fields (that is in addition to the fact that all metadata, standard or not, is extracted); and
  • processing emails from a PST file.
I have taken the decision to stop putting the documentation into the text-based files in the project itself, and put it on the GitHub's wiki instead. This means being integrated into GitHub and relying on it, but what of it? - it is very convenient for editing.


Thursday, March 24, 2011

How important is blogging for a lawyer?

Not important at all, but really, very important. How can both be true? One asks a trusted friend for a lawyer, and the friend gives him the reference... based on that lawyer's blog. Continue reading...

Wednesday, March 23, 2011

FreeEed is reading processing parameters from a file

FreeEed now reads processing parameters from a file, has a number of default settings, and saves each run into a dated parameter file. More precisely,

  • All defaults are read from default.freeeed.properties, which is provided with the distribution;
  • They can be overwritten by any parameter file (such as my.freeeed.properties) mentioned with the option -param_file "filename"
  • Command-line parameters take precedence over all.

That is simple and powerful: you don't need to know about parameters at all, and it will give you all the defaults. Or you can change every one of them.

In addition, when the program is run, it saves the parameters from the current run into a file that has a date in its name, such as freeeed.parameters.110323_175926.properties.

Tuesday, March 22, 2011

Plans for open source eDiscovery FreeEed

Here is what is need to make the package more competitive

  • Add culling, PST (email) processing, deduping, and PDF/TIFF creation;
  • Process all of Enron email (about 150 mailboxes) and announce the first, the 10, the 100, and all;
  • Add advanced text processing, similar to predictive coding, which is the current direction of eDiscovery.
That will take a few months, but gradually the package will be become more and more useful.

Sunday, March 20, 2011

New open source software for eDiscovery

I have created and published open source software for eDiscovery, called FreeEed. It works on your computer, on a Hadoop cluster, or on Amazon EC2 cloud.

The project is hosted on GitHub here. The discussion group for it is here.

The software is in working state, but it is an early release, which follows the common open source "approach of commit early, commit often." At this time, I am looking for feedback on what the next incremental improvement steps can be.

The software has been tested in Ubuntu. It works in local mode. It will work on a private or Amazon EC2 cluster.

Wednesday, March 16, 2011

At SXSW by pure luck!


Hi tech and 2000 bands in one place... very charged atmosphere, but relaxed at the same time.

Monday, March 14, 2011

New cloud company has appeal

A cloud computing company out of Iceland plans to be Amazon EC2 - compatible, cheaper, and easier to use. Incidentally, they also plan to leave the other cloud computing contenders - Microsoft and Salesforce - behind. It is called Greenqloud.

Their marketing has a warm touch, but of course the developers will be the decisive force. They seem to know to how to appeal to developers also, with the "cool and easy" feeling.

Art: photo of Texas pendulous clouds.

Sunday, March 13, 2011

Pat Kerr, the CFO of SHMsoft

There is a lot that can be said about Pat, that he is tenacious, persistent, fearless, down-to-earth or that he flies in the sky, but I think the most important is that he is so sure of himself that he is not afraid when people make fun of him.

As a case in point, he could not find any serous photo of himself. Here is his latest attempt.

Friday, March 11, 2011

New open source software for eDiscovery

I have created and published open source software for eDiscovery, called FreeEed. The project is hosted on GitHub here. The discussion group for it is here.

The software is in working state, but it is an early release, which follows the common open source "approach of commit early, commit often." At this time, I am looking for feedback on what the next incremental improvement steps can be.

The software has been tested in Ubuntu, but it may work in Windows. It works in local mode or on a cluster and is scalable: the same code will work on a cluster without any change.

Tuesday, March 8, 2011

Open source eDiscovery restarted

A couple weeks ago I restarted the FreeEed, an open source eDiscovery project. The initial announcement can be found here, but that was two years ago. Now I have changed everything, except for the project name.

The project now is on GitHub, which is the new black for developers. It combines complete processing and has no resemblance to the old version, which relied on Google Desktop search.

Moreover, it is intended to run locally, on a private Hadoop cluster, or on EC2 Amazon cloud. Thus, it can be as small or as big as you want.

Really, it is not at all hard to do this. Look, for example, at this company, talking about open source eDiscovery solutions, and you will see that the choice of components - Hadoop, Lucene, Mahout, Tika, Solr, Nutch - is pretty standard. With the road thus outlined, it was indeed strange that no one has done it so far - so there you, I have done it now.

In the past two weeks I was already able to accomplish about half the work. It is just a matter of combining the right libraries in the right way. Truth be told, the 90-10 rule applies here too: the most fun and usefulness are in the first 90% of the functionality, which takes 10% of the time. The rest are the details.

One can follow the project by joining the Google Group here, or on GitHub, where developers can have free accounts for easy cooperation.

Sunday, March 6, 2011

Armies of Expensive Lawyers, Replaced by Cheaper Software?

The article in the New York Times claims just that, software replacing lawyers. But, as many newspaper articles, it mixes a few things up.

Anyone who is reading this probably knows, the eDiscovery market is about 4B, and it is not shrinking. Of the usual costs of eDiscovery, lawyers get 75% and processing costs constitute only 25%. As this blog is frequent to point out, lawyers are creatures of habit more than others, and this habit is good law, so it is not going away any time soon, even in the presence of statistics that are pro-computer.

Then where are the authors of the article correct? In their praise of the new tools for analytical eDiscovery, even if not in their assessment of their impact. For internal use, these tools provide lawyers with definite advantages and help them find important leads. Whether the judge will accept document classification as privileged, if it is done by a computer review - this will require the change in attitude, the improvement in technology, and perhaps a review of applicable laws.

However, with the driving force coming from customers not willing to pay the high bills of what essentially is human search where automation would be welcome - all sides may eventually be forced to accept this. Why do we trust Google on search results? Because it is not humanly possible to read all relevant information by ourselves.

Art: Pierre Auguste Renoir - The Thinker Aka Seated Young Woman