Wednesday, June 15, 2011

Search in eDiscovery

After you process all the files you have collected, your next step is searching through the data and slicing and dicing it in every way possible. This may be done at different times by both the defendant and the plaintiff. Case analysts need to try and answer different "what if" questions and scenarios. But with millions of files, SQL solutions become terribly slow. Anybody who saw that please raise your hand.

So here is how I formulated this questions on HBase and Cassandra user groups - because I needed it for FreeEed, my open source eDiscovery tool:

Imagine I need to store, say, 10M-100M documents, with each document having say 100 fields, like author, creation date, access date, etc., and then I want to ask questions like "give me all documents whose author is like abc**, and creation date any time in 2010 and access date in 2010-2011, and so on, perhaps 10-20 conditions, matching a list of some keywords."

What's best, Lucene, Katta, HBase CF with secondary indices, or plain scan and compare of every record?


Well, the very nice and knowledgeable people on these groups came with a number of solutions:

  • I can't claim its the best, but I'd say solar or katta (Joey Echeverria)
  • 'Add search to HBase' - HBASE-3529 is in development. (Jason Rutherglen)
  • I, for one, am interested in learning more about elasticsearch with HBase after reading the article over at StumbleUpon (Matt Davies)
  • I think this's the key line: "ElasticSearch is the search analogue to HBase that frees us from some restrictions that Solr imposes".  It is quite true however if search is inside of HBase ones gets the same thing.  Solr does have serious limitations in terms of scaling etc.  I think ES has done a great job there, though this could have been done with Solr just as easily, eg, upgrade Solr with the same functionality and remove the need for schemas.  Solr does allow schema-less, with eg, dynamic fields. (Jason Rutherglen)
  • I'd say give Lily a spin. Currently, we rely on Solr for search. In the next few months, we'll take a good look at "HBase-native" secondary indexes as well (Steven Noels)
  • There's also DrawnToScale (M. C. Srivas)
  • I store over 500M documents in HBase, and index using Solr with dynamic fields.  This gives you tremendous flexibility to do the type of queries you are looking for -- and to make them simple and intuitive via a faceted interface. However, there was quite a bit of software that we had to write to get things going, and I can neither release all of it open source, or support other people using it.  If I had to start again, I would seriously look at solutions like elastic search and lily (David Buttler)
  • Check out Solandra (Jake Luciani)
Thank you, all! Now I need to investigate all suggestions and select the one most fitting for eDiscovery.

No comments: