Tuesday, February 14, 2012

Using FreeEed for Early Case Assessment

Open source tool for eDiscovery, FreeEed, can be used on your Windows, Mac, or Linux workstation. What's more, since it is built on Hadoop, it can run on a Hadoop cluster, whether in your computing center or on Amazon EC2, and it can run on tens or hundreds of machines. It produces a csv file with metadata (load file), and an archive containing the native documents, the extracted text, and any exceptions that happens in processing.

This is all very well, you can say, but what do I do with a load file? Enters Hive! Hive is an open source tool, and it runs in the same environments where FreeEed runs. You can load the csv metadata file, regardless of how large it is, and then you can run SQL queries against your results. Any question about your data that you can formulate as a SQL query, Hive can answer.

How does Hive do that? It runs on the same Hadoop cluster, and it harnesses all the same machines that you just used to run FreeEed. It runs locally too, in case you want to test it on a workstation. Actually, its language is called HiveQL, and it is a vast superset of SQL, so really you can do much more.

The caveat? Since runs a MapReduce job to find the answer, it takes at least a minute to run through it. More data will require more time. On the other hand,

  • Any amount of data is not a problem, precisely because the cluster computes the answer;
  • Hive is open source and easily available;
  • HiveQL language is powerful and is used by analysts to solve Big Data problem daily.
Let's get to the code (complete scripts for Hive come with FreeEed).

Create the Hive table:

create table load_file (Hash string, UPI string, File_Name string, Custodian string, .... )
row format delimited 
fields terminated by '\t'
stored as textfile

Load metadata into the Hive table we just created:

load data local inpath 'freeeed_output/output/part-r-00000'
overwrite into table load_file;

By the way, to run the script outside Hive shell, you use this command

hive -f scripts/hive_load_table.sql





Let's see what we've done so far: created the table, and loaded the metadata into Hive. Let's start the Hive shell, show all tables, then show our 'load_file" table:


Now let us run some simple queries.

How many rows do I have?

hive> select count (*) from load_file;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201202101358_0020, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201202101358_0020
Kill Command = /usr/lib/hadoop-0.20/bin/hadoop job  -Dmapred.job.tracker=localhost:8021 -kill job_201202101358_0020
2012-02-14 21:06:20,180 Stage-1 map = 0%,  reduce = 0%
2012-02-14 21:06:23,188 Stage-1 map = 100%,  reduce = 0%
2012-02-14 21:06:32,220 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201202101358_0020
OK
2323

For the rest, we will take out the Hadoop messages.

Who is writing, and how many emails each one wrote?

select email_from, count(email_from) from load_file group by email_from;

"Abrams  Clement " 1
"Adams  Suzanne " 1
"Anderson  Diane " 11
"Apollo  Beth " 1
"Aronowitz  Alan " 9
"Bailey  Derek " 1
....

How else can you use this FreeEed/Hive functionality? Investigators and compliance auditors may find it useful, and companies can make it the engine for their enterprise search.

No comments: