SHMsoft blog: October 2011

Sunday, October 23, 2011

FreeEed for Windows Release Candidate 1 is available for download

The open source software for eDiscovery, FreeEed, now works in Windows, in addition to Linux. It can be downloaded here, http://freeeed.org/download. We are testing it on the Mac.

The software version is 2.9.5, and it contains multiple bug fixes and enhancements. The new release is still free and will always be free, even though it contains a closed source version of the PST extactor, which is needed for it to work in Windows. No other additional software needs to be installed.

In a few weeks, when the software is officially released, it will contain a licensed version of this extractor, free for all users.

As always, your feedback is welcome.

Thursday, October 6, 2011

Loading inner maps in Hive

Sometimes you would want to load maps which contain maps into Hive. I mean, this structure

map <string,map<string,string>>

Hive allows you this. In fact, it allows even deeper levels of mapping. However, the question is, how do you tell it where your inner maps end, since this is not one of the parameters in the LOAD DATA INPATH statement. Well, there is an undocumented default, and that is, '\004' and '\005' for inner maps.

Here is how your data has to be formatted (using an image, to show the non-ascii separators)

and this is how you define your table

CREATE EXTERNAL TABLE map_table
(
complex_map map <string, map<string,string>>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\003'
MAP KEYS TERMINATED BY '\002'
STORED AS TextFile;

and load it as follows

LOAD DATA INPATH "/mydata" OVERWRITE INTO TABLE map_table;

and you can have the following queries:

hive> select * from map_table;
Result:
{"key1":{"innerkey11":"innervalue11","innerkey21":"innervalue21"},"key2":{"innerkey12":"innervalue12","innerkey22":"innervalue22"}}

also

hive> select complex_map["key1"] from map_table;
Result:
{"innerkey11":"innervalue11","innerkey21":"innervalue21"}

and even

hive> select complex_map["key1"]["innerkey11"] from map_table;
Result:
innervalue11

Now, it is true, this uses the default coding for inner-level maps for Hive load, and that's not documented, but it IS the coding which is unlikely to change.

Credits: this post talks about it, and it was brought to my attention by Steven Wong of Netflix.

Sunday, October 2, 2011

Email archiving and retention with Hadoop

Cloudera explores a specific use case for Apache Hadoop, one that is not commonly recognized, but is gaining interest behind the scenes. It has to do with converting, storing, and searching email messages using the Hadoop platform for archival purposes.

To read further...

Implications: you can set up a system for legal hold, email archiving and retention, and subsequent eDiscovery, all on open source scalable platform. And, it will likely work better than home-grown proprietary platforms, because Hadoop is designed for scalability and has been proven at world's largest companies.

Art: Winslow Homer : Woman and Elephant

SHMsoft blog