The open source software for eDiscovery, FreeEed, now works in Windows, in addition to Linux. It can be downloaded here,
http://freeeed.org/download. We are testing it on the Mac.
The software version is 2.9.5, and it contains multiple bug fixes and enhancements. The new release is still free and will always be free, even though it contains a closed source version of the PST extactor,
which is needed for it to work in Windows. No other additional software needs to be installed.
In a few weeks, when the software is officially released, it will contain a licensed version of this extractor, free for all users.
As always, your feedback is welcome.
Sunday, October 23, 2011
Thursday, October 6, 2011
Loading inner maps in Hive
Sometimes you would want to load maps which contain maps into Hive. I mean, this structure
map <string,map<string,string>>
Hive allows you this. In fact, it allows even deeper levels of mapping. However, the question is, how do you tell it where your inner maps end, since this is not one of the parameters in the LOAD DATA INPATH statement. Well, there is an undocumented default, and that is, '\004' and '\005' for inner maps.
Here is how your data has to be formatted (using an image, to show the non-ascii separators)
and this is how you define your table
CREATE EXTERNAL TABLE map_table
(
complex_map map <string, map<string,string>>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\003'
MAP KEYS TERMINATED BY '\002'
STORED AS TextFile;
and load it as follows
LOAD DATA INPATH "/mydata" OVERWRITE INTO TABLE map_table;
and you can have the following queries:
hive> select * from map_table;
Result:
{"key1":{"innerkey11":"innervalue11","innerkey21":"innervalue21"},"key2":{"innerkey12":"innervalue12","innerkey22":"innervalue22"}}
also
hive> select complex_map["key1"] from map_table;
Result:
{"innerkey11":"innervalue11","innerkey21":"innervalue21"}
and even
hive> select complex_map["key1"]["innerkey11"] from map_table;
Result:
innervalue11
Now, it is true, this uses the default coding for inner-level maps for Hive load, and that's not documented, but it IS the coding which is unlikely to change.
Credits: this post talks about it, and it was brought to my attention by Steven Wong of Netflix.
map <string,map<string,string>>
Hive allows you this. In fact, it allows even deeper levels of mapping. However, the question is, how do you tell it where your inner maps end, since this is not one of the parameters in the LOAD DATA INPATH statement. Well, there is an undocumented default, and that is, '\004' and '\005' for inner maps.
Here is how your data has to be formatted (using an image, to show the non-ascii separators)
and this is how you define your table
CREATE EXTERNAL TABLE map_table
(
complex_map map <string, map<string,string>>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\003'
MAP KEYS TERMINATED BY '\002'
STORED AS TextFile;
and load it as follows
LOAD DATA INPATH "/mydata" OVERWRITE INTO TABLE map_table;
and you can have the following queries:
hive> select * from map_table;
Result:
{"key1":{"innerkey11":"innervalue11","innerkey21":"innervalue21"},"key2":{"innerkey12":"innervalue12","innerkey22":"innervalue22"}}
also
hive> select complex_map["key1"] from map_table;
Result:
{"innerkey11":"innervalue11","innerkey21":"innervalue21"}
and even
hive> select complex_map["key1"]["innerkey11"] from map_table;
Result:
innervalue11
Now, it is true, this uses the default coding for inner-level maps for Hive load, and that's not documented, but it IS the coding which is unlikely to change.
Credits: this post talks about it, and it was brought to my attention by Steven Wong of Netflix.
Sunday, October 2, 2011
Email archiving and retention with Hadoop
Cloudera explores a specific use case for Apache Hadoop, one that is not commonly recognized, but is gaining interest behind the scenes. It has to do with converting, storing, and searching email messages using the Hadoop platform for archival purposes.
To read further...
Implications: you can set up a system for legal hold, email archiving and retention, and subsequent eDiscovery, all on open source scalable platform. And, it will likely work better than home-grown proprietary platforms, because Hadoop is designed for scalability and has been proven at world's largest companies.
Art: Winslow Homer : Woman and Elephant
To read further...
Implications: you can set up a system for legal hold, email archiving and retention, and subsequent eDiscovery, all on open source scalable platform. And, it will likely work better than home-grown proprietary platforms, because Hadoop is designed for scalability and has been proven at world's largest companies.
Art: Winslow Homer : Woman and Elephant
Subscribe to:
Posts (Atom)