Thursday, October 6, 2011

Loading inner maps in Hive

Sometimes you would want to load maps which contain maps into Hive. I mean, this structure

map <string,map<string,string>>

Hive allows you this. In fact, it allows even deeper levels of mapping. However, the question is, how do you tell it where your inner maps end, since this is not one of the parameters in the LOAD DATA INPATH statement. Well, there is an undocumented default, and that is, '\004' and '\005' for inner maps.

Here is how your data has to be formatted (using an image, to show the non-ascii separators)

and this is how you define your table

CREATE EXTERNAL TABLE map_table
(
complex_map map <string, map<string,string>>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\003'
MAP KEYS TERMINATED BY '\002'
STORED AS TextFile;


and load it as follows

LOAD DATA INPATH "/mydata" OVERWRITE INTO TABLE map_table;

and you can have the following queries:

hive> select * from map_table;
Result:
{"key1":{"innerkey11":"innervalue11","innerkey21":"innervalue21"},"key2":{"innerkey12":"innervalue12","innerkey22":"innervalue22"}}

also

hive> select complex_map["key1"] from map_table;
Result:
{"innerkey11":"innervalue11","innerkey21":"innervalue21"}

and even

hive> select complex_map["key1"]["innerkey11"] from map_table;
Result:
innervalue11

Now, it is true, this uses the default coding for inner-level maps for Hive load, and that's not documented, but it IS the coding which is unlikely to change.

Credits: this post talks about it, and it was brought to my attention by Steven Wong of Netflix.

No comments: