Mark Kerzner(mark@shmsoft.com), Greg Keller (greg@r-hpc.com), Ivan Lazarov (ivan.lazarov@shmsoft.com)
Abstract
Hadoop cluster stores its most vital information in the RAM of the NameNode server. Although this architecture is vital to fast operation, it represents a single point of failure. To mitigate this, the NameNode’s memory is regularly flushed to hard drives. Consequently, in case of a failure, it takes many hours to restore the cluster to its operation. Another limitation is imposed on Hadoop by the size of the RAM on the NameNode server: Hadoop can store only as much data (files and blocks) as the number of descriptors that can fit in the memory of the NameNode.
The new Hadoop architecture described in this paper removes the size limitation and greatly improves the uptime by running the Hadoop NameNode on a persistent memory device, Kove (www.kove.com) XPD. Its advantages are: no limit on the number of files, no limit on the size of the cluster storage, and faster restore times. The performance of the ruggedized Hadoop cluster is on par with the standard Hadoop configuration.
The Hadoop XPD driver software that achieves this operation is open source, and is freely available for download.
Background
The Hadoop NameNode keeps all of its block and file information in memory. This is done for efficiency, but it is naturally a single point of failure (SPOF). There are multiple approaches to mitigating this SPOF, ranging from NameNode HA in Hadoop 2 to distributed NameNode.
In Hadoop 1, the complete memory snapshot is periodically written to local hard drives, usually 2-3 in number. This makes the snapshots act as cold spare, but the restoration process leads to many hours of downtime.
In Hadoop 2, there is high availability (HA) mode. In this mode two servers act as NameNodes, with one in active and another one in the standby mode. Although HA node effects failover in only 5 seconds, this comes at a price. Another server has to be running alongside the main NameNode, the architecture grows more complex and, since the standby node acts as a write-only NameNode, the traffic that the DataNode nodes generate doubles.
In other architectures, such as MapR, the NameNode is distributed. MapR stands in a class by itself by being closed sources C++ re-implementation of Hadoop. However, distributed architecture of the NameNode has its complexities and is not free from limitations.
Another limitation of Hadoop is created by the limited size of the NameNode RAM: Hadoop can only store as much data (files and blocks) as fits in memory. The limitation on files is addressed by federation, present in Hadoop 2. Under federated architecture, each NameNode is serving its own memory space, while drawing on the same pool of DataNodes. The limitation of billions of small files is not addressed even in Hadoop 2.
We created a simple solution that obviates the problems listed above, by running the NameNode on persistent memory, provided by the Kove XPD device.
The advantages of this implementation are manyfold.
1. Persistent memory is not erased in the event of power failure. Therefore, when the power comes back on, or after maintenance has been performed on the server, the Hadoop cluster comes back instantly.
2. The size of the Kove XPD is virtually unlimited, and these devices can be scaled well beyond a terabyte. As a result, the NameNode can store much more information, lifting the limitation on the number of files stored in Hadoop and obviating the need for federation.
3. Excellent ROI. Although Kove devices contain costly memory modules, under mass production resulting from their use in Hadoop, the device price will go down significantly.
Implementation
Our implementation includes, in addition to the Kove XDP hardware, special software drivers that make the Hadoop cluster use XPD in place of memory. This allows us to achieve the advantage listed above.
The implementation is the result of solid research. Firstly, we have formulated and defined the problems with the current Hadoop architecture. This summarization is found in the appendix A.
Secondly, we looked at a few possible implementations. They are described in the Appendix B. Our final choice was to implement the driver using the EHCache library. EHCache is an open source, popular, standards-based cache, which is robust, proven, and full-featured. More details on the software part of the implementation are found in the same appendix.
The project is called nn_kove (for NameNode on Kove) and can be found on github here, https://github.com/markkerzner/nn_kove.
Testing
For testing with used NNBench and a combination of teragen/terasort. The results of the runs are shown in the appendix B.
Our main goals in these tests were to
- Verify the correctness of the Hadoop operation.
- Make sure that the execution time is not increased significantly under the new architecture. The NameNode memory operations constitute a minor part of the overall Hadoop load, so the increase in their execution time by the factor less than 2 was considered acceptable in view of the advantages.
- Observe the performance on Hadoop on Kove under load.
All of these goals have been achieved. The time taken by the NameNode operations did not exceed double the base time.
Although this performance of the NameNode acceptable and influences the overall cluster performance only slightly, we are currently working on improvements. For our first implementation we treated the Kove XPD as KDSA block device, since it is a stable and simple approach. By contrast, doing direct writes with Java to C interface, we plan to improve the performance by the factor of two.
Conclusion
Running Hadoop NameNode on a Kove XPD persistent memory device improves cluster uptime and removes the limitations on the number of files, which is characteristic of the standard Hadoop architecture.
The software driver to achieve this has been created and is maintained by the product lab company SHMsoft, Inc., and its training/implementation partner Elephant Scale. The software is open sourced under Apache 2.0 license and can be freely downloaded here: https://github.com/markkerzner/nn_kove. For further information please write to info@shmsoft.com or info@elephantscale.com.
Appendix A. Outline of research
Given that the NameNode stores following data in memory (simplified view), we have these major actors.
machine -> blockList (DataNodeMap, DatanodeDescriptor, BlockInfo)
block -> machineList (BlocksMap, BlockInfo)
Also these structures are referenced within FSImage structures (INode, FSDirectory) and some additional structures like CorruptReplicasMap, recentInvalidateSets, PendingBlockInfo, ExcessReplicateMap, PendingReplicationBlocks, UnderReplicatedBlocks.
Class diagram displaying some of affected classes
Considerations to implement storing Namenode data on Kove XPD instead of memory
Data exchange with Kove requires usage of special buffer registered with API. Registration takes time (about 100 microseconds), and same for copying data to/from buffers (each read/write about 1 microsecond).
The two ways to use XPD are
A. Use RMI calls into XPD memory with C++/Java wrappers.
B. Mount XPD as a hard drive and use EHCache to map memory operation to this “hard drive” - which is really RAM.
We have chosen approach B as the simpler and stable. To this end, we have created our own storage instead of the LightWeightGSet, wrapping the EHCache. Then is is plugged into the BlocksMap.
EHCache is configured using hadoop-env.xml file, setting up the location of the ehcache.xml configuration file and Also, it is important to set: overflowToDisk="true"
Some of the Hadoop classes were not Serializable and were refactored.
Appendix B. Performance testing
We have tested our XPD Hadoop driver on a cluster of 3 Dell 720 processors, provided for this purpose by Dell, and installed in the R-HPC hosting.
We have run the standard test for measuring the performance of the Hadoop NameNode, because that is the part of the Hadoop cluster that we were changing. Below are the results of such, in four configurations.
The results proved that the performance of the modified system does not affect the total throughput, while permitting the desired enhancements.
Test results
There are four groups of test results given below.
============================= EHCACHE + KOVE ================================
---- terasort ----
hadoop jar hadoop-examples-1.1.2.jar teragen 1000000000 /user/hduser/terasort-input
hadoop jar hadoop-examples-1.1.2.jar terasort /user/hduser/terasort-input /user/hduser/terasort-output
13/11/15 07:17:20 INFO mapred.JobClient: Job complete: job_201311150348_0002
13/11/15 07:17:20 INFO mapred.JobClient: Counters: 30
13/11/15 07:17:20 INFO mapred.JobClient: Job Counters
13/11/15 07:17:20 INFO mapred.JobClient: Launched reduce tasks=1
13/11/15 07:17:20 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=8738068
13/11/15 07:17:20 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/11/15 07:17:20 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/11/15 07:17:20 INFO mapred.JobClient: Launched map tasks=1490
13/11/15 07:17:20 INFO mapred.JobClient: Data-local map tasks=1490
13/11/15 07:17:20 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=5972097
13/11/15 07:17:20 INFO mapred.JobClient: File Input Format Counters
13/11/15 07:17:20 INFO mapred.JobClient: Bytes Read=100006094848
13/11/15 07:17:20 INFO mapred.JobClient: File Output Format Counters
13/11/15 07:17:20 INFO mapred.JobClient: Bytes Written=100000000000
13/11/15 07:17:20 INFO mapred.JobClient: FileSystemCounters
13/11/15 07:17:20 INFO mapred.JobClient: FILE_BYTES_READ=520831386294
13/11/15 07:17:20 INFO mapred.JobClient: HDFS_BYTES_READ=100006263218
13/11/15 07:17:20 INFO mapred.JobClient: FILE_BYTES_WRITTEN=622912987615
13/11/15 07:17:20 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=100000000000
13/11/15 07:17:20 INFO mapred.JobClient: Map-Reduce Framework
13/11/15 07:17:20 INFO mapred.JobClient: Map output materialized bytes=102000008940
13/11/15 07:17:20 INFO mapred.JobClient: Map input records=1000000000
13/11/15 07:17:20 INFO mapred.JobClient: Reduce shuffle bytes=102000008940
13/11/15 07:17:20 INFO mapred.JobClient: Spilled Records=6106187817
13/11/15 07:17:20 INFO mapred.JobClient: Map output bytes=100000000000
13/11/15 07:17:20 INFO mapred.JobClient: Total committed heap usage (bytes)=299606343680
13/11/15 07:17:20 INFO mapred.JobClient: CPU time spent (ms)=16102510
13/11/15 07:17:20 INFO mapred.JobClient: Map input bytes=100000000000
13/11/15 07:17:20 INFO mapred.JobClient: SPLIT_RAW_BYTES=168370
13/11/15 07:17:20 INFO mapred.JobClient: Combine input records=0
13/11/15 07:17:20 INFO mapred.JobClient: Reduce input records=1000000000
13/11/15 07:17:20 INFO mapred.JobClient: Reduce input groups=1000000000
13/11/15 07:17:20 INFO mapred.JobClient: Combine output records=0
13/11/15 07:17:20 INFO mapred.JobClient: Physical memory (bytes) snapshot=340621791232
13/11/15 07:17:20 INFO mapred.JobClient: Reduce output records=1000000000
13/11/15 07:17:20 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1200548696064
13/11/15 07:17:20 INFO mapred.JobClient: Map output records=1000000000
13/11/15 07:17:20 INFO terasort.TeraSort: done
Hadoop job: 0002_1384515239881_hduser
=====================================
Job tracker host name: job
job tracker start time: Tue May 18 18:45:50 CDT 1976
User: hduser
JobName: TeraSort
JobConf: hdfs://localhost:54310/xpdmount/app/hadoop/tmp/mapred/staging/hduser/.staging/job_201311150348_0002/job.xml
Submitted At: 15-Nov-2013 05:33:59
Launched At: 15-Nov-2013 05:34:00 (0sec)
Finished At: 15-Nov-2013 07:17:20 (1hrs, 43mins, 20sec)
Status: SUCCESS
Task Summary
============================
Kind Total Successful Failed Killed StartTime FinishTime
Setup 1 1 0 0 15-Nov-2013 05:34:00 15-Nov-2013 05:34:02 (1sec)
Map 1490 1490 0 0 15-Nov-2013 05:34:02 15-Nov-2013 06:48:56 (1hrs, 14mins, 54sec)
Reduce 1 1 0 0 15-Nov-2013 05:37:46 15-Nov-2013 07:17:18 (1hrs, 39mins, 32sec)
Cleanup 1 1 0 0 15-Nov-2013 07:17:18 15-Nov-2013 07:17:20 (1sec)
============================
---- nn bench ----
hadoop jar hadoop-test-1.1.2.jar nnbench -operation create_write -maps 2 -reduces 1 -blockSize 1 -bytesToWrite 20 -bytesPerChecksum 1 -numberOfFiles 500 -replicationFactorPerFile 1
13/11/16 11:39:38 INFO hdfs.NNBench: -------------- NNBench -------------- :
13/11/16 11:39:38 INFO hdfs.NNBench: Version: NameNode Benchmark 0.4
13/11/16 11:39:38 INFO hdfs.NNBench: Date & time: 2013-11-16 11:39:38,66
13/11/16 11:39:38 INFO hdfs.NNBench:
13/11/16 11:39:38 INFO hdfs.NNBench: Test Operation: create_write
13/11/16 11:39:38 INFO hdfs.NNBench: Start time: 2013-11-16 11:36:52,938
13/11/16 11:39:38 INFO hdfs.NNBench: Maps to run: 2
13/11/16 11:39:38 INFO hdfs.NNBench: Reduces to run: 1
13/11/16 11:39:38 INFO hdfs.NNBench: Block Size (bytes): 1
13/11/16 11:39:38 INFO hdfs.NNBench: Bytes to write: 20
13/11/16 11:39:38 INFO hdfs.NNBench: Bytes per checksum: 1
13/11/16 11:39:38 INFO hdfs.NNBench: Number of files: 500
13/11/16 11:39:38 INFO hdfs.NNBench: Replication factor: 1
13/11/16 11:39:38 INFO hdfs.NNBench: Successful file operations: 1000
13/11/16 11:39:38 INFO hdfs.NNBench:
13/11/16 11:39:38 INFO hdfs.NNBench: # maps that missed the barrier: 0
13/11/16 11:39:38 INFO hdfs.NNBench: # exceptions: 0
13/11/16 11:39:38 INFO hdfs.NNBench:
13/11/16 11:39:38 INFO hdfs.NNBench: TPS: Create/Write/Close: 12
13/11/16 11:39:38 INFO hdfs.NNBench: Avg exec time (ms): Create/Write/Close: 213.425
13/11/16 11:39:38 INFO hdfs.NNBench: Avg Lat (ms): Create/Write: 71.185
13/11/16 11:39:38 INFO hdfs.NNBench: Avg Lat (ms): Close: 142.076
13/11/16 11:39:38 INFO hdfs.NNBench:
13/11/16 11:39:38 INFO hdfs.NNBench: RAW DATA: AL Total #1: 71185
13/11/16 11:39:38 INFO hdfs.NNBench: RAW DATA: AL Total #2: 142076
13/11/16 11:39:38 INFO hdfs.NNBench: RAW DATA: TPS Total (ms): 213425
13/11/16 11:39:38 INFO hdfs.NNBench: RAW DATA: Longest Map Time (ms): 155134.0
13/11/16 11:39:38 INFO hdfs.NNBench: RAW DATA: Late maps: 0
13/11/16 11:39:38 INFO hdfs.NNBench: RAW DATA: # of exceptions: 0
13/11/16 11:39:38 INFO hdfs.NNBench:
-------------------------------------------------------------------------------------------------------------------
========================== REGULAR HADOOP + DISK ==========================
---- terasort ----
hadoop jar hadoop-examples-1.1.2.jar teragen 1000000000 /user/hduser/terasort-input
hadoop jar hadoop-examples-1.1.2.jar terasort /user/hduser/terasort-input /user/hduser/terasort-output
13/11/16 14:07:55 INFO mapred.JobClient: Job complete: job_201311161214_0002
13/11/16 14:07:55 INFO mapred.JobClient: Counters: 30
13/11/16 14:07:55 INFO mapred.JobClient: Job Counters
13/11/16 14:07:55 INFO mapred.JobClient: Launched reduce tasks=1
13/11/16 14:07:55 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=8120417
13/11/16 14:07:55 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/11/16 14:07:55 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/11/16 14:07:55 INFO mapred.JobClient: Launched map tasks=1490
13/11/16 14:07:55 INFO mapred.JobClient: Data-local map tasks=1490
13/11/16 14:07:55 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=5721277
13/11/16 14:07:55 INFO mapred.JobClient: File Input Format Counters
13/11/16 14:07:55 INFO mapred.JobClient: Bytes Read=100006094848
13/11/16 14:07:55 INFO mapred.JobClient: File Output Format Counters
13/11/16 14:07:55 INFO mapred.JobClient: Bytes Written=100000000000
13/11/16 14:07:55 INFO mapred.JobClient: FileSystemCounters
13/11/16 14:07:55 INFO mapred.JobClient: FILE_BYTES_READ=557178887904
13/11/16 14:07:55 INFO mapred.JobClient: HDFS_BYTES_READ=100006263218
13/11/16 14:07:55 INFO mapred.JobClient: FILE_BYTES_WRITTEN=659260080692
13/11/16 14:07:55 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=100000000000
13/11/16 14:07:55 INFO mapred.JobClient: Map-Reduce Framework
13/11/16 14:07:55 INFO mapred.JobClient: Map output materialized bytes=102000008940
13/11/16 14:07:55 INFO mapred.JobClient: Map input records=1000000000
13/11/16 14:07:55 INFO mapred.JobClient: Reduce shuffle bytes=102000008940
13/11/16 14:07:55 INFO mapred.JobClient: Spilled Records=6462535872
13/11/16 14:07:55 INFO mapred.JobClient: Map output bytes=100000000000
13/11/16 14:07:55 INFO mapred.JobClient: Total committed heap usage (bytes)=299611914240
13/11/16 14:07:55 INFO mapred.JobClient: CPU time spent (ms)=14961890
13/11/16 14:07:55 INFO mapred.JobClient: Map input bytes=100000000000
13/11/16 14:07:55 INFO mapred.JobClient: SPLIT_RAW_BYTES=168370
13/11/16 14:07:55 INFO mapred.JobClient: Combine input records=0
13/11/16 14:07:55 INFO mapred.JobClient: Reduce input records=1000000000
13/11/16 14:07:55 INFO mapred.JobClient: Reduce input groups=1000000000
13/11/16 14:07:55 INFO mapred.JobClient: Combine output records=0
13/11/16 14:07:55 INFO mapred.JobClient: Physical memory (bytes) snapshot=342367383552
13/11/16 14:07:55 INFO mapred.JobClient: Reduce output records=1000000000
13/11/16 14:07:55 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1199751053312
13/11/16 14:07:55 INFO mapred.JobClient: Map output records=1000000000
13/11/16 14:07:55 INFO terasort.TeraSort: done
Hadoop job: 0002_1384626535866_hduser
=====================================
Job tracker host name: job
job tracker start time: Tue May 18 18:46:01 CDT 1976
User: hduser
JobName: TeraSort
JobConf: hdfs://localhost:54310/dev/shm/app/hadoop/tmp/mapred/staging/hduser/.staging/job_201311161214_0002/job.xml
Submitted At: 16-Nov-2013 12:28:55
Launched At: 16-Nov-2013 12:28:56 (0sec)
Finished At: 16-Nov-2013 14:07:54 (1hrs, 38mins, 58sec)
Status: SUCCESS
Task Summary
============================
Kind Total Successful Failed Killed StartTime FinishTime
Setup 1 1 0 0 16-Nov-2013 12:28:56 16-Nov-2013 12:28:58 (1sec)
Map 1490 1490 0 0 16-Nov-2013 12:28:58 16-Nov-2013 13:38:32 (1hrs, 9mins, 34sec)
Reduce 1 1 0 0 16-Nov-2013 12:32:31 16-Nov-2013 14:07:52 (1hrs, 35mins, 21sec)
Cleanup 1 1 0 0 16-Nov-2013 14:07:52 16-Nov-2013 14:07:54 (1sec)
============================
---- nn bench ----
hadoop jar hadoop-test-1.1.2.jar nnbench -operation create_write -maps 2 -reduces 1 -blockSize 1 -bytesToWrite 20 -bytesPerChecksum 1 -numberOfFiles 500 -replicationFactorPerFile 1
13/11/16 14:46:27 INFO hdfs.NNBench: -------------- NNBench -------------- :
13/11/16 14:46:27 INFO hdfs.NNBench: Version: NameNode Benchmark 0.4
13/11/16 14:46:27 INFO hdfs.NNBench: Date & time: 2013-11-16 14:46:27,625
13/11/16 14:46:27 INFO hdfs.NNBench:
13/11/16 14:46:27 INFO hdfs.NNBench: Test Operation: create_write
13/11/16 14:46:27 INFO hdfs.NNBench: Start time: 2013-11-16 14:44:54,721
13/11/16 14:46:27 INFO hdfs.NNBench: Maps to run: 2
13/11/16 14:46:27 INFO hdfs.NNBench: Reduces to run: 1
13/11/16 14:46:27 INFO hdfs.NNBench: Block Size (bytes): 1
13/11/16 14:46:27 INFO hdfs.NNBench: Bytes to write: 20
13/11/16 14:46:27 INFO hdfs.NNBench: Bytes per checksum: 1
13/11/16 14:46:27 INFO hdfs.NNBench: Number of files: 500
13/11/16 14:46:27 INFO hdfs.NNBench: Replication factor: 1
13/11/16 14:46:27 INFO hdfs.NNBench: Successful file operations: 1000
13/11/16 14:46:27 INFO hdfs.NNBench:
13/11/16 14:46:27 INFO hdfs.NNBench: # maps that missed the barrier: 0
13/11/16 14:46:27 INFO hdfs.NNBench: # exceptions: 0
13/11/16 14:46:27 INFO hdfs.NNBench:
13/11/16 14:46:27 INFO hdfs.NNBench: TPS: Create/Write/Close: 23
13/11/16 14:46:27 INFO hdfs.NNBench: Avg exec time (ms): Create/Write/Close: 110.878
13/11/16 14:46:27 INFO hdfs.NNBench: Avg Lat (ms): Create/Write: 63.344
13/11/16 14:46:27 INFO hdfs.NNBench: Avg Lat (ms): Close: 47.403
13/11/16 14:46:27 INFO hdfs.NNBench:
13/11/16 14:46:27 INFO hdfs.NNBench: RAW DATA: AL Total #1: 63344
13/11/16 14:46:27 INFO hdfs.NNBench: RAW DATA: AL Total #2: 47403
13/11/16 14:46:27 INFO hdfs.NNBench: RAW DATA: TPS Total (ms): 110878
13/11/16 14:46:27 INFO hdfs.NNBench: RAW DATA: Longest Map Time (ms): 84662.0
13/11/16 14:46:27 INFO hdfs.NNBench: RAW DATA: Late maps: 0
13/11/16 14:46:27 INFO hdfs.NNBench: RAW DATA: # of exceptions: 0
No comments:
Post a Comment