Sunday, February 2, 2014

Big Data, Hadoop, and NoSQL Testing

By Mark Kerzner and Sujee Maniyam, Elephant Scale LLC


In this paper we discuss best practices and real world testing strategies for Big Data, Hadoop, and NoSQL. The subjects of testing and software correctness take an even more important role in the world of Big Data, and that is why taking them into account throughout the project lifetime, from design to implementation and to maintenance is paramount. We discuss the maven project organization, the test modules, the use of the mock frameworks, and the TestSuite design pattern. All these serve to factor out extensive copy/pasting into the framework, and in this way to make the projects less error-prone and to improve code quality.

Table of contents
  • Introduction
  • Project organization for test-ability
  • JUnit single unit tests
  • Test modules
  • A word on Scala, Scalding and Kiji
  • System integration testing
  • Conclusion: lessons and further direction

Software testing is one of the most important yet often neglected parts of the software development. For this reason, developers have created a list of 20 most popular responses to give when their software fails the tests. Here they are:

20. "That's weird..."
19. "It's never done that before."
18. "It worked yesterday."

17. "How is that possible?"
16. "It must be a hardware problem."
15. "What did you type in wrong to get it to crash?"
14. "There is something funky in your data."
13. "I haven't touched that module in weeks!"
12. "You must have the wrong version."
11. "It's just some unlucky coincidence."
10. "I can't test everything!"
9. "THIS can't be the source of THAT."
8. "It works, but it hasn't been tested."
7. "Somebody must have changed my code."
6. "Did you check for a virus on your system?"
5. "Even though it doesn't work, how does it feel?
4. "You can't use that version on your system."
3. "Why do you want to do it that way?"
2. "Where were you when the program blew up?"
1. "It works on my machine"

(thanks to the anonymous author whose post still floats on the web and who proves that things basically don’t change in software development and testing).

More seriously, when I was preparing to do testing for one of the largest software companies in the world, the book most popular with them at this time was “Testing Complex and Embedded Systems” by Kim H. Pries and Jon M. Quigley. It is still a great book, and it starts by buttressing that software testing should be an integral part of the development cycle, not a last-week effort. It also teaches you how to look for the cases that the developers usually get wrong, such as the first and the last elements in the lists and loops, marginal conditions, and the importance and system and integrated testing.

With the advent of Apache Hadoop and Big Data, things have changed somewhat. You are not that much interested in the marginal cases as you are in the following issues:
  1. Statistical code correctness: you may not be able to always get 100% of the processing results right, but you need to guarantee them with a specific SLA.
  2. Automated testing: there are more and more tools that help you avoid doing testing manually.
  3. Performance tuning.
In this article, we will address the three key areas listed above.

Project organization for test-ability

Since most of the advanced software developments use maven and git, or tools similar to these in their power, we will use them in our examples. The lessons though are transferable to other development environment.

Here is the general rules of the project organization that reflects the best practices today:
  1. You complete project should have a parent-modules architecture in Maven. The two modules that will always be present are the “application” and “tests”, although you may have more than one applications, and very possibly more than one test.
  2. Your single-unit tests are based on JUnit and reside with the “application” module. Usually we try to make sure that the ‘mvn clean install’ on the ‘application’ project never more than a few seconds. We want developers to run it as often as can.
  3. The more complete tests reside in the separate ‘tests’ module. ‘tests’ should run for a few minutes or less.
  4. For bigger integration tests, you may have another module, such as ‘integration-tests’, and these may take up to half an hour or sometimes longer.
  5. Use a mocking framework to isolate your ‘tests’ module from other subsystems. Complete integration tests will go into ‘integration-tests’ module.
  6. Include the distribution-building code into your tests. That is, your distribution (for example, jars in the case of Java)  should have the application code, but not your test or your mock-out code. It should also contain everything needed to run, so make sure that you start the application and verify the results, as part of your tests.
We will explain every rule and give the real-world examples of its use below. Here is how your projects might look in an IDE

We are using our FreeEed project, which is an open source application for Big Data search, concentrating on legal (eDiscovery). We do what we preach, and try to use the best practices in our project. Besides, the projects is generic enough to illustrate a Hadoop search application.

JUnit single unit tests

The rule of thumb is that all of your single unit tests should complete within a few seconds. Then the developer working on this project will not hesitate to do a complete build and test while he or she is working on that project, and this will lead to early indications of any potential problem.

Get good test coverage: write a test for any function of significance. In test-driven development, you actually write tests before you create the implementations for your methods. In this way, the tests become the design document.

Even if you write your JUnit tests after you start writing the implementations, they are still very useful as design documents: they give any new developer an immediate feel for what is expected of the class he is looking at.

Below are a couple JUnit test which can serve as a good example:

public void testGetPlatform() {
    assertTrue(PlatformUtil.isNix() || PlatformUtil.isWindows());
public void testRunUnixCommand() {
    if (PlatformUtil.isNix()) {
       try {
           List out = PlatformUtil.runUnixCommand("ls");
       } catch (Exception e) {
           fail("No exceptions expected!");

While doing the single unit tests, you can already start using mocking. We will cover mocking in more detail later, but your tests might like like this:

Mapper.Context context = mock(Mapper.Context.class);
ArgumentCaptor arg1 = ArgumentCaptor.forClass(Text.class);
ArgumentCaptor arg2 = ArgumentCaptor.forClass(MapWritable.class);
doNothing().when(context).write(arg1.capture(), arg2.capture());
EmlFileProcessor emlProcessor = new EmlFileProcessor("test-data/02-loose-files/docs/eml/1.eml", context, null);
emlProcessor.process(false, null);
Text hashkey = arg1.getValue();
MapWritable map = arg2.getValue();
Map emlLine = TestUtil.flatten(map);
assertEquals("bob_smith", emlLine.get("Custodian"));

In this code we mock out the Mapper.Context object, so that we can verify the operation of the code without actually involving MapReduce. We want to do it, because our mapper writes the results out. So to verify them, we would have to read and decode this output file. That’s more work and more complications in the testing than we want, and therefore we just collect the results in the mock object, and then get it back from this object. Note how we use the ArgumentCaptor facility of Mockito framework for that:

ArgumentCaptor arg1 = ArgumentCaptor.forClass(Text.class);
ArgumentCaptor arg2 = ArgumentCaptor.forClass(MapWritable.class);

As the number of your tests grows, and if you are using continuous testing, like Jenkins, then your stats will look something like the picture below

It is acceptable to put small integration tests in the main module - as long as these do not make the complete tests run too long, as we mentioned above.

Test modules

More extensive tests should be put in a separate module(s). This approach has the following advantages:

  • Test code is not compiled into the production distribution (jars). There is less code to copy around, and less possibility of executing test code in production.
  • Long tests are executed separately, not as part of regular development environment. This leads to improved developer productivity.
Building a production jar

Another best practice is to build the complete production executable as part of the test module. Then this executable can be tested in a separate JVM. Test the return status and the result of the run command. You code would like something like below

final StringBuilder commandBuilder = new StringBuilder();
commandBuilder.append("java -cp ");
commandBuilder.append(yourAppPath).append("/your-jar.jar ");
for (final String argument : arguments) {
    commandBuilder.append(" ").append(argument);
return CommandRunner.waitFor(commandBuilder.toString());

In the test modules you may have to mock out larger subsystems, but normally you don’t have to use mocking - you are doing integration testing, and you have all the time you need for the tests to complete.

TestSuite design pattern

Tests, with the characteristic variety of conditions, are often repetitive. This means that it is something that a computer is good at, but that humans may mess up. A good cure for this is the use of TestSuite design pattern.

First you design the base class, which implements all of the methods needed to test a set of related cases with varying input parameters. You also define the “run()” method, which executes all of the test methods in the base class.

Then every set of conditions becomes an object instance of your TestSuite. All you have to do is instantiate it and run it - and it will execute all the tests that you want. If, for a particular test case, you need to override some of the test methods - so be it! You simply override this method in your instantiation of the TestSuite, and presto! - you got yourself a variation of the test suite for this particular case.

A word on Scala, Scalding and Kiji

Scala, Scalding and Kiji have this in common: by raising to the higher level of abstraction, they intend lessen the amount of code that the developer writes, and in this way eliminate a large percentage of errors.

Let’s give the basic definitions.

Scala is an object-functional programming and scripting language for general software applications, statically typed, designed to concisely express solutions in an elegant, type-safe and lightweight manner.

Cascading (we need this definition to go on) is a framework to simplify Hadoop development. Since Hadoop uses functional programming, it makes most sense to express it in terms of functional programming terminology, and this is what Cascading does for you - define the common MapReduce operations in a cleaner, more readable way.

Scalding is the marriage of Cascading and and Scala. Since both are striving for conciseness and simplicity, that is a natural union. Here is how the WordCount example of Hadoop looks in Scalding. Anybody who has written low-level boilerplate code in Java for MR will surely appreciate this code snippet:

import com.twitter.scalding._

class WordCountJob(args : Args) extends Job(args) {
  TextLine( args("input") )
    .flatMap('line -> 'word) { line : String => line.split("""\s+""") }
    .groupBy('word) { _.size }
    .write( Tsv( args("output") ) )

Kiji is an open source framework developed by WibiData, in order to simplify HBase/Hadoop development. It allows you to create HBase schemas (recall that there is no high-level explicit mechanism for this in HBase, but rather your tables live in your imagination or in diagrams like MindMap.

In Kiji you have a language to define your tables. This imposes some limitations, but gives you back the writing of best-practices HBase applications with the correct , maintainable structure of the tables.

So, how do you test all that?

Remember that for Scala there are multiple test frameworks which make writing your JUnit test cases a breeze. With a little effort, you can make your IDE debug and even step through the code.

Most of these components are clients side. This means that a regular job is submitted, for example, by Kiji framework, to your Hadoop cluster. Thus, all of the advice before applies here also.

Do not forget to divide your projects into application and test(s) modules, for all the benefits described above.

System integration testing

In here you may be entering the world of pain, but with a little preparation it is bearable.

Here are the golden rules of Big Data system integration testing.
  1. Create two or more test modules, with increasing load and execution time.
  2. Spin your clusters of Hadoop or HBase nodes as part of the test, and then shut them down (the latter may have to be done by the operator, due to possible test breaking, which would leave clusters running).
  3. Generate a lot of test data.
  4. Do performance testing early.
  5. Install proper monitoring. Optimizing in the dark is a dangerous and less-than-perfect strategy.
  6. Using continuous integration (CI) and automated builds.
  7. We sincerely wish you good luck!
Conclusion: lessons and further direction

Having done a number of test projects, we came away with the following lessons.
  • Don’t treat tests as an afterthought.
  • Do make them part of the over design and planning.
  • Test every part.
  • Automate and script as much as possible.

Testing Complex and Embedded Systems by Kim H. Pries and Jon M. Quigley.
Guide to Reliable Distributed Systems: Building High-Assurance Applications and Cloud-Hosted Services by Kenneth Birman
Hadoop (TM) is a trademark of Apache Software Foundation

No comments: