Introducing Big Data with Apache Hadoop

The revenue gained with Big Data solutions rose by 66% up to 73.5 billion euro world-wide and 59% up to 6.1 billion in Germany over the past year. One of the core technologies used is Hadoop which creates the base for a broad and rich eco-system containing distributed databases, data and graph processing libraries, query and workflow engines and much more. In one of our former blog posts, we have described how we use Hadoop for storing log messages. Since then, a lot has happened in the Hadoop universe and ecosystem. With the start of our new Big Data series, we want to cover those changes and show best practices in the Big Data world.

When we talk about “Big Data”, naturally the question arises what Big Data means in terms of projects which we have carried out already.

Big Data at mgm technology partners

We have been able to gather Big Data experience in several industries like automotive, market data analysis, e-commerce and governmental organizations. During those projects, we handled web graphs containing more than 100 billion edges with additional information summing totaling up to more than 150 terabytes. We have created systems which are able to ingest tens of thousands of data entries per second. In a current project we are helping one of our customers with the transition of a big monolithic landscape to an easily scalable, more flexible and cheaper open-source solution. This system will be able to handle thousands of different data input formats and enables our customer to shorten the necessary production times for reports from several days down to a few hours. Another important field where we support customers is the visualization of large data sets to improve insights and give structure to data. Furthermore, we have realized dozens of projects where we created systems for interactive querying and searching through terabyte-scale data.

Do you have similar use-cases? Do you want to find out how we can help you? Feel free to contact us by simply calling us and arrange a meeting. You can also get in touch with us by mailing to

Key Questions about Big Data

Now let us talk about Big Data in general because it is a widely used buzzword nowadays. A lot of our customers are curious about Big Data and in the following we will answer some frequently asked questions.

What is Big Data?

Big Data refers to any collection of data sets which are too large to be stored and processed entirely on traditional data processing systems. Nowadays, Big Data technologies are capable of real-time batch and stream processing to calculate valuable information of the provided data. This data is not necessarily structured as in RDBMS. In many cases it is semi-structured or even unstructured. All of this leads to the defining characteristics or dimensions of big data (“3 V’s”):

  • Variety describes many different types of data and the degrees how well-structured they can be,
  • Velocity refers to the high-speed ingest and processing of the data with different strategies,
  • Volume represents a vast and ever-growing amount of data that can be processed.
1. The 3 V’s of Big Data.

Which Value can Big Data generate?

One of the main advantages of Big Data technologies is the ability to analyze all the data one has. At first, you start collecting data. You will execute queries and smaller algorithms on it to gain a better understanding of the possibilities you have. After you have “played” a bit with the data you will develop ideas how to turn your data into value. At that point, you already have historic data and you can analyze it.

A good example for turning data into value is the energy supply industry in North America. Smart meters and Big Data enable them to analyze peaks in their energy network. This leverages their ability to prevent blackouts and reduce the total energy network load which ultimately leads to less maintenance. Furthermore, they are able to better determine when to buy and when to sell energy to other companies.

One of our customers has used Big Data technologies to significantly enhance the flexibility of their reports for market data analysis. Furthermore, they have been able to reduce the time from gaining data to a finalized report from two weeks to merely one day thanks to the scalable architecture and algorithms.

When should I use Big Data technologies?

Whether a Big Data scenario exists or not isn’t only related to the size of your data but also to the things you do with it. The following listing contains some more general examples of Big Data scenarios:

  • Terabytes of data are stored
  • Rapid data growth – keeping all available data
  • Data is streaming in continuously
  • Billions of single data points, e.g. time series data
  • Heterogeneous and / or changing or unknown data formats

Why should I use Hadoop?

Hadoop enables you to create Big Data applications with low license fees. Furthermore, your Big Data applications are able to run on commodity server hardware while providing amazing scalability. If you use Hadoop, you will use software which is developed at the Apache Software Foundation in cooperation with some of the biggest players in Big Data like Yahoo, Microsoft, Facebook, LinkedIn, Twitter and many more. Moreover, a broad and rich eco-system around Hadoop exists and still is continuing to grow. A lot of those projects are open-source and under the Apache License 2 which provides the right of software customization.

Can I migrate my data to Hadoop?

Yes, you can! Hadoop offers a big variety of data stores based on the distributed file system HDFS. Plain HDFS files or distributed databases like Apache HBase, Apache Accumulo, Apache Cassandra and many more will be the right place for your data and business case. Our team regularly evaluates the technology landscape and can help you find the most suitable distributed storage architecture and assist your migration process.

Apache Hadoop 1.x to 2.x

Let’s talk about Apache Hadoop. In a previous article we have explained Hadoop 1.x and its two main components the MapReduce framework and the distributed file system HDFS. In Hadoop 2.x those two are extended by the YARN framework. Hadoop’s three main components are now:

  • Hadoop Distributed Filesystem (HDFS): This is a file system which spans over all servers in the Hadoop cluster and provides a single namespace for storing files. It is optimized for reading and writing large files. The files are broken down into blocks with a default size of 64 MB. Each block is stored on multiple servers to increase the fault-tolerance and read performance of the file system. This is called replication and it is advisable to never use a replication factor less than three. The Namenode knows the location of all data blocks. After getting this information, data access is done by connecting directly to the server where the relevant blocks are stored, which is called Datanode. As a result, network bandwidth is minimized and the IO load is distributed throughout the cluster. HDFS is inspired by the Google File System.
  • Yet Another Resource Negotiator (YARN): YARN provides a global ResourceManager (RM) for Hadoop which administers the ApplicationMaster (AM) of each application started in the cluster. An application could be a classical MapReduce job or a number of other tasks ordered in a DAG (directed acyclic graph). An AM negotiates the resources it is allowed to use with the RM. At every cluster node there is a NodeManager (NM) which forms the data-computation framework of the cluster and manages locally running Containers started by the corresponding AM. Like the MapReduce framework, other frameworks and services are implementing their own AM. The RM/AM architecture enables everyone to create their own distributed processing paradigm if necessary.
  • Map/Reduce: This is Hadoop’s processing framework for vast amounts of data. A MapReduce job usually splits the input data into smaller independent chunks. This data is processed by the map tasks. The output is afterwards sorted and combined into the overall output files. This is the reduce step. Afterwards, the processed data is usually stored in the HDFS or a distributed database. The MapReduce framework is fully integrated into YARN and uses the RM, the NMs, an MRAppMaster and multiple map and reduce containers per application. The MapReduce principle is an exceptionally strong pattern for batch processing of very big data sets in the cluster.
2. Overview for the interaction of the Apache Hadoop components.

Figure 2 shows what has happened after YARN was added to Hadoop and how it enhances Hadoop’s capabilities of data processing. The necessity of YARN emerged by the need for a broader cluster use such as real-time applications or tasks that didn’t fit into the MapReduce paradigm, for instance iterative graph algorithms. It is even possible to write your own distributed applications with YARN by creating your own AM.

3. Brief overview how the different components of Hadoop interact with each other.

As a result of this evolution a lot of companies developed additional frameworks and technologies. Those increase the capabilities of Hadoop greatly. Most of those applications like Apache Accumulo or Apache Spark have become projects of the Apache Software Foundation. Nevertheless, there are several others like the workflow engine Azkaban which was developed and open-sourced by LinkedIn to provide it to the general public.

Distributed Databases

When the HDFS reaches its limits, distributed NoSQL databases provide additional functionality. They complement distributed filesystems with possibilities for accessing data, changing and historicizing of values, caches, automatic dividing and joining files filled with data and comprehensive APIs for finding and storing data.

Well-known representatives of this category can be found among the Apache projects. Apache HBase is the most mature distributed database and uses the HDFS as distributed filesystem. Its maturity not only comes from its rich set of features and very active community but especially from its excellent maintainability and integration in all Hadoop distributions. In case of network problems, HBase values consistency over availability and is optimized for batch reads and high ingestion rates.

Another distributed database which values consistency over availability is Apache Accumulo. Like HBase, Accumulo is optimized for very high ingestion rates. However, the main advantages of Accumulo can be found in its support of entirely schema-free tables, access management on cellular basis and the possibility of easy tests for client code due to the provided mock implementation.

If a system is required which values availability in case of network partition higher than consistency or when a high single value read performance is necessary, Apache Cassandra is a good choice. This distributed database does not use HDFS like Accumulo and HBase but its own distributed filesystem CFS. Nevertheless, this filesystem can be used in parallel with the HDFS on the same cluster.

4. Parallel use of the HDFS and the Cassandra File System (CFS) on the same cluster.

Real-time Processing with Hadoop

Real-time frameworks like Apache Storm and Apache Spark have gained popularity in the last years. Those applications break with the batch-oriented MapReduce paradigm. Instead, they use streaming, micro-batching and in-memory techniques to provide access to real-time analytics.

Apache Storm was developed by Twitter for stream processing which helps them with real-time analytics of their tweets. It is fully integrated in their infrastructure where other Big Data technologies reside. A lot of other companies like Yahoo, Spotify or Baidu are using Storm for event processing as well.

Another potent real-time framework is Apache Spark which was originally developed in the AMPLab at UC Berkeley. Apache Spark introduced in-memory data structures which are used cluster-wide. One of the main concerns of Apache Spark is unification. It provides access to included SQL, stream processing, graph processing and machine learning frameworks. Especially the capability of SQL queries on Hadoop is an eye catcher. Thanks to its capabilities, a lot of different companies like Taobao, Amazon and Groupon are using Apache Spark.

Even though Apache Spark is receiving a lot of attention in the Hadoop eco-system these days, one definitely has to take a closer look at Apache Flink. The creation of Apache Flink is the result of the DFG (Deutsche Forschungsgemeinschaft) project Stratosphere which started in 2009 and was initially developed at one of the German BigData competence centres at the Technical University of Berlin. Flink supports a wide field of application and is usable for dozens of BigData scenarios. Its extensions support SQL queries, graph processing, machine learning and stream processing.

It is a fact that a lot of companies have real-time requirements, but there are a lot that are still using MapReduce and its batch capabilities. After YARN’s invention, it was possible to enhance the performance of MapReduce job graphs with the use of Apache Tez. Apache Tez, mainly developed by Hortonworks, one of the biggest Hadoop vendors, enables developers to create cascading DAG-like job flows. It is possible to concatenate several Map and Reduce steps without self-managed intermediate output. The use of memory makes this approach handier and a lot faster than the standard MapReduce paradigm.

Query Engines

While traditional relational database systems can rely on SQL as rich query language for automated evaluation or manual data analysis this is not the case for distributed databases. In general, they provide small and limited query engines by themselves. Especially for use-cases like Business Intelligence or interactive data exploration this is not enough. That’s the reason why several SQL-on-Hadoop solutions exist.

One of the best known SQL-on-Hadoop projects is Apache Hive. Hive provides access to distributed data with its SQL-like query language HiveQL. As first step, the data which should be queried by Hive has to be transferred into a Hive table. The data is then accessible and the queries written with HiveQL to query the data are translated into a series of MapReduce jobs which are executed on the given Hive-table. This makes Hive well-suited for queries involving very large data sets but cumbersome for real-time querying. Often the performance of Hive was not good enough. This problem was addressed by Hortonworks starting in 2013 with the Stinger initiative which aimed for faster Hive queries and more core features of SQL in HiveQL. The story is continued by which wants to provide sub-second queries by the use of Apache Spark as computation framework.

Presto, another SQL-on-Hadoop solution has been open-sourced by Facebook. One of its core features is the perfect integration of Presto with Facebook’s distributed database Apache Cassandra. The design goal of Presto is to provide real-time access to the data, which was a result of Hive being too slow for Facebook’s purposes. Nevertheless, Hortonworks’ efforts to close the gap may be fruitful and only in-depth evaluations of the frameworks can determine the faster engine. We may address this topic in later blog posts.

Recently, another notable representative of SQL-on-Hadoop engines emerged: Apache Drill. The unique selling point of Drill is the possibility of Day-Zero-Analytics. This enables the user to infer the schema of the data while already accessing it. There is no need for transferring or importing data into a store or a special format. The only restriction is the necessity of a Drill-supported file format. Besides general file formats like TSV, JSON or Parquet, Drill also provides out-of-the-box support for HBase. If you want to create your own data source for Drill, you can use a plugin mechanism provided by Drill. The recent announcement of Drill as Apache top level project shows the impact Drill will bring to the SQL-on-Hadoop world.


Over the past years, the stability and reliability of Hadoop clusters has grown significantly. Due to the many contributors from different companies under the Apache Software Foundation, a wide variety of use cases has been addressed by Hadoop eco-system technologies and a lot of problems are identified and fixed very fast. These developments show that Hadoop 2.x compared to Hadoop 1.x lifts Big Data to the next level and makes the Hadoop eco-system a solid and production-ready technology stack for business.
Even though we have described a couple of core technologies and use cases, there are many more in the big Hadoop ocean out there. The ecosystem also contains

As one can see, Hadoop has a rich eco-system of libraries and frameworks that supports different use cases for cluster computation. We pointed out that Hadoop 2.x leverages Big Data business and the capability to turn data into value.

In future posts, we will explore advantages and disadvantages of that rich technology stack and discuss scenarios where certain products should be used over others. Furthermore, we will give insight which topics could be solved with Apache Hadoop and what we have learned while doing so.