About Us
Our Approach
Contact Us

Hadoop Tag

The Hadoop Ecosystem: HDFS, Yarn, Hive, Pig, HBase and growing…

  |   Blog

Hadoop is the leading open-source software framework developed for scalable, reliable and distributed computing. With the world producing data in the zettabyte range there is a growing need for cheap, scalable, reliable and fast computing to process and make sense of all of this data. The underlying technology for Hadoop framework was created by Google as there was no software in the market that fit Google needs. Indexing the web and analysing search patterns required deep and computationally extensive analytics that would help Google to improve their user behaviour algorithms. Hadoop is built just for that as it runs on a large number of machines that share the workload to optimise performance. Moreover, Hadoop replicates the data throughout the machines ensuring that the processing of data will not be disrupted if one or multiple machines stop working. Hadoop has been extensively developed over the years adding new technologies and features to existing software creating the ecosystem we have today.


HDFS – or Hadoop Distributed File System is the primary storage system used for Hadoop. It is the key tool for managing Big Data and supporting analytic applications in a scalable, cheap and rapid way. Hadoop is usually used on low-cost commodity machines, where server failures are fairly common. To accommodate a high failure environment the file system is designed to distribute data throughout different servers in different server racks making the data highly available. Moreover, when HDFS takes in data it breaks it down into smaller blocks that get assigned to different nodes in a cluster which allows for parallel processing, increasing the speed in which the data is managed.


Hadoop Yarn is a programming model for processing and generating large sets of data. Yarn is the successor of Hadoop MapReduce. The original MapReduce is no longer viable in today’s environment. MapReduce was created 10 years ago, as the size of data being created increased dramatically so did the time in which MapReduce could process the ever growing amounts of data, ranging from minutes to hours. Secondly, programing MapReduce jobs is a time consuming and complex task that requires extensive training. And lastly, MapReduce did not fit all business scenarios as it was created for the single purpose of indexing the web. Yarn provides many benefits over its predecessor. Yarn provides better scalability due to distributed life-cycle management and support for multiple MapReduce API’s in a single cluster. It allows for faster processing and coupled with the in-memory capabilities of other software such as Apache Spark it is comes close to real-time processing. Yarn also supports many frameworks eliminating the need for MapReduce and making it more flexible for different use cases.


Apache Hive is a data warehouse management and analytics system that is built for Hadoop. Hive was initially developed by Facebook, but soon after became an open-source project and is being used by many other companies ever since. Apache hive uses a SQL like scripting language called HiveQL that can convert queries to MapReduce, Apache Tez and Spark jobs.


Apache Pig is a platform for analysing large sets of data. It includes a high level scripting language called Pig Latin that automates a lot of the manual coding comparing it to using Java for MapReduce jobs. Apache Pig is somewhat similar to Apache Hive though some users say that it is easier to transition to Hive rather than Pig if you come from a RDBMS SQL background. However, both platforms have a place in the market. Hive is more optimised to run standard queries and is easier to pick up where as Pig is better for tasks that require more customisation.


Apache Hbase is a non-relational database that runs on top of HDFS. This schema-less database supports in-memory caching via block cache and bloom filters that provide near real-time access to large datasets, making it especially useful for sparse data which are common in many Big Data use cases. However, it is not a replacement for a relational database as it does not speak SQL, support cross record transactions or joins.


Hadoop has become the low cost industry standard ecosystem for securely analysing high volume data from a variety of enterprise sources. We specialise in helping organisations leverage this advanced stack to rapidly understand their data landscape to deliver faster and more insightful reporting, analytics and analysis.  Contact us for more details.

Read More
Apache Spark image

Introduction to Apache Spark

  |   Blog

New technologies continue to emerge enabling faster data processing and advanced analytics. The Hadoop platform was a great breakthrough in this space as it solved many of the storage and retrieval challenges for very large and varied datasets by dividing and processing across multiple machines. This was faster, more cost-effective, and less prone to failures than traditional RDBMS systems. Though Hadoop was a big step forward and made it easier to store, process and retrieve data in a schemaless environment it is already 10 years old and is not capable of multi-pass computations. When using Hadoop the output data of a job needs to be stored after each step slowing things down due to replication and storage. Apache Spark solves this problem by supporting multi-step data pipe-lines and allows jobs to be run in-memory.


It’s calculated that Apache Spark can run programs up to a 100 times faster in memory and 10 times faster on disc compared to Hadoop alone. As with many Apache projects it prides itself on simplicity and compatibility. It provides simplified code for developers and is compatible with Java, Scala and Python languages. Spark is also not limited to being run just on top of Hadoop; it can be integrated with other platforms such as Mesos, EC2 and even be run as a standalone platform.


Apache spark has some great features that synergises very well with its “Lightning-fast cluster computing”. These high-level libraries currently include: Spark SQL, Spark Streaming, MLlib and GraphX. Spark SQL lets users to ETL their data from formats such as JSON or Parquet and query their data via SQL or HIVE. Spark Streaming utilises Spark’s speed and allows users to process data in a real-time. It uses a stream of resilient distributed datasets (RDDs) to process the data. MLlib is a machine learning library that uses various algorithms to process the data in a meaningful way that can then be used with GraphX to visualise the results.


All in all, Apache Spark is one of the fastest big data analytics engines in the market that is widely compatible, easy to use and packs a lot of features in one solution.


For more data management solutions and news please book a meeting with us.

Read More
Apache drill introduction

An introduction to Apache drill and why is it useful

  |   Blog

With the rapid growth of data and the shift towards rapid development solutions much data is being stored in NoSQL stores such as Hadoop and MongoDB. The infrastructure built upon relational databases that have been used for decades cannot keep up with the volume and scope of data being captured. Further to this SQL is also a really good invention and method for extracting and analysing data that is very widely used.  In short it will not be replaced by hierarchical query techniques such as XPATH anytime soon.


In the common case study of Google, the data that Google captured in the early 2000’s was just too large to for the traditional database structure to handle. Google developed an innovative algorithm that divided their data into smaller, more manageable sets of data across multiple machines and mapped the data to come together after required processing is done. They called this algorithm Map reduce, this algorithm was used to develop an open-source platform called Hadoop.


Hadoop is one of many frameworks that has been developed to allow massively parallel computing for fast and cost efficient results. With the massive increase of data being captured from new sources businesses started using old and new frameworks together. The challenge of this occurrence is how to link up all of this information from different sources and different formats to extract the right data for an ad-hoc business case that would yield valuable and rapid results. Google solved this with an innovation they called Dremel.  The open source community in tribute then created Apache Drill.  Drill solves this relational & non-relational problem by enabling the user to query data across different frameworks and formats to deliver low-latency results that can be interpreted in familiar tools and language.

Visual representation of how Apache drill works

Apache Drill software has a few differentiating features that gives it a competitive edge. It is schema free, users can quickly query raw data without the time consuming and costly task of schema creation and significant IT involvement. Apache Drill is considered to be one of the fastest query engines on the market today. There is no need for data loading, it has specialized memory management that reduces memory footprint and eliminates garbage collections. It also supports locality-aware execution that reduces network traffic when Drill is co-located with the datastore. Lastly, Apache Drill has been developed with the user in mind. The software is easy to install and supports all major operating systems. It leverages user’s acquired SQL skillsets, there is no need to learn a new coding language. It is also integrated to work with popular business intelligence tools such as Tableau, Qlikview, MicroStrategy and more.


Overall, Apache Drill is a software solution that users can implement to leverage their traditional relational data assets alongside newer nosql sources in a quick and convenient way while continuing to use familiar tools and language.


At Data to Value we have been using Drill as a way of accessing GPS track data stored in its native format (GPX) in order to create derived data such as speed and heading.  This is great as it enables us to retain the data in a standard format for using in GPX compliant tools whilst also generating the custom analysis that we need to analyse in BI and other visualisation tools.

Read More