About Us
Our Approach
Services
Projects
News
Contact Us
 

Introduction to Data Lineage

Database icon

Gartner Magic Quadrant for Operational DBMS

  |   Blog

Gartner released their Magic Quadrant for operational database management systems. As expected the leaders in this market space in terms of completeness and execution are the well-established solutions such as Oracle, Microsoft and Amazon Web Services.

 

Although the robust and well-established solutions are still the preferred choice in the market, Gartner indicates that the open-source solutions are making way into the market as well. It is predicted that by 2018 more than 70% of new in-house solutions will be developed using open-source technology and 50 % of existing RDBMS will be in the process of being converted. Gartner argues open-source systems are much more cost-efficient than the traditional solutions and open-source systems are rapidly developing tools and capabilities to rival market leaders.

Gartner Magic Quadrant for Operational DBMS

 

Read More
Apache Spark image

Introduction to Apache Spark

  |   Blog

New technologies continue to emerge enabling faster data processing and advanced analytics. The Hadoop platform was a great breakthrough in this space as it solved many of the storage and retrieval challenges for very large and varied datasets by dividing and processing across multiple machines. This was faster, more cost-effective, and less prone to failures than traditional RDBMS systems. Though Hadoop was a big step forward and made it easier to store, process and retrieve data in a schemaless environment it is already 10 years old and is not capable of multi-pass computations. When using Hadoop the output data of a job needs to be stored after each step slowing things down due to replication and storage. Apache Spark solves this problem by supporting multi-step data pipe-lines and allows jobs to be run in-memory.

 

It’s calculated that Apache Spark can run programs up to a 100 times faster in memory and 10 times faster on disc compared to Hadoop alone. As with many Apache projects it prides itself on simplicity and compatibility. It provides simplified code for developers and is compatible with Java, Scala and Python languages. Spark is also not limited to being run just on top of Hadoop; it can be integrated with other platforms such as Mesos, EC2 and even be run as a standalone platform.

 

Apache spark has some great features that synergises very well with its “Lightning-fast cluster computing”. These high-level libraries currently include: Spark SQL, Spark Streaming, MLlib and GraphX. Spark SQL lets users to ETL their data from formats such as JSON or Parquet and query their data via SQL or HIVE. Spark Streaming utilises Spark’s speed and allows users to process data in a real-time. It uses a stream of resilient distributed datasets (RDDs) to process the data. MLlib is a machine learning library that uses various algorithms to process the data in a meaningful way that can then be used with GraphX to visualise the results.

 

All in all, Apache Spark is one of the fastest big data analytics engines in the market that is widely compatible, easy to use and packs a lot of features in one solution.

 

For more data management solutions and news please book a meeting with us.

Read More