Interest in Big Data and Data Science as disciplines, toolsets, roles and philosophies is at an all time high. Job boards, software tools, specialist consultancies and even university courses are all rapidly expanding to embrace what Harvard Business Review has described as “the sexiest job of the 21st century”. Cynics argue that the discipline is simply the modern application of Statistics and nothing new. Proponents point to the difference in scale, innovative use of technology and use of new techniques such as machine learning to go beyond the data centric nature of statistics into modelling and mining knowledge itself.
Ways of achieving a Big Data capability without the infrastructure and expense.
Adoption is growing rapidly across some sectors with online retail, healthcare, social media and manufacturing amongst those accelerating what Mckinsey suggest will be a substantial skills shortage by 2018. Across other sectors adoption however is less pronounced and is held back by a variety of factors. The Economist in a report on Big Data Adoptionsuggests “a company’s biggest hindrance to gaining value from big data is often itself” with the two largest inhibitors being a lack of suitable software and in-house skills. This in itself is not however a cause of slow adoption – the tools and professional landscape are more diverse, mature and inexpensive than ever at present.
Prior to the Big Data era many firms embarked on enterprise data warehousing and business intelligence projects which were often expensive and prolonged with difficult to quantify returns. Often these were driven by insight rather than efficiency objectives in the same way Data Science and Big Data solutions are often promoted. These approaches are sometimes however unfairly treated synonymously with the challenges of one tarnishing the other. Whilst Data Warehousing and BI tools are well geared to answering what Donald Rumsfeld would describe as known unknowns Data Science and Big Data approaches are much better at providing insight on unknown unknowns. Both approaches still have a very valid place in the Information Management toolkit addressing somewhat different use cases.
So what possible solutions exist for organisations looking to gain insight from large volumes of varied data without the implementation risk? One approach that is currently evolving is the use of service based models with both startups and large players offering a growing number of different options inspired by cloud computing. For processing power and advanced machine learning capabilities there are a number of smaller startups now offering services such as Wise.io,Datumbox and BigML which offers a storage and prediction based pricing. For more sophisticated and context based requirements IBM have also started to rent out the processing infrastructure of their famous jeopardy-winning Watson supercomputer. This has tackled some impressive and diverse use cases ranging from cancer research to cognitive cooking. Not everyone however requires the full infrastructure to be provided, for some the challenge is more around expertise and leveraging existing best practice. This is an area that startup Algorithmia aim to tackle by creating a marketplace for algorithms accessible via both an API and a code library.
Clearly there is an ever growing number of options for those wishing to benefit from Big Data Analytics, Machine Learning and other related Data Science capabilities without the distraction of managing complex infrastructure. Clearly however there are also a number of risks to understand and also prerequisites to maximising benefits of the service based approach. Given many of the solutions are third party hosted, cloud based platforms many of the traditional concerns in this area need to be considered. Additionally to these long debated privacy, information security and support issues there are also intellectual property considerations that have to be carefully made when reusing algorithms on a market place. More important however is the consideration that to really get the most value from any kind of third party analysis toolset you need to have well understood data quality, definitions and coverage of the data inputs. Without these important pre-requisites the unknown unknowns will always remain just that, unknown unknowns.