Blog

Seasons Greetings & Number 1 Data Consultancy on Google

  |   Blog

2016 has been another fantastic year at Data to Value.  We’ve really enjoyed working with a number of new and existing clients across a range of sectors including Renewable Energy, Banking, Hedge Funds, Non Government Organisations and Government departments. Thanks again to all our clients, staff, partners and also meetup group attendees for making it such an enjoyable and interesting year and best wishes for a pleasant christmas break and a happy 2017!

 

On another note we are also pleased to report that Data to Value have made it to the number 1 Data Consultancy spot on Google (excluding paid ads).  Big thanks to our partners, customers and supporters for making this happen!

Read More
cycle

Ride London Data Visualisation 2016

  |   Blog

pru dashboard 2

 

 

pru dashboard

 

 

Congratulations to all Ride London participants! We enjoyed watching and supporting Prudential Ride London 2016 in the sunshine and in response to numerous requests to repeat the dashboard we produced in 2015 we are pleased to say we have created 2 dashboards for 2016:

 

 

 

We hope people enjoy taking a slightly different look at the results. Its probably worth mentioning a few caveats however:

 

  • The results obviously have nothing officially to do with Prudential Ride London and are based on a snapshot from a couple of days after the event.  Thus they may be out of date, for official up to date results please go to the official website.

 

  • Some cyclists were diverted.  These times are covered by the visualisation but are not separated out making like for like split times in some cases not possible.

 

  • Given the data available we had to recalculate pace and average kph stats.  These are slightly out of kilter from official stats we believe due to different decimal places in rounding.
Read More

The Panama Papers – how did they pull off history’s biggest data leak?

  |   Blog

Find out how Data to Value’s Graph Data software partners Neo4j and Linkurious have been used in the Panama Papers investigation.

running_tap1

Recently there has been a lot of interest around the newly published Panama papers. This giant trove of data that is said to contain a whopping 11.5 million documents or 2.6TB of data. This completely dwarfs pervious leaks like the 1.7GB WikiLeaks scandal or the 30GB Ashley Madison leak. It took two years, more than 400 journalists and cutting edge technology solutions to process all of this information and gain valuable insight.

 

The data was leaked from one of the world’s leading firms in incorporation offshore entities – Mossack Fonseca. The data was then gradually transferred to a German journalist that worked in the Süddeutsche Zeitung (SZ) via encrypted chat. The real work began shortly after the data started pouring in, as the SZ was not able to make sense of data that size and got in contact with the International Consortium of Investigative Journalists (ICIJ) to find a way of handling these millions of documents. The ICIJ were very efficient and very prudent when handling this data. The data and its copies were stored in encrypted drives using open-source software – VeraCrypt. The choice was made to use Apache Solr – as the main search server coupled with Apache Tika, a toolkit that detects and extracts metadata and text from over a thousand different file types. This made it possible for a seamless and near real-time way of searching different file types, such as PDFs, Word documents and emails. A custom UI developed by Blacklight was put on top of the solution for ease of use. Once built one of more than 400 journalists needed a link and a randomly generated password to start discovering interesting data.

 

To make sense of the highly connected and complex data the investigators decided to ask the help of two of our software partners – Neo4j and Linkurious. Using Neo4j, the world’s leading graph database, made it easy to find and analyse complex connections as graphs use special structures incorporating nodes, properties and edges to define and store data. Linkurious, a graph visualisation platform helped the journalists to navigate through this ocean of data uncovering unique insights into the offshore banking world, showing the relationships between banks, clients, offshore companies and their lawyers.

 

The entire dataset of the Panama Papers is expected to be released early May. For more interesting articles about finding meaning in data visit our website and follow us on LinkedIn or Twitter.

Read More

The General Data Protection Regulation in a nutshell

  |   Blog

After more than three years of discussion the EU General Data Protection Regulation or GDPR framework has been finally agreed on. This directive will replace the current 1998 Data Protection Act. As with most major legislative change it will not be enforced immediately and will likely become compulsory at the first half of 2018.  The main intent of the GDPR is to give individuals more control over their personal data, impose stricter rules to companies handling it and make sure companies embrace new technology to process the influx of data produced. Here are the major changes that are mentioned in this new legislation:

 

  • Expanded territorial reach

Companies that are based outside of the EU, but targeting customers that are in the EU will be subject to the GDPR which is not the case now.

 

  • Consent

Consent of personal data must be freely given, specific, informed and unambiguous. Consent is not freely given if a person is unable to freely refuse consent without detriment.

 

  • Accountability and privacy by default

The GDPR has placed great emphasis on the accountability for data controllers to demonstrate data compliance. They will be required to maintain certain documentation, conduct impact assessment reports for riskier processing and employ data protection practices by default – such as data minimisation.

 

  • Notification of a data breach

Data controllers must notify the Data Protection Authorities as quickly as possible, where applicable within 72 hours of the data breach discovery.

 

  • Sanctions

This new legislation allows the Data protection Authorities to impose higher fines – up to 4% of annual worldwide turnover. The maximum fines can be applied for discrepancies related to international data transfers or breach of processing principles, such as conditions for consent. Other violations can be fined up to 2% of annual worldwide turnover.

 

  • Role of data processors

Data processors will now have direct obligations to implement technical and organisation measures to ensure data protection, this could include appointing a Data Protection Officer if needed.

 

  • One stop shop

This legislation will be applicable in all EU states without the need of implementing national legislation. Having a single set of rules will benefit businesses as they will not need to comply with multiple authorities, streamlining the process and saving an estimate of €2.3 billion a year.

 

  • Removal of notification requirement

Some data controllers will be glad to hear that the requirement of notifying or seeking approval from a Data Protection Authority is going to be removed in many circumstances. This decision is made to save funds and time. Instead of notification the new directive requires data controllers to put in place appropriate practices for large scale processing in the form of new technology.

 

  • Right to be forgotten

This change is one of the most useful changes for the average person managing their data protection risks. A person will be able to require their data to be deleted when there is no legitimate reason for an organisation to retain it. Following this is requested the organisation must also take appropriate steps to inform any third party that might have any links or copies of the data and request them to delete it.

 

This new directive has clearly been created acknowledging that people produce much more sensitive data than they have ever before. Managing data on a large scale can be risky for organisations if they do not plan out an appropriate strategy and update their systems to handle the influx. This kind of negligence can lead to data breaches or leaks.

 

Data to Value are data specialists that can help you stay ahead of the curve when it comes regulatory compliance. We use the latest No-SQL technologies that can rapidly assess your data quality, identify problem areas, solve them and set up alerts that promptly notify you of any new issues. Engage Data to Value to help you to ensure compliance with the law, mitigation against the risk of regulatory fines and maintenance of a good reputation. For more information contact us directly at info@datatovalue.co.uk

Read More

Interactive Data Governance workshop

  |   Blog

We were delighted to co-host this informative workshop with our software partners Semanta. Nigel Higgs kicked off the session with a short introduction of the session and introductions around the room to break the ice and get to know what problems attendees are having with Data Governance.

 

Tomas Barta presented some great case studies of the past and present approaches to Data Governance with some open discussions around the room about common challenges and pitfalls. Tomas then introduced Semantas’ approach of tackling these issues in an engaging an easy to understand way. This was followed by an informative practical session where everyone was hard at work learning to improve Data Governance for their organisation.

 

For more information about Data Governance read our blog: Outside In Data Governance – a value driven approach. For more news and our latest events sign up for our newsletter.

 

More images

IMG_20160308_091135

IMG_8656

 

IMG_8654

 



 

Read More

Top 21 Open Data sources

  |   Blog

Data is everywhere, created and used by just about anyone. The days when companies or individuals have to pay significant sums of money to access useful and interesting datasets is long gone. Here is our top 20 list of the best free data sources available online.

 

1. Data.gov.uk the UK government’s open data portal including the British National Bibliography – metadata on all UK books and publications since 1950.

 

2. Data.gov Search through 194,832 USA data sets about topics ranging from education to Agriculture.

 

3. US Census Bureau latest population, behaviour and economic data in the USA.

 

4. Socrata – software provider that works with governments to provide open data to the public, it also has its own open data network to explore.

 

5. European Union Open Data Portal thousands of datasets about a broad range of topics in the European Union.

 

6. European Data Portal is a European portal that harvests metadata from public sector portals throughout Europe. EDP therefore focuses on data made available by European countries. In addition, EDP also harvests metadata from ODP.

 

7.DBpedia crowd sourced community trying to create a public database of all Wikipedia entries.

 

8. The New York Times a searchable archive of all New York Times articles from 1851 to today.

 

9. Dataportals.org datasets from all around the world collected in one place.

 

10. The World Factbook information prepared by the CIA about, what seems like, all of the countries of the world.

 

11. NHS Health and Social Care Information Centre data sets from the UK National Health Service.

 

12. Healthdata.gov detailed USA healthcare data covering loads of health related topics.

 

13. UNICEF statistics about the situation of children and women around the world.

 

14. World Health organisation statistics concerning nutrition, disease and health.

 

15. Amazon web services large repository of interesting data sets including the human genome project, NASA’s database and an index of 5 billion web pages.

 

16. Google Public data explorer search through already mentioned and lesser known open data repositories.

 

17. Gapminder a collection of datasets from the World Health Organisation and World Bank covering economic, medical and social statistics.

 

18.Google Trends analyse the shift of searches throughout the years.

 

19. Google Finance real-time finance data that goes back as far as 40 years.

 

20. UCI Machine Learning Repository a collection of databases for the machine learning community.

 

21.National Climatic Data Center world largest archive of climate data.

 

For more interesting articles, projects and events visit our news section or contact us directly

Read More

Outside In Data Governance – a value driven approach

  |   Blog

Originally written by Nigel Higgs on LinkedIn Pulse.

 

 

We who have been in the data sphere a while and in and around Data Governance will have seen the pitch-decks, watched the webinars, read the blogs and attended the conferences.  Some of us will have hired the staff, taken sage advice from expensive consultants and kicked off programmes to get the organisation up the Data governance maturity curve. It’s almost like a religion, Data Governance is so clearly the answer why can’t everybody in the organisation see it? It’s a no-brainer. Unfortunately and speaking as a Data Governance practitioner for far too many years I can honestly say that I have yet to see a fully functioning enterprise-wide Data Governance implementation. Look, I appreciate that could be down to my incompetence, but I know this is not an isolated or unique sentiment. Lots of peers, colleagues and people far smarter than me have been preaching the benefits of data administration, data architecture, data governance, or whatever it will be called next, for many years and yet many of them struggle to come up with success stories. In fact when pressed they often don’t have any!

 

 

So why so much denial? Einstein is reputed to have said something along the lines of ‘the definition of insanity is to keep doing the same thing and expect the outcome to be different’. It is also reputed to be the most wrongly attributed and quoted platitude on the planet! But hey this is a LinkedIn post and like most of my writings nobody will read it.

 

 

What’s that got to do with Data Governance? Well, ‘Outside In Data Governance’ is about approaching the problem from a different angle. There is little doubt the problem Data Governance is trying to solve is very real. Very few organisations know what data they have got, what it means, where it is, who is responsible for it or what its quality is?

 

 

But how to solve the problem? What I typically hear is that you need to write a policy, form committees, define processes, assign roles and then everything will be working like clockwork within months – data governed, quality data delivered to users and the organisation flying up the data maturity curve. But is that what happens, does the story painted in the pitch-decks come into reality? Sadly, it very rarely if ever does.

 

 

What is needed is a value driven approach. Start with who are we doing a Data Governance approach for? We are doing it for the business users. Then ask what are they interested in? They are interested in something that makes their lives easier right now. So ‘Outside In Data Governance’ starts with a single business report and works back from there. Answer those fundamental questions (what, where, who and how good?) about the fields and outputs on the report and make that knowledge accessible. You could do this with a simple Excel based approach or maybe a Wiki or SharePoint; but pretty soon you will need some tooling to really make it scalable and responsive to increasing demands for more reports to be included in the scope. There are ways to do this in a ‘proof of concept’ environment and demonstrate the benefits before committing to spend. A friend of mine is fond of saying it’s easier to ask for forgiveness that for permission’. In this case he is right. There are browser based tools that sit outside your firewall and can offer this try before you buy approach.

 

 

This is what a value driven and lean approach is all about. If what you do in this small scale doesn’t get traction then what makes you think a £250k project will end up any better? Start small, ensure you get honest feedback from users at every iteration of your solution and focus on delivering value. If you bring the data users with you then they will demand the capability is extended. Beat the Einstein quote and start from the ‘Outside In’.

 

 

To learn more about Lean Data Governance sign up to our free Data Governance workshop that is led by Tomas Barta, the Co-Founder and Lead Designer of SEMANTA – one of the most innovative software company in the Data Governance market.

Read More

Open Data in Government

  |   Blog

Open data has several definitions but our preferred one at Data To Value is from the Open Data Institute‘Open data is data that anyone can access, use and share.’ Simple really but there is a follow-on – ‘For data to be considered ‘open’, it must be published in an accessible format, with a license that permits anyone to access, use and share it’.

 

The growing open data movement is using these principles to make what is traditionally considered internal data more readily available to anyone who wishes to access it, use it and manipulate in any way, shape or form.

 

At national level in 2013 the UK along with the other G8 countries signed up to the Open Data Charter and committed to five key principles:

 

  • Open Data by Default
  • Quality and Quantity
  • Useable by All
  • Releasing Data for Improved Governance
  • Releasing Data for Innovation

 

This level of driver is helping government departments, agencies and local authorities to become more transparent and accountable while enabling tech entrepreneurs to create disruptive technology that benefits society. A 2013 study done by Deloitte estimates that the economic benefit of public sector information is worth around £1.8 billion, with social benefits amounting to £5 billion. The study highlights that the use and re-use of public information helps organisations and individuals in the following ways:

 

  • Fuel innovation – develop new products and services.
  • Increase the accountability of public service providers, improve engagement rates of individuals in the democratic function, increase transparency and help better policymaking.
  • Reduce barriers to entry for markets with information asymmetry.
  • Inform people about the social issues happening around them.

 

Some case studies include:

 

  • Publishing Open Data on cardiac surgery that had positive impacts on mortality rates with an economic value of £400 million p.a.
  • Providing Open Data streams – the clear benefits is evident in many mobile apps, for example tracking congestion zones and helping users find alternative routes or live transport information such as the next bus.
  • The use of live weather data to identify if users are in danger of storms, floods, snow or other hazards has an estimated economic value between £15 million and £58 million p.a..

 

Open Data is beginning to affect us all and it is important to use sound Data Management principles to achieve public trust and to deliver the maximum benefit.

For more information on how Data To Value could help develop your Open Data Strategy click here

Read More

Introduction to data quality

  |   Blog

How many times have you heard managers and colleagues complain about the quality of the data in a particular report, system or database? People often describe poor quality data as unreliable or not trustworthy. Defining exactly what high or low quality data is, why it is a certain quality level and how to manage and improve it is often a trickier task.

 

Within the Data Quality Management community there is a generally held view that the quality of a dataset is dependent on whether it meets defined requirements. Managers often define these requirements as outcomes such as higher sales, lower costs or less defects. Whilst this is important however it doesn’t help practitioners at the coalface to codify rules and other tests designed to measure the quality of a dataset.  For this a more specific definition of requirements such as completeness or uniqueness levels is required. An example requirement statement for example could be “All clients should have a name and address populated in our CRM system”.

 

 

Measuring Data Quality

 

Data Quality dimensions are often used by practitioners to generically group different types of tests that typically span different project requirements. Whilst there is some disagreement on the number of dimensions and the terms used for these many practitioners use definitions such as the below:

 

  • Completeness – requires that a particular column, element or class of data is populated and does not feature null values or values in place of nulls (e.g. N/As).

 

  • Consistency – something that tests whether one fact is consistent with another e.g. gender and title in a CRM database.

 

  • Uniqueness – are all the entities or attributes within a dataset unique?

 

  • Integrity – are all the relationships populated for a particular entity – for example its parent or child entities?

 

  • Conformity – does the data conform to the right conventions and standards. For example a value may be correct but follow the wrong format or recognised standard.

 

  • Accuracy – the hardest dimension to test for as this often requires some kind of manual checking by a Subject Matter Expert (SME).

 

Dimensions are often used not only as a checklist to check that the best mix of rules have been implemented to test the quality of a dataset, they are also often used for aggregating data quality scores for tracking trends and MIS. Many more complex measurement methods also exist which help to translate individual pass/fail results into more business friendly cost, risk and revenue calculations.

 

 

Improving Data Quality

 

A different set of skills and tools are often used for improving data quality after it has been measured. A good Data Quality Analyst tends to exhibit a mix of skills typically found in Data Analysts, Data Scientists and Business Analysts amongst others. At a strategic level a good understanding of corporate culture, architecture, technology and other factors is often important.  However a number of essential technical skills are also required when dealing with the data itself. These include parsing, standardising, record linkage/matching, data scrubbing/cleansing, data profiling and data auditing/monitoring. These skills are often extensively used when conducting projects such as data migrations where data quality improvements need to be achieved in tight timescales.

 

Parsing

Parsing is the process of analysing data and determining if a string of data conforms to one or few main patterns. Parsing is fairly easy to automate if a dataset has a recognisable or predictable format.

 

Standardising

When the main formats are recognised and parsing is complete the next step is to standardise the dataset. This is done by correcting the data in a pre-defined way that is consistent and clear throughout the whole dataset.

 

Record linkage/matching

Record linkage or matching describes a process of identifying and linking duplicate records that refer to the same real world entity, but may not be completely identical in the datasets. For instance having the same product entered as “Leather chair – black” and “Chair, Blk. – Leather”.

 

Data scrubbing/cleansing

Describes the process of amending or removing data that is incorrect, incomplete, improperly formatted and or duplicated. Typically a software tool uses rules and algorithms to amend specific types of mistakes saving the data quality professional a significant amount of time.

 

Data profiling, auditing and monitoring

Data profiling is the process of analysing and gathering information about the data. This information can be used for specific data quality metrics and help determine if the metadata accurately describes the source data. Data profiling is one of the main tools used for data auditing, it helps assessing the fit of data for a specific purpose, which then in turn ties in with long term data monitoring that helps prevent serious issues.

 

 

Hopefully the above article gives a flavour of some of the skills and techniques involved in Data Quality Management. For more in depth coaching please do consider our upcoming Data Quality Management fundamentals course in London this March.

Read More