Quality data is the key to success for machine learning


The old phrase “garbage in, garbage out” is very appropriate in the new age of machine learning tools. An article in the Harvard Business Review highlights the importance for businesses of getting not just quality data but also the “right” data, to ensure the success of initiatives using machine learning.

The article emphasises that to properly train a predictive model, historical data must meet exceptionally broad and high-quality standards. First, the data must be right. It must be correct, properly labelled and so forth. But it must also be the “right” data — lots of unbiased data, over the entire range of inputs for which one aims to develop the predictive model.

The article suggests that most data quality work focuses on one criterion or the other, but for machine learning, it is necessary to work on both simultaneously.

For businesses, the first task is to clarify the objective for the machine learning exercise and assess whether you have the right data to support these objectives. Objectives could be cost reduction, customer knowledge or improving production efficiencies.

It is important for management to ensure there is one person who has responsibility for the data. This person should possess intimate knowledge of the data, including its strengths and weaknesses and enforce standards for data quality. That person must also lead ongoing efforts to find and eliminate causes of error.

The Harvard Business Review author emphasises the need to pair data scientists and experienced business people when preparing the data and training the model. They note that business people have “dealt with bad data forever, and you need to build their expertise into your predictive model”.

It is a difficult process to get right but machine learning has incredible power and businesses need to learn to tap that power. Poor data quality can “cause that power to be delayed, denied, or misused, fully justifying every ounce of the effort”.

