3 Techniques for Preparing Data for Predictive Modeling

“A customer with a billing balance of $0 is certain to cancel service.” So begins the tale of bad data preparation and its effects on Predictive Analytics.

The story goes something like this: a data scientist uses predictive modeling to determine which customers will leave next month. The analyst uses historical data, including information from the company’s billing system, which includes fields on the current balance for each customer. When this field is used as input, along with many others for predictive modeling, the model uses this $0 balance as a primary indicator of future customer churn.

What’s the problem?

The $0 balance isn’t in the future. The $0 balance often occurs with a business process in which the balance value is “zeroed out” shortly after a customer leaves. The data scientist has made a grave error in data preparation, and this error and other errors serve as cautionary tales.

Getting data preparation right makes all the difference in Predictive Analytics. You don’t want to miss the following three considerations during data preparation.

NULL means 0, right?

Often, a dataset will have null values. A column of data will have values for many records, and then you notice blanks or “nulls.” What’s going on? Asking this question is the beginning of any good data preparation. Curiosity is a prerequisite for predictive modeling.

Technical solutions existing for filling these missing values by imputing a likely value. But don’t be in such a hurry to “clean” data by filling in values or deleting cases. Often, the lack of data is an important fact for predictive modeling. In a public data set on adverse drug effects, for example, many patients have a weight of 0.

The right question to ask is, “Why don’t we know the weight of these patients?” It turns out that having a weight of 0 strongly correlates with a particularly bad adverse effect: there is simply no time to weigh the patient.

No keys, no join? Nope.

In traditional database architectures, we use keys to link two tables of data. Which transactions belong to which customers? A unique customer ID enables you to match up a customer with transactions. Combining these tables is called a join.

But what happens when you do not have keys? A common example is matching company names. Predictive modeling can greatly benefit from outside data sources. Obviously, these data sources do not use your proprietary keys for your customers.

Company names follow certain conventions. Running standard pre-processing steps to strip company entity types like Inc, LLC, GmbH, etc. helps. Also, look for other fields like website domain, which must be unique, that can serve as an identifier.

Hierarchies and categories matter.

When using predictive modeling techniques, we do not know at the beginning what the technique will consider important. For example, if we have a list of products, and we want to know what products are purchased together, it may be much more interesting to know the families of the products.

For example, thousands of individual ball bearing SKUs may simply be noise in your data, whereas categories like “deep groove ball bearings” or “angular contact ball bearings” may be much more interesting for predictive modeling techniques.

Need help? The experts at Syntelli Solutions understand the value of data preparation and predictive modeling.