Overcoming Challenges in Predictive Modeling Part 1: Lack of Sufficient Data

By Doug Wing  |  January 22, 2014


The results of a recent ISO and Earnix survey of 269 insurance professionals clearly indicate that two primary factors restrict the majority of insurers from creating more predictive analytics. The top challenges were a lack of sufficient data and a shortage of skilled modelers. I’ll discuss the lack of sufficient data in this blog and follow up with another focusing on the lack of skilled modelers.

Data is the fuel needed for any successful predictive modeling exercise. Building models requires two forms of data: the variables statisticians use as predictors, known as independent variables, and loss events, known as target variables. Carriers can add independent variables simply by purchasing information such as weather, demographic data, or anything else they believe may be predictive. For example, nearly every insurer can add credit variables to a modeling data set. In contrast, target variables, such as loss events, are extremely difficult to acquire. Carriers typically aren’t in the business of sharing their loss experience with other carriers, and only the largest carriers have enough data of their own to build reliable predictive models.

Just how much data does a company need? That depends. If you follow the best practices of predictive modeling and split your data into multiple data sets — called training, validation, and holdout data sets — then you’ll require a greater amount of data. You don’t want to build a model and also validate your model on the same data set. That would create a classic overfitting concern.

If your goal is just to create a generalized linear model (GLM) to converge, you can accomplish it with minimal loss experience. However, getting a model to converge doesn’t make it a good predictor of future events. Following best practices to produce tight confidence intervals on your validation data set requires even more loss experience. If your goal is to produce a model that generalizes well on a holdout data set — and that should be your goal — then you require even more data to ensure your predictions are reliable and reproducible.

The biggest mistake an insurer can make isn’t failing to build a model but rather successfully building a bad model. Many companies have built models on insufficient data sets and implemented the results without validating them on a holdout data set. The results were costly when the models did not generalize well in the real world to evaluate new risks.

Similarly, if you want to create better, more predictive variables, you’ll need additional data. Identifying new independent variables (for example, credit, weather, and so on) will require significant amounts of target variable data. To create variable transformations and find interactions through exploratory data analysis, statisticians and modelers need data to detect the small but predictive relationships. Without sufficient loss experience, you could acquire as many independent variables and data sources as you like, but you’d be no closer to building a successful model.

In conclusion, insurance professionals need to recognize not only that insufficient data is a problem but that it can create potential risks in modeling.

Doug Wing

Douglas Wing, assistant vice president of Analytic Products for ISO Insurance Programs and Analytic Services, is responsible for the ISO Risk Analyzer® suite of predictive analytic tools. He leads ISO’s initiatives to enhance its offerings through analytics and predictive modeling across all lines of insurance. Before joining Verisk, Doug was in actuarial research and development at Liberty Mutual.