How to Pick a Better Model

By Hernan L. Medina, CSPA, CPCU

How to Pick a Better Model

Your predictive analytics team has just built a predictive model. Now all you need to do is approve implementation, right? Not really. Although it’s reasonable to expect that the model will lead to better business performance, before implementing the model, you should review the evidence. Data scientists have many ways to evaluate and compare models. To help you make better informed decisions, here are some concepts underlying two of these best practices.

How well does the predictive model generalize?

A model can be said to generalize well when it can make accurate predictions for new data. Let’s consider pure premium (the portion of the insurance rate needed to pay claims and loss adjustment expenses, or LAE). Suppose I use 2016 personal auto physical damage claim history to build a model for predicting pure premium as follows: For each policy, the model’s prediction equals the claim amount plus claim adjustment expense that the policy had in 2016. So, if “policy A” had no claims, the prediction is $0; if “policy B” had one claim with $10,000 in loss and LAE, the prediction is $10,000, and so on. This model does a perfect job on the training sample (the 2016 data used to develop the model). It predicts 2016 pure premiums with 100 percent accuracy. However, there’s a chance policy A will have a claim in 2017; its pure premium could be greater than $0. Additionally, it’s highly unlikely policy B would have exactly one claim with exactly $10,000 loss and LAE again in 2017; it may have no claims at all, or it could have a less expensive or more expensive claim. Thus, this model would perform very poorly with new data. A model does not generalize well if it does very well on training data but poorly on new data.

What can happen when a company implements a model that does not generalize well?

Continuing the example above, suppose the portfolio had 100 policies in 2016. Also, assume 95 policies had no claims, and each of the remaining five policies had a claim of $10,000. For simplicity, let’s ignore profit and expenses other than LAE. Using the model above, the 2017 rate for the 95 policies with no claims in 2016 would be $0, and the 2017 rate for the other five policies with one claim in 2016 would be $10,000. The five policyholders receiving a $10,000 premium renewal bill would likely look to find insurance elsewhere, and the company would be left with the 95 policies it’s insuring for $0 premium.

Clearly, if one or more of the 95 policies in this example has a claim in 2017, the company will have not collected enough premium to pay the claims. This exaggerated example illustrates the following point: When a company implements a model that does not generalize well, it risks losing policyholders whose model price is too high. The company may also retain many policyholders whose model price is too low. These two effects can lead to a decrease in profitability or even a net loss.

How can one achieve a model that generalizes well?

A model generally predicts an expected value for each group of policies having the same features, also called predictors or independent variables. A pure premium model would estimate the same expected value or average pure premium for all policies in a group with the same predictor values. If another policy had almost the same predictor values, but one of them was different, the model may estimate a higher or lower pure premium.

However, as the number of predictors or independent variables used to build a model increases, smaller groups of policies share the same predictor values. At some point, it may seem as if the model were using each policy’s pure premium to predict the pure premium for only that policy, almost like the example above. As the number of predictors expands relative to the size of the modeling data set, there’s an increasing danger the model will not generalize well. Therefore, an overfitted model may not be suitable.

Avoiding overfitted models is usually the responsibility of a predictive analytics team. They often have best practices for avoiding overfitting models. One of these best practices is to obtain a larger modeling data set, if possible. In statistics, the law of large numbers stipulates that averages based on a sample converge (get closer and closer) to the population average as the sample size increases. Thus, all else being equal, the more policies in your modeling data set, the more accurate your model’s predictions should be. Other best practices include splitting the data, reviewing statistics, and examining graphs. Reviewing statistics is beyond the scope of this article. A few comments on best practices for splitting the data and examining graphs follow.

Splitting the data

Data scientists often split the data into three subsets. They may use different names for these subsets, including training, testing, and holdout or validation data sets—there’s no official naming convention. Training data may be used to fit several models, which are evaluated and compared using the testing data. Then the final selected model is evaluated on the holdout or validation data. A key point is to evaluate models using data other than the model used to build them. Sometimes the available data set is not large enough to split three ways. With small data sets, the analytics team may use a technique called k-fold cross-validation.

For example, in 10-fold cross validation, the data is split into ten subsets. Then nine subsets are used to train the model, and one subset to test it. This procedure is repeated ten times until each of the ten subsets has been used once as the testing subset. When modeling data includes several years, data scientists will often choose training data from one set of years and test with a different set of years. Additionally, once the model is in production, data from the latest period (quarter or year, depending on volume) can be used as a test data set to monitor the model’s performance.

Examining graphs

Analytics teams review many types of graphs to gauge model performance. Some are applicable only to specific types of models (they will not be discussed here). Other plots, such as lift charts, are applicable more generally. A closer look at lift charts follows, in the context of pure premium models. Pure premium lift charts are a graphical representation of a model’s ability to separate policyholders with low expected pure premium from those with high expected pure premium. To construct them, data scientists sort the data by predicted pure premium and then divide into groups having equal exposure (or as close as possible to equal exposure). Then, for each group, they calculate the group’s relativity to the overall average. This relativity is the group’s average actual pure premium divided by the overall average actual pure premium. Thus, groups containing low-cost policyholders will have a relativity below one, and those containing high-cost policyholders will have a relativity above one.

Lift charts and overfitted models

A model that shows a lot of lift on training data but significantly less lift on testing or holdout data is quite likely overfitted. For example, in the hypothetical charts that follow, there’s a significant difference in the spread of relativities for the training data versus the spread in the testing data. This suggests the model may not generalize well.

Hypothetical Lift Chart - Training Data
Hypothetical Lift Chart -  Testing Data

Lift charts and models that generalize well

When a model generalizes well, the lift on training and test (or holdout) samples is similar. For example, in the hypothetical charts that follow, approximately the same spread of relativities exists for the testing data as for the training data. This indicates the model generalizes well.

Hypothetical Lift Chart - Training Data
Hypothetical Lift Chart -  Testing Data

Lift charts and performance improvement

A model that shows more lift than the current rating plan can likely lead to better business results. For example, the hypothetical charts that follow would be based on holdout data: the spread of relativities for the current rating plan is not as wide as the spread of relativities for the new model. This implies the new model could help the company price some policyholders more accurately, thus helping to avoid adverse selection and improve business results.

Holdout Data - Hypothetical Current Rate Plan
Holdout Data - Hypothetical New Model

In summary, models can be overfitted to the training data, and they may not work well on other data (they may not generalize well). Implementing a model that does not generalize well is a waste of money, as it’s unlikely to price more accurately and improve business results. The larger the number of predictors relative to the size of the modeling data set, the larger the risk of overfitting. Having a sufficiently large database can help avoid overfitting. Other ways to reduce the risk of overfitting include splitting the data into training, test, and holdout samples and testing models in data samples that were not used to build them. Lift charts can be useful illustrations of how well a model generalizes from training to test or holdout data, and they can also provide a useful comparison to the current rating plan.

Scott Stephenson

Hernan L. Medina, CPCU, is senior principal data scientist at ISO Solutions, a Verisk business.