Predicting Car Insurance Claims

Data Cleaning and Logistic Regression

Author

Ibrahima Fikry Diallo

Published

September 21, 2024

Introduction

Machine learning has become increasingly important in the insurance industry, which heavily relies on data for decision-making.

In this project, I analyzed customer data to help predict whether a claim will be made against a policy!. The goal is to identify the most impactful feature for building a logistic regression model that would allow the company to estimate this likelihood with accuracy.

Insurance companies invest considerable resources into refining their pricing strategies and improving claim predictions. Since car insurance is mandatory in many countries, the market is very large, and these predictions can have a significant impact on business outcomes.


Investigating and cleaning the data

Client Data Description

We have been supplied with customer data as a csv file called car_insurance.csv, along with a table detailing the column names and descriptions below.

Click to expand details on client data columns
Column Description
id Unique client identifier
age Client’s age:
0: 16-25
1: 26-39
2: 40-64
3: 65+
gender Client’s gender:
0: Female
1: Male
driving_experience Years the client has been driving:
0: 0-9
1: 10-19
2: 20-29
3: 30+
education Client’s level of education:
0: No education
2: High school
3: University
income Client’s income level:
0: Poverty
1: Working class
2: Middle class
3: Upper class
credit_score Client’s credit score (between zero and one)
vehicle_ownership Client’s vehicle ownership status:
0: Paying off finance
1: Owns their vehicle
vehcile_year Year of vehicle registration:
0: Before 2015
1: 2015 or later
married Client’s marital status:
0: Not married
1: Married
children Client’s number of children
postal_code Client’s postal code
annual_mileage Number of miles driven by the client each year
vehicle_type Type of car:
0: Sedan
1: Sports car
speeding_violations Total number of speeding violations received by the client
duis Number of times the client has been caught driving under the influence of alcohol
past_accidents Total number of previous accidents the client has been involved in
outcome Whether the client made a claim on their car insurance:
0: No claim
1: Made a claim

Reading the dataset

library(readr)
library(gt)
library(dplyr)


# reading the dataset
Cars = read_csv("car_insurance.csv")

# creating custom table using gt library
Cars %>% head(10) %>% gt() %>%
    tab_header(title = md("**customer data**"),
               subtitle = md("First 10 elements"))
1
Fast and user-friendly package for reading tabular data into R.
2
Elegant and user-friendly package for creating and customizing tables in R.
3
Essential toolkit for data manipulation with intuitive functions.

customer data

First 10 elements

id age gender race driving_experience education income credit_score vehicle_ownership vehicle_year married children postal_code annual_mileage vehicle_type speeding_violations duis past_accidents outcome
569520 3 0 1 0 2 3 0.6290273 1 1 0 1 10238 12000 0 0 0 0 0
750365 0 1 1 0 0 0 0.3577571 0 0 0 0 10238 16000 0 0 0 0 1
199901 0 0 1 0 2 1 0.4931458 1 0 0 0 10238 11000 0 0 0 0 0
478866 0 1 1 0 3 1 0.2060129 1 0 0 1 32765 11000 0 0 0 0 0
731664 1 1 1 1 0 1 0.3883659 1 0 0 0 32765 12000 0 2 0 1 1
877557 2 0 1 2 2 3 0.6191274 1 1 0 1 10238 13000 0 3 0 3 0
930134 3 1 1 3 2 3 0.4929436 0 1 1 1 10238 13000 0 7 0 3 0
461006 1 0 1 0 3 1 0.4686893 0 1 0 1 10238 14000 0 0 0 0 1
68366 2 0 1 2 3 1 0.5218149 0 0 1 0 10238 13000 0 0 0 0 0
445911 2 0 1 0 2 3 0.5615310 1 0 0 1 32765 11000 0 0 0 0 1

View data types

# Display the structure of the dataset in a readable format
str(Cars, vec.len = 1, give.attr = FALSE)
spc_tbl_ [10,000 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ id                 : num [1:10000] 569520 ...
 $ age                : num [1:10000] 3 0 ...
 $ gender             : num [1:10000] 0 1 ...
 $ race               : num [1:10000] 1 1 ...
 $ driving_experience : num [1:10000] 0 0 ...
 $ education          : num [1:10000] 2 0 ...
 $ income             : num [1:10000] 3 0 ...
 $ credit_score       : num [1:10000] 0.629 ...
 $ vehicle_ownership  : num [1:10000] 1 0 ...
 $ vehicle_year       : num [1:10000] 1 0 ...
 $ married            : num [1:10000] 0 0 ...
 $ children           : num [1:10000] 1 0 ...
 $ postal_code        : num [1:10000] 10238 ...
 $ annual_mileage     : num [1:10000] 12000 16000 ...
 $ vehicle_type       : num [1:10000] 0 0 ...
 $ speeding_violations: num [1:10000] 0 0 ...
 $ duis               : num [1:10000] 0 0 ...
 $ past_accidents     : num [1:10000] 0 0 ...
 $ outcome            : num [1:10000] 0 1 ...

Missing values per column

colSums(is.na(Cars))

library(DataExplorer)

plot_missing(Cars)
1
Automated and easy-to-use package for exploratory data analysis and reporting in R.

Missing values per column in %.

Missing values per column in %.
                 id                 age              gender                race 
                  0                   0                   0                   0 
 driving_experience           education              income        credit_score 
                  0                   0                   0                 982 
  vehicle_ownership        vehicle_year             married            children 
                  0                   0                   0                   0 
        postal_code      annual_mileage        vehicle_type speeding_violations 
                  0                 957                   0                   0 
               duis      past_accidents             outcome 
                  0                   0                   0 

Handling missing values

The variables with missing values annual_mileage and credit_score are continuous in nature and the proportion of missing data is small, making the mean a potential appropriate central estimate for imputation.

By imputing missing values rather than removing entire rows, we are able to retain a larger portion of the dataset for analysis, which ensures that the model is trained on as much information as possible

Let us check the distribution of these variables first

Distribution of credit_score

library(ggplot2)
library(hrbrthemes)

Cars %>%
        ggplot( aes(x=credit_score)) + geom_density(fill="#151931", color="#e9ecef", alpha=0.9) + ggtitle("Distribution of credit_score") + theme_ipsum()
1
Powerful and flexible package for creating advanced and customizable data visualizations in R.
2
Minimal and modern ggplot2 themes for creating visually appealing charts in R.

Approximately normally distributed credit_score

Approximately normally distributed credit_score
summary(Cars$credit_score)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0534  0.4172  0.5250  0.5158  0.6183  0.9608     982 

Distribution of annual_mileage

Cars %>%
        ggplot( aes(x=annual_mileage)) + geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) + ggtitle("Distribution of annual_mileage") + theme_ipsum()

Approximately normally distributed annual_mileage

Approximately normally distributed annual_mileage
summary(Cars$annual_mileage)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2000   10000   12000   11697   14000   22000     957 

The variables are approximately normally distributed, allowing the mean to represent the central tendency without significantly skewing the results.

Fill missing values with the mean

Cars$credit_score[is.na(Cars$credit_score)] <- mean(Cars$credit_score, na.rm = TRUE)
Cars$annual_mileage[is.na(Cars$annual_mileage)] <- mean(Cars$annual_mileage, na.rm = TRUE)

Encoding categorical variables ?

Upon examining the dataset, we found that the categorical variables were already numerically encoded. This included columns such as gender, race, driving_experience, education, and vehicle_type. Since these features were represented as integers, additional encoding (e.g., one-hot encoding) was not required. The dataset was structured with consistent numeric values, making it ready for further analysis and modeling without the need for manual encoding of categorical variables.

For example: gender: 0 for Female, 1 for Male vehicle_ownership: 0 for Paying off finance, 1 for Owns their vehicle education: 0 for No education, 2 for High school, 3 for University

This simplified the preprocessing steps and allowed the focus to shift towards handling other aspects, such as scaling continuous variables and addressing any potential missing values.


Modeling

Building the models

About Logistic Regression Model:
Logistic regression is a widely used statistical method for binary classification problems—where the outcome is either “yes” (claim made) or “no” (no claim).

How it works:
Logistic regression models the relationship between the dependent variable (the outcome) and one or more independent variables (features) by estimating probabilities. The model outputs a probability value between 0 and 1, which can then be classified into binary categories (e.g., claim or no claim).

Why it’s suitable:
In our case, the outcome variable is binary (whether a claim is made or not), making logistic regression an ideal choice. It helps us evaluate how changes in each feature influence the likelihood of a claim, giving us a clear understanding of the predictors’ importance.

For this project, we use a logistic regression model to predict the probability of a car insurance claim based on individual features.In this section, we systematically analyze the impact of individual features on the outcome of car insurance claims. The goal is to evaluate how well each feature predicts whether a claim will be made.

library(glue)
library(yardstick)

# Create a dataframe to store features
features_df <- data.frame(features = c(names(subset(Cars, select = -c(id, outcome)))))

# Empty vector to store accuracies
accuracies <- c()

# Loop through features
for (col in features_df$features) {
        
        # Create a model
        model <- glm(glue('outcome ~ {col}'), data = Cars, family = 'binomial')
        
        # Get prediction values for the model
        predictions <- round(fitted(model))
        
        # Calculate accuracy
        accuracy <- length(which(predictions == Cars$outcome)) / length(Cars$outcome)
        
        # Add accuracy to features_df
        features_df[which(features_df$feature == col), "accuracy"] = accuracy
}
1
Simple and flexible string interpolation using embedded R code.
2
Comprehensive package for measuring model performance with a wide range of evaluation metrics in R.
3
Feature Selection:
First, we create a dataframe that stores all the relevant features from the dataset, excluding identifiers like id and the target variable, outcome. This allows us to focus on the predictors without biasing the model.
4
Model Creation:
For each feature, we build a logistic regression model. By using the glm() function with a binomial family, we develop a series of models where the outcome is predicted based on one feature at a time. This method helps us isolate the effect of each individual feature on claim outcomes.
5
Prediction and Accuracy Calculation: After fitting each model, we generate prediction values. We then round these predictions to classify the outcome (e.g., whether a claim is made or not). To assess the model’s performance, we calculate the accuracy by comparing the predictions to the actual outcomes in the dataset. Accuracy is defined as the proportion of correct predictions out of the total number of cases.
6
Storing Results: Once we obtain the accuracy for each feature, we store it in the dataframe. This allows us to compare the predictive power of each feature side-by-side and identify which ones contribute the most to accurate predictions.

Finding the feature with the largest accuracy

We calculate accuracy as the proportion of correct predictions out of the total number of observations in the dataset. Specifically, we count how many times the model’s prediction (either 0 or 1) matches the actual outcome and divide this by the total number of cases.

\[ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Observations}} \]
# Find the feature with the largest accuracy
best_feature <- features_df$features[which.max(features_df$accuracy)]
best_accuracy <- max(features_df$accuracy)

# Create best_feature_df
best_feature_df <- data.frame(best_feature, best_accuracy)

# Run in a new cell to check your solution
best_feature_df
        best_feature best_accuracy
1 driving_experience        0.7771

After evaluating the predictive power of each feature, we found that driving_experience emerged as the most accurate predictor of car insurance claims, with an accuracy of 0.7771 (approximately 77.7%).

This means that using only the driving_experience variable, the model correctly predicted whether a claim was made or not in about 77.7% of the cases. This high accuracy suggests that driving experience plays a significant role in determining the likelihood of filing a claim. It aligns with the intuition that more experienced drivers may have a better understanding of road safety and accident prevention, leading to fewer insurance claims.