Machine learning has become increasingly important in the insurance industry, which heavily relies on data for decision-making.
In this project, I analyzed customer data to help predict whether a claim will be made against a policy!. The goal is to identify the most impactful feature for building a logistic regression model that would allow the company to estimate this likelihood with accuracy.
Insurance companies invest considerable resources into refining their pricing strategies and improving claim predictions. Since car insurance is mandatory in many countries, the market is very large, and these predictions can have a significant impact on business outcomes.
Investigating and cleaning the data
Client Data Description
We have been supplied with customer data as a csv file called car_insurance.csv, along with a table detailing the column names and descriptions below.
Click to expand details on client data columns
Column
Description
id
Unique client identifier
age
Client’s age: 0: 16-25 1: 26-39 2: 40-64 3: 65+
gender
Client’s gender: 0: Female 1: Male
driving_experience
Years the client has been driving: 0: 0-9 1: 10-19 2: 20-29 3: 30+
education
Client’s level of education: 0: No education 2: High school 3: University
income
Client’s income level: 0: Poverty 1: Working class 2: Middle class 3: Upper class
credit_score
Client’s credit score (between zero and one)
vehicle_ownership
Client’s vehicle ownership status: 0: Paying off finance 1: Owns their vehicle
vehcile_year
Year of vehicle registration: 0: Before 2015 1: 2015 or later
married
Client’s marital status: 0: Not married 1: Married
children
Client’s number of children
postal_code
Client’s postal code
annual_mileage
Number of miles driven by the client each year
vehicle_type
Type of car: 0: Sedan 1: Sports car
speeding_violations
Total number of speeding violations received by the client
duis
Number of times the client has been caught driving under the influence of alcohol
past_accidents
Total number of previous accidents the client has been involved in
outcome
Whether the client made a claim on their car insurance: 0: No claim 1: Made a claim
Reading the dataset
library(readr)library(gt)library(dplyr)# reading the datasetCars =read_csv("car_insurance.csv")# creating custom table using gt libraryCars %>%head(10) %>%gt() %>%tab_header(title =md("**customer data**"),subtitle =md("First 10 elements"))
1
Fast and user-friendly package for reading tabular data into R.
2
Elegant and user-friendly package for creating and customizing tables in R.
3
Essential toolkit for data manipulation with intuitive functions.
customer data
First 10 elements
id
age
gender
race
driving_experience
education
income
credit_score
vehicle_ownership
vehicle_year
married
children
postal_code
annual_mileage
vehicle_type
speeding_violations
duis
past_accidents
outcome
569520
3
0
1
0
2
3
0.6290273
1
1
0
1
10238
12000
0
0
0
0
0
750365
0
1
1
0
0
0
0.3577571
0
0
0
0
10238
16000
0
0
0
0
1
199901
0
0
1
0
2
1
0.4931458
1
0
0
0
10238
11000
0
0
0
0
0
478866
0
1
1
0
3
1
0.2060129
1
0
0
1
32765
11000
0
0
0
0
0
731664
1
1
1
1
0
1
0.3883659
1
0
0
0
32765
12000
0
2
0
1
1
877557
2
0
1
2
2
3
0.6191274
1
1
0
1
10238
13000
0
3
0
3
0
930134
3
1
1
3
2
3
0.4929436
0
1
1
1
10238
13000
0
7
0
3
0
461006
1
0
1
0
3
1
0.4686893
0
1
0
1
10238
14000
0
0
0
0
1
68366
2
0
1
2
3
1
0.5218149
0
0
1
0
10238
13000
0
0
0
0
0
445911
2
0
1
0
2
3
0.5615310
1
0
0
1
32765
11000
0
0
0
0
1
View data types
# Display the structure of the dataset in a readable formatstr(Cars, vec.len =1, give.attr =FALSE)
spc_tbl_ [10,000 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ id : num [1:10000] 569520 ...
$ age : num [1:10000] 3 0 ...
$ gender : num [1:10000] 0 1 ...
$ race : num [1:10000] 1 1 ...
$ driving_experience : num [1:10000] 0 0 ...
$ education : num [1:10000] 2 0 ...
$ income : num [1:10000] 3 0 ...
$ credit_score : num [1:10000] 0.629 ...
$ vehicle_ownership : num [1:10000] 1 0 ...
$ vehicle_year : num [1:10000] 1 0 ...
$ married : num [1:10000] 0 0 ...
$ children : num [1:10000] 1 0 ...
$ postal_code : num [1:10000] 10238 ...
$ annual_mileage : num [1:10000] 12000 16000 ...
$ vehicle_type : num [1:10000] 0 0 ...
$ speeding_violations: num [1:10000] 0 0 ...
$ duis : num [1:10000] 0 0 ...
$ past_accidents : num [1:10000] 0 0 ...
$ outcome : num [1:10000] 0 1 ...
Automated and easy-to-use package for exploratory data analysis and reporting in R.
Missing values per column in %.
id age gender race
0 0 0 0
driving_experience education income credit_score
0 0 0 982
vehicle_ownership vehicle_year married children
0 0 0 0
postal_code annual_mileage vehicle_type speeding_violations
0 957 0 0
duis past_accidents outcome
0 0 0
Handling missing values
The variables with missing values annual_mileage and credit_score are continuous in nature and the proportion of missing data is small, making the mean a potential appropriate central estimate for imputation.
By imputing missing values rather than removing entire rows, we are able to retain a larger portion of the dataset for analysis, which ensures that the model is trained on as much information as possible
Let us check the distribution of these variables first
Distribution of credit_score
library(ggplot2)library(hrbrthemes)Cars %>%ggplot( aes(x=credit_score)) +geom_density(fill="#151931", color="#e9ecef", alpha=0.9) +ggtitle("Distribution of credit_score") +theme_ipsum()
1
Powerful and flexible package for creating advanced and customizable data visualizations in R.
2
Minimal and modern ggplot2 themes for creating visually appealing charts in R.
Approximately normally distributed credit_score
summary(Cars$credit_score)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0534 0.4172 0.5250 0.5158 0.6183 0.9608 982
Distribution of annual_mileage
Cars %>%ggplot( aes(x=annual_mileage)) +geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +ggtitle("Distribution of annual_mileage") +theme_ipsum()
Approximately normally distributed annual_mileage
summary(Cars$annual_mileage)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
2000 10000 12000 11697 14000 22000 957
The variables are approximately normally distributed, allowing the mean to represent the central tendency without significantly skewing the results.
Upon examining the dataset, we found that the categorical variables were already numerically encoded. This included columns such as gender, race, driving_experience, education, and vehicle_type. Since these features were represented as integers, additional encoding (e.g., one-hot encoding) was not required. The dataset was structured with consistent numeric values, making it ready for further analysis and modeling without the need for manual encoding of categorical variables.
For example: gender: 0 for Female, 1 for Male vehicle_ownership: 0 for Paying off finance, 1 for Owns their vehicle education: 0 for No education, 2 for High school, 3 for University
This simplified the preprocessing steps and allowed the focus to shift towards handling other aspects, such as scaling continuous variables and addressing any potential missing values.
Modeling
Building the models
About Logistic Regression Model: Logistic regression is a widely used statistical method for binary classification problems—where the outcome is either “yes” (claim made) or “no” (no claim).
How it works: Logistic regression models the relationship between the dependent variable (the outcome) and one or more independent variables (features) by estimating probabilities. The model outputs a probability value between 0 and 1, which can then be classified into binary categories (e.g., claim or no claim).
Why it’s suitable: In our case, the outcome variable is binary (whether a claim is made or not), making logistic regression an ideal choice. It helps us evaluate how changes in each feature influence the likelihood of a claim, giving us a clear understanding of the predictors’ importance.
For this project, we use a logistic regression model to predict the probability of a car insurance claim based on individual features.In this section, we systematically analyze the impact of individual features on the outcome of car insurance claims. The goal is to evaluate how well each feature predicts whether a claim will be made.
library(glue)library(yardstick)# Create a dataframe to store featuresfeatures_df <-data.frame(features =c(names(subset(Cars, select =-c(id, outcome)))))# Empty vector to store accuraciesaccuracies <-c()# Loop through featuresfor (col in features_df$features) {# Create a model model <-glm(glue('outcome ~ {col}'), data = Cars, family ='binomial')# Get prediction values for the model predictions <-round(fitted(model))# Calculate accuracy accuracy <-length(which(predictions == Cars$outcome)) /length(Cars$outcome)# Add accuracy to features_df features_df[which(features_df$feature == col), "accuracy"] = accuracy}
1
Simple and flexible string interpolation using embedded R code.
2
Comprehensive package for measuring model performance with a wide range of evaluation metrics in R.
3
Feature Selection:
First, we create a dataframe that stores all the relevant features from the dataset, excluding identifiers like id and the target variable, outcome. This allows us to focus on the predictors without biasing the model.
4
Model Creation:
For each feature, we build a logistic regression model. By using the glm() function with a binomial family, we develop a series of models where the outcome is predicted based on one feature at a time. This method helps us isolate the effect of each individual feature on claim outcomes.
5
Prediction and Accuracy Calculation: After fitting each model, we generate prediction values. We then round these predictions to classify the outcome (e.g., whether a claim is made or not). To assess the model’s performance, we calculate the accuracy by comparing the predictions to the actual outcomes in the dataset. Accuracy is defined as the proportion of correct predictions out of the total number of cases.
6
Storing Results: Once we obtain the accuracy for each feature, we store it in the dataframe. This allows us to compare the predictive power of each feature side-by-side and identify which ones contribute the most to accurate predictions.
Finding the feature with the largest accuracy
We calculate accuracy as the proportion of correct predictions out of the total number of observations in the dataset. Specifically, we count how many times the model’s prediction (either 0 or 1) matches the actual outcome and divide this by the total number of cases.
\[
\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Observations}}
\]
# Find the feature with the largest accuracybest_feature <- features_df$features[which.max(features_df$accuracy)]best_accuracy <-max(features_df$accuracy)# Create best_feature_dfbest_feature_df <-data.frame(best_feature, best_accuracy)# Run in a new cell to check your solutionbest_feature_df
After evaluating the predictive power of each feature, we found that driving_experience emerged as the most accurate predictor of car insurance claims, with an accuracy of 0.7771 (approximately 77.7%).
This means that using only the driving_experience variable, the model correctly predicted whether a claim was made or not in about 77.7% of the cases. This high accuracy suggests that driving experience plays a significant role in determining the likelihood of filing a claim. It aligns with the intuition that more experienced drivers may have a better understanding of road safety and accident prevention, leading to fewer insurance claims.