Introduction

Machine learning has become increasingly important in the insurance industry, which heavily relies on data for decision-making.

In this project, I analyzed customer data to help predict whether a claim will be made against a policy!. The goal is to identify the most impactful feature for building a logistic regression model that would allow the company to estimate this likelihood with accuracy.

Insurance companies invest considerable resources into refining their pricing strategies and improving claim predictions. Since car insurance is mandatory in many countries, the market is very large, and these predictions can have a significant impact on business outcomes.

Investigating and cleaning the data

Client Data Description

We have been supplied with customer data as a csv file called car_insurance.csv, along with a table detailing the column names and descriptions below.

Click to expand details on client data columns

Column	Description
`id`	Unique client identifier
`age`	Client’s age: `0`: 16-25 `1`: 26-39 `2`: 40-64 `3`: 65+
`gender`	Client’s gender: `0`: Female `1`: Male
`driving_experience`	Years the client has been driving: `0`: 0-9 `1`: 10-19 `2`: 20-29 `3`: 30+
`education`	Client’s level of education: `0`: No education `2`: High school `3`: University
`income`	Client’s income level: `0`: Poverty `1`: Working class `2`: Middle class `3`: Upper class
`credit_score`	Client’s credit score (between zero and one)
`vehicle_ownership`	Client’s vehicle ownership status: `0`: Paying off finance `1`: Owns their vehicle
`vehcile_year`	Year of vehicle registration: `0`: Before 2015 `1`: 2015 or later
`married`	Client’s marital status: `0`: Not married `1`: Married
`children`	Client’s number of children
`postal_code`	Client’s postal code
`annual_mileage`	Number of miles driven by the client each year
`vehicle_type`	Type of car: `0`: Sedan `1`: Sports car
`speeding_violations`	Total number of speeding violations received by the client
`duis`	Number of times the client has been caught driving under the influence of alcohol
`past_accidents`	Total number of previous accidents the client has been involved in
`outcome`	Whether the client made a claim on their car insurance: `0`: No claim `1`: Made a claim

Reading the dataset

library(readr)
library(gt)
library(dplyr)


# reading the dataset
Cars = read_csv("car_insurance.csv")

# creating custom table using gt library
Cars %>% head(10) %>% gt() %>%
    tab_header(title = md("**customer data**"),
               subtitle = md("First 10 elements"))

1: Fast and user-friendly package for reading tabular data into R.
2: Elegant and user-friendly package for creating and customizing tables in R.
3: Essential toolkit for data manipulation with intuitive functions.

customer data
First 10 elements
id	age	gender	race	driving_experience	education	income	credit_score	vehicle_ownership	vehicle_year	married	children	postal_code	annual_mileage	vehicle_type	speeding_violations	duis	past_accidents	outcome
569520	3	0	1	0	2	3	0.6290273	1	1	0	1	10238	12000	0	0	0	0	0
750365	0	1	1	0	0	0	0.3577571	0	0	0	0	10238	16000	0	0	0	0	1
199901	0	0	1	0	2	1	0.4931458	1	0	0	0	10238	11000	0	0	0	0	0
478866	0	1	1	0	3	1	0.2060129	1	0	0	1	32765	11000	0	0	0	0	0
731664	1	1	1	1	0	1	0.3883659	1	0	0	0	32765	12000	0	2	0	1	1
877557	2	0	1	2	2	3	0.6191274	1	1	0	1	10238	13000	0	3	0	3	0
930134	3	1	1	3	2	3	0.4929436	0	1	1	1	10238	13000	0	7	0	3	0
461006	1	0	1	0	3	1	0.4686893	0	1	0	1	10238	14000	0	0	0	0	1
68366	2	0	1	2	3	1	0.5218149	0	0	1	0	10238	13000	0	0	0	0	0
445911	2	0	1	0	2	3	0.5615310	1	0	0	1	32765	11000	0	0	0	0	1

View data types

# Display the structure of the dataset in a readable format
str(Cars, vec.len = 1, give.attr = FALSE)

spc_tbl_ [10,000 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ id                 : num [1:10000] 569520 ...
 $ age                : num [1:10000] 3 0 ...
 $ gender             : num [1:10000] 0 1 ...
 $ race               : num [1:10000] 1 1 ...
 $ driving_experience : num [1:10000] 0 0 ...
 $ education          : num [1:10000] 2 0 ...
 $ income             : num [1:10000] 3 0 ...
 $ credit_score       : num [1:10000] 0.629 ...
 $ vehicle_ownership  : num [1:10000] 1 0 ...
 $ vehicle_year       : num [1:10000] 1 0 ...
 $ married            : num [1:10000] 0 0 ...
 $ children           : num [1:10000] 1 0 ...
 $ postal_code        : num [1:10000] 10238 ...
 $ annual_mileage     : num [1:10000] 12000 16000 ...
 $ vehicle_type       : num [1:10000] 0 0 ...
 $ speeding_violations: num [1:10000] 0 0 ...
 $ duis               : num [1:10000] 0 0 ...
 $ past_accidents     : num [1:10000] 0 0 ...
 $ outcome            : num [1:10000] 0 1 ...

Missing values per column

colSums(is.na(Cars))

library(DataExplorer)

plot_missing(Cars)

1: Automated and easy-to-use package for exploratory data analysis and reporting in R.

                 id                 age              gender                race 
                  0                   0                   0                   0 
 driving_experience           education              income        credit_score 
                  0                   0                   0                 982 
  vehicle_ownership        vehicle_year             married            children 
                  0                   0                   0                   0 
        postal_code      annual_mileage        vehicle_type speeding_violations 
                  0                 957                   0                   0 
               duis      past_accidents             outcome 
                  0                   0                   0

Handling missing values

The variables with missing values annual_mileage and credit_score are continuous in nature and the proportion of missing data is small, making the mean a potential appropriate central estimate for imputation.

By imputing missing values rather than removing entire rows, we are able to retain a larger portion of the dataset for analysis, which ensures that the model is trained on as much information as possible

Let us check the distribution of these variables first

Distribution of credit_score

library(ggplot2)
library(hrbrthemes)

Cars %>%
        ggplot( aes(x=credit_score)) + geom_density(fill="#151931", color="#e9ecef", alpha=0.9) + ggtitle("Distribution of credit_score") + theme_ipsum()

1: Powerful and flexible package for creating advanced and customizable data visualizations in R.
2: Minimal and modern ggplot2 themes for creating visually appealing charts in R.

Approximately normally distributed credit_score

summary(Cars$credit_score)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0534  0.4172  0.5250  0.5158  0.6183  0.9608     982

Distribution of annual_mileage

Cars %>%
        ggplot( aes(x=annual_mileage)) + geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) + ggtitle("Distribution of annual_mileage") + theme_ipsum()

Approximately normally distributed annual_mileage

summary(Cars$annual_mileage)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2000   10000   12000   11697   14000   22000     957

The variables are approximately normally distributed, allowing the mean to represent the central tendency without significantly skewing the results.

Fill missing values with the mean

Cars$credit_score[is.na(Cars$credit_score)] <- mean(Cars$credit_score, na.rm = TRUE)
Cars$annual_mileage[is.na(Cars$annual_mileage)] <- mean(Cars$annual_mileage, na.rm = TRUE)

Encoding categorical variables ?

Upon examining the dataset, we found that the categorical variables were already numerically encoded. This included columns such as gender, race, driving_experience, education, and vehicle_type. Since these features were represented as integers, additional encoding (e.g., one-hot encoding) was not required. The dataset was structured with consistent numeric values, making it ready for further analysis and modeling without the need for manual encoding of categorical variables.

For example: gender: 0 for Female, 1 for Male vehicle_ownership: 0 for Paying off finance, 1 for Owns their vehicle education: 0 for No education, 2 for High school, 3 for University

This simplified the preprocessing steps and allowed the focus to shift towards handling other aspects, such as scaling continuous variables and addressing any potential missing values.

Modeling

Building the models

About Logistic Regression Model:
Logistic regression is a widely used statistical method for binary classification problems—where the outcome is either “yes” (claim made) or “no” (no claim).

How it works:
Logistic regression models the relationship between the dependent variable (the outcome) and one or more independent variables (features) by estimating probabilities. The model outputs a probability value between 0 and 1, which can then be classified into binary categories (e.g., claim or no claim).

Why it’s suitable:
In our case, the outcome variable is binary (whether a claim is made or not), making logistic regression an ideal choice. It helps us evaluate how changes in each feature influence the likelihood of a claim, giving us a clear understanding of the predictors’ importance.

For this project, we use a logistic regression model to predict the probability of a car insurance claim based on individual features.In this section, we systematically analyze the impact of individual features on the outcome of car insurance claims. The goal is to evaluate how well each feature predicts whether a claim will be made.

library(glue)
library(yardstick)

# Create a dataframe to store features
features_df <- data.frame(features = c(names(subset(Cars, select = -c(id, outcome)))))

# Empty vector to store accuracies
accuracies <- c()

# Loop through features
for (col in features_df$features) {
        
        # Create a model
        model <- glm(glue('outcome ~ {col}'), data = Cars, family = 'binomial')
        
        # Get prediction values for the model
        predictions <- round(fitted(model))
        
        # Calculate accuracy
        accuracy <- length(which(predictions == Cars$outcome)) / length(Cars$outcome)
        
        # Add accuracy to features_df
        features_df[which(features_df$feature == col), "accuracy"] = accuracy
}

1: Simple and flexible string interpolation using embedded R code.
2: Comprehensive package for measuring model performance with a wide range of evaluation metrics in R.
3: Feature Selection:
First, we create a dataframe that stores all the relevant features from the dataset, excluding identifiers like id and the target variable, outcome. This allows us to focus on the predictors without biasing the model.
4: Model Creation:
For each feature, we build a logistic regression model. By using the glm() function with a binomial family, we develop a series of models where the outcome is predicted based on one feature at a time. This method helps us isolate the effect of each individual feature on claim outcomes.
5: Prediction and Accuracy Calculation: After fitting each model, we generate prediction values. We then round these predictions to classify the outcome (e.g., whether a claim is made or not). To assess the model’s performance, we calculate the accuracy by comparing the predictions to the actual outcomes in the dataset. Accuracy is defined as the proportion of correct predictions out of the total number of cases.
6: Storing Results: Once we obtain the accuracy for each feature, we store it in the dataframe. This allows us to compare the predictive power of each feature side-by-side and identify which ones contribute the most to accurate predictions.

Finding the feature with the largest accuracy

We calculate accuracy as the proportion of correct predictions out of the total number of observations in the dataset. Specifically, we count how many times the model’s prediction (either 0 or 1) matches the actual outcome and divide this by the total number of cases.

\[ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Observations}} \]

# Find the feature with the largest accuracy
best_feature <- features_df$features[which.max(features_df$accuracy)]
best_accuracy <- max(features_df$accuracy)

# Create best_feature_df
best_feature_df <- data.frame(best_feature, best_accuracy)

# Run in a new cell to check your solution
best_feature_df

        best_feature best_accuracy
1 driving_experience        0.7771

After evaluating the predictive power of each feature, we found that driving_experience emerged as the most accurate predictor of car insurance claims, with an accuracy of 0.7771 (approximately 77.7%).

This means that using only the driving_experience variable, the model correctly predicted whether a claim was made or not in about 77.7% of the cases. This high accuracy suggests that driving experience plays a significant role in determining the likelihood of filing a claim. It aligns with the intuition that more experienced drivers may have a better understanding of road safety and accident prevention, leading to fewer insurance claims.