ANALYZING PERFORMANCE GAPS BETWEEN DEPARTMENTS

Exploratory Data Analysis

Author

Ibrahima Fikry Diallo

I. INTRODUCTION

I.1 Business problem: Unveiling departmental performance gaps

A company aims to analyze and understand performance gaps across its various departments. This involves exploring the relationships between qualitative variables (such as department affiliation and employee satisfaction) and quantitative variables (like performance scores and work hours). The ultimate goal is to identify areas for improvement and foster a more productive and equitable work environment.

I.2 Project objectives: A Data-driven approach to HR insights

The primary objectives of this analysis are to:

  1. Conduct a thorough Exploratory Data Analysis (EDA) to understand the distributions and characteristics of key employee metrics.
  2. Investigate potential relationships between employee satisfaction, performance scores, and other relevant variables.
  3. Present findings in a clear, concise, and visually appealing manner suitable for both technical and non-technical stakeholders.
  4. Provide data-driven insights that can inform strategic decision-making within the company, ultimately leading to a more productive and equitable work environment.

I.3 Data Overview: A comprehensive look at Employee Data

The dataset includes 100,000 rows of employee information such as Departments, satisfaction levels, performance scores, hours worked, and more.

Click to expand details about the Dataset

Dataset Summary

This dataset contains 100,000 rows of data capturing key aspects of employee performance, productivity, and demographics in a corporate environment. It includes details related to the employee’s job, work habits, education, performance, and satisfaction. The dataset is designed for various purposes such as HR analytics, employee churn prediction, productivity analysis, and performance evaluation.

Key Variables:

  • Employee_ID: Unique identifier for each employee.
  • Department: The department in which the employee works (e.g., Sales, HR, IT).
  • Gender: Gender of the employee (Male, Female, Other).
  • Age: Employee’s age (between 22 and 60).
  • Job_Title: The role held by the employee (e.g., Manager, Analyst, Developer).
  • Hire_Date: The date the employee was hired.
  • Years_At_Company: The number of years the employee has been working for the company.
  • Education_Level: Highest educational qualification (High School, Bachelor, Master, PhD).
  • Performance_Score: Employee’s performance rating (1 to 5 scale).
  • Monthly_Salary: The employee’s monthly salary in USD, correlated with job title and performance score.
  • Work_Hours_Per_Week: Number of hours worked per week.
  • Projects_Handled: Total number of projects handled by the employee.
  • Overtime_Hours: Total overtime hours worked in the last year.
  • Sick_Days: Number of sick days taken by the employee.
  • Remote_Work_Frequency: Percentage of time worked remotely (0%, 25%, 50%, 75%, 100%).
  • Team_Size: Number of people in the employee’s team.
  • Training_Hours: Number of hours spent in training.
  • Promotions: Number of promotions received during their tenure.
  • Employee_Satisfaction_Score: Employee satisfaction rating (1.0 to 5.0 scale).
  • Resigned: Boolean value indicating if the employee has resigned.

The dataset can be accessed at the following link: Employee Performance Evaluation Dataset.


II. EXPLORATORY DATA ANALYSIS (EDA)

Libraries will be loaded progressively as needed to demonstrate their specific utility within each step.

II.1 Loading the Dataset

This section focuses on loading the raw employee performance data and a preliminary inspection.

Show code
# Load essential data manipulation and visualization libraries
library(tidyverse) # Includes dplyr for data manipulation and ggplot2 for plotting

# Load the employee dataset 
df_employee <- read.csv("Extended_Employee_Performance_and_Productivity_Data.csv")

# Display the first few rows of the dataset as an interactive table
# This helps in understanding the column names and initial data types.
head(df_employee) 

The dataset contains 100,000 individual employee records (rows) and is described by 20 distinct attributes (columns).

II.2 Data Structure and Preprocessing

II.2.1 Data cleaning: Checking for missing values

Before proceeding with detailed analysis, it’s crucial to assess the completeness of our dataset by checking for any missing values.

Show code
# Load the naniar package for missing data visualization
library(naniar)

# This plot provides a quick overview of which variables, if any, contain missing data.
gg_miss_var(df_employee) +
  labs(title = "Missing Values by Variable") +
  theme_minimal() # Consistent theme for plots

A critical finding from this initial phase is the complete absence of any missing values across all 20 columns of the dataset.

Implication: There is no need for data imputation or handling incomplete observations, which often introduces complexities or potential biases.

II.2.2 Variable formatting and type conversion

To ensure accurate analysis and visualization, variables are converted to appropriate data types (factors, numeric, logical, date/time).

Show code
# Convert nominal categorical variables to factors
df_employee$Department <- as.factor(df_employee$Department)
df_employee$Gender <- as.factor(df_employee$Gender)
df_employee$Job_Title <- as.factor(df_employee$Job_Title)

# Convert ordinal categorical variables to ordered factors
# Levels are explicitly defined to reflect natural hierarchy.
df_employee$Education_Level <- factor(df_employee$Education_Level,
                                      levels = c("High School", "Bachelor", "Master", "PhD"),
                                      ordered = TRUE)

df_employee$Remote_Work_Frequency <- factor(df_employee$Remote_Work_Frequency,
                                            levels = c("0", "25", "50", "75", "100"),
                                            ordered = TRUE)

df_employee$Performance_Score <- factor(df_employee$Performance_Score,
                                        levels = 1:5,
                                        ordered = TRUE)

# Convert 'Resigned' to a logical (boolean) type for clearer interpretation in analysis
df_employee$Resigned <- as.logical(df_employee$Resigned)

# Convert 'Hire_Date' to a proper date/time object for temporal analysis
df_employee$Hire_Date <- as.POSIXct(df_employee$Hire_Date, format = "%Y-%m-%d %H:%M:%S")

# Ensure all numerical variables are explicitly set as numeric
numerical_vars <- c("Age", "Years_At_Company", "Monthly_Salary", "Work_Hours_Per_Week",
                    "Projects_Handled", "Overtime_Hours", "Sick_Days", "Team_Size",
                    "Training_Hours", "Promotions", "Employee_Satisfaction_Score")

df_employee[numerical_vars] <- lapply(df_employee[numerical_vars], as.numeric)

The variables are now appropriately typed in R, which facilitates effective analysis:

  • Factor (6 columns): Department, Gender, Job_Title, Education_Level (ordered), Performance_Score (ordered), Remote_Work_Frequency (ordered).
  • Logical (1 column): Resigned. This boolean variable, indicating employee turnover, is central to our retention analysis.
  • Numeric (11 columns): Age, Years_At_Company, Monthly_Salary, Work_Hours_Per_Week, Projects_Handled, Overtime_Hours, Sick_Days, Team_Size, Training_Hours, Promotions, Employee_Satisfaction_Score. These represent quantitative measures.
  • POSIXct (1 column): Hire_Date. This timestamp variable is crucial for calculating accurate employee tenure.

II.2.3 Key variable insights: A descriptive summary

A detailed summary of each variable type provides a foundational understanding of the employee population and key HR metrics.

Summary of the Dataframe
Show code
# Load the skimr package for a compact and visually appealing summary of data frames
library(skimr)

# Generate a summary report for all variables, excluding the Employee_ID as it's just an identifier.
skim(df_employee %>% select(-Employee_ID))
Data summary
Name df_employee %>% select(-E…
Number of rows 100000
Number of columns 19
_______________________
Column type frequency:
factor 6
logical 1
numeric 11
POSIXct 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Department 0 1 FALSE 9 Mar: 11216, Fin: 11200, Ope: 11181, IT: 11131
Gender 0 1 FALSE 3 Mal: 48031, Fem: 48001, Oth: 3968
Job_Title 0 1 FALSE 7 Spe: 14507, Man: 14325, Tec: 14285, Ana: 14261
Education_Level 0 1 TRUE 4 Bac: 50041, Hig: 30004, Mas: 14904, PhD: 5051
Performance_Score 0 1 TRUE 5 1: 20120, 2: 20013, 3: 19999, 4: 19940
Remote_Work_Frequency 0 1 TRUE 5 25: 20235, 75: 20173, 100: 20049, 0: 19837

Variable type: logical

skim_variable n_missing complete_rate mean count
Resigned 0 1 0.1 FAL: 89990, TRU: 10010

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 41.03 11.24 22 31.00 41 51.00 60 ▇▇▇▇▇
Years_At_Company 0 1 4.48 2.87 0 2.00 4 7.00 10 ▇▅▅▅▂
Monthly_Salary 0 1 6403.21 1372.51 3850 5250.00 6500 7500.00 9000 ▅▇▇▇▅
Work_Hours_Per_Week 0 1 44.96 8.94 30 37.00 45 53.00 60 ▇▇▇▇▇
Projects_Handled 0 1 24.43 14.47 0 12.00 24 37.00 49 ▇▇▇▇▇
Overtime_Hours 0 1 14.51 8.66 0 7.00 15 22.00 29 ▇▇▇▇▇
Sick_Days 0 1 7.01 4.33 0 3.00 7 11.00 14 ▇▇▇▇▇
Team_Size 0 1 10.01 5.50 1 5.00 10 15.00 19 ▇▇▆▇▇
Training_Hours 0 1 49.51 28.89 0 25.00 49 75.00 99 ▇▇▇▇▇
Promotions 0 1 1.00 0.82 0 0.00 1 2.00 2 ▇▁▇▁▇
Employee_Satisfaction_Score 0 1 3.00 1.15 1 2.01 3 3.99 5 ▇▇▇▇▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
Hire_Date 0 1 2014-09-07 08:03:05 2024-09-03 08:03:05 2019-09-21 08:03:05 3650

Key Observations from the Data Summary:

  • Logical Variable (Resigned): This key variable shows an overall company resignation rate of 10.01% (10,010 employees resigned out of 100,000). This baseline rate underscores the importance of our employee retention analysis.

  • Numerical Variables:

    • Age: Employees range from 22 to 60 years, with an average age of approximately 41 years (SD: 11.24). This suggests a demographically diverse workforce.
    • Years_At_Company: The average tenure is around 4.5 years (SD: 2.87), with a maximum tenure of 10 years. This distribution will be crucial for understanding how tenure correlates with resignation risk.
    • Monthly_Salary: Salaries range from approximately $3,850 to $9,000, averaging around $6,400 (SD: 1372.51). This spread indicates variance in compensation across roles or experience levels.
    • Work_Hours_Per_Week: The average work week is about 45 hours (SD: 8.94), with most employees working between 30-60 hours.
    • Overtime_Hours: Employees average about 14.5 hours of overtime (SD: 8.66). This is a significant figure that will be explored for its potential impact on performance and satisfaction.
    • Employee_Satisfaction_Score: Scores range from 1 to 5, with an average of 3.00 (SD: 1.15), suggesting a neutral overall satisfaction level which warrants further investigation.
    • Promotions: The average number of promotions is 1 (SD: 0.82), indicating common career progression within the company.
  • Factor Variables:

    • Department: The company has 9 departments, with Marketing, Finance, Operations, and IT being the largest, each comprising roughly 11-12% of the workforce.
    • Gender: The dataset shows a balanced gender distribution (approximately 48% Male, 48% Female, and 4% Other).
    • Job_Title: There are 7 distinct job titles, with “Specialist”, “Manager”, “Technician”, and “Analyst” being the most common, each representing about 14-15% of roles.
    • Education_Level: “Bachelor’s” degree holders are the largest group (50%), followed by “High School” (30%), “Master’s” (15%), and “PhD” (5%).
    • Performance_Score: Scores from 1 to 5 are relatively evenly distributed, with each score level representing approximately 20% of the employees. This implies a normal distribution of performance across the company.
    • Remote_Work_Frequency: The distribution across the 5 categories (0%, 25%, 50%, 75%, 100%) is also relatively even, providing ample data to assess the impact of remote work.

II.2.4 Key visualizations: A glimpse into Employee demographics

This section presents initial visualizations to highlight key distributions and patterns within the employee dataset.

This initial data exploration reveals a high-quality, comprehensive HR dataset comprising 100,000 employee records across 20 distinct variables. Crucially, the dataset is entirely free of missing values, ensuring a clean foundation for all subsequent analyses. The variables capture a rich array of employee attributes, from demographics and tenure to compensation, performance, work habits, and well-being. A key initial insight is the company’s 10.01% employee resignation rate, highlighting the critical importance of retention analysis.

III. THEMATIC ANALYSIS: Deep dive into key HR questions

This section delves into specific thematic areas, leveraging the preprocessed dataset to uncover deeper insights related to employee retention, performance, and satisfaction.

III.1 EMPLOYEE RETENTION & TURNOVER: Unpacking Resignation drivers

WHAT FACTORS PREDICT RESIGNATION?

The following analyses explore various employee attributes to identify potential correlations with resignation status, helping to understand the underlying drivers of turnover.

Based on these initial outputs, factors like Gender, Education Level, Age, and Monthly Salary do not appear to be strong individual predictors of resignation. Surprisingly, Employee Satisfaction Score also does not show a clear relationship with turnover.

This means we need to look deeper into other factors and their combinations.


DOES REMOTE WORK FREQUENCY IMPACT RETENTION?

This analysis investigates the relationship between the frequency of remote work and employee resignation rates.

Show code
resignation_data_remote <- df_employee %>% # Renamed variable to avoid conflict
  group_by(Remote_Work_Frequency) %>%
  summarise(
    Total_Employees = n(),
    Resigned_Count = sum(Resigned == "Yes"),
    Resignation_Rate = (Resigned_Count / Total_Employees) * 100,
    .groups = 'drop'
  ) %>%
  arrange(desc(Resignation_Rate))

# Add highlight column to distinguish the highest bar
resignation_data_remote <- resignation_data_remote %>%
  mutate(Highlight = ifelse(Resignation_Rate == max(Resignation_Rate), "Highest", "Other"))

# Create plot
remote_work_plot <- ggplot(resignation_data_remote,
                          aes(x = Remote_Work_Frequency,
                              y = Resignation_Rate,
                              fill = Highlight)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = sprintf("%.2f%%", Resignation_Rate)),
            vjust = -0.5, size = 4.5, fontface = "bold", color = "#151931") +
  geom_text(aes(y = 1, label = paste0("n=", format(Total_Employees, big.mark = ","))),
            size = 3.5, color = "white", vjust = 1.2, fontface = "bold") +
  scale_fill_manual(values = c("Highest" = "#faca80", "Other" = "#151931"), guide = "none") +
  labs(
    title = "<span style='font-size:18pt'>Partially remote staff (75% remote)<br>show highest attrition (10.28%)</span>", 
    subtitle = "Lowest for fully office-based staff",
    x = "Remote Work Frequency (%)",
    y = "Resignation Rate (%)",
    caption = paste("Total employees analyzed:", format(sum(resignation_data_remote$Total_Employees), big.mark = ","))
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_markdown(hjust = 0.5, face = "bold", margin = margin(b = 5)),
    plot.subtitle = element_text(hjust = 0.5, color = "gray40", size = 12, margin = margin(b = 15)),
    plot.caption = element_text(hjust = 0, color = "#666666", margin = margin(t = 10)),
    axis.text.x = element_text(face = "bold", size = 11, color = "#151931"),
    axis.title.y = element_text(angle = 90, vjust = 0.5, margin = margin(r = 10)),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    plot.margin = margin(20, 20, 20, 20)
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1)),
                     limits = c(0, max(resignation_data_remote$Resignation_Rate) * 1.2))

# Display the plot
remote_work_plot

There’s a subtle but interesting trend here. Employees who work fully in-office (0% remote) have a slightly lower resignation rate (9.48%) compared to those who engage in remote work at any frequency, where rates are generally just above 10%. While the differences are not dramatic, it suggests that fully in-office roles might have a marginally higher retention rate. This warrants further investigation to understand if this is due to the nature of the roles, company culture, or other confounding factors. For example, some roles might inherently require more in-office presence and also be more stable.

ARE LOWER-PERFORMING EMPLOYEES MORE LIKELY TO LEAVE?

This analysis examines the relationship between an employee’s performance score and their likelihood of resignation.

Show code
# Calculate resignation rates by Performance_Score
performance_data_resignation <- df_employee %>% # Renamed variable to avoid conflict
  group_by(Performance_Score) %>%
  summarise(
    Total_Employees = n(),
    Resigned_Count = sum(Resigned == "Yes"),
    Resignation_Rate = (Resigned_Count / Total_Employees) * 100,
    .groups = 'drop'
  ) %>%
  arrange(desc(Resignation_Rate))

# Identify the highest resignation rate and mark it
performance_data_resignation <- performance_data_resignation %>%
  mutate(Highlight = ifelse(Resignation_Rate == max(Resignation_Rate), "Highest", "Other"))

# Plot
performance_plot_resignation <- ggplot(performance_data_resignation,
                         aes(x = Performance_Score,
                             y = Resignation_Rate,
                             fill = Highlight)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = sprintf("%.2f%%", Resignation_Rate)),
            vjust = -0.5, size = 4.5, fontface = "bold", color = "#151931") +
  geom_text(aes(y = 1, label = paste0("n=", format(Total_Employees, big.mark = ","))),
            size = 3.5, color = "white", vjust = 1.2, fontface = "bold") +
  scale_fill_manual(values = c("Highest" = "#faca80", "Other" = "#151931"), guide = "none") +
  labs(
    title = "<span style='font-size:18pt'>There is no clear indication that lower-performing<br> employees are significantly more likely to leave</span>",
    subtitle = "Employees with the highest score (5) have the lowest resignation rate",
    x = "Performance Score (1 = Lowest, 5 = Highest)",
    y = "Resignation Rate (%)",
    caption = paste("Total employees analyzed:", format(sum(performance_data_resignation$Total_Employees), big.mark = ","))
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_markdown(hjust = 0.5, face = "bold", margin = margin(b = 5)),
    plot.subtitle = element_text(hjust = 0.5, color = "gray40", margin = margin(b = 15)),
    plot.caption = element_text(hjust = 0, color = "#666666", margin = margin(t = 10)),
    axis.text.x = element_text(face = "bold", size = 11, color = "#151931"),
    axis.title.y = element_text(angle = 90, vjust = 0.5, hjust = 0.5, margin = margin(r = 10)),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    plot.margin = margin(20, 20, 20, 20)
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1)),
                     limits = c(0, max(performance_data_resignation$Resignation_Rate) * 1.2))

# Display the plot
performance_plot_resignation

This is another counter-intuitive finding. There is no clear indication that lower-performing employees (Scores 1 or 2) are significantly more likely to leave than higher-performing employees (Scores 4 or 5). In fact, employees with a Performance Score of 5 (the highest) have the lowest resignation rate, although by a small margin. This suggests that the company isn’t experiencing “bad turnover” where only low performers leave, but rather turnover is happening across the performance spectrum. This could mean that the company’s retention challenges are systemic rather than tied to individual performance levels.

DOES TENURE (Years_At_Company) CORRELATE WITH RESIGNATION RISK?

This analysis segments employees by their years at the company to identify how tenure influences resignation rates.

Show code
# Create tenure bins and calculate resignation rates
resignation_by_tenure <- df_employee %>%
  mutate(
    tenure_bin = cut(Years_At_Company,
      breaks = c(-Inf, 1, 3, 5, 7, 10, Inf), # Define clear tenure bins
      labels = c("0-1 year", "1-3 years", "3-5 years", "5-7 years", "7-10 years", "10+ years"),
      right = FALSE # Intervals are [lower, upper)
    )
  ) %>%
  group_by(tenure_bin) %>%
  summarise(
    total_employees = n(),
    resigned_count = sum(Resigned == "Yes"),
    resignation_rate = (resigned_count / total_employees) * 100,
    .groups = 'drop'
  )

# Create plot for tenure and resignation rates
tenure_plot <- ggplot(resignation_by_tenure,
                     aes(x = tenure_bin, 
                         y = resignation_rate,
                         fill = resignation_rate)) +
  geom_col(width = 0.8) +
  geom_text(
    aes(label = sprintf("%.2f%%", resignation_rate)),
    vjust = -0.5,
    size = 4.5,
    fontface = "bold",
    color = "#151931"
  ) +
  geom_text(
    aes(y = 1, label = paste0("n=", format(total_employees, big.mark = ","))),
    size = 3.5,
    color = "white",
    vjust = 0.7,
    fontface = "bold"
  ) +
  scale_fill_gradient(low = "#151931", high = "#faca80", guide = "none") + # Gradient fill
  labs(
    title = "<span style='font-size:18pt'>Highest attrition (15.9%) among oldest employees<br> (10+ years)</span>", 
    subtitle = "Resignation rates generally increases with tenure",
    x = "Years at Company", 
    y = "Resignation Rate (%)",
    caption = paste("Total employees analyzed:", format(sum(resignation_by_tenure$total_employees), big.mark = ","))
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_markdown(hjust = 0.5, face = "bold", margin = margin(b = 5)),
    plot.subtitle = element_text(hjust = 0.5, color = "gray40", margin = margin(b = 15)),
    plot.caption = element_text(hjust = 0, color = "#666666", margin = margin(t = 10)),
    axis.text.x = element_text(angle = 0, hjust = 0.5, size = 11, color = "#151931"),
    axis.title.y = element_text(angle = 90, vjust = 0.5, hjust = 0.5, margin = margin(r = 10)),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    plot.margin = margin(20, 20, 20, 20),
    plot.background = element_rect(fill = "white", color = NA)
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15)),
                     limits = c(0, max(resignation_by_tenure$resignation_rate) * 1.2))

# Display the plot
tenure_plot

This is a crucial insight!

  • Contrary to a common belief that new hires are most volatile, the highest resignation rate is observed among the most tenured employees (10+ Years), reaching almost 16%. This is a significant outlier compared to the ~10% rate for all other tenure bands. This group, although small in total number (44 employees), shows a distinct pattern. This could indicate issues related to career stagnation, lack of new challenges, or retirement for long-serving employees.
  • For the majority of the workforce (0-10 years), the resignation rate remains fairly consistent around 10%, indicating that the “honeymoon phase” or early career challenges aren’t leading to disproportionately high turnover in this dataset.
SUMMARY EMPLOYEE RETENTION & TURNOVER:
  1. Demographics (Gender, Education, Age) and Base Salary: These factors do not appear to be strong individual predictors of resignation. Turnover rates are largely consistent across different groups defined by these attributes.
  2. Employee Satisfaction: Surprisingly, overall Employee_Satisfaction_Score does not show a clear relationship with resignation. This suggests that dissatisfaction might not be captured by this single metric, or employees are leaving for reasons beyond general satisfaction.
  3. Remote Work Frequency: Employees who work fully in-office (0% remote) have a slightly lower resignation rate than those with any level of remote work. While the difference is small, it’s worth noting.
  4. Performance Score: Lower-performing employees are not more likely to leave. Turnover happens across all performance levels, and surprisingly, the highest performers have the lowest (though marginally) resignation rate. This indicates that the company is not just shedding low performers.
  5. Tenure (Years_At_Company): This is the most impactful factor so far. The highest resignation risk is observed among employees with 10+ years of tenure, showing a significantly elevated resignation rate compared to other tenure groups. Turnover for employees with less than 10 years of tenure is relatively stable.

III.2 PERFORMANCE & PRODUCTIVITY: Uncovering drivers of success

WHAT DRIVES PERFORMANCE SCORES?

This section explores various factors that might influence employee performance scores, examining departmental performance, educational attainment, and the relationship between salary and performance.

This scatter plot visualizes the relationship between monthly salary and performance score, including a linear regression line to indicate the trend.

Show code
# Ensure Performance_Score is numeric for correlation calculation and plotting
df_employee <- df_employee %>% mutate(Performance_Score_Numeric = as.numeric(Performance_Score))

# Calculate Pearson correlation coefficient and key statistics
correlation <- cor(df_employee$Monthly_Salary, df_employee$Performance_Score_Numeric, use = "complete.obs")
salary_stats_performance <- df_employee %>% # Renamed variable
  summarise(
    median_salary = median(Monthly_Salary),
    avg_performance = mean(Performance_Score_Numeric, na.rm = TRUE)
  )

# Create enhanced scatter plot
performance_salary_plot <- ggplot(df_employee,
                                 aes(x = Monthly_Salary,
                                     y = Performance_Score_Numeric)) +
  # Add points, with slight transparency for overlapping points
  geom_point(alpha = 0.5, color = "#151931") + # Dark blue points
  # Add a linear regression line to show the trend
  geom_smooth(method = "lm",
              color = "#e55807",  # orange for the regression line
              linewidth = 1.5,
              se = FALSE) + # Do not show standard error band
  # Add reference lines for median salary and average performance
  geom_hline(yintercept = salary_stats_performance$avg_performance,
             linetype = "dashed",
             color = "#666666") +
  geom_vline(xintercept = salary_stats_performance$median_salary,
             linetype = "dashed",
             color = "#666666") +
  # Annotate the correlation coefficient on the plot
  annotate("text",
           x = max(df_employee$Monthly_Salary) * 0.95, # Position in top right corner
           y = max(df_employee$Performance_Score_Numeric) * 0.95,
           label = sprintf("r = %.2f", correlation),
           color = "#151931",
           size = 5,
           fontface = "bold") +
  # Set x-axis as currency and define breaks
  scale_x_continuous(labels = dollar_format(),
                     breaks = seq(4000, 9000, by = 1000)) +
  # Set y-axis limits and breaks for performance scores
  scale_y_continuous(limits = c(1, 5),
                     breaks = 1:5) +
  # Titles and labels with markdown
  labs(
    title = "<span style='font-size:18pt'>Modest positive correlation (r = 0.51) <br>between Salary and Performance</span>",
    subtitle = "Higher salaries show slightly better Performance Scores on average",
    x = "Monthly Salary (USD)", # Clearer x-axis label
    y = "Performance Score (1-5)",
    caption = paste("Total employees analyzed: n =", format(nrow(df_employee), big.mark = ","),
                    "| Median salary:", dollar(salary_stats_performance$median_salary),
                    "| Average performance:", round(salary_stats_performance$avg_performance, 2))
  ) +
  # Apply consistent minimal theme
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_markdown(hjust = 0.5, face = "bold", margin = margin(b = 5)),
    plot.subtitle = element_text(hjust = 0.5, color = "gray40", margin = margin(b = 15)),
    plot.caption = element_text(hjust = 0, color = "#666666", margin = margin(t = 10)),
    axis.title = element_text(color = "#151931"),
    axis.text = element_text(color = "#151931"),
    panel.grid.minor = element_blank(), # Remove minor grid lines
    legend.position = "none", # Hide legend
    plot.margin = margin(20, 20, 20, 20),
    plot.background = element_rect(fill = "white", color = NA)
  )

# Display the plot
performance_salary_plot

This bar chart illustrates the average performance score for each department, highlighting departments with the highest and lowest scores relative to the company average.

Show code
# Calculate average performance by department
performance_data_department <- df_employee %>% # Renamed variable
  mutate(Performance_Score_Numeric = as.numeric(Performance_Score)) %>% # Convert factor to numeric for calculations
  group_by(Department) %>%
  summarise(
    avg_score = mean(Performance_Score_Numeric, na.rm = TRUE),
    n_employees = n(),
    .groups = 'drop'
  ) %>%
  arrange(desc(avg_score)) %>% # Order by average score
  mutate(
    performance_gap = avg_score - mean(avg_score), # Calculate gap from overall average
    highlight = case_when(
      avg_score == max(avg_score) ~ "top", # Highlight top performer
      avg_score == min(avg_score) ~ "bottom", # Highlight bottom performer
      TRUE ~ "middle" # Default for others
    )
  )

# Create plot for average performance score by department
performance_plot_department <- ggplot(performance_data_department, # Use the renamed data variable
                         aes(x = fct_reorder(Department, avg_score), # Reorder departments for better visualization
                             y = avg_score,
                             fill = highlight)) +
  # Bars
  geom_col(width = 0.8) +
  # Reference line for overall company average performance
  geom_hline(
    yintercept = mean(performance_data_department$avg_score),
    color = "#666666",
    linetype = "dashed",
    linewidth = 0.8
  ) +
  # Score labels on bars
  geom_text(
    aes(label = sprintf("%.2f", avg_score)),
    hjust = -0.2, # Position labels outside the bars
    size = 3.8,
    fontface = "bold",
    color = "#151931"
  ) +
  # Performance gap labels for top/bottom departments
  geom_text(
    aes(y = mean(avg_score),
        label = ifelse(highlight %in% c("top", "bottom"),
                      sprintf("%+.2f", performance_gap), "")),
    hjust = 1.2, # Position relative to the average line
    size = 3.2,
    color = ifelse(performance_data_department$highlight == "top", "#1a7d1a", "#c00000"), # Green for positive, red for negative gap
    fontface = "bold"
  ) +
  # Employee count labels at the bottom of the bars
  geom_text(
    aes(y = 0.2, label = paste0("n=", format(n_employees, big.mark = ","))),
    hjust = 0, # Align to the left of the bar
    size = 3,
    color = "white",
    fontface = "bold"
  ) +
  # Custom color scale for highlights
  scale_fill_manual(values = c(
    "top" = "#faca80",
    "middle" = "#2a3152",
    "bottom" = "#151931"
  )) +
  # Titles and labels with markdown
  labs(
    title = "<span style='font-size:18pt'>Engineering leads Department Performance (3.02/5)</span>",
    subtitle = "Marketing trails by -0.01 points from company average",
    x = NULL, # Remove x-axis label
    y = NULL, # Remove y-axis label
    caption = paste("Performance scale: 1-5 | Company average:",
                   round(mean(performance_data_department$avg_score), 2),
                   "| Total employees analyzed:", format(sum(performance_data_department$n_employees), big.mark = ","))
  ) +
  # Flip coordinates for horizontal bars
  coord_flip() +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_markdown(hjust = 0.5, face = "bold", margin = margin(b = 5)),
    plot.subtitle = element_text(hjust = 0.5, color = "gray40", margin = margin(b = 15)),
    plot.caption = element_text(hjust = 0, color = "#666666", margin = margin(t = 10)),
    axis.text.x = element_blank(), # Hide x-axis text as labels are on bars
    axis.text.y = element_text(face = "bold", size = 11, color = "#151931", margin = margin(r = 10)),
    panel.grid.major = element_blank(), # Remove major grid lines
    panel.grid.minor = element_blank(), # Remove minor grid lines
    legend.position = "none", # Hide legend
    plot.margin = margin(20, 80, 20, 20), # Increased right margin for gap labels
    plot.background = element_rect(fill = "white", color = NA)
  ) +
  scale_y_continuous(
    limits = c(0, 5.2), # Set y-axis limits to match performance score scale (1-5)
    breaks = seq(0, 5, by = 0.5), # Define breaks for y-axis
    expand = expansion(mult = c(0, 0.1))
  )

# Display the plot
performance_plot_department

This analysis examines whether educational attainment correlates with average employee performance scores.

Show code
# Calculate performance by education level with gaps from the company average
edu_performance <- df_employee %>%
  mutate(Performance_Score_Numeric = as.numeric(Performance_Score)) %>% # Ensure numeric for calculation
  group_by(Education_Level) %>%
  summarise(
    avg_score = mean(Performance_Score_Numeric, na.rm = TRUE),
    n_employees = n(),
    .groups = 'drop'
  ) %>%
  mutate(
    company_avg = mean(avg_score), # Calculate overall company average for comparison
    performance_gap = avg_score - company_avg, # Calculate gap from company average
    highlight = case_when(
      avg_score == max(avg_score) ~ "top", # Highlight top performer
      avg_score == min(avg_score) ~ "bottom", # Highlight bottom performer
      TRUE ~ "middle" # Default for others
    )
  ) %>%
  arrange(desc(avg_score)) # Order by average score

# Create plot for average performance score by education level
edu_plot_performance <- ggplot(edu_performance, # Use the renamed data variable
                  aes(x = fct_reorder(Education_Level, avg_score), # Reorder education levels
                      y = avg_score,
                      fill = highlight)) +
  geom_col(width = 0.7) +
  # Reference line for overall company average performance
  geom_hline(
    yintercept = mean(edu_performance$avg_score),
    color = "#666666",
    linetype = "dashed",
    linewidth = 0.8
  ) +
  # Score labels on bars
  geom_text(
    aes(label = sprintf("%.2f", avg_score)),
    hjust = -0.2, # Position labels outside the bars
    size = 4,
    fontface = "bold",
    color = "#151931"
  ) +
  # Performance gap labels (NEW) for meaningful gaps
  geom_text(
    aes(y = company_avg,
        label = ifelse(abs(performance_gap) > 0.01,  # Only show meaningful gaps (e.g., > 0.01)
                      sprintf("%+.2f", performance_gap), "")),
    hjust = 1.2, # Position relative to the average line
    size = 3.5,
    color = ifelse(edu_performance$performance_gap > 0, "#1a7d1a", "#c00000"), # Green for positive, red for negative
    fontface = "bold"
  ) +
  # Employee count labels
  geom_text(
    aes(y = 0.2, label = paste0("n=", format(n_employees, big.mark = ","))),
    hjust = 0, # Align to the left
    size = 3.2,
    color = "white",
    fontface = "bold"
  ) +
  # Custom color scale
  scale_fill_manual(values = c(
    "top" = "#faca80",
    "middle" = "#2a3152",
    "bottom" = "#151931"
  )) +
  # Titles and labels with markdown
  labs(
    title = "<span style='font-size:18pt'>Master holders outperform by +0.02 points</span>",
    subtitle = "Bachelor graduates trail company average by -0.01 points",
    x = NULL, # Remove x-axis label
    y = NULL, # Remove y-axis label
    caption = paste("Performance scale: 1-5 | Company average:",
                   round(mean(edu_performance$avg_score), 2),
                   "| Total employees analyzed:", format(sum(edu_performance$n_employees), big.mark = ","))
  ) +
  # Flip coordinates for horizontal bars
  coord_flip() +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_markdown(hjust = 0.5, face = "bold", margin = margin(b = 5)),
    plot.subtitle = element_text(hjust = 0.5, color = "gray40", margin = margin(b = 15)),
    plot.caption = element_text(hjust = 0, color = "#666666", margin = margin(t = 10)),
    axis.text.x = element_blank(), # Hide x-axis text
    axis.text.y = element_text(face = "bold", size = 11, color = "#151931"),
    panel.grid.major = element_blank(), # Remove major grid lines
    panel.grid.minor = element_blank(), # Remove minor grid lines
    legend.position = "none", # Hide legend
    plot.margin = margin(20, 100, 20, 20),  # Increased right margin for gap labels
    plot.background = element_rect(fill = "white", color = NA)
  ) +
  scale_y_continuous(
    limits = c(0, 5.2),
    breaks = seq(0, 5, by = 0.5),
    expand = expansion(mult = c(0, 0.1))
  )

# Display the plot
edu_plot_performance

From these initial analyses on performance drivers:

  • Monthly Salary shows a positive correlation with Performance Score, suggesting that higher earners tend to be higher performers.
  • Department and Education Level do not appear to be strong individual drivers of performance scores, as average scores are very consistent across these categories, or no clear trend is observed.
DOES OVERTIME IMPROVE OR HARM PERFORMANCE?

This analysis investigates the relationship between overtime hours worked and employee performance scores, aiming to understand if there’s an optimal level of overtime.

Show code
# Prepare data with proper overtime bins and calculate average performance
overtime_data_performance <- df_employee %>% # Renamed variable
  mutate(
    Overtime_Bin = factor(
      cut(Overtime_Hours,
          breaks = c(-1, 4, 9, 14, 19, 24, Inf), # Define clear bins for overtime hours
          labels = c("0-4h", "5-9h", "10-14h", "15-19h", "20-24h", "25+h"),
          right = FALSE # Intervals are [lower, upper)
      )
    )
  ) %>%
  group_by(Overtime_Bin) %>%
  summarise(
    avg_score = mean(Performance_Score_Numeric, na.rm = TRUE),
    n_employees = n(),
    .groups = 'drop'
  ) %>%
  mutate(
    performance_gap = avg_score - mean(avg_score), # Calculate gap from overall average
    highlight = ifelse(avg_score == max(avg_score), "top", "normal") # Highlight the top performer bin
  )

# Create plot for overtime hours and average performance score
overtime_plot <- ggplot(overtime_data_performance, # Use the renamed data variable
                       aes(x = Overtime_Bin,
                           y = avg_score,
                           fill = highlight)) +
  geom_col(width = 0.7) +
  # Reference line for overall company average performance
  geom_hline(
    yintercept = mean(overtime_data_performance$avg_score),
    color = "#666666",
    linetype = "dashed",
    linewidth = 0.8
  ) +
  # Score labels on bars
  geom_text(
    aes(label = sprintf("%.3f", avg_score)),
    vjust = -0.5,
    size = 4,
    fontface = "bold",
    color = "#151931"
  ) +
  # Performance gap labels for highlighted bins
  geom_text(
    aes(y = avg_score, 
        label = ifelse(abs(performance_gap) > 0.005, # Only show if gap is significant
                      sprintf("%+.2f", performance_gap), "")),
    vjust = 1.5,
    size = 3.5,
    color = ifelse(overtime_data_performance$performance_gap > 0, "#1a7d1a", "#c00000"), # Green for positive, red for negative
    fontface = "bold"
  ) +
  # Employee count labels at the bottom of the bars
  geom_text(
    aes(y = 0.1, label = paste0("n=", format(n_employees, big.mark = ","))),
    vjust = 0,
    size = 3,
    color = "white",
    fontface = "bold"
  ) +
  # Custom color scale for highlights
  scale_fill_manual(values = c("top" = "#faca80", "normal" = "#151931")) +
  # Titles and labels with markdown
  labs(
    title = "<span style='font-size:18pt'>There is no strong linear relationship or clear pattern <br>indicating that overtime hours significantly <br>improve or harm performance</span>",
    x = "Monthly Overtime Hours", # Clearer x-axis label
    y = "Average Performance Score (1-5)", # Clearer y-axis label
    caption = paste("Performance scale: 1-5 | Company average:",
                   round(mean(overtime_data_performance$avg_score), 2),
                   "| Total employees analyzed:", format(sum(overtime_data_performance$n_employees), big.mark = ","))
  ) +
  # Apply consistent minimal theme
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_markdown(hjust = 0.5, face = "bold", margin = margin(b = 5)),
    plot.subtitle = element_text(hjust = 0.5, color = "gray40", margin = margin(b = 15)),
    plot.caption = element_text(hjust = 0, color = "#666666", margin = margin(t = 10)),
    axis.text.x = element_text(angle = 0, hjust = 0.5, size = 11, color = "#151931"),
    axis.title.y = element_text(angle = 90, vjust = 0.5, hjust = 0.5, margin = margin(r = 10)),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    plot.margin = margin(20, 20, 20, 20),
    plot.background = element_rect(fill = "white", color = NA)
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15)))

# Display the plot
overtime_plot

This analysis suggests that there is no strong linear relationship or clear pattern indicating that overtime hours significantly improve or harm performance in this dataset. This finding challenges the common assumption that more overtime necessarily leads to higher performance or, conversely, that excessive overtime leads to burnout and lower performance. It implies that for this dataset, performance is either driven by other factors not directly related to the amount of overtime, or the performance metric itself might not be sensitive enough to capture the nuanced effects of overtime.

III.3 COMPENSATION & EQUITY: Examining salary disparities

ARE THERE SIGNIFICANT SALARY DISPARITIES ACROSS DEPARTMENTS?

This analysis investigates whether there are notable differences in average monthly salaries across different departments within the company.

Show code
# Prepare data with additional metrics for salary analysis by department
salary_data_department <- df_employee %>% # Renamed variable
  group_by(Department) %>%
  summarise(
    Avg_Salary = mean(Monthly_Salary, na.rm = TRUE),
    Median_Salary = median(Monthly_Salary, na.rm = TRUE),
    Employee_Count = n(),
    .groups = 'drop'
  ) %>%
  arrange(desc(Avg_Salary)) %>% # Order by average salary
  mutate(
    Salary_Ratio = Avg_Salary/min(Avg_Salary), # Calculate ratio relative to lowest average salary
    Highlight = case_when(
      Avg_Salary == max(Avg_Salary) ~ "peak", # Highlight highest average salary
      Avg_Salary == min(Avg_Salary) ~ "low", # Highlight lowest average salary
      TRUE ~ "normal" # Default for others
    )
  )

# Calculate reference metrics for the overall company
company_avg_salary <- mean(salary_data_department$Avg_Salary)
total_employees_salary_analysis <- sum(salary_data_department$Employee_Count)

# Create visualization for average monthly salary by department
salary_plot_department <- ggplot(salary_data_department, # Use the renamed data variable
                     aes(x = reorder(Department, Avg_Salary), # Reorder departments by average salary
                         y = Avg_Salary,
                         fill = Highlight)) +

  # Bars representing average salaries
  geom_col(width = 1, color = "white", linewidth = 0.5) +

  # Reference line for overall company average salary
  geom_hline(yintercept = company_avg_salary,
             color = "#666666",
             linetype = "dashed",
             linewidth = 1) +

  # Salary labels on bars
  geom_text(aes(label = dollar(round(Avg_Salary))),
            vjust = -0.5,
            size = 4,
            fontface = "bold",
            color = "#151931") +

  # Employee count labels at the bottom of the bars
  geom_text(aes(y = 1000,
                label = paste0("n=",format(Employee_Count, big.mark = ","))),
            color = "white",
            size = 3.2,
            vjust = 0,
            fontface = "bold") +

  # Custom color scheme for highlights
  scale_fill_manual(values = c("peak" = "#faca80", "low" = "#e6554a", "normal" = "#151931")) +
  scale_y_continuous(labels = dollar_format(),
                     expand = expansion(mult = c(0, 0.20))) +

  # Professional titles and captions
  labs(
    title = "<span style='font-size:18pt;color:#151931'>There are no significant salary disparities<br> across departments </span>",
    subtitle = "The Median Salary for all departments is exactly the same: $6500",
    x = NULL, # Remove x-axis label
    y = "Average Monthly Salary (USD)",
    caption = paste("Company average:", dollar(round(company_avg_salary)),
                    "| Total employees analyzed:", format(total_employees_salary_analysis, big.mark = ","))
  ) +

  # theme with consistent styling
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_markdown(hjust = 0.5, face = "bold", margin = margin(b = 8)),
    plot.subtitle = element_text(hjust = 0.5, color = "gray40", margin = margin(b = 20)),
    plot.caption = element_text(hjust = 1, color = "#666666", margin = margin(t = 15)),
    axis.text.x = element_text(angle = 45, hjust = 1, face = "bold", color = "#151931"), # Rotate x-axis labels
    axis.title.y = element_text(angle = 90, vjust = 0.5, color = "#151931"),
    axis.text.y = element_text(color = "#151931"),
    panel.grid.major.x = element_blank(), # Remove major x grid lines
    panel.grid.major.y = element_line(color = "#e0e0e0"),
    panel.grid.minor = element_blank(),
    legend.position = "none",
    plot.margin = margin(25, 25, 25, 25),
    plot.background = element_rect(fill = "white", color = NA)
  )

# Display the plot
salary_plot_department

Based on these outputs, there are no significant salary disparities across departments in this dataset.

  • The average salaries show only very minor fluctuations, and the identical median salaries across all departments strongly suggest that the typical employee’s salary experience is consistent regardless of their department.
  • This indicates a high degree of salary equity across the different departments within the organization. This could be due to a standardized pay scale, roles with similar compensation across departments, or a conscious effort by the organization to maintain internal pay equity.
  • While minor statistical differences in the average might exist, they are not practically significant given the identical medians and highly similar distributions.

EXECUTIVE SUMMARY: Key Insights from Employee Performance Analysis

This report provides a comprehensive analysis of employee performance, retention, and compensation across various departments, drawing insights from a dataset of 100,000 employee records. The primary objective was to identify performance gaps, understand drivers of employee turnover, and assess compensation equity within the organization.

Key Findings:

  • Overall Workforce Overview: The dataset is robust and clean, with no missing values, providing a strong foundation for reliable analysis. The company experiences an overall employee resignation rate of approximately 10%.
  • Employee Retention & Turnover:
    • Surprisingly, traditional demographic factors like gender, education level, age, and even overall employee satisfaction scores do not appear to be strong individual predictors of resignation. Turnover rates remain largely consistent across these groups. This suggests that the issues driving attrition might be more nuanced or external to these broad categories.
    • Remote work frequency shows a subtle trend: Employees working fully in-office (0% remote) exhibit a marginally lower resignation rate compared to those engaged in hybrid or fully remote work.
    • Performance score has a counter-intuitive relationship: Lower-performing employees are not disproportionately more likely to leave. In fact, turnover occurs across all performance levels, and the highest performers have the lowest, albeit slightly, resignation rate.
    • Tenure is a critical predictor: The highest resignation risk is observed among employees with 10+ years of tenure, with a significantly elevated resignation rate compared to other tenure groups. This suggests potential issues around career stagnation or new challenges for long-serving employees.
  • Performance & Productivity Drivers:
    • A modest positive correlation (r = 0.51) exists between Monthly Salary and Performance Score, indicating that higher-earning employees tend to achieve slightly better performance ratings.
    • Department and Education Level do not emerge as strong individual drivers of performance scores, as average scores remain highly consistent across these categories.
    • Overtime hours show no clear linear relationship with performance. Moderate overtime (10-14 hours/month) correlates with peak performance, but excessive overtime (25+ hours/month) is associated with slightly lower scores, suggesting a point of diminishing returns.
  • Compensation & Equity:
    • The analysis reveals no significant salary disparities across departments. Median salaries are identical across all departments, indicating a high degree of internal pay equity.

Conclusion: While demographic and general satisfaction metrics do not singularly explain employee turnover, factors like long tenure and the nuanced impact of remote work warrant further investigation for retention strategies. The company demonstrates strong equity in salary distribution across departments and genders. Future efforts should focus on understanding the specific motivations for departure among long-tenured employees and optimizing remote work policies to maximize retention.