Hands-on Survival Analysis with Python



Original Source Here

Introduction

Survival analysis is a popular statistical method to investigate the expected duration of time until an event of interest occurs. We can recall it from medicine as patients’ survival time analysis, from engineering as reliability analysis or time-to-failure analysis, and from economics as duration analysis.

Besides these disciplines, survival analysis can also be used by HR teams to understand and create insights about their employee engagement, retention, and satisfaction — which is a hot topic nowadays 🌶 🌶 🌶

According to AchieversEmployee Engagement and Retention Report, 52% of workers plan on looking for new jobs in 2021 and a recent survey participated by over 30,000 workers in 31 countries shows that 40% of employees are thinking of quitting their jobs. Forbes calls this trend “Turnover Tsunami”, mostly driven by pandemic burnout and Linkedin experts predict the arrival of big talent migration and discuss under #GreatResignation and #GreatReshuffle topics.

As always data can help to understand employee engagement & retention to reduce turnover and build more engaged, committed, and satisfied teams.

Some examples of what HR teams can dig in the employee turnover data are:

  • What are the certain characteristics of employees that stay/leave?
  • Are there similarities in attrition rates between groups of employees?
  • What are the probabilities of employees leave after a certain amount of time? (i.e. after 2 years)

In this article, we will build a survival analysis that helps to answer these questions. Let’s start! ☕️

Photo by Danielle MacInnes on Unsplash

Data Selection 👥

We will use a fictional Employee Attrition & Performance dataset created by IBM available on Kaggle to explore employee attrition rates and important employee characteristics to predict survival duration of current employees.

Background

Before modeling the survival function, let’s cover some basic terminology and concepts behind survival analysis.

  • Event is the experience of interest such as survive/death or stay/resign
  • Survival time is the duration until the event of interest occurs i.e. duration until an employee quits

Censorship Problem 🚫

Censored observations happen in time-to-event data if the event has not been recorded for some individuals. This can be due to two main reasons:

  • Event has not yet occurred (i.e. survival time is unknown/misleading for those who are not resigned yet)
  • Missing data (i.e. dropout) or losing contact

There are three types of censorship:

  1. Left-Censored: Survival duration is less than the observed duration
  2. Right-Censored: Survival duration is greater than the observed duration
  3. Interval-Censored: Survival duration can’t exactly be defined

The most common type is right-censored and it is usually taken care of by survival analysis. However, the other two might indicate a problem in the data and might require further investigation.

Survival Function

T is when the event occurs and t is any point of time during the observation, survival S(t) is the probability of T greater than t. In other words, survival function is the probability of an individual will survive after time t.

S(t) = Pr(T > t)

An illustration of a survival curve:

The probability of an individual survives longer than 2 years is 60% — Image by the author

Some important characteristics of survival function: 🔆

  • T ≥ 0 and 0 < t < ∞
  • It is non-increasing
  • If t=0, then S(t)=1(survival probability is 1 at time 0)
  • If t=∞, then S(t)=0(survival probability goes to 0 as time goes to infinity

Hazard Function

Hazard function or hazard rate, h(t), is the probability of an individual who has survived until time t and experiencing the event of interest at exactly at time t. Hazard function and survival function can be derived from each other by using the following formula.

Hazard Function

Kaplan-Meier Estimator

Being a non-parametric estimator, Kaplan-Meier doesn’t require making initial assumptions about the distribution of data. It also takes care of right-censored observations by computing the survival probabilities from observed survival times. It uses the product rule from probability and in fact, it is also called a product-limit estimator.

Survival Probability as time t

where:

  • d_i: number of events happened at time t_i
  • n_i: number of subjects that have survived up to time t_i

We can think the survival probability at time t_i is equal to the product of the probability of surviving at prior time t_i-1 and the percentage chance of surviving at time t_i.👇

The survival probability of t=2 is the survival probability of t=1 multiplied with the percentage chance of surviving at time t=2.

Survival Function with KMF

We can model with Kaplan-Meier Fitter using the lifelines package. While fitting data to kmf, we should specify durations (years spent at the company) and event_observed (attrition value: 1 or 0).

from lifelines import KaplanMeierFitter# Initiate and fit
kmf = KaplanMeierFitter()
kmf.fit(durations=df.YearsAtCompany, event_observed=df.Attrition)
# Plot the survival function
kmf.survival_function_.plot()
plt.title('Survival Curve estimated with Kaplan-Meier Fitter')
plt.show()
# Print survival probabilities at each year
kmf.survival_function_
Timeline continues up to year 40.

We can see that probability of an individual survives longer than 2 years at the company is 92% however probability of surviving longer than 10 years is dropped to 77%.

We can also plot survival function with the confidence intervals.

# Plot the survival function with confidence intervals
kmf.plot_survival_function()
plt.show()

Note that, wide confidence interval indicates that the model is less certain at that time usually due to fewer data points.

Survival Function of Different Groups with KMF

We can plot survival curves of different groups such as gender to see whether if the probabilities change.

Let’s do it based on the Environmental Satisfaction column, where we have the following inputs:

1=‘Low’
2=‘Medium’
3=‘High’
4=‘Very High’

To keep things simpler, I will aggregate Low and Medium together under “Low Environmental Satisfaction” and High and Very High under “High Environmental Satisfaction”.

# Define the low and high satisfaction
Low = ((df.EnvironmentSatisfaction == 1) | (df.EnvironmentSatisfaction == 2))
High = ((df.EnvironmentSatisfaction == 3) | (df.EnvironmentSatisfaction == 4))# Plot the survival function
ax = plt.subplot()
kmf = KaplanMeierFitter()
kmf.fit(durations=df[Low].YearsAtCompany,
event_observed=df[Low].Attrition, label='Low Satisfaction')
kmf.survival_function_.plot(ax=ax)
kmf.fit(durations=df[High].YearsAtCompany, event_observed=df[High].Attrition, label='High Satisfaction')
kmf.survival_function_.plot(ax=ax)
plt.title('Survival Function based on Environmental Satisfaction')
plt.show()

As we can see individuals with high environmental satisfaction have higher survival probabilities than ones with low satisfaction.

We can perform the same analysis also on “gender” and “work-life balance”.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: