Original Source Here

# Introduction

Survival analysis is a popular statistical method to investigate the expected duration of time until an event of interest occurs. We can recall it from medicine as patients’ survival time analysis, from engineering as reliability analysis or time-to-failure analysis, and from economics as duration analysis.

Besides these disciplines, survival analysis can also be used by HR teams to understand and create insights about their employee engagement, retention, and satisfaction — which is a hot topic nowadays 🌶 🌶 🌶

According to Achievers’ Employee Engagement and Retention Report, 52% of workers plan on looking for new jobs in 2021 and a recent survey participated by over 30,000 workers in 31 countries shows that 40% of employees are thinking of quitting their jobs. Forbes calls this trend “Turnover Tsunami”, mostly driven by pandemic burnout and Linkedin experts predict the arrival of big talent migration and discuss under #GreatResignation and #GreatReshuffle topics.

As always **data** can help to understand employee engagement & retention to reduce turnover and build more engaged, committed, and satisfied teams.

Some examples of what HR teams can dig in the employee turnover data are:

- What are the certain characteristics of employees that
*stay/leave*? - Are there similarities in attrition rates between groups of employees?
- What are the probabilities of employees leave after a certain amount of time? (i.e. after 2 years)

In this article, we will build a survival analysis that helps to answer these questions. Let’s start! ☕️

# Data Selection 👥

We will use a fictional Employee Attrition & Performance dataset created by IBM available on Kaggle to explore employee attrition rates and important employee characteristics to predict survival duration of current employees.

# Background

Before modeling the survival function, let’s cover some basic terminology and concepts behind survival analysis.

**Event**is the experience of interest such as survive/death or stay/resign**Survival time**is the duration until the event of interest occurs i.e. duration until an employee quits

## Censorship Problem 🚫

Censored observations** **happen in time-to-event data if the event has not been recorded for some individuals. This can be due to two main reasons:

- Event has not yet occurred (i.e. survival time is unknown/misleading for those who are not resigned yet)
- Missing data (i.e. dropout) or losing contact

There are three types of censorship:

*Left-Censored:**Survival duration is less than the observed duration*Survival duration is greater than the observed duration*Right-Censored:**Interval-Censored:**Survival duration can’t exactly be defined*

The most common type is right-censored and it is usually taken care of by survival analysis. However, the other two might indicate a problem in the data and might require further investigation.

## Survival Function

** T **is when the event occurs and

**is any point of time during the observation, survival**

*t***is the probability of**

*S(t)**T*greater than

*t.*In other words, survival function is the probability of an individual will survive after time

*t.*

S(t) = Pr(T > t)

An illustration of a survival curve:

Some important characteristics of survival function: 🔆

- T ≥ 0 and 0 < t < ∞
- It is non-increasing
- If
`t=0`

, then`S(t)=1`

(survival probability is 1 at time 0) - If
`t=`

∞, then`S(t)=0`

(survival probability goes to 0 as time goes to infinity

## Hazard Function

Hazard function or hazard rate, *h(t)**, *is the probability of an individual who has survived until time *t* and experiencing the event of interest at exactly at time *t*. Hazard function and survival function can be derived from each other by using the following formula.

# Kaplan-Meier Estimator

Being a non-parametric estimator, Kaplan-Meier doesn’t require making initial assumptions about the distribution of data. It also takes care of right-censored observations by computing the survival probabilities from observed survival times. It uses the product rule from probability and in fact, it is also called a product-limit estimator.

where:

**d_i**: number of events happened at time t_i**n_i**: number of subjects that have survived up to time t_i

We can think the survival probability at time *t_i *is equal to the product of the probability of surviving at prior time *t_i-1 *and the percentage chance of surviving at time *t_i*.*👇*

## Survival Function with KMF

We can model with Kaplan-Meier Fitter using the `lifelines`

package. While fitting data to kmf, we should specify **durations** (years spent at the company) and **event_observed **(attrition value: 1 or 0).

from lifelines import KaplanMeierFitter# Initiate and fit

kmf = KaplanMeierFitter()

kmf.fit(durations=df.YearsAtCompany, event_observed=df.Attrition)# Plot the survival function

kmf.survival_function_.plot()

plt.title('Survival Curve estimated with Kaplan-Meier Fitter')

plt.show()

`# Print survival probabilities at each year`

kmf.survival_function_

We can see that probability of an individual survives longer than 2 years at the company is 92% however probability of surviving longer than 10 years is dropped to 77%.

We can also plot survival function with the confidence intervals.

`# Plot the survival function with confidence intervals`

kmf.plot_survival_function()

plt.show()

Note that, wide confidence interval indicates that the model is less certain at that time usually due to fewer data points.

## Survival Function of Different Groups with KMF

We can plot survival curves of different groups such as gender to see whether if the probabilities change.

Let’s do it based on the Environmental Satisfaction column, where we have the following inputs:

1=‘Low’

2=‘Medium’

3=‘High’

4=‘Very High’

To keep things simpler, I will aggregate Low and Medium together under “Low Environmental Satisfaction” and High and Very High under “High Environmental Satisfaction”.

# Define the low and high satisfaction

Low = ((df.EnvironmentSatisfaction == 1) | (df.EnvironmentSatisfaction == 2))High = ((df.EnvironmentSatisfaction == 3) | (df.EnvironmentSatisfaction == 4))# Plot the survival function

ax = plt.subplot()kmf = KaplanMeierFitter()

kmf.fit(durations=df[Low].YearsAtCompany,

event_observed=df[Low].Attrition, label='Low Satisfaction')

kmf.survival_function_.plot(ax=ax)kmf.fit(durations=df[High].YearsAtCompany, event_observed=df[High].Attrition, label='High Satisfaction')

kmf.survival_function_.plot(ax=ax)plt.title('Survival Function based on Environmental Satisfaction')

plt.show()

As we can see individuals with high environmental satisfaction have higher survival probabilities than ones with low satisfaction.

We can perform the same analysis also on “gender” and “work-life balance”.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot