A Simple Explanation of Causal Inference in Python



Original Source Here

A Simple Explanation of Causal Inference in Python

Photo by 愚木混株 cdd20 on Unsplash

Background

I first became interested in causality when I finished a commercial machine learning classification project and the first thing the customers asked after the presentation was …

“Why does that happen and what are the underlying causes?”

My first attempt revolved around artificially modifying the input data for the classification model but that did not work very well.

Next I trawled Google, Medium and Towards Data Science for articles on “causal inference” and I did find a few but they were complicated, often incomplete and not generalizable to my own data sets.

That prompted me to explore the wider literature and the documentation for the various causal libraries that are currently available and this article presents a causal inference model that is the results of that research.

What You Will Learn

By the end of this article you will be able to generate test data to represent any causal inference scenario, build a causal model in Python code and then run “what-if?” queries against the model.

Before we dive into the causal inference model please consider …

Joining Medium with my referral link (I will receive a proportion of the fees if you sign up using this link).

Subscribing to a free e-mail whenever I publish a new story.

Taking a quick look at my previous articles.

Downloading my free strategic data-driven decision making framework.

Visiting my data science website — The Data Blog.

Getting Started

The first thing we need is to import all the libraries needed in the code …

The second thing we need is a hypothetical scenario to model –

  • Out of 1 million children, 990,000 (99%) are vaccinated
  • Of those vaccinated, 9,900 (1% of 990,000) have a reaction
  • Of those that have a reaction, 99 (1% of 9,900) die from the reaction
  • Of those not vaccinated, 200 (2% of 10,000) get smallpox
  • Of those that get smallpox, 40 (20% of 200) die from the disease

The next thing we need is some test data to represent the “observations” for the scenario. To quickly build 1 million rows of test data representing the causal scenario please head over to this article …

… for the source code and explanation for the BinaryDataGenerator class …

(1000000, 4)
Image by Author

This grid is a summary of the 1 million rows of test data. The first row is saying that 980,100 of the 1,000,000 “observations” were vaccinated, did not have a reaction, did not contract Smallpox and did not die etc.

The Dilemma

From the summary below we can see a counter intuitive outcome from the vaccine programme …

Deaths caused by Vaccination: 99
Deaths caused by Smallpox: 40

There have been more deaths caused by the vaccine than the disease! So should the vaccine programme be cancelled to save lives?

To solve that we need to ask the question “What would have happened if we had not run the vaccine programme?”. That is a counter-factual question i.e. it is asking us to imagine a different world where we made a key choice differently and to find out what impact that would have had.

I will tackle counter-factuals in detail in a future article but for now it is enough to say that the counter-factual makes this a causal inference model that is not well suited to machine learning techniques because it is ab out causation and not correlation.

The remainder of this article will explain how to solve this using Python code that has been written to be easy to follow and understand …

Step 1: The Causal Diagram

In “The Book of Why” Pearl argues that one of the key components of a causal inference engine is a “causal model” which can be causal diagrams, structural equations, logical statements etc. but Pearl is “strongly sold” on causal diagrams.

Most of the libraries and articles out there are the moment refer to these as “Directed Acyclic Graphs” (DAGs); they are essentialy the same thing.

As a data scientist my natural instinct is that problems are solvable by identifying and resolving patterns in the data and sure enough there are algorithms out there to built a causal diagram from the data and now that we have 1 million rows of test data (our “observations”) we can easily try this out …

Image by Author

The result is not very satisfactory!

According to the NOTEARS algorithm results Smallpox causes the vaccination reaction, death causes both Smallpox and the reaction and the reaction causes the vaccination! (more information on the unsuitability of NOTEARS can be found here .

I tried several of the publicly available libraries and got similar results, so how can we create a causal inference diagram?

Well, Pearl makes a strong case that the causal diagram requires domain expertise and cannot be simply inferred from the data and that does intuitive sense.

If we were to collect observations of a cockerel crowing and the sun rising there would certainly be a correlation but the knowledge that the cockerel’s crows does not cause the sun to rise could not be intuited from the data alone.

With the Smallpox diagram we know that the vaccination comes before the reaction and that temporal knowledge would help us to correct the diagram but as the cockerel crows just before the sun rises even this will not help us out!

What we need then is domain knowledge obtained by interacting with, learning from and questioning the domain experts (and then critically challenged!) to enable a valid causal diagram to be built.

Here is the Python code …

Image by Author

This is looking much better!

Step 2: The Conditional Probability Tables

The next thing that is needed are a set of conditional probability tables (CPTs) that describe the probabilities for transitioning between the nodes on the diagram.

It sounds complicated but those probabilities are all available within our 1 million rows of data (representing the hypothetical observations) and the pgmpy library can do all the necessary calculations in just 1 line of code with an additional line to validate the model …

True

… and the CPTs can easily be printed out …

+-----------------+------+
| Vaccination?(0) | 0.01 |
+-----------------+------+
| Vaccination?(1) | 0.99 |
+-----------------+------+
+--------------+-----------------+-----------------+
| Vaccination? | Vaccination?(0) | Vaccination?(1) |
+--------------+-----------------+-----------------+
| Reaction?(0) | 1.0 | 0.99 |
+--------------+-----------------+-----------------+
| Reaction?(1) | 0.0 | 0.01 |
+--------------+-----------------+-----------------+
+--------------+-----------------+-----------------+
| Vaccination? | Vaccination?(0) | Vaccination?(1) |
+--------------+-----------------+-----------------+
| Smallpox?(0) | 0.98 | 1.0 |
+--------------+-----------------+-----------------+
| Smallpox?(1) | 0.02 | 0.0 |
+--------------+-----------------+-----------------+
+----------+-------------+-------------+-------------+-------------+
| Reaction?| Reaction?(0)| Reaction?(0)| Reaction?(1)| Reaction?(1)|
+----------+-------------+-------------+-------------+-------------+
| Smallpox?| Smallpox?(0)| Smallpox?(1)| Smallpox?(0)| Smallpox?(1)|
+----------+-------------+-------------+-------------+-------------+
| Death?(0)| 1.0 | 0.8 | 0.99 | 0.5 |
+----------+-------------+-------------+-------------+-------------+
| Death?(1)| 0.0 | 0.2 | 0.01 | 0.5 |
+----------+-------------+-------------+-------------+-------------+

The full model i.e. the causal inference diagram PLUS the conditional probability tables can be more easily visualised as follows …

Image by Author

To round out our understanding …

  • The table at the top is saying that the population has a probability of vaccination of 99%
  • The table on the left is saying that the probability of reaction is 0% if vaccination = 0 and 1% if vaccination = 1
  • The table on the right is saying that the probability of smallpox is 2% if vaccination = 0 and 0% if vaccination = 1
  • The table below is saying that the probability of death is 20% if smallpox = 1 and reaction = 0 AND 1% if smallpox = 0 and reaction = 1

Note: The final column is meaningless in this example; it has been added by the .fit() to ensure all the values are present but there are no data where reaction = 1 and smallpox = 1 in the same row.

Step 3: Querying the Model

Basic Queries

First we need a simple helper function to execute queries against the model …

The first query is going to use the model to show the total number of deaths that occurred …

+-----------+---------------+
| Death? | phi(Death?) |
+===========+===============+
| Death?(0) | 0.9999 |
+-----------+---------------+
| Death?(1) | 0.0001 |
+-----------+---------------+
Number of deaths: 139

The next query is going to aske the counter-factual question — “What would have happened if there had not been a vaccination programme?” i.e. what would the effect have been on the variable Death? if we change the evidence for Vaccination? to 0? …

+-----------+---------------+
| Death? | phi(Death?) |
+===========+===============+
| Death?(0) | 0.9960 |
+-----------+---------------+
| Death?(1) | 0.0040 |
+-----------+---------------+
Number of deaths: 4000.0
Lives saved by vaccine program: 3861

Now we have the answer to the question! The vaccination programme has saved 3,861 lives even though the vaccination caused 99 deaths (if you would like to see a validation of the answer check out page 44 of “The Book of Why” where the problem is solved using the probabilities alone).

More Complex Queries

But what if we want to answer a more complicated question like “What would have happened if, instead of 99%, the vaccination rate had been 50%?”.

The basic query() method can only take arguments about the probabilities of the features being 1 or zero and the 50 / 50 question requires that the conditional probability table for “Vaccination?” that currently contains 99 / 1 be replaced.

Again it sounds complicated, but it is very straight-forward …

+-----------------+------+
| Vaccination?(0) | 0.01 |
+-----------------+------+
| Vaccination?(1) | 0.99 |
+-----------------+------+
+-----------------+-----+
| Vaccination?(0) | 0.5 |
+-----------------+-----+
| Vaccination?(1) | 0.5 |
+-----------------+-----+
+-----------+---------------+
| Death? | phi(Death?) |
+===========+===============+
| Death?(0) | 0.9980 |
+-----------+---------------+
| Death?(1) | 0.0021 |
+-----------+---------------+
Number of deaths: 2050

A 50% vaccination programme would still have saved 2050 lives, but 1811 lives would have been lost compared to the 100% vaccination programme which saved 3861 lives.

Conclusion

Causal inference is a hot topic in machine learning and artificial intelligence attracting an increasing number of articles and libraries.

However the articles are often complex and difficult to understand and no single library has emerged as the “go-to” for causal inference like scikit-learn for supervised learning.

Certainly human beings are intuitively good at causal inference. We are able to wonder what today might have looked like if we had made different decisions in the past and we can apply this historical experience to imagined future states but machine learning algorithms are not as good at this task.

This article has broken down some of the complexity around causal inference by presenting a simple, straight-forward example of how to build a causal model (causal inference diagram PLUS conditional probability tables) in Python and how to execute basic and more complex queries against that model.

In future articles I will explore counter-factuals, the “do” operator, confounding and other aspects of this fascinating and emerging subject.

In the mean-time, if you enjoyed this article please consider …

Joining Medium with my referral link (I will receive a proportion of the fees if you sign up using this link).

Subscribing to a free e-mail whenever I publish a new story.

Taking a quick look at my previous articles.

Downloading my free strategic data-driven decision making framework.

Visiting my data science website — The Data Blog.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: