https://miro.medium.com/max/1200/0*KVGh31R1-AyWbMNW

Original Source Here

# Modelling New York City Bicycle Volumes Using Generalised Linear Models

## A simple project walkthrough in fitting Generalised Linear Models in Python

# Introduction

Despite writing several articles on the topic and working in the insurance industry, I have actually never fitted a **Generalised Linear Model (GLM)** from scratch.

Shocking I know.

So, I have decided to spread my wings and carry out a small project where I can put all my theoretical knowledge into practise!

In this article, I want to walk you through a simple project using GLMs to model the bicycle crossing volumes in New York City. We will also briefly cover the main technical details behind GLMs and the motivations for their use.

The data used for this project was from the New York City Department of Transportation and is available

hereon Kaggle with a CC0 licence. Kaggle actually sourced this dataset fron NYC Open Data which you find here.

# Overview of Generalised Linear Models

For the purpose of completeness, I will discuss the main concepts behind GLMs in this post. However, for a more in-depth understanding, I highly recommend you check out my previous articles that really deep-dive into their technical details:

## Motivation

**Generalised Linear Models** literally ‘generalise’ **Linear Regression** to a target variable that is **non-normal**.

For example, here we are going to be modelling the bicycle crossing volumes in New York City. If we were going to model this as a Linear Regression problem, we would be assuming that the bicycle count against our features would follow a **Normal distribution**.

There are two issues with this:

- Normal distribution is
**continuous**, whereas bicycle count is**discrete**. - Normal distribution can be
**negative**, but bicycle count is**positive**.

Hence, we use GLMs to overcome these issues and limitations of regular Linear Regression.

## Mathematics

The general formula for Linear Regression is:

Where ** X** are the

**features**,

**are the**

*β***coefficients**with

**being the**

*β_0***intercept**and

**is the**

*E[Y | X]***expected value**(mean) of

**given our data**

*Y***.**

*X*To transform this Linear Regression formula to incorporate non-normal distributions, we attach something called the **link function**, ** g()**:

The link function literally ‘**links**’ your linear combination input to your desired target distribution.

The link function can either be found empirically or mathematically for each distribution. In the articles I have linked above, I go through deriving the link functions for some distributions.

Not every distibution is covered under the GLM umbrella. The distributions have to be part of the **exponential family**. However, most of your common distributions: **Gamma**, **Poisson**, **Binomial** and **Bernoulli** are all part of this family.

The link function for the Normal distribution (Linear Regression) is called the

identity.

## Poisson Regression

To model our bicycle volumes we will use the** ****Poisson distribution**. This distribution describes the probability of a certain number of events occurring in a given timeframe with a mean occurrence rate.

To learn more about the Poisson distribution check out my article about it here.

For Poisson Regression, the link function is the **natural log**:

As you can see, our output will now always be positive as we are using the exponential. This means we will avoid any possible non-sensical results, unlike if we used Linear Regression where the output could have been negative.

Again, I haven’t done a full thorough analysis of GLMs as it would be exhaustive and I have previously covered these topics. If you are interested in learning more about GLMs, make sure to check out my articles I linked above or any of the hyperlinks I have provided throughout!

# Modelling Walk Through

## Packages

We will first download the basic Data Science packages and also the statsmodels package for GLM modelling.

`import pandas as pd`

import matplotlib.pyplot as plt

import numpy as np

import seaborn as sns

import statsmodels.api as sm

from statsmodels.formula.api import glm

## Data

Read in and print the data:

`data = pd.read_csv('nyc-east-river-bicycle-counts.csv')`

data.head()

Here we have the bike volumes across the four bridges: Brooklyn, Manhattan, Williamsburg and Queensboro with their sum under the ‘Total’ feature.

There are duplicate columns for ‘Date’ and the index, so lets clean that up:

`data.drop(['Unnamed: 0', 'Day'], axis=1, inplace=True)`

data.head()

Notice there are two columns for temperature: high and low. Lets make that a single column by taking their mean:

`data['Mean_Temp'] = (data['High Temp (°F)'] + data['Low Temp (°F)'])/2`

data.head()

The precipitation column contains some strings, so lets remove those:

`data['Precipitation'].replace(to_replace='0.47 (S)', value=0.47, inplace=True)`

data['Precipitation'].replace(to_replace='T', value=0, inplace=True)

data['Precipitation'] = data['Precipitation'].astype(np.float16)

data.head()

## Visualisations

The two main independent variables that affect bicycle volumes are temperature and precipitation. We can plot these two variables against the target variable ‘Total’:

`fig = plt.figure(figsize=(22,7))`

ax = fig.add_subplot(121)

ax.scatter(data['Mean_Temp'], data['Total'], linewidth=4, color='blue')

ax.tick_params(axis="x", labelsize=22)

ax.tick_params(axis="y", labelsize=22)

ax.set_xlabel('Mean Temperature', fontsize=22)

ax.set_ylabel('Total Bikes', fontsize=22)

ax2 = fig.add_subplot(122)

ax2.scatter(data['Precipitation'], data['Total'], linewidth=4, color='red')

ax2.tick_params(axis="x", labelsize=22)

ax2.tick_params(axis="y", labelsize=22)

ax2.set_xlabel('Precipitation', fontsize=22)

ax2.set_ylabel('Total Bikes', fontsize=22)

## Modelling

We can now build a model to predict ‘Total’ using the mean temperature feature through the statsmodel package. As this relationship is Poisson, we will use the natural log link function:

`model = glm('Total ~ Mean_Temp', data = data[['Total','Mean_Temp']], family = sm.families.Poisson())`

results = model.fit()

results.summary()

We used the

R-style formulafor the GLM as that gives better performance in the backend.

## Analysis

From the output above, we see that the coefficient for the mean temperature is **0.0263** and the intercept is **8.1461**.

Using the Poisson Regression formula we presented above, the equation of our line is then:

x = np.linspace(data['Mean_Temp'].min(),data['Mean_Temp'].max(),100)

y = np.exp(x*results.params[1] + results.params[0])plt.figure(figsize=(10,6))

plt.scatter(data['Mean_Temp'], data['Total'], linewidth=3, color='blue')

plt.plot(x, y, label = 'Poisson Regression', color='red', linewidth=3)

plt.xticks(fontsize=18)

plt.yticks(fontsize=18)

plt.xlabel('Mean Temperature', fontsize=18)

plt.ylabel('Total Count', fontsize=18 )

plt.legend(fontsize=18)

plt.show()

Eureka! We have fitted a GLM!

For the interested reader, the algorithm used by statsmodels to fit the GLM is called

iteratively reweighted least squares.

The full code/notebook is available at my GitHub here:

# Conclusion

In this article we had a short discussion of the short-comings of Linear Regression and how GLMs solve this issue by providing a wider and generic framework for regression models. We then fit a basic Poisson Regression line to model the number of bicycles in New York City as a function of daily average temperature.

# Connect With Me!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot