Original Source Here
COVID-19 Prediction using LSTM
Building a Deep Learning Model for Forecasting the cases and performing EDA
The pandemic of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) is spreading all over the world and has become the most pressing issue for mankind.
The COVID-19 disease has changed the global landscape completely. A high reproduction rate and a higher chance of complications have led to border closures, empty streets, rampant stockpiling, mass self-isolation policies, and an economic recession.
The Supply Chain Management Systems that have been in practice in industries till now have not been able to meet the demand and supply surge amongst the industries across the globe.
At Clairvoyant, we help multiple clients in CPG/retail domains as their traditional forecasting systems failed to respond to these drastic changes. By forecasting one can have an idea for the demand, they can act accordingly and plan their way out. Prediction of COVID spread and feeding it to forecasting of demand helped with warehouse & capacity planning efficiently.
In this article, we perform Exploratory Data Analysis on Covid-19 global data and then forecast between the actual cases and the predicted cases. In this blog we are using Long Short-Term Memory (LSTM) architecture, a Deep Learning technique for building the model.
The dataset that we will be using in this project is available on Kaggle.
It contains the following files:
- covid_19.csv — Describes the number of cases day-wise for every country
- time_series_covid_19_confirmed_US.csv — Describes the confirmed US cases
- time_series_covid_19_deaths_US.csv — Describes the deaths in the US
- time_series_covid_19_deaths.csv — Describes the total number of deaths in the world since January 2020
- time_series_covid_19_recovered.csv — Describes the number of recovered patients worldwide
As we have downloaded the data, we will start with the EDA.
What is EDA?
EDA stands for Exploratory Data Analysis. Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations.
Knowing the data well before making sense out of it.
Let’s start with the code part!
Importing the libraries.
For performing the EDA following libraries are used which will help in doing the exploratory data analysis:
Pandas: It is a software library written for the Python programming language for data manipulation and analysis.
Numpy: It is a library used for working with arrays. It also has functions for working in the domain of linear algebra, fourier transform, and matrices. NumPy stands for Numerical Python.
Seaborn: Seaborn is a data visualization library based on matplotlib.
Matplotlib: Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy.
Ploty: Plotly allows us to import, copy and paste, or stream data to be analyzed and visualized. For analysis and styling graphs it offers a Python sandbox (NumPy supported), Datagrid, and GUI. It is a scientific graphing library.
By using the head() function we can get to see the first few rows of the data and by using the tail() function we can see the last few rows.
This shows the confirmed cases date-wise at the start of 2020. This shows that the number of confirmed cases for a day is low for every country.
The number of confirmed cases in Bejing is only at 14.
The above screenshot shows the confirmed cases date-wise. This shows that the confirmed cases for a day vary from one country to the other. For Canada, the cases confirmed are 81, while at the same time in Russia the confirmed cases are around 42,364.
We can use the describe() function to know the statistical summary of the DataFrame columns.
value_counts() function returns an object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
The isnull() function detects missing values in the given series object. It returns a boolean same-sized object indicating if the values are NA. Missing values get mapped to True and non-missing values get mapped to False.
For visualizing the data we can use different types of graphs like bar graphs, Histograms, boxplot, and otheres.
Plotting the active, confirmed, deaths, recovered cases in the world and displaying the top 10 countries that are most affected.
The above output shows the deaths, confirmed and active cases for every country. We can observe that the most affected country has been the USA. It has around 577,045 deaths, which is the maximum.
Now displaying cases date-wise:
Displaying top 10 dates when the confirmed case counts were highest:
The above table displays the confirmed cases date-wise when there were maximum cases reported.
On 02–05–2021, there were maximum cases reported.
Plotting Bar Plot for the Total number of Confirmed Vs Recovered Vs Active and Deaths:
Through this graph, we can see the number of active cases, confirmed, recovered, and deaths through plotting bars. It shows that the recovery numbers are far behind the confirmed numbers, which is a concern.
Plotting top 10 Countries with the most number of confirmed cases:
The above graph shows the number of 10 most affected cases country-wise. The United States is the most affected country with 32421534 cases reported till now.
Plotting top 10 countries with most Death:
Most death graph plots – The total number of deaths in each country and the top 10 countries with the maximum number of deaths. The US is on the top with the maximum number of deaths, followed by Brazil with 407639 deaths.
Top 10 countries with the most number of Active Cases:
Most active cases graph plots – Total number of active cases of top 10 countries.
The active cases are the most in the USA. The dataset is a little old so the numbers might have changed by now.
Rate of Increase in Infection with respect to Time:
The above graph shows how the rate of infection has increased. In the beginning, between 0–100 days the infection rate has been low. After which the rate of infection increased at a rapid rate and it peaked after 400 days. By this time the confirmed cases reached more than 150M.
Plotting Active case rate with respect to time:
Here, we are plotting the active cases with respect to time. X-axis plots the count and the Y-axis shows total live cases. The peak has reached 60M in 450 days.
Death toll with respect to time:
Similar to the above two plots, here we are plotting the death cases. We can see that the death cases kept on increasing in an exponential manner. After a period of around 90 days, the rate of deaths increased as the days passed. The peak deaths have been 3 million worldwide.
Plotting Confirmed based on days (months):
This plot shows the confirmed cases according to the months. It gives a wider view that can help in visualizing the data of how the cases increased and when they were maximum.
Active cases based on days:
This graph plots the active live cases with respect to days. The live cases are maximum in the month of May 2021. Through these graphs, we get a wider image of the cases month-wise.
Death toll based on days:
The above graph plots the deaths with respect to days. Here we can see that the deaths in the month of March 2020 were zero. After that, there is an increase in the number of cases every month.
Recovered cases with respect to days:
Covid Analysis In India:
Forecasting Using LSTM
For predicting the covid numbers for our model we will build our model with the help of LSTM architecture.
After this, we will plot a graph between the actual covid numbers and the predicted numbers from the model.
The reason why we went with RNN for Time Series prediction instead of other models is, LSTMs are better at identifying complex pattern logics from data by remembering what’s useful and discarding what’s not.
Steps performed to forecast using the model:
1. Data Preparation
2. Preprocessing Data
Since the number of covid cases gets rather large over time our model’s calculation during training may be very slow. We can fix this by using sklearn’s MinMaxScaler to rescale our data.
3. We split the X and y in such a way that X will contain cases for a certain amount of previous days(time_step) and y contains the reading for the next day.
This way the model will be trained to predict the number of cases on a certain day based on the trend in the number of cases within the previous time_steps number of days.
4. Data Partitioning
Since we are looking at a chronological timeline of covid cases, we are taking the first 80% of the data as our training, and our testing will be the remaining 20%.
Reshaping the input X[n] partitions so our model can process them properly.
5. Building Model Architecture
- We are going to build the model with the help of LSTM
- The model first has an input layer which is followed by three LSTM layers
- The LSTM layers contain Dropout as 0.5 to prevent overfitting in the model
- The output layer consists of a Dense layer with 1 neuron with activation as ReLU
- We are predicting the number of Corona cases, so our output will be a positive number (0, ∞)
- For compiling the model, we are taking the loss as ‘mean_squared_error’ and the optimizer that we are taking is adam optimizer
6. Training the Model
To train the model we’ll take out training data (80%) and from it use 20% as validation data.
To lower the learning rate of our model we will use ReduceLROnPlateau in the model.
Training the model with 100 epochs.
7. Plotting the Prediction
In order to see the prediction and see the accuracy, first, we will predict the output of our X_test data. This is the output that we get from the test data.
To accurately plot the values we need to bring our prediction and y_test data back to the original bounds of the data.
In the end, we can plot a graph between the actual Covid cases compared to our predicted Covid cases to see the overall accuracy of our model.
By performing EDA we can get to know the dataset better, at the same time we can bring out meaningful information from the dataset and also could figure out if any flaw exists in the dataset or not.
For forecasting of the data, there are many algorithms available, even though there are many statistical models like Random Effect, Fixed Effect, etc, but all these models are linear models, therefore it can be difficult to adapt to multiple input forecasting problems.
The LSTM model which is being used for forecasting has an exponential trend in the number of Covid-19 cases which is quite similar to the Real number of cases. This model can give better results if it is trained with more epochs.
I hope you found this post interesting and informative.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot