Simple Guide on using Supervised Learning Model to forecast for Time-Series Data



Original Source Here

Let’s now start by preprocess our data set and identify what features we can build for our model. As we are using the Supervised learning model approach, we will need to frame the data as a supervised learning problem.

Preprocessing/ Formatting the Dataset

First, let’s have the most basic check if there are missing values in the dataset.

print(Train_Table.isnull().sum())
Number of null records (Image by Author)

From the result above, we can see that the dataset has no missing values in all of the columns. Next, if you look at the data type of the data provided, you will notice that our date column (“Week_date”) is in string format. Format this column by changing the data type to “Date”

Train_Table['Week_date'] = pd.to_datetime(Train_Table['Week_date'], errors="coerce")
Convert Week_date column from Type “String” to “datetime” (Image by Author)

The Date column (“Week_Date”) is the time-based column and needs to be converted as a DateTime object. With the right format for the data column, we can index the data frame based on the date column.

Feature Engineering for Time Series Data

Before building the model, we will need to re-structure the dataset with a set of features/input variables (x) and the output variable (y-target). Below are the common features generated on a Time-Series dataset:

  • Lag Periods: Lagged values (e.g. yesterday, previous week, previous month, etc.)

In this example here, as our data set is by week — the quantity values are lagged by one week.

#Python shift() function 
Train_Table['Total_Sales_Dollars_lag1'] = Train_Table['Total_Sales_dollars'].shift(1)
“Total_Sales_Dollars_lag1” feature created (Image by Author)
  • Lag Periods by Moving Average: Rolling mean in the past X hour/day/week

Moving Average features are created to smooth out the values and impacts of outliers and fluctuations over a specific time period. In this example here, we create a rolling 2 weeks average with a lag of 1 week to avoid data leakage.

#Python rolling().mean() function 
Train_Table['Total_Sales_Dollar_MA2_lag1'] = Train_Table['Total_Sales_dollars'].rolling(2).mean().shift(1).reset_index(0,drop=True)
“Total_Sales_Dollar_MA2_lag1” feature created (Image by Author)

Differences: Current hour/day/week/month value subtracted by the previous hour/day/week/month value (eg. value from yesterday difference with the value from last week on the same day)

Features that calculate the difference with previous values are considered trend features. In this example here we calculate the difference of this week with the previous week with a lag of 1 week.

#Python diff() function 
def difference(data, feature):
# assume data is already sorted
return data[feature] - data['Total_Sales_dollars'].shift(1)
Train_Table['Total_Sales_Dollar_Difference_lag1'] = difference(Train_Table,'Total_Sales_dollars').shift(1)
“Total_Sales_Dollar_Difference_lag1” feature created (Image by Author)

Other than that, we can also create features that calculate the percentage change between the current value and the previous values. In this example, we will calculate the percentage change for the current week with the previous week’s with the lag by 1 week.

def difference_in_percentage(data, feature):
# assume data is already sorted
lag_by_1 = data[feature].shift(1)
return (data[feature] - lag_by_1)/(lag_by_1)Train_Table['Total_Sales_Dollar_Difference_in_Percent_Lag1'] = difference_in_percentage(Train_Table,'Total_Sales_dollars').shift(1)
“Total_Sales_Dollar_Difference_in_Percent_Lag1” feature created (Image by Author)
  • Timestamp decomposition: day of the week, day of the month, the month of the year, weekday, or weekend.

The year, month, week, and day values from the date can also be used as numerical features. In this example, we already extract the week and year using SQL query. But if we were to perform using python, we can do so by the following approach:

Train_Table['week'] =Train_Table['Week_date'].dt.week
Train_Table['month'] =Train_Table['Week_date'].dt.month
Train_Table['year'] =Train_Table['Week_date'].dt.year
“year”, “month”, “week” feature created (Image by Author)

Other Features that can be created are — Statistical values such as minimum, maximum, mean, standard deviation values for yesterday, previous week, previous 2 weeks, etc.

By now, we should have a set of features generated for our model to use. Now let’s move to the next step- Fitting the model.

Fitting the Model

First of all, determine which are the features that can be used and which should be indexed. In this example here, I won’t be taking the feature “Year” as the training data are falling under the same year and the column “Week_date” column will be indexed.

Train_Table = Train_Table [['Week_date', 'Total_Sales_dollars', 'Total_Sales_Dollars_lag1','Total_Sales_Dollar_MA2_lag1','Total_Sales_Dollar_Difference_lag1','Total_Sales_Dollar_Difference_in_Percent_Lag1', 'month','week']]Train_Table = Train_Table.fillna(0)Table = Train_Table.set_index('Week_date')
Selected features and Index by “Week_date” (Image by Author)

Next, we will split the data into Training and Test Set based on the 80:20 rule.

from sklearn.model_selection import train_test_splittraining_data, testing_data = train_test_split(Table, test_size=0.2)
print(f"No. of training examples: {training_data.shape[0]}")
print(f"No. of testing examples: {testing_data.shape[0]}")

Separating the data to x_train, y_train, x_test and y_test — Input(x) and Output (y)

x_train, y_train = training_data.drop("Total_Sales_dollars", axis=1), training_data['Total_Sales_dollars']
x_test, y_test = testing_data.drop("Total_Sales_dollars", axis=1) , testing_data['Total_Sales_dollars']

With our data set separated into input (x) and output(y), we can import the XGBoost Model library and fit the model with the training set.

import xgboost as xgb
from sklearn.metrics import mean_absolute_error
model = xgb.XGBRegressor(n_estimators=1000)
model.fit(x_train, y_train,
eval_set=[(x_train, y_train), (x_test, y_test)],
early_stopping_rounds=50, #stop if 50 consequent rounds without decrease of error
verbose=False)

After the model has been trained, apply the model to the test set and evaluate the model performance. The performance measure used here is mean absolute error (MAE). MAE is calculated based on the average of the forecast error values and the error values are converted to be positive. There are many different performance measures that can be used such as error ratio, mean square error. If you are interested you can refer to this article which lists down different performance measures — various performance measures for time series forecasting.

preds = pd.DataFrame(model.predict(x_test))

Measuring model performance:

from sklearn.metrics import mean_absolute_percentage_errormean_absolute_percentage_error(y_test, preds)
MAE Error rate (Image by Author)

For better accuracy, we can test our model on a holdout set (another test set that is unseen by the model ). We can use the data that has been provided and available under the same Google Big Query Public date set — “ 2021 sales predict”. Apply the same data transformation and create the same set of features that are used to train the model before you can apply the model on the test set.

After applying the model on the holdout data set, we can also plot a chart to compare the actual vs forecast values.

Plot for Actual vs Forecast (Image by Author)

As you can see in the plot above, we have a small holdout set with several weeks in the year 2021 for the model to test. From the plot, we can see the model forecast does not match in the beginning weeks but follows closely in the following weeks and deviates from the actual towards the end of March. There are several reasons for the model not performing well, one of them could be due to insufficient training data as we are only using one year of data to build the model and not enough robust features for the model to learn from.

Congrats, we have understood how to build a time-series forecast with a supervised learning model. Do take note that this is a simple guide and real-life examples will be more complicated with many different features you will need to consider creating — such as alpha variables to capture holiday/ seasonality trends, generating features that derive from other drivers’ data that can help with the forecast, etc.

Conclusion:

This article covered a basic understanding of the characteristic of time-series data and how to prepare your time-series data into a forecasting problem that a supervised machine learning model such as XGBoost can be applied.

Thank you for reading this article and I hope you find this article useful and able to help you in any step of your Data Science journey.

References & Links:

[1] https://www.investopedia.com/terms/t/timeseries.asp

[2] https://machinelearningmastery.com/xgboost-for-time-series-forecasting/

[3] https://machinelearningmastery.com/time-series-forecasting-supervised-learning/

[4] https://searchenterpriseai.techtarget.com/definition/supervised-learning

[5] https://towardsdatascience.com/everything-you-need-to-know-about-time-series-5fa1834d5b18

[6] https://console.cloud.google.com/bigquery(cameo:product/iowa-department-of-commerce/iowa-liquor-sales)

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: