Water availability prediction using historical data

Original Source Here

Feature Engineering

The primary goal is to have 4 models at the end, one for each waterbody. Also, we have to take into account the possible latency in the variables due to the effect of rainfall and temperature. Thus we integrate the datasets into their respective waterbody type by taking the average based on different time durations i.e. daily, weekly, monthly, and yearly.

Then we find out the Median Absolute error trends in each of these durations models when the variables are shifted by days, weeks, or months. Hence, the final data for each water body will have the actual values of the target variables as affected by rainfall and temperature.

For eg. in Aquifers, we first take the mean of all rainfall, temperature, volume, hydrometry, and depth features in each aquifer dataset, then combine them and again take daily mean, weekly mean or monthly mean based on the Date column. This gives us datasets aquifer_daily, aquifer_weekly, and aquifer_monthly.

In each of these new datasets, we shift each feature value by 1 to 31 days, 1 to 52 weeks, and 1 to 12 months. Then using a Random Forest Regressor, note the time duration after which the MAE is lowest to predict that shifted feature (i.e. the rainfall and temp effect is most appropriate).

A sample code snippet for shifting:

# dataframe for shifted values of volume feature
shifted_volume = pd.DataFrame()
for i in range(1,32):
shifted_volume = shifted_volume.ffill()

Observing best results at 26 days (aquifer_daily), 8 weeks(aquifer_weekly), and 2 months(aquifer_monthly), we made the final aquifers dataset by shifting the values by 56 days (8 weeks or 2 months) backward.

# shift the values in the daily aquifer data shifted to the actual number of days taken to show the effect of rainfall and temp aquifer_daily['Actual_Depth']=aquifer_daily['Mean_Depth'].shift(-56) aquifer_daily['Actual_Volume']=aquifer_daily['Mean_Volume'].shift(-56)
aquifer_daily = aquifer_daily.ffill()
# remove unrequired columns
# store date as a seperate column
aquifer_daily['Date'] = aquifer_daily.index

Likewise, do the analysis for all water bodies and ultimately get 4 feature engineered datasets having the true effect of rainfall and temperature.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: