This is a study focused on the applicability of the Time Series Modeling with Prophet on the available Covid19 data from John Hopkins University.
Let’s take a look at the sneak peak of the data before proceeding:
# Confirmed Cases Data : Topview
There seems to be some sort of relationship between the number of Cases to their Geographical Locations. And seems like Confirmed & Death Cases have a high degree of Positive Correlation.
The Pairwise Plots herein below confirms the above postulates along with the fact that the number of Recovered Cases have no definite relationship with the Geographical Location. Also with the increase in the Confirmed and Death Cases, the number of Recovered Cases are also increasing, which makes sense, right! Very interesting! What do you think?
Alright let’s actually get started with the Time Series Analysis now. In this study Confirmed, Recovered and Death Cases are forecasted separately along with considering their Model Uncertainty and Seasonal Variations in the Model. However, to keep it simple let’s walk through the Recovered Cases Forecasting Model. If interested to take a look at the other Models please check the Reference at the bottom.
Additive & Multiplicative Prophet Models
The Additive Prophet Model, that we used here for the purpose of building the Model: y(t) = g(t) + s(t) + h(t) + e(t), where:
- g(t) represents trend
- s(t) represents periodic component
- h(t) holiday related events or may be social distancing in this case.
- e(t) is the error Prophet with The Multiplicative Model: y(t) = g(t) * s(t) * h(t) * e(t)
For the purpose of understanding the data, ran the Models with both Additive and Multiplicative Models. Usually Multiplicative Models are used when the magnitude of the seasonal pattern in the data depends on the magnitude of the data. Whereas Additive Models are usually used when the magnitude of seasonality does not change in relation to time. In this case, we are still in the Initial Stage of the Data collection and we are yet to see the actual seasonal change over the behavior on the data. Clearly we can’t conclude with certainty due to Systematic Uncertainty and lack of a whole year worth of data. So, it is a better practice to go with a Model that won’t overfit the Model’s Performance.
For keeping this short and to the point, let’s sneak peak Models Forecasting for Recovered Cases, only. Models Forecasting for Confirmed and Death Cases are included in my Kaggle Notebook, as mentioned in the reference below.
Seasonal Components in Forecasting of Recovered Cases
Looking at the Seasonal Variation of the Recovered Cases in Forecasting, let’s break it down to visualize the number of Recovered Cases reported on a Daily Basis, over the Day of the Week, over the Day of the Year, and over the Timestamps on each day over the whole year.
Model Uncertainty in Recovered Cases
The next question that should arise is that alright, we have the Seasonal Variation considered, but where is the Model Uncertainty? As the data is dynamic, so while building the Models, I considered the Uncertainty Interval to be 95%, just being over conservative with the Model Fitting, assuming there are a lot of other unknown parameters which I did not consider. In other words I considered 95% of Uncertainty while Fitting a sample of data from the last six months into the Model. By default Prophet uses an Uncertainty Intervals of 80%.
# Uncertainty in Trend
forecast_covid19recoveredcases_trenduncertainty = Prophet(interval_width=0.95).fit(forecast_covid19recoveredcases).predict(covid19recoveredcases)# Uncertainty in Seasonality: Considering Samples from 6 Months Data
mr = Prophet(mcmc_samples=300)
forecast_covid19recoveredcases_seasonaluncertainty = mr.fit(forecast_covid19recoveredcases).predict(covid19recoveredcases)
Now, if we compare the trend of the Timestamp to both the Models with and without Uncertainty, we can see that there is a variation in the reported number of Recovered Cases in and around approximately of 40 weeks. That’s approximately 5.3% variation if we compare the Trend line of the Timestamp data for the Recovered Cases Forecasted Models with and without uncertainty, for the month of October, 2020. In weekly Trend, Wednesday seems to have the highest Variation in the Recovered Cases Model with Uncertainty. Whereas successive peaks of Variation seems to be on Tuesday, Thursday and on Saturday for the weekly Data Trend of Recovered Cases Model without any Uncertainty but with Seasonal Variation.
Fourier Order In Seasonality For Recovered Cases
Considering the rapid and arbitrary change in the data flow, Fourier Transformation with Prophet might help to approximate such data variation. Such arbitrary data variation might be due to other data variance, related to data collection, data reporting and other unknown factors and errors, apart from seasonality. Considering higher frequency of data variation, the variable “yearly_seasonality” is set to 40. Prophet has a default value of “yearly_seasonality” set to 10. However, this can be changed depending on the variation in the data. Increasing the value of the “yearly_seasonality” allows to fit faster change in data cycle variation, but can also lead to overfitting.
from fbprophet.plot import plot_yearly
mr = Prophet(yearly_seasonality=40).fit(covid19recoveredcases)
Weekly & Monthly Seasonality with Fourier Transform
mr = Prophet(weekly_seasonality=True)
mr.add_seasonality(name='monthly', period=30.5, fourier_order=6)
forecast = mr.fit(covid19recoveredcases).predict(forecast_covid19recoveredcases_trenduncertainty)
Now that we have our Model Fit, let’s take a look at the Model with Fourier Transformation over monthly and weekly trend along with considering weekly seasonality changes. Using a higher Fourier Order might help in fitting the data model to capture the data uncertainty but might also cause Overfitting. A better approach would be probably to start with Prophet’s default Fourier Order and then change them according after testing the Model Cross Validation.
Model Cross Validation
Cross Validation automates the process of Model Performance. The first parameter given is the trained model (not the data). Then the next parameter prediction horizon — how frequently we want to predict (in this case ’30 days’). Then give an initial (how long to train before starting the tests) and a period (how frequently to stop and do a prediction). If we don’t provide these parameter values, Prophet will assign defaults of initial = 3 * horizon, and cutoffs every half a horizon.
# Recovered Cases : Cross Validation
from fbprophet.diagnostics import cross_validation
dfrecovered = cross_validation(mr, initial='60 days', period='7 days', horizon='30 days')
The “performance_metrics” utility can be used to compute some useful statistics of the prediction performance (yhat, yhat_lower, and yhat_upper compared to y), as a function of the distance from the cutoff (how far into the future the prediction was). The statistics computed are mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), median absolute percentage error (MDAPE), mean absolute percent error (MAPE), and coverage of the “yhat_lower” and yhat_upper estimates. These are computed on a rolling window of the predictions after sorting by horizon (ds minus cutoff).
# Recovered Cases : Performance Evaluation
from fbprophet.diagnostics import performance_metrics
from fbprophet.plot import plot_cross_validation_metricdf_precovered = performance_metrics(dfrecovered, rolling_window = 0.1)
Error Visualization : Recovered, Death & Confirmed Cases Forecasting Model
Cross Validation Performance Metrics can be visualized with “plot_cross_validation_metric”, here shown for RMSE & MDAPE. The blue line shows the MDAPE, where the mean is taken over a rolling window of the dots. There is almost zero errors until few days, and then seems like the error would gradually increase with the number of days, considering RMSE. The MDAPE measures the deviation of median from the actual data in terms of percentage. RMSE is more aggressive towards big errors.
The blue line shows the MAPE, where the mean is taken over a rolling window of the dots. There is almost zero errors in this model as well, until few days, and then seems like the error would gradually increase with the number of days, considering RMSE. The MAPE measures the deviation from the actual data in terms of percentage, which in case of this Models seems to be approximately 4% with an RMSE of in and around 50.
The MAPE shows an error of approximately 0.04% that increases with the number of days and the RMSE shows an error of approximately little over 10%. In other words as the number of days increases in Model the Forecasting Error seems to increase as well.
This study helps in understanding and evaluating Forecasting Time Series Model Performance along with Cross Validating the Models using Prophet. The rate of error in the Model can be changed depending on the number of days until when the Forecasting is done. Clearly increasing the windows of days increases the model error and vice versa. The bigger picture of the uncertainty with the Forecasting Models are nominally addressed in terms of Forecasting Uncertainty. Other uncertainties such as those related to Community Spread in Each Country/Location, Pandemic Uncertainty related to the covid19 virus, human to human interaction etc. are not considered while conducting this study. It focusses mainly on the application of the Time Series Data Analysis with Prophet. For future scope, this study can also help to build out covid19 Forecasting Models and evaluate their Performance across specific countries of interest. The data comprises of the Cases reported across the whole world from John Hopkins University. There are some part of the time series data which are archived and is not used in these study. On data update my Kaggle Notebook will also get updated. For more details on data sources please refer to the GitHub & my Kaggle Notebook.
Feel free to reach out if you have any questions!