Time series forecasting is a critical aspect of predictive analytics, focusing on predicting future values based on previously observed values in time-ordered data. This technique is widely used across various domains such as finance, economics, supply chain management, healthcare, and meteorology. The primary objective is to analyze historical data to identify patterns, trends, and seasonal variations that can inform future projections.
In this tutorial, we will explore how we can leverage Time series forecasting to analyse Air passenger Data.
You can watch the video-based tutorial with step by step explanation down below.
Import Modules
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
pandas - provides data structures and data analysis tools for Python.
numpy - used for numerical computing in Python.
statsmodels.api - provides classes and functions for statistical estimation and hypothesis testing.
seasonal_decompose - used for time series analysis to decompose time series data into trend, seasonal, and residual components.
pyplot - used for creating static, animated, and interactive visualizations in Python.
seaborn - built on top of matplotlib and provides a high-level interface for drawing attractive and informative statistical graphics.
warnings - provides a way to handle warnings that may arise during program execution.
warnings.filterwarnings('ignore') - Sets up a filter to ignore all warnings generated by Python's warning subsystem during execution. This is often used to suppress non-critical warnings that might otherwise clutter the output.
Import Data
Next we will read the data from the csv file.
df = pd.read_csv('AirPassengers.csv')
df.set_index('Month', inplace=True)
df.head()
Loading Data: The CSV file 'AirPassengers.csv' is read into a DataFrame. This file likely contains data about the number of air passengers over time, with one column for the month and another for the number of passengers.
Setting the Index: The 'Month' column is set as the index of the DataFrame, which means that each row will now be labeled by its corresponding month. This is useful for time series analysis, where the index often represents time points.
Viewing the Data: The head() method is called to display the first few rows of the DataFrame, allowing you to quickly inspect the data to ensure it has been loaded correctly and to get a sense of its structure.
Exploratory Data Analysis
First we will create a plot to visualize the number of passengers over time.
plt.figure(figsize=(20, 5))
plt.plot(df.index, df['#Passengers'], label='#Passengers')
plt.xlabel('Date')
plt.ylabel('#Passengers')
plt.title('#Passengers Over Time')
plt.legend()
plt.grid(True)
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(20, 5)): Creates a new figure for plotting with a size of 20 inches in width and 5 inches in height. The figsize parameter allows you to specify the dimensions of the plot.
plt.plot(df.index, df['#Passengers'], label='#Passengers'): Plots the data from the DataFrame df. The x-axis values are taken from the DataFrame's index (which contains the dates), and the y-axis values are taken from the '#Passengers' column. The label parameter sets the label for the plotted line, which will be used in the legend.
plt.xlabel('Date'): Sets the label for the x-axis to 'Date'.
plt.ylabel('#Passengers'): Sets the label for the y-axis to '#Passengers'.
plt.title('#Passengers Over Time'): Sets the title of the plot to '#Passengers Over Time'.
plt.legend(): Displays the legend on the plot. The legend will show the label specified in the plt.plot function.
plt.grid(True): Adds a grid to the plot, making it easier to see the values and trends.
plt.xticks(rotation=90): Rotates the x-axis tick labels by 90 degrees. This is often useful for date labels to make them more readable.
plt.show(): Displays the plot.
The above code snippet creates a clear and informative visualization of how the number of passengers changes over time.
Next we will perform seasonal decomposition.
# perform seasonal decomposition
result = seasonal_decompose(df['#Passengers'], model='multiplicative', period=12)
seasonal_decompose: This function from the statsmodels.tsa.seasonal module decomposes the time series into three components: trend, seasonal, and residual (or noise). It is used to better understand the underlying patterns in the data.
df['#Passengers']: This specifies the time series data to be decomposed. Here, it refers to the '#Passengers' column from the DataFrame df.
model='multiplicative': This parameter specifies the type of seasonal decomposition model to use. The 'multiplicative' model assumes that the seasonal variations and the residuals multiply the trend component. This is appropriate when the seasonal fluctuations increase or decrease proportionally with the trend level. There is also an 'additive' model, which assumes that the components add together linearly.
period=12: This parameter specifies the period of the seasonal component. Since the data is likely monthly and the seasonality is expected to be annual, a period of 12 is used, indicating that the seasonal pattern repeats every 12 months.
By performing this decomposition, you obtain a result object containing the three components of the time series:
Trend: The long-term progression of the series (increasing or decreasing).
Seasonal: The repeating short-term cycle within the series.
Residual: The remaining part of the series after removing the trend and seasonal components, representing irregular fluctuations or noise.
This decomposition helps in understanding and analyzing the structure of the time series data, making it easier to identify patterns, detect anomalies, and make more accurate forecasts.
Next we will plot the components in the graph.
# plot the components in the graph
sns.set(style='whitegrid')
plt.figure(figsize=(18,12))
# trend component
plt.subplot(411)
sns.lineplot(data=result.trend)
plt.title('Trend')
plt.xticks(rotation=90)
# seasonal component
plt.subplot(412)
sns.lineplot(data=result.seasonal)
plt.title('Seasonal')
plt.xticks(rotation=90)
# Residuals component
plt.subplot(413)
sns.lineplot(data=result.resid)
plt.title('Residuals')
plt.xticks(rotation=90)
# Original data
plt.subplot(414)
sns.lineplot(data=df['#Passengers'])
plt.title('Original Data')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
Setting the Seaborn Style: This sets the style of the seaborn plots to 'whitegrid', which adds a white background with gridlines. This style enhances the readability of the plots.
Creating the Figure: This creates a new figure for plotting with a size of 18 inches in width and 12 inches in height.
Trend Component:
plt.subplot(411): Creates the first subplot in a 4x1 grid (4 rows, 1 column).
sns.lineplot(data=result.trend): Plots the trend component of the decomposition.
plt.title('Trend'): Sets the title of the subplot to 'Trend'.
plt.xticks(rotation=90): Rotates the x-axis tick labels by 90 degrees for better readability.
Seasonal Component:
plt.subplot(412): Creates the second subplot in the 4x1 grid.
sns.lineplot(data=result.seasonal): Plots the seasonal component of the decomposition.
plt.title('Seasonal'): Sets the title of the subplot to 'Seasonal'.
plt.xticks(rotation=90): Rotates the x-axis tick labels by 90 degrees for better readability.
Residual Component:
plt.subplot(413): Creates the third subplot in the 4x1 grid.
sns.lineplot(data=result.resid): Plots the residual component of the decomposition.
plt.title('Residuals'): Sets the title of the subplot to 'Residuals'.
plt.xticks(rotation=90): Rotates the x-axis tick labels by 90 degrees for better readability.
Original Data:
plt.subplot(414): Creates the fourth subplot in the 4x1 grid.
sns.lineplot(data=df['#Passengers']): Plots the original '#Passengers' data.
plt.title('Original Data'): Sets the title of the subplot to 'Original Data'.
plt.xticks(rotation=90): Rotates the x-axis tick labels by 90 degrees for better readability.
Adjusting Layout and Displaying the Plot:
plt.tight_layout(): Adjusts the spacing between the subplots to prevent overlap and ensure a clean layout.
plt.show(): Displays the plot.
By executing the above code snippet, you generate a comprehensive visualization that shows the trend, seasonal, and residual components of the time series along with the original data. This allows for a detailed examination of the different elements that make up the time series, providing valuable insights into its underlying patterns and behaviors.
Next we will perform the Augmented Dickey-Fuller (ADF) test on the time series data to check for stationarity.
from statsmodels.tsa.stattools import adfuller # Augmented Dickey-Fuller Test
result = adfuller(df['#Passengers'], autolag='AIC') # Akaike Information Criterion
print('ADF Statistic:', result[0])
print('p-value:', result[1])
First import the adfuller function from the statsmodels.tsa.stattools module. The ADF test is a statistical test used to determine if a time series is stationary.
adfuller(df['#Passengers'], autolag='AIC'): This applies the ADF test to the '#Passengers' column of the DataFrame df.
df['#Passengers']: The time series data to be tested.
autolag='AIC': This parameter specifies that the lag length should be selected based on the Akaike Information Criterion (AIC), which is a measure used for model selection. The AIC helps in choosing the optimal number of lags for the test.
By executing the above code, you perform the ADF test on the '#Passengers' time series data and print the test statistic and p-value. Here's a summary of the output:
ADF Statistic: This value indicates the test statistic for the ADF test. The more negative this value, the stronger the evidence against the null hypothesis (i.e., the series is stationary).
p-value: This value helps determine the significance of the test result. A p-value less than the chosen significance level (commonly 0.05) suggests rejecting the null hypothesis and concluding that the time series is stationary.
Next we will perform a first-order differencing on the time series data and then apply the Augmented Dickey-Fuller (ADF) test to check for stationarity on the differenced data.
# first order differencing
result = adfuller(df['#Passengers'].diff().dropna(), autolag='AIC')
print('ADF Statistic:', result[0])
print('p-value:', result[1])
df['#Passengers'].diff(): This calculates the first-order differences of the '#Passengers' time series data. Differencing is a common method to remove trends and make a non-stationary time series stationary.
The diff() function computes the difference between each element and its predecessor, effectively transforming the data to reflect the change from one time point to the next..
dropna(): This removes any NaN values that result from the differencing operation. The first element in the differenced series is NaN because there is no previous element to subtract from.
autolag='AIC': The ADF test is applied to the differenced data with the autolag parameter set to 'AIC' to select the lag length based on the Akaike Information Criterion.
By executing this code, you first transform the original time series to its first-order differences and then perform the ADF test on the transformed data to check for stationarity. Here's what you are doing in each step:
Differencing: This operation removes trends and helps stabilize the mean of a time series, making it stationary.
ADF Test: By applying the ADF test to the differenced data, you check if the transformed series is stationary.
Next we will perform a second-order differencing on the time series data and then applies the Augmented Dickey-Fuller (ADF) test to check for stationarity on the twice-differenced data.
# second order differencing
result = adfuller(df['#Passengers'].diff().diff().dropna(), autolag='AIC')
print('ADF Statistic:', result[0])
print('p-value:', result[1])
df['#Passengers'].diff().diff(): This calculates the second-order differences of the '#Passengers' time series data. The first diff() computes the difference between each element and its predecessor, and the second diff() applies the differencing operation again to the first differenced data.
The first diff() creates a series of first-order differences.
The second diff() creates a series of second-order differences, which can help to remove more complex trends and seasonal effects.
dropna(): This removes any NaN values that result from the differencing operations. The first and second elements in the twice-differenced series will be NaN because there are no previous elements to subtract from for the second differencing.
autolag='AIC': The ADF test is applied to the twice-differenced data with the autolag parameter set to 'AIC' to select the lag length based on the Akaike Information Criterion.
By executing this code, you transform the original time series to its second-order differences and then perform the ADF test on the transformed data to check for stationarity. Here's what you are doing in each step:
Second-Order Differencing: This operation removes more complex trends and seasonal effects, further stabilizing the mean of the time series, making it more likely to be stationary.
ADF Test: By applying the ADF test to the twice-differenced data, you check if the transformed series is stationary.
Next we will visualize the original time series, its first-order differences, and its second-order differences.
# plot the differencing values
fig, (ax1, ax2, ax3) = plt.subplots(3)
ax1.plot(df)
ax1.set_title('Original Time Series')
ax1.axes.xaxis.set_visible(False)
ax2.plot(df.diff())
ax2.set_title('1st Order Differencing')
ax2.axes.xaxis.set_visible(False)
ax3.plot(df.diff().diff())
ax3.set_title('2nd Order Differencing')
ax3.axes.xaxis.set_visible(False)
plt.show()
plt.subplots(3): Creates a figure with three subplots arranged vertically (3 rows, 1 column). The ax1, ax2, and ax3 variables refer to the axes of the three subplots.
ax1.plot(df): Plots the original time series data on the first subplot.
ax1.set_title('Original Time Series'): Sets the title of the first subplot to 'Original Time Series'.
ax1.axes.xaxis.set_visible(False): Hides the x-axis labels for the first subplot to make the plot cleaner.
ax2.plot(df.diff()): Plots the first-order differences of the time series data on the second subplot.
ax2.set_title('1st Order Differencing'): Sets the title of the second subplot to '1st Order Differencing'.
ax2.axes.xaxis.set_visible(False): Hides the x-axis labels for the second subplot to make the plot cleaner.
ax3.plot(df.diff().diff()): Plots the second-order differences of the time series data on the third subplot.
ax3.set_title('2nd Order Differencing'): Sets the title of the third subplot to '2nd Order Differencing'.
ax3.axes.xaxis.set_visible(False): Hides the x-axis labels for the third subplot to make the plot cleaner.
This visualization helps to see how differencing transforms the original time series, making it more stationary and suitable for further analysis and modeling.
Next we will create two subplots to visualize the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) for the first-order differenced time series data.
fig, ax = plt.subplots(2, 1, figsize=(12, 7))
sm.graphics.tsa.plot_acf(df.diff().dropna(), lags=40, ax=ax[0])
sm.graphics.tsa.plot_pacf(df.diff().dropna(), lags=40, ax=ax[1])
plt.show()
plt.subplots(2, 1): Creates a figure with two subplots arranged vertically (2 rows, 1 column). The fig variable refers to the entire figure, and ax is an array of axes objects, one for each subplot.
figsize=(12, 7): Sets the size of the figure to 12 inches in width and 7 inches in height.
sm.graphics.tsa.plot_acf(...): This function from statsmodels plots the ACF of the time series data.
df.diff().dropna(): The first-order differenced data with NaN values removed.
lags=40: The number of lags to include in the plot, which in this case is 40.
ax=ax[0]: Specifies that the ACF plot should be drawn on the first subplot.
sm.graphics.tsa.plot_pacf(...): This function from statsmodels plots the PACF of the time series data.
df.diff().dropna(): The first-order differenced data with NaN values removed.
lags=40: The number of lags to include in the plot, which in this case is 40.
ax=ax[1]: Specifies that the PACF plot should be drawn on the second subplot.
By executing this code, you generate two plots that help in understanding the autocorrelation and partial autocorrelation structures of the first-order differenced time series data. Here's what each plot represents:
Autocorrelation Function (ACF): The ACF plot shows the correlation of the time series with its own lagged values. It helps to identify the presence of any remaining patterns in the differenced data.
Partial Autocorrelation Function (PACF): The PACF plot shows the correlation of the time series with its own lagged values, controlling for the values of the time series at all shorter lags. It helps to identify the order of an autoregressive model by showing the direct relationship between the series and its lags.
These plots are useful for identifying the appropriate parameters for ARIMA models (AutoRegressive Integrated Moving Average) and for understanding the nature of the time series data after differencing.
Model Training
Next we will define and fit a Seasonal ARIMA (SARIMA) model to the '#Passengers' time series data using the SARIMAX class from the statsmodels library.
p = 2 # pacf
d = 1 # 1st order difference
q = 1 # acf
P = 1
D = 0
Q = 3
# define the arima model
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(df['#Passengers'], order=(p, d, q), seasonal_order=(P, D, Q, seasonal_period))
fitted_model = model.fit()
print(fitted_model.summary())
Import the SARIMAX class from the statsmodels.tsa.statespace.sarimax module. The SARIMAX class is used for fitting SARIMA models, which are ARIMA models that include seasonal components.
SARIMAX(df['#Passengers'], order=(p, d, q), seasonal_order=(P, D, Q, seasonal_period): This creates an instance of the SARIMAX model for the '#Passengers' time series data.
df['#Passengers']: The time series data to be modeled.
order=(p, d, q): The non-seasonal part of the model, where:
p: The number of lag observations included in the model (autoregressive order).
d: The number of times that the raw observations are differenced (differencing order).
q: The size of the moving average window (moving average order).
seasonal_order=(P, D, Q, seasonal_period): The seasonal part of the model, where:
P: The number of lag observations in the seasonal part of the model (seasonal autoregressive order).
D: The number of times that the raw observations are differenced in the seasonal part of the model (seasonal differencing order).
Q: The size of the moving average window in the seasonal part of the model (seasonal moving average order).
seasonal_period: The number of observations per seasonal cycle.
fitted_model = model.fit(): This fits the SARIMA model to the time series data using maximum likelihood estimation. The fit() method returns the fitted model object.
fitted_model.summary(): This generates a summary of the fitted model, which includes information such as the estimated parameters, their standard errors, and various statistics that assess the model fit.
Forecasting
Next we will forecast the next 2 years (24 months) using the fitted SARIMA model and create a date range for the forecasted values.
# forecast for next 2 years
forecast_steps = 24
forecast = fitted_model.get_forecast(steps=forecast_steps)
# create the date range for the forecasted values
forecast_index = pd.date_range(start=df.index[-1], periods=forecast_steps+1, freq='M')[1:].strftime('%Y-%m') # remove start date
forecast_steps = 24: Specifies the number of steps to forecast. In this case, 24 steps correspond to 24 months, or 2 years.
forecast = fitted_model.get_forecast(steps=forecast_steps): Generates the forecast for the next 24 months using the fitted SARIMA model.
fitted_model: This is the SARIMA model that was previously fitted to the '#Passengers' time series data.
get_forecast(steps=forecast_steps): This method produces the forecasted values for the specified number of steps (24 months).
pd.date_range(start=df.index[-1], periods=forecast_steps+1, freq='M'): Creates a date range starting from the last date in the original time series (df.index[-1]) and spans the next 24 months.
start=df.index[-1]: The starting point for the date range is the last date in the original time series data.
periods=forecast_steps+1: Specifies the number of periods to generate. The +1 is used to ensure that the resulting date range covers the entire forecast period, including the start date, which is then removed.
freq='M': Indicates that the frequency of the periods is monthly.
Remove Start Date:
[1:]: Slices the date range to exclude the start date. This is necessary because the date_range function includes the start date, and we want the forecast to start from the month after the last date in the original data.
Format Dates:
strftime('%Y-%m'): Formats the dates in the resulting date range to the 'YYYY-MM' format, which is a common representation for monthly data.
This process results in a set of forecasted values and corresponding dates, which can be used for further analysis or visualization.
Next we will plot the forecast values along with the historical data and confidence intervals, you need to ensure that you include the confidence intervals in your forecast data frame
# plot the forecast values
plt.figure(figsize=(12, 6))
plt.plot(df['#Passengers'], label='Historical Data')
plt.plot(forecast_df['Forecast'], label='Forecast Data')
plt.fill_between(forecast_df.index, forecast_df['Lower CI'], forecast_df['Upper CI'], color='k', alpha=0.1)
plt.xlabel('Date')
plt.ylabel('Number of Passengers')
plt.title('Air Passengers Forecast')
plt.xticks(rotation=90)
plt.legend()
plt.show()
Extract the forecasted mean, and the lower and upper confidence intervals from the forecast object.
Combine the forecasted mean and confidence intervals into a DataFrame.
Use matplotlib to plot the historical data, forecasted data, and the confidence intervals.
This code generates a plot showing:
The historical data (df['#Passengers']).
The forecasted data (forecast_df['Forecast']).
The confidence intervals (forecast_df['Lower CI'] and forecast_df['Upper CI']), shaded with a semi-transparent fill to indicate uncertainty in the forecast.
Final Thoughts
Time series forecasting is a crucial task in many fields, including finance, economics, and operations management.The Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors (SARIMAX) model is a powerful tool for this purpose, particularly when dealing with data that exhibits both trend and seasonality.
SARIMAX models are highly flexible and can be tuned to fit various patterns in the data. The combination of autoregressive (AR), differencing (I), and moving average (MA) components, along with seasonal terms, allows for a wide range of model configurations.
SARIMAX is an extension of ARIMA that incorporates seasonal components. This makes it particularly effective for time series data with periodic fluctuations, such as monthly sales data or yearly climate measurements.
The ability to include exogenous variables (additional predictors) allows SARIMAX to model the impact of external factors on the time series, potentially improving the forecast accuracy.
it's important to remember that forecasting is inherently uncertain. Combining statistical models like SARIMAX with domain knowledge and continuously monitoring model performance will lead to the most effective and reliable forecasting outcomes.
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm
Comentarios