Overview
SARIMAX is a powerful and flexible extension of the ARIMA model. It stands for Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors. This model is designed to handle time series data that exhibits both non-seasonal and seasonal patterns, and it can also incorporate the influence of external (exogenous) variables.
Architecture & Components
A SARIMAX model is defined by two sets of parameters: non-seasonal (p, d, q) and seasonal (P, D, Q, s).
- Non-Seasonal Orders (p, d, q): These are the same as in the standard ARIMA model:
p
: Order of the non-seasonal Autoregressive (AR) part.d
: Degree of non-seasonal Differencing (I) to make the series stationary.q
: Order of the non-seasonal Moving Average (MA) part.
- Seasonal Orders (P, D, Q, s): These parameters mirror the non-seasonal components but apply to the seasonal part of the series:
P
: Order of the seasonal Autoregressive (AR) part.D
: Degree of seasonal Differencing (I).Q
: Order of the seasonal Moving Average (MA) part.s
: The number of time steps in a single seasonal period (e.g., 12 for monthly data, 4 for quarterly data, 24 for hourly data with daily seasonality).
- Exogenous Regressors (X): The 'X' in SARIMAX indicates the inclusion of external variables that are not part of the time series itself but are believed to influence it. These can be other time series (e.g., economic indicators, weather data) that are known for both historical and future periods.
Mathematical Formulation (Conceptual)
The full SARIMAX model combines the non-seasonal and seasonal parts multiplicatively. Including exogenous variables, the model conceptually looks like:
$ (1 - \sum_{i=1}^{p} \phi_i L^i) (1 - \sum_{i=1}^{P} \Phi_i L^{is}) (1-L)^d (1-L^s)^D Y_t = c + (1 + \sum_{i=1}^{q} \theta_i L^i) (1 + \sum_{i=1}^{Q} \Theta_i L^{is}) \epsilon_t + \beta X_t $
Where:
- $Y_t$ is the time series value at time t.
- $L$ is the lag operator.
- $\phi_i, \theta_i$ are non-seasonal AR and MA parameters.
- $\Phi_i, \Theta_i$ are seasonal AR and MA parameters.
- $d, D$ are non-seasonal and seasonal differencing orders.
- $s$ is the seasonal period.
- $\epsilon_t$ is the white noise error term.
- $X_t$ represents the exogenous variables, and $\beta$ are their coefficients.
- $c$ is a constant.
When to Use SARIMAX
SARIMAX is particularly useful for:
- Time series data that exhibits both a clear trend and a repeating seasonal pattern (e.g., monthly sales data, daily electricity consumption).
- When the series is non-stationary and requires differencing to achieve stationarity.
- When external factors are known to influence the time series, and their future values can be reliably predicted or are already known.
- When you need to quantify the uncertainty of your forecasts with confidence intervals.
Pros and Cons
Pros
- Handles Seasonality & Trend: Explicitly models both non-seasonal and seasonal components.
- Incorporates Exogenous Variables: Can leverage external information to improve forecast accuracy.
- Statistically Robust: Based on well-established statistical principles, providing interpretable parameters and confidence intervals.
- Flexible: Can be configured to handle a wide variety of time series patterns by adjusting its many parameters.
Cons
- Complex Configuration: Determining the optimal (p,d,q)(P,D,Q,s) orders can be challenging and time-consuming, often requiring ACF/PACF analysis and iterative testing.
- Assumes Linearity: Struggles with highly non-linear relationships in the data.
- Requires Stationarity: The series (or its differences) must be stationary.
- Computational Cost: Can be slow to fit on very long time series due to its iterative optimization process.
- Future Exogenous Data Needed: To forecast into the future, you must have future values for all exogenous variables, which is often a practical limitation.
Example Implementation
Here's an example of implementing a SARIMAX model in Python using the `statsmodels` library. We'll generate sample data with trend, seasonality, and an exogenous variable.
# Import necessary libraries
import pandas as pd
import numpy as np
from statsmodels.tsa.statespace.sarimax import SARIMAX
import matplotlib.pyplot as plt
# 1. Generate sample data with trend, seasonality, and an exogenous variable
np.random.seed(42)
n_samples = 120 # 10 years of monthly data
time_index = pd.date_range(start='2010-01-01', periods=n_samples, freq='MS')
# Target series: sales with trend, yearly seasonality, and noise
sales_data = (100 + np.arange(n_samples) * 0.5 + # Trend
30 * np.sin(np.arange(n_samples) * 2 * np.pi / 12) + # Yearly seasonality
np.random.normal(0, 5, n_samples)) # Noise
series = pd.Series(sales_data, index=time_index)
# Exogenous variable: advertising spend (also with some pattern)
advertising_spend = (50 + 10 * np.cos(np.arange(n_samples) * 2 * np.pi / 6) + # Bi-monthly pattern
np.random.normal(0, 2, n_samples))
exog_df = pd.DataFrame({'advertising': advertising_spend}, index=time_index)
# 2. Split data into train and test sets
train_size = 100
train_series, test_series = series[0:train_size], series[train_size:n_samples]
train_exog, test_exog = exog_df[0:train_size], exog_df[train_size:n_samples]
# 3. Fit the SARIMAX model
# Non-seasonal order (p,d,q) = (1,1,1)
# Seasonal order (P,D,Q,s) = (1,1,1,12) for yearly seasonality with monthly data
model = SARIMAX(
train_series,
exog=train_exog,
order=(1, 1, 1), # Non-seasonal (AR, I, MA)
seasonal_order=(1, 1, 1, 12), # Seasonal (AR, I, MA, Period)
enforce_stationarity=False, # Set to True if you want to enforce stationarity
enforce_invertibility=False # Set to True if you want to enforce invertibility
)
model_fit = model.fit(disp=False) # disp=False to suppress verbose output
# 4. Make a forecast
forecast_steps = len(test_series)
# To forecast, you MUST provide future values for exogenous variables
future_exog = test_exog # In a real scenario, this would be actual future or predicted future exog values
forecast = model_fit.forecast(steps=forecast_steps, exog=future_exog)
# 5. Display the forecast (conceptual, actual plotting requires matplotlib setup)
print("SARIMAX Model Summary:")
print(model_fit.summary())
print("\nSARIMAX Forecast:")
print(forecast)
# Example plotting (uncomment and run in a Python environment with matplotlib)
# plt.figure(figsize=(14, 7))
# plt.plot(train_series.index, train_series, label='Training Data')
# plt.plot(test_series.index, test_series, label='Actual Data', color='orange')
# plt.plot(forecast.index, forecast, label='SARIMAX Forecast', color='green', linestyle='--')
# plt.title('SARIMAX Model Forecast with Exogenous Variable')
# plt.xlabel('Date')
# plt.ylabel('Value')
# plt.legend()
# plt.grid(True)
# plt.show()
Dependencies & Resources
Dependencies: pandas
, numpy
, statsmodels
, matplotlib
(for plotting).