Overview
Vector Autoregression (VAR) models are a class of statistical models used for forecasting when two or more time series are interdependent. Unlike univariate models (like ARIMA) that focus on a single series, VAR models capture the linear interdependencies among multiple time series simultaneously. This makes them particularly useful in fields like economics, finance, and climate science where variables often influence each other over time.
While primarily a multivariate model, understanding VAR is crucial for univariate forecasting when exogenous variables are present and their dynamic interaction with the target series needs to be modeled explicitly, rather than just as a simple input.
Architecture & Components
A VAR(p) model for $K$ variables means that the current value of each variable is expressed as a linear function of its own past $p$ values, and the past $p$ values of all other $K-1$ variables in the system.
Mathematical Formulation
For a VAR(p) model with $K$ variables ($Y_{1,t}, Y_{2,t}, ..., Y_{K,t}$), the model can be written in matrix form:
$ Y_t = c + A_1 Y_{t-1} + A_2 Y_{t-2} + ... + A_p Y_{t-p} + \epsilon_t $
Where:
- $Y_t = [Y_{1,t}, Y_{2,t}, ..., Y_{K,t}]^T$ is a $K \times 1$ vector of observations at time $t$.
- $c = [c_1, c_2, ..., c_K]^T$ is a $K \times 1$ vector of constants (intercepts).
- $A_i$ are $K \times K$ matrices of autoregressive coefficients for lag $i$. Each element $A_{i,jk}$ represents the influence of variable $Y_k$ at time $t-i$ on variable $Y_j$ at time $t$.
- $Y_{t-i}$ are the lagged vectors of observations.
- $\epsilon_t = [\epsilon_{1,t}, \epsilon_{2,t}, ..., \epsilon_{K,t}]^T$ is a $K \times 1$ vector of white noise error terms, assumed to be multivariate normally distributed with mean zero and a constant covariance matrix.
The order p
determines how many past time steps are included in the model for each variable.
When to Use VAR Models
VAR models are most appropriate when:
- You are dealing with multiple time series that are believed to influence each other (e.g., interest rates, inflation, and GDP).
- The primary goal is forecasting, and understanding the causal relationships is secondary (though VAR can be extended for causality analysis like Granger causality).
- The time series are stationary. If not, cointegration tests (for long-run relationships) and Vector Error Correction Models (VECM) might be more appropriate.
- You need to model dynamic interactions between variables rather than treating some as simple exogenous inputs.
Pros and Cons
Pros
- Captures Interdependencies: Explicitly models the dynamic relationships among multiple time series.
- Flexible: Does not require strong theoretical assumptions about the underlying economic or physical structure, unlike structural models.
- Good for Forecasting: Often provides accurate forecasts for multivariate systems.
- Provides Impulse Response Functions: Allows analysis of how a shock to one variable affects all variables in the system over time.
Cons
- Parameter Explosion: The number of parameters to estimate grows quadratically with the number of variables ($K^2 \times p$), making it challenging for many series or high lags.
- Assumes Linearity: Cannot capture complex non-linear relationships.
- Requires Stationarity: All series must be stationary, or differenced to achieve stationarity.
- Less Interpretable Coefficients: Individual coefficients can be hard to interpret directly due to the complex interdependencies.
- Data Hungry: Requires a sufficient amount of historical data to reliably estimate all parameters.
Example Implementation
Here's an example of implementing a VAR model using the `statsmodels` library in Python. We'll generate two interdependent time series and then fit the VAR model.
# Import necessary libraries
import pandas as pd
import numpy as np
from statsmodels.tsa.api import VAR
import matplotlib.pyplot as plt
# 1. Generate sample interdependent data
np.random.seed(42)
n_samples = 150
# Create two series where each influences the other
data1 = np.zeros(n_samples)
data2 = np.zeros(n_samples)
# Initial values
data1[0] = 10
data2[0] = 5
# Simple VAR(1) process:
# data1_t = 0.6 * data1_{t-1} + 0.3 * data2_{t-1} + noise1_t
# data2_t = 0.4 * data1_{t-1} + 0.5 * data2_{t-1} + noise2_t
for i in range(1, n_samples):
noise1 = np.random.normal(0, 1)
noise2 = np.random.normal(0, 1)
data1[i] = 0.6 * data1[i-1] + 0.3 * data2[i-1] + noise1
data2[i] = 0.4 * data1[i-1] + 0.5 * data2[i-1] + noise2
# Combine into a DataFrame
df = pd.DataFrame({'Series1': data1, 'Series2': data2},
index=pd.date_range(start='2020-01-01', periods=n_samples, freq='D'))
# 2. Split data into train and test
train_size = 120
train_df, test_df = df[0:train_size], df[train_size:n_samples]
# 3. Fit the VAR model
# The order (lags) determines how many past time steps are used
model = VAR(train_df)
results = model.fit(maxlags=5, ic='aic') # Let it select optimal lags up to 5 using AIC
# 4. Make a forecast
forecast_steps = len(test_df)
# To forecast, you need the last 'lags' observations from the training data
# results.k_ar gives the optimal number of lags chosen by the model
lag_order = results.k_ar
forecast_input = train_df.values[-lag_order:]
# Forecast future values
forecast = results.forecast(y=forecast_input, steps=forecast_steps)
forecast_df = pd.DataFrame(forecast, index=test_df.index, columns=df.columns)
# 5. Display the forecast (conceptual, actual plotting requires matplotlib setup)
print("VAR Model Summary:")
print(results.summary())
print("\nVAR Model Forecast:")
print(forecast_df)
# Example plotting (uncomment and run in a Python environment with matplotlib)
# fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(14, 10), sharex=True)
#
# # Plot Series1
# axes[0].plot(train_df.index, train_df['Series1'], label='Training Series1')
# axes[0].plot(test_df.index, test_df['Series1'], label='Actual Series1', color='orange')
# axes[0].plot(forecast_df.index, forecast_df['Series1'], label='VAR Forecast Series1', color='green', linestyle='--')
# axes[0].set_title('Series1 Forecast')
# axes[0].legend()
# axes[0].grid(True)
#
# # Plot Series2
# axes[1].plot(train_df.index, train_df['Series2'], label='Training Series2')
# axes[1].plot(test_df.index, test_df['Series2'], label='Actual Series2', color='red')
# axes[1].plot(forecast_df.index, forecast_df['Series2'], label='VAR Forecast Series2', color='purple', linestyle='--')
# axes[1].set_title('Series2 Forecast')
# axes[1].legend()
# axes[1].grid(True)
#
# plt.tight_layout()
# plt.show()
Dependencies & Resources
Dependencies: pandas
, numpy
, statsmodels
, matplotlib
(for plotting).