Overview
The Transformer is a revolutionary deep learning architecture introduced in the 2017 paper "Attention Is All You Need." Originally designed for neural machine translation, its core innovation, the **self-attention mechanism**, has proven incredibly powerful for processing sequential data. Unlike Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs), Transformers process all elements of a sequence simultaneously, allowing them to capture long-range dependencies more effectively and enable parallelization during training. While its direct application to raw time series data can be computationally intensive for very long sequences, it forms the foundation for many state-of-the-art time series models like Informer, Autoformer, and PatchTST.
Architecture & Components
The standard Transformer architecture consists of an encoder and a decoder, both composed of multiple identical layers. For time series forecasting, often only the encoder part is used, or a modified encoder-decoder structure is employed.
- Input Embedding: Each time step (or a patch of time steps, as in PatchTST) is first converted into a dense vector representation.
- Positional Encoding: Since the Transformer does not inherently process sequences in order, positional encodings are added to the input embeddings. These encodings provide information about the relative or absolute position of each time step in the sequence.
$ PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}}) $
Where $pos$ is the position, $i$ is the dimension, and $d_{model}$ is the embedding dimension.
$ PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}}) $ - Multi-Head Self-Attention: This is the heart of the Transformer. It allows the model to weigh the importance of different parts of the input sequence when processing each element. It consists of:
- Query (Q), Key (K), Value (V) Vectors: For each input embedding, three vectors are created. Queries are used to score against Keys, and these scores determine how much attention to pay to corresponding Values.
- Scaled Dot-Product Attention: The attention score is calculated as a dot product of Q and K, scaled by the square root of the key dimension ($d_k$), followed by a softmax function to get weights, which are then multiplied by V.
$ \text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V $
- Multiple Heads: The self-attention mechanism is performed multiple times in parallel ("multiple heads"), allowing the model to focus on different aspects of the sequence simultaneously. The outputs from all heads are then concatenated and linearly transformed.
- Feed-Forward Network: After attention, each position in the sequence passes through an identical, fully connected feed-forward network independently.
- Add & Normalize: Residual connections and layer normalization are applied after both the multi-head attention and the feed-forward network to facilitate training of deep networks.
Conceptual diagram of a Transformer encoder block, showing multi-head attention and feed-forward layers.
When to Use Transformer
The raw Transformer architecture, or its adapted variants, are strong for time series forecasting when:
- Long-range dependencies are critical: Its self-attention mechanism excels at capturing relationships between distant points in a sequence.
- Parallelization during training is desired: Unlike RNNs, the Transformer can process all time steps simultaneously, speeding up training on modern hardware.
- The time series exhibits complex, non-linear patterns: The attention mechanism and feed-forward networks can model intricate relationships.
- You have sufficient training data: Transformers are data-hungry models and perform best with large datasets.
- You are working with multivariate time series: The attention mechanism can learn relationships across different variables.
Pros and Cons
Pros
- Excellent at Long-Range Dependencies: Overcomes vanishing gradient issues of RNNs by directly attending to all positions.
- High Parallelization: Enables faster training compared to sequential models like RNNs.
- State-of-the-Art Performance: Forms the basis for many top-performing time series models.
- Flexible for Sequence Data: Adaptable to various sequence lengths and types.
Cons
- Quadratic Complexity: Standard self-attention has quadratic time and memory complexity with respect to sequence length, limiting its direct application to very long raw time series.
- High Data Requirements: Requires substantial amounts of data for effective training.
- Less Interpretable: A "black box" model, though attention weights can offer some insight.
- No Inherent Inductive Bias for Time Series: Needs positional encoding to understand order, and may require adaptations (like patching, decomposition) to better capture time series specific patterns.
Example Implementation
Implementing a full Transformer from scratch is extensive. Here, we provide conceptual examples using TensorFlow/Keras and PyTorch focusing on the core Encoder block, adapted for univariate time series forecasting.
TensorFlow/Keras Example (Conceptual)
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Dropout, LayerNormalization
from tensorflow.keras.models import Model
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
# 1. Generate sample data
np.random.seed(42)
n_samples = 500
time = np.arange(n_samples)
data = np.sin(time / 20) * 10 + time * 0.1 + np.random.randn(n_samples) * 2
data = data.reshape(-1, 1)
# 2. Scale data
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(data)
# 3. Create sequences for Transformer input
def create_sequences(data, look_back):
X, y = [], []
for i in range(len(data) - look_back):
X.append(data[i:(i + look_back), 0])
y.append(data[i + look_back, 0])
return np.array(X), np.array(y)
look_back = 50 # Input sequence length
X, y = create_sequences(scaled_data, look_back)
# Reshape X for Transformer: (samples, timesteps, features)
X = X.reshape(X.shape[0], X.shape[1], 1)
# 4. Implement Positional Encoding
class PositionalEmbedding(tf.keras.layers.Layer):
def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
super().__init__(**kwargs)
self.token_embeddings = Dense(embed_dim) # Linear projection for input features
self.position_embeddings = tf.keras.layers.Embedding(sequence_length, embed_dim)
self.sequence_length = sequence_length
self.embed_dim = embed_dim
def call(self, inputs):
length = tf.shape(inputs)[-2] # Get sequence length from input tensor
positions = tf.range(start=0, limit=length, delta=1)
embedded_tokens = self.token_embeddings(inputs)
embedded_positions = self.position_embeddings(positions)
return embedded_tokens + embedded_positions
# 5. Implement Multi-Head Self-Attention
class MultiHeadSelfAttention(tf.keras.layers.Layer):
def __init__(self, embed_dim, num_heads=8, **kwargs):
super().__init__(**kwargs)
self.embed_dim = embed_dim
self.num_heads = num_heads
if embed_dim % num_heads != 0:
raise ValueError(
f"embedding dimension = {embed_dim} should be divisible by number of heads = {num_heads}"
)
self.projection_dim = embed_dim // num_heads
self.query_dense = Dense(embed_dim)
self.key_dense = Dense(embed_dim)
self.value_dense = Dense(embed_dim)
self.combine_heads = Dense(embed_dim)
def attention(self, query, key, value):
score = tf.matmul(query, key, transpose_b=True)
dim_key = tf.cast(tf.shape(key)[-1], tf.float32)
scaled_score = score / tf.math.sqrt(dim_key)
weights = tf.nn.softmax(scaled_score, axis=-1)
output = tf.matmul(weights, value)
return output, weights
def separate_heads(self, x, batch_size):
x = tf.reshape(x, (batch_size, -1, self.num_heads, self.projection_dim))
return tf.transpose(x, perm=[0, 2, 1, 3])
def call(self, inputs):
batch_size = tf.shape(inputs)[0]
query = self.query_dense(inputs) # (batch_size, seq_len, embed_dim)
key = self.key_dense(inputs) # (batch_size, seq_len, embed_dim)
value = self.value_dense(inputs) # (batch_size, seq_len, embed_dim)
query = self.separate_heads(query, batch_size) # (batch_size, num_heads, seq_len, projection_dim)
key = self.separate_heads(key, batch_size) # (batch_size, num_heads, seq_len, projection_dim)
value = self.separate_heads(value, batch_size) # (batch_size, num_heads, seq_len, projection_dim)
attention, weights = self.attention(query, key, value)
attention = tf.transpose(attention, perm=[0, 2, 1, 3]) # (batch_size, seq_len, num_heads, projection_dim)
concat_attention = tf.reshape(attention, (batch_size, -1, self.embed_dim))
output = self.combine_heads(concat_attention)
return output
# 6. Implement Transformer Block (Encoder Layer)
class TransformerBlock(tf.keras.layers.Layer):
def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1, **kwargs):
super().__init__(**kwargs)
self.att = MultiHeadSelfAttention(embed_dim, num_heads)
self.ffn = tf.keras.Sequential(
[Dense(ff_dim, activation="relu"), Dense(embed_dim),]
)
self.layernorm1 = LayerNormalization(epsilon=1e-6)
self.layernorm2 = LayerNormalization(epsilon=1e-6)
self.dropout1 = Dropout(rate)
self.dropout2 = Dropout(rate)
def call(self, inputs, training):
attn_output = self.att(inputs)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(inputs + attn_output)
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
return self.layernorm2(out1 + ffn_output)
# 7. Build the Transformer Model for Time Series Forecasting
embed_dim = 32 # Embedding size for each token
num_heads = 4 # Number of attention heads
ff_dim = 32 # Hidden layer size in feed forward network inside transformer
inputs = Input(shape=(look_back, 1)) # Input is (sequence_length, features)
x = PositionalEmbedding(look_back, 1, embed_dim)(inputs) # 1 is vocab_size for single feature
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = TransformerBlock(embed_dim, num_heads, ff_dim)(x) # Add another block for deeper model
# Take the output of the last time step for prediction
outputs = Dense(1)(x[:, -1, :]) # Predict the next single value
model = Model(inputs=inputs, outputs=outputs)
# 8. Compile and train
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(X, y, epochs=20, batch_size=32, verbose=0)
print("TensorFlow/Keras Transformer-like model training complete.")
# 9. Make predictions (conceptual)
train_predict_scaled = model.predict(X)
train_predict = scaler.inverse_transform(train_predict_scaled)
y_original = scaler.inverse_transform(y.reshape(-1, 1))
print(f"First 5 original values: {y_original[:5].flatten()}")
print(f"First 5 predicted values: {train_predict[:5].flatten()}")
# Plotting (conceptual)
# plt.figure(figsize=(14, 7))
# plt.plot(data[look_back:], label='Original Data')
# plt.plot(train_predict, label='Training Prediction', linestyle='--')
# plt.title('TensorFlow/Keras Transformer Time Series Forecast')
# plt.xlabel('Time Step')
# plt.ylabel('Value')
# plt.legend()
# plt.grid(True)
# plt.show()