--- license: mit language: - en tags: - attention - temporal-reasoning - time-series - inductive-bias - plug-and-play --- # TemporalSelfAttention - A Time-Biased Attention Module > Give Transformers a sense of time - not by scaling, but by structure. --- ## Why? Standard attention treats all tokens equally in time. This works for syntax, but breaks for: - Temporal event ordering - Causal reasoning - Timeline consistency - Long-range narrative coherence 💡 Insight: These models *simulate* time via token position. We inject it *structurally* with a tiny inductive bias. --- ## Core Equation The time-aware attention score is computed as: $$ \text{score}_{ij} = \frac{Q_i \cdot K_j^\top}{\sqrt{d_k}} + \gamma \cdot f(t_j - t_i) $$ ### Notation | Symbol | Description | |-----------------|-------------| | \\( \text{score}_{ij} \\) | Attention score between query at position \\( i \\) and key at position \\( j \\) | | \\( Q_i \\) | Query vector for position \\( i \\) | | \\( K_j \\) | Key vector for position \\( j \\) | | \\( d_k \\) | Dimension of key vectors | | \\( \gamma \\) | Learnable time bias strength | | \\( f(\cdot) \\) | Time difference function | | \\( t_j - t_i \\) | Relative time difference | ## How To Use ```python from temporal_attention import TemporalSelfAttention model = TemporalSelfAttention( embed_dim=64, num_heads=1, bias_type="linear", # or 'gaussian' gamma=1.0, causal=False ) # x: (B, T, D), timestamps: (B, T) output, weights = model(x, timestamps)