Stanford CS25: V4 I Transformers that Transform Well Enough to Support Near-Shallow Architectures
23 May 2024 (4 months ago)
Self-Attention Mechanism
- Professor Jake Williams proposes a modified self-attention mechanism as an alternative to the standard dimensionalizing version, treating it as a feed-forward layer to produce self-attention weights.
- The modified mechanism is compatible with the traditional self-attention approach and avoids the need for additional model complexity.
- The key point is that the vectors used for self-attention should have consistent and meaningful comparisons.
- Optimizing the keys and queries of standard self-attention is similar to token and word embedding, with multiple self-attention heads and indeterminacy in creating different dimensional spaces.
- The indeterminacy relates to the lottery ticket hypothesis, suggesting that multiple different embeddings can be used in parallel for robustness and eliminating poorly initialized parameters.
Dimensionality Reduction and Vector Comparison
- Dimensionality reduction is necessary for language modeling, but it also poses challenges due to computational intractability and the distance from embedding layers to learning information.
- The discernability hypothesis proposes that low-dimensional vectors should be able to distinguish features, with more common features assigned more distinguishable vectors.
- The Bit Cipher algorithm generalizes one-hot vectors to low dimensions, allowing for controlled exploration of dimensionality.
- A deterministic low-dimensionalization procedure enables non-random initialization of layers in neural networks, improving performance compared to random initialization.
Warm-Starting Language Models
- The softmax activation function is necessary for self-attention features, and a differential criterion is derived to determine the targets for self-attention.
- Warm-starting a network with non-random vectors reduces perplexity and improves learning compared to cold starts.
- The modified version of self-attention compares inputs to themselves, with keys and queries in between, and uses non-random initialization for the parameters.
- The warm-start solution can be applied to feed-forward layers with non-negative inputs.
- For non-unit normed vectors, the optimal value of K (number of features per prediction) is the average norm of the inputs.
Context Models and Training
- Longer context windows provide more information, but without feature weights, models don't simply get better with long context windows.
- Self-attention is needed to determine the best weights for context vectors.
- Different context models (block, radial, document) provide different information and can be integrated to improve language modeling.
- Bit Cipher vectors don't capture similarities between similar tokens, so traditional co-occurrence methods can be used to create vectors with meaningful similarities.
- A co-occurrence matrix is used to create vectors that can be used in self-attention feed-forward unit models.
- Caching vector comparisons reduces the self-attention layer cost from quadratic to linear, making training faster.
- Models trained on small data can be effective but may not generalize well to larger datasets.
- Packing long contexts can be used to improve the utilization of the block model of context, but it requires careful engineering.
- Dynamically changing the context length allows for more efficient use of self-attention parameters without the need for packing.
Alternative Self-Attention Strategies
- The proposed method uses a warm start to initialize the embedding layer, which saturates quickly and doesn't require a large amount of data.
- Training times are significantly faster compared to standard self-attention models, even for large models with billions of parameters.
- The method is effective for training models on specific tasks without pre-training, as demonstrated by a use case of predicting whether to turn a light on or off based on voice commands.
- The approach involves continuous data collection, transcription, language modeling, anticipation of user intent, and correction of training data.
- The models used for this task are small enough to fit on a microprocessor or a single-chip GPU, enabling real-time predictions and operation without an internet connection.
Future Work
- The speaker also discusses future work, including incorporating a speech recognition system into the model and exploring different layer types for warm starts.
- Implementations of SAFU will be made available after publication, but require significant work on developing evaluation systems.