Attention Is All You Need and Why It Changed the Way Machines Learning
Before 2017, models like RNNs and LSTMs processed text one word at a time, which meant: - Slow training - Poor long-term memory -No parallel processing Everything changed by Google 's paper “Attention Is All You Need” - Instead of reading sequentially - The model looks at all words at once and focuses on what matters most using Attention. This idea led to the Transformer, the foundation of modern models like GPT and BERT. In this article, we’ll quickly understand the paper and how Transformers works.
Transformer
The Transformer model is mainly used to understand and generate human language. Older models such as RNNs and LSTMs does not process sentences word by word. It looks at the full sentence at once using attention. This helps the model understand context better and also makes training faster.
Transformer Model Architecture
Transformer has two parts:
- Encoder -> it understand input sentence
- Decoder -> it generates te output sentence.it generates new text.
Positional Encoding
Transformers do not naturally understand the order of words. To solve this problem, positional encoding is added to word embeddings. This gives the model information about the position of each word in the sentence, so it knows which word comes first and next
Why sine and cosine is used ?
- Unique for every position
- Generalizes to long sentences
- Easy for model to learn patterns
Encoder
The encoder is the part of the Transformer that understands the input. It reads the entire sentence at the same time and learns how words are related to each other. Each word can look at all other words in the sentence and decide which ones are important.
Encoder layer contain,
- Self-Attention
- Feed Forward Neural Network
- Residual Connection + Layer Normalization
Self-Attention
Self-attention allows words within the same sentence to interact with each other. Self-attention means Q, K, V come from the same sentence.
- Words understand context
- Long-range dependencies are easy
- No memory loss like RNNs
Multi-Head Attention
Multiple attention heads are used at the same time. Each head focuses on different aspects of the sentence, such as grammar, meaning, or word distance.
Decoder
The decoder is responsible for generating the output. It produces the sentence one word at a time. predicting a word, the decoder is not allowed to see future words using masked attention.
Each decoder layer has:
- Masked Self-Attention
- Encoder-Decoder Attention
- Feed Forward Network
Softmax
Softmax is used to convert the model’s output into probabilities and choose the most next word.
- It Convert score into probability
- It Choose the most likely next word
Training
The Transformer model is trained on large datasets using cross-entropy loss and backpropagation. Since the architecture supports parallel processing, training is much faster compared to older models like RNNs and LSTMs.
- Uses Cross-Entropy Loss
- Trained on large datasets
- Parallel processing makes it very fast
- Backpropagation updates attention weights
Example (Translator)
Input: செயற்கை நுண்ணறிவை பாதுகாப்பாக பயன்படுத்தவும்
Steps:
- Encoder understands word relationships
- Decoder attends to relevant encoder outputs
- Output : Use artificial intelligence safely
Why are Transformers Important ?
Transformers handle long sentences well, train fast and scaleable. They are used in modern AI systems such as ChatGPT, Google Translate, and BERT.
Final Thought
The Transformer model showed that attention alone is enough to understand language. For students learning AI, Machine Learning, or NLP, understanding this model is very important because it forms the foundation of most modern language models today.
References
[1] Vaswani, A., et al. (2017) https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
[2] Jay Alammar – The Illustrated Transformer - https://jalammar.github.io/illustrated-transformer/
[3] GeeksforGeeks – Transformer Attention Mechanism https://www.geeksforgeeks.org/transformer-attention-mechanism/