Attention Is All You Need and Why It Changed the Way Machines Learning

Before 2017, models like RNNs and LSTMs processed text one word at a time, which meant: - Slow training - Poor long-term memory -No parallel processing Everything changed by Google 's paper “Attention Is All You Need” - Instead of reading sequentially - The model looks at all words at once and focuses on what matters most using Attention. This idea led to the Transformer, the foundation of modern models like GPT and BERT. In this article, we’ll quickly understand the paper and how Transformers works.

Transformer

The Transformer model is mainly used to understand and generate human language. Older models such as RNNs and LSTMs does not process sentences word by word. It looks at the full sentence at once using attention. This helps the model understand context better and also makes training faster.

Transformer Model Architecture

Transformer has two parts:

Encoder -> it understand input sentence
Decoder -> it generates te output sentence.it generates new text.

Blog image

Positional Encoding

Transformers do not naturally understand the order of words. To solve this problem, positional encoding is added to word embeddings. This gives the model information about the position of each word in the sentence, so it knows which word comes first and next

Blog image

Why sine and cosine is used ?

Unique for every position
Generalizes to long sentences
Easy for model to learn patterns

Encoder

The encoder is the part of the Transformer that understands the input. It reads the entire sentence at the same time and learns how words are related to each other. Each word can look at all other words in the sentence and decide which ones are important.

Blog image
Encoder layer contain,

Self-Attention
Feed Forward Neural Network
Residual Connection + Layer Normalization

Self-Attention

Self-attention allows words within the same sentence to interact with each other. Self-attention means Q, K, V come from the same sentence.

Blog image

Words understand context
Long-range dependencies are easy
No memory loss like RNNs

Multi-Head Attention

Multiple attention heads are used at the same time. Each head focuses on different aspects of the sentence, such as grammar, meaning, or word distance.

Blog image

Decoder

The decoder is responsible for generating the output. It produces the sentence one word at a time. predicting a word, the decoder is not allowed to see future words using masked attention.

Each decoder layer has:

Masked Self-Attention
Encoder-Decoder Attention
Feed Forward Network

Blog image

Softmax

Softmax is used to convert the model’s output into probabilities and choose the most next word.

Blog image

It Convert score into probability
It Choose the most likely next word

Training

The Transformer model is trained on large datasets using cross-entropy loss and backpropagation. Since the architecture supports parallel processing, training is much faster compared to older models like RNNs and LSTMs.

Uses Cross-Entropy Loss
Trained on large datasets
Parallel processing makes it very fast
Backpropagation updates attention weights

Example (Translator)

Input: செயற்கை நுண்ணறிவை பாதுகாப்பாக பயன்படுத்தவும்

Steps:

Encoder understands word relationships
Decoder attends to relevant encoder outputs
Output : Use artificial intelligence safely

Why are Transformers Important ?

Transformers handle long sentences well, train fast and scaleable. They are used in modern AI systems such as ChatGPT, Google Translate, and BERT.

Final Thought

The Transformer model showed that attention alone is enough to understand language. For students learning AI, Machine Learning, or NLP, understanding this model is very important because it forms the foundation of most modern language models today.

References

[1] Vaswani, A., et al. (2017) https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

[2] Jay Alammar – The Illustrated Transformer - https://jalammar.github.io/illustrated-transformer/

[3] GeeksforGeeks – Transformer Attention Mechanism https://www.geeksforgeeks.org/transformer-attention-mechanism/

Attention Is All You Need and Why It Changed the Way Machines Learning

Transformer

Transformer Model Architecture

Positional Encoding

Why sine and cosine is used ?

Encoder

Self-Attention

Multi-Head Attention

Decoder

Softmax

Training

Example (Translator)

Why are Transformers Important ?

Final Thought

References

Related Posts

Generative AI Agents and Cognitive Architectures