How does attention changes context and embeddings
š¦ 1. What is a word embedding?
Letās say a model reads this sentence:
“The cat sat on the mat.“
Each word like ācatā, āmatā, āsatā is turned into a vector ā imagine this as a list of numbers like:
bashCopyEditācatā ā [0.1, -0.3, 0.9, ...]
This is called the embedding of the word ā it captures the wordās meaning in a mathematical way.
But this initial embedding is dumb ā it doesn’t know the sentence. “cat” will have the same numbers whether you say:
- āThe cat sat on the matā
- āThe cat was chased by a dogā
Thatās not ideal, right? We want ācatā to understand its role in the sentence.
Thatās where attention comes in.
šØ 2. How attention changes the embedding
After you get the basic embedding, attention updates it using context.
Think of attention like listening to your friends in a meeting. You have an opinion (your original embedding), but after hearing others, you might change your view slightly ā depending on who you trust more.
So the word ācatā listens to:
- āTheā
- āsatā
- āonā
- ātheā
- āmatā
But it doesnāt treat them equally.
It might think:
- āsatā is important ā so give it more attention
- ātheā is not important ā less attention
It then updates its own meaning (its embedding) based on this.
š© 3. How later words impact earlier ones
Letās say weāre now reading:
āAlice gave Bob a book, and he smiled.ā
We see āheā at the end. Who does āheā refer to? Probably Bob.
Even though āBobā came earlier, the model now updates Bobās embedding slightly after seeing āheā.
Yes ā in transformers, the model allows earlier words to be updated based on later ones.
So after reading the full sentence, āBobā knows:
- Oh! Someone referred to me later.
- So my role in the sentence is more important now.
In math: each word gets processed multiple times through layers, and in each layer, attention allows other words (even later ones) to contribute to how this word thinks.
š§ Think of it like this:
Imagine every word is a person in a room. When a new person (word) enters, they can:
- Listen to others: get context
- Speak up: impact others’ thoughts
So attention is like a conversation. The more layers, the more the conversation continues.
By the end of the sentence, everyone (every word) has a much richer understanding of whatās going on ā their embeddings are now context-aware.
š§¾ Simple Story Example:
Sentence:
āThe bank will not lend money if the customer has bad credit.ā
Letās focus on the word ābankā. At first, the model doesnāt know if it means:
- River bank?
- Financial bank?
So:
- Its initial embedding is neutral.
- But as the model reads:
- ālend moneyā
- ācustomerā
- ābad creditā
…it says:
āAha! Now I know ābankā means a financial institution.ā
So the final embedding of ābankā changes, because the words that came later clarified its meaning.
š In Short:
- Every word starts with a default meaning (embedding).
- As the sentence is read, attention lets each word talk to others.
- Each word adjusts its meaning based on what others are saying.
- This makes word understanding contextual, not just dictionary-based.