• Uncategorised

How does attention changes context and embeddings

🟦 1. What is a word embedding?

Let’s say a model reads this sentence:

The cat sat on the mat.

Each word like ā€œcatā€, ā€œmatā€, ā€œsatā€ is turned into a vector — imagine this as a list of numbers like:

bashCopyEditā€œcatā€ → [0.1, -0.3, 0.9, ...]

This is called the embedding of the word — it captures the word’s meaning in a mathematical way.

But this initial embedding is dumb — it doesn’t know the sentence. “cat” will have the same numbers whether you say:

  • ā€œThe cat sat on the matā€
  • ā€œThe cat was chased by a dogā€

That’s not ideal, right? We want ā€œcatā€ to understand its role in the sentence.

That’s where attention comes in.


🟨 2. How attention changes the embedding

After you get the basic embedding, attention updates it using context.

Think of attention like listening to your friends in a meeting. You have an opinion (your original embedding), but after hearing others, you might change your view slightly — depending on who you trust more.

So the word ā€œcatā€ listens to:

  • ā€œTheā€
  • ā€œsatā€
  • ā€œonā€
  • ā€œtheā€
  • ā€œmatā€

But it doesn’t treat them equally.

It might think:

  • ā€œsatā€ is important → so give it more attention
  • ā€œtheā€ is not important → less attention

It then updates its own meaning (its embedding) based on this.


🟩 3. How later words impact earlier ones

Let’s say we’re now reading:

ā€œAlice gave Bob a book, and he smiled.ā€

We see ā€œheā€ at the end. Who does ā€œheā€ refer to? Probably Bob.

Even though ā€œBobā€ came earlier, the model now updates Bob’s embedding slightly after seeing ā€œheā€.

Yes — in transformers, the model allows earlier words to be updated based on later ones.

So after reading the full sentence, ā€œBobā€ knows:

  • Oh! Someone referred to me later.
  • So my role in the sentence is more important now.

In math: each word gets processed multiple times through layers, and in each layer, attention allows other words (even later ones) to contribute to how this word thinks.


🧠 Think of it like this:

Imagine every word is a person in a room. When a new person (word) enters, they can:

  • Listen to others: get context
  • Speak up: impact others’ thoughts

So attention is like a conversation. The more layers, the more the conversation continues.

By the end of the sentence, everyone (every word) has a much richer understanding of what’s going on — their embeddings are now context-aware.


🧾 Simple Story Example:

Sentence:

ā€œThe bank will not lend money if the customer has bad credit.ā€

Let’s focus on the word ā€œbankā€. At first, the model doesn’t know if it means:

  • River bank?
  • Financial bank?

So:

  • Its initial embedding is neutral.
  • But as the model reads:
    • ā€œlend moneyā€
    • ā€œcustomerā€
    • ā€œbad creditā€

…it says:

ā€œAha! Now I know ā€˜bank’ means a financial institution.ā€

So the final embedding of ā€œbankā€ changes, because the words that came later clarified its meaning.


šŸ” In Short:

  • Every word starts with a default meaning (embedding).
  • As the sentence is read, attention lets each word talk to others.
  • Each word adjusts its meaning based on what others are saying.
  • This makes word understanding contextual, not just dictionary-based.

You may also like...