How does attention changes context and embeddings

by marjavamitjava · July 14, 2025

🟦 1. What is a word embedding?

Let’s say a model reads this sentence:

“The cat sat on the mat.“

Each word like “cat”, “mat”, “sat” is turned into a vector — imagine this as a list of numbers like:

bashCopyEdit“cat” → [0.1, -0.3, 0.9, ...]

This is called the embedding of the word — it captures the word’s meaning in a mathematical way.

But this initial embedding is dumb — it doesn’t know the sentence. “cat” will have the same numbers whether you say:

“The cat sat on the mat”
“The cat was chased by a dog”

That’s not ideal, right? We want “cat” to understand its role in the sentence.

That’s where attention comes in.

🟨 2. How attention changes the embedding

After you get the basic embedding, attention updates it using context.

Think of attention like listening to your friends in a meeting. You have an opinion (your original embedding), but after hearing others, you might change your view slightly — depending on who you trust more.

So the word “cat” listens to:

“The”
“sat”
“on”
“the”
“mat”

But it doesn’t treat them equally.

It might think:

“sat” is important → so give it more attention
“the” is not important → less attention

It then updates its own meaning (its embedding) based on this.

🟩 3. How later words impact earlier ones

Let’s say we’re now reading:

“Alice gave Bob a book, and he smiled.”

We see “he” at the end. Who does “he” refer to? Probably Bob.

Even though “Bob” came earlier, the model now updates Bob’s embedding slightly after seeing “he”.

Yes — in transformers, the model allows earlier words to be updated based on later ones.

So after reading the full sentence, “Bob” knows:

Oh! Someone referred to me later.
So my role in the sentence is more important now.

In math: each word gets processed multiple times through layers, and in each layer, attention allows other words (even later ones) to contribute to how this word thinks.

🧠 Think of it like this:

Imagine every word is a person in a room. When a new person (word) enters, they can:

Listen to others: get context
Speak up: impact others’ thoughts

So attention is like a conversation. The more layers, the more the conversation continues.

By the end of the sentence, everyone (every word) has a much richer understanding of what’s going on — their embeddings are now context-aware.

🧾 Simple Story Example:

Sentence:

“The bank will not lend money if the customer has bad credit.”

Let’s focus on the word “bank”. At first, the model doesn’t know if it means:

River bank?
Financial bank?

So:

Its initial embedding is neutral.
But as the model reads:
- “lend money”
- “customer”
- “bad credit”

…it says:

“Aha! Now I know ‘bank’ means a financial institution.”

So the final embedding of “bank” changes, because the words that came later clarified its meaning.

🔁 In Short:

Every word starts with a default meaning (embedding).
As the sentence is read, attention lets each word talk to others.
Each word adjusts its meaning based on what others are saying.
This makes word understanding contextual, not just dictionary-based.

How does attention changes context and embeddings

🟦 1. What is a word embedding?

🟨 2. How attention changes the embedding

🟩 3. How later words impact earlier ones

🧠 Think of it like this:

🧾 Simple Story Example:

🔁 In Short:

You may also like...

Real world web application to implement human-in-loop(interrupt) in langgraph4j

Detailed Explanation of Neural Network Layers and Propagation

How a word’s embeddings change as context varies