Understanding Transformers Part 8: Shared Weights in Self-Attention

In the previous article, we started calculating the self-attention values.

Let’s now calculate the self-attention values for the word “go”.

We do not need to recalculate the keys and values.

Instead, we only need to create the query that represents the word “go”, and then perform the same calculations as before.

After completing the calculations, we get the self-attention values for “go” as:

2.5 and -2.1

Key Observations About Self-Attention

  • The weights used to calculate queries are the same for both “Let’s” and “go”.

    • This means that regardless of the number of words, we use one shared set of weights.
    • Similarly, the same sets of weights are reused to calculate keys and values for every input word.
    • No matter how many words are given as input, the transformer reuses the same weights for queries, keys, and values.
  • We do not need to compute queries, keys, and values sequentially.

    • All of them can be computed at the same time.
    • This allows transformers to take advantage of parallel computation, making them very efficient.

We will continue building our transformer step by step in the next article.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run:

ipm install repo-name

… and you’re done! 🚀

Installerpedia Screenshot

🔗 Explore Installerpedia here

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

OpenAI takes aim at Anthropic with beefed-up Codex that gives it more power over your desktop

Next Post

VIDEO PODCAST | Building A Resilient Supply Chain Through Quality

Related Posts