• Mindscape ๐Ÿ”ฅ
    • Playlist ๐ŸŽง
  • Algorithm

    • 1018๋ฒˆ: ์ฒด์ŠคํŒ ๋‹ค์‹œ ์น ํ•˜๊ธฐ
    • 1966๋ฒˆ: ํ”„๋ฆฐํ„ฐ ํ
    • Python ์‹œ๊ฐ„ ์ดˆ๊ณผ ๋ฐฉ์ง€๋ฅผ ์œ„ํ•œ ํŒ
    • C++ std::vector ์‚ฌ์šฉ๋ฒ• ์ •๋ฆฌ
    • Vim ์‚ฌ์šฉ ๋งค๋‰ด์–ผ
  • Ubuntu

    • ๋ฆฌ๋ˆ…์Šค ์šฐ๋ถ„ํˆฌ GRUB ํฐํŠธ ๋ณ€๊ฒฝ
    • ์šฐ๋ถ„ํˆฌ ์ด๋ฏธ์ง€ ๋น„๋””์˜ค ์ธ๋„ค์ผ(๋ฏธ๋ฆฌ๋ณด๊ธฐ) ์•ˆ ๋ณด์ž„ ๋ฌธ์ œ ํ•ด๊ฒฐ
    • Wine ํ™˜๊ฒฝ์—์„œ ์นด์นด์˜คํ†ก ์‹คํ–‰ ์‹œ explorer.exe ๋œจ์ง€ ์•Š๊ฒŒ ํ•˜๋Š” ๋ฒ•
    • ์šฐ๋ถ„ํˆฌ Wine ์นด์นด์˜คํ†ก ์‚ฌ์ง„ ์ด๋ฏธ์ง€ ์Šคํฌ๋ฆฐ์ƒท ๋ถ™์—ฌ๋„ฃ๊ธฐ
    • Wine ์นด์นด์˜คํ†ก ์ด๋ชจ์ง€ ๊นจ์ง ๋ฌธ์ œ ํ•ด๊ฒฐ
    • Ubuntu ์œˆ๋„์šฐ ์• ๋‹ˆ๋ฉ”์ด์…˜ ๋„๊ธฐ
  • Wellness

    • ์ฐจ์ „์žํ”ผ (Psyllium Husk)
    • ์—‘์ŠคํŠธ๋ผ ๋ฒ„์ง„ ์˜ฌ๋ฆฌ๋ธŒ์œ  (Extra Virgin Olive Oil)
    • ์ž๊ฐ€๋น„๊ฐ•์„ธ์ฒ™ (Nasal Irrigation)
    • QCY HT08 (MeloBuds Pro Plus)
    • ์ฝ˜์„œํƒ€ (Concerta)
    • ์ธ๋ฐ๋†€ (Inderal)
    • ์„คํŠธ๋ž„๋ฆฐ (Sertraline)
    • ๋ฉœ๋ผํ† ๋‹Œ (Melatonin)
    • ์น˜๊ฒฝ๋ถ€ ๋งˆ๋ชจ์ฆ
    • ๋ฐ”๋ฒจ ์Šค์ฟผํŠธ (Barbell Squat)
  • Humanities

    • Nordvik, Russia
    • North Sentinel Island
    • ๋กฑ๊ณ ๋กฑ๊ณ (Rongorongo)
    • ๋ฐ”๋กœํฌ ์Œ์•… (Baroque Music)
  • Design

    • ๊ตฌ๊ธ€์˜ ์•„์ด์ฝ˜ ๋Œ€๊ฐœํŽธ โ€” 6๋…„ ๋งŒ์˜ ์‹ค์ˆ˜ ์ธ์ •
    • ์ œ๋Ÿด๋“œ ์  ํƒ€ โ€” ๋Ÿญ์…”๋ฆฌ ์Šคํฌ์ธ  ์›Œ์น˜์˜ ์ฐฝ์‹œ์ž
    • ๋ฐ”์šฐํ•˜์šฐ์Šค โ€” ํ˜„๋Œ€ ๋””์ž์ธ์˜ ์›์ 
  • Brands

    • NOMOS Glashรผtte
    • Frรฉdรฉrique Constant
    • KZ (Knowledge Zenith)
    • ์—์ŠคํŠธ๋ผ (AESTURA)
    • JINHAO (้‡‘่ฑช)
    • Herman Miller
    • ๋ฐ์Šค์ปค (DESKER)
    • ๋ฌด์‹ ์‚ฌ ์Šคํƒ ๋‹ค๋“œ (Musinsa Standard)
  • Finance

    • ํ˜„๋Œ€์นด๋“œ ZERO โ€” Edition2 vs Edition3 ๋น„๊ต
    • ์‹ ํ•œ์นด๋“œ ์ฒ˜์Œ
    • S&P 500 ETF ํˆฌ์ž ๊ฐ€์ด๋“œ
    • ํŒŒํ‚นํ†ต์žฅ vs CMA ํ†ต์žฅ
    • ๋ฒ„ํฌ์…” ํ•ด์„œ์›จ์ด (Berkshire Hathaway)
    • ๋น„ํŠธ์ฝ”์ธ(Bitcoin)
  • Products

    • ์˜ค๋””์˜ค ์ธํ„ฐํŽ˜์ด์Šค (Audio Interface)
    • ์ฟ ๋ฃจํ† ๊ฐ€ (KURUTOGA)
    • CX31993 DAC ๋™๊ธ€
    • ํด๋ Œ์ง• ๋ฐ€ํฌ (Cleansing Milk)
    • ํ”ผ์ ฏ ํ† ์ด (Fidget Toy)
    • ThinkPad
  • Programming Languages

    • 8.0. Statement Level Control Structures
    • 8. Subprogram
    • 9. Implementing Subprogram
    • 10.1. Abstract Data Types and Encapsulation Constructs
    • 10.2. Support for Object Oriented Programming
    • 11. Concurrency
    • 12. FPL (1)
    • 13. FPL (2)
    • 14. Exception Handling and Event Handling
    • Final Exam

26. Attention Mechanism and Self-Attention

์ž‘์„ฑ 2026. 6. 12.ยท์ˆ˜์ • 2026. 6. 12.

Sequence-to-Sequence Models

Focusing on neural machine translation

Machine Translation

  • Machine Translation (MT)
    • Source language์˜ ๋ฌธ์žฅ xxx๋ฅผ target language์˜ ๋ฌธ์žฅ yyy๋กœ ๋ฒˆ์—ญํ•˜๋Š” task
  • The early history of MT: 1950s
    • "A.I."๋ผ๋Š” ์šฉ์–ด๊ฐ€ ๋งŒ๋“ค์–ด์ง€๊ธฐ ์ „์ธ 1950๋…„๋Œ€ ์ดˆ๋ฐ˜์— ์‹œ์ž‘
    • ์ฃผ๋กœ ๋‹จ์–ด ์น˜ํ™˜์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋‹จ์ˆœํ•œ ๊ทœ์น™ ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ
    • ์ž์—ฐ์–ด syntax, semantics, pragmatics(ํ™”์šฉ๋ก )์— ๋Œ€ํ•œ ์ดํ•ด ๋ถ€์กฑ

Neural Machine Translation

  • Neural Machine Translation (NMT)
    • ํ•˜๋‚˜์˜ end-to-end Neural Network๋กœ machine Translation์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•
    • ํ•ด๋‹น neural Network ์•„ํ‚คํ…์ฒ˜๋Š” sequence-to-sequence model (a.k.a seq2seq)์ด๋ผ ๋ถˆ๋ฆฌ๋ฉฐ ๋‘ ๊ฐœ์˜ RNN์„ ํฌํ•จ

Sequence-to-Sequence is Versatile!

  • Encoder-Decoder model์ด๋ผ๋Š” ์ผ๋ฐ˜์ ์ธ ๊ฐœ๋…
    • ํ•˜๋‚˜์˜ neural network๋Š” ์ž…๋ ฅ์„ ๋ฐ›์•„ neural representation์„ ์ƒ์„ฑ
    • ๋‹ค๋ฅธ network๋Š” ํ•ด๋‹น neural representation์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ถœ๋ ฅ์„ ์ƒ์„ฑ
    • ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์ด sequence์ธ ๊ฒฝ์šฐ ์ด๋ฅผ Seq2seq model์ด๋ผ ์นญํ•จ
  • Sequence-to-sequence๋Š” MT(Machine Translation) ์™ธ์—๋„ ๋‹ค์–‘ํ•œ nLP(Natural Language Processing) Task์— ์œ ์šฉ
    • Summarization (Long text โ†’ Short text)
    • Dialogue (Previous utterances(๋ฐœํ™”) โ†’ Next utterance)
    • Code generation (Natural language โ†’ Python code)

Neural Machine Translation (NMT)

  • Sequence-to-sequence model์€ Conditional Language Model์˜ ํ•œ ์˜ˆ์‹œ
    • Decoder๊ฐ€ target sentence yyy์˜ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋ฏ€๋กœ language model์ด๋ผ ํ•  ์ˆ˜ ์žˆ์Œ.
    • ์˜ˆ์ธก์ด source sentence xxx์— ์กฐ๊ฑด์„ ๋ฐ›๊ธฐ(Conditioned) ๋•Œ๋ฌธ์— conditional์ด๋ผ ํ•  ์ˆ˜ ์žˆ์Œ.
  • NMT๋Š” P(yโˆฃx)P(y|x)P(yโˆฃx)๋ฅผ ์ง์ ‘ ๊ณ„์‚ฐ

P(yโˆฃx)P(y|x) P(yโˆฃx)

=P(y1โˆฃx)P(y2โˆฃy1,x)P(y3โˆฃy1,y2,x)โ€ฆP(yTโˆฃy1,โ€ฆ,yTโˆ’1,x)= P(y_1|x) P(y_2|y_1, x) P(y_3|y_1, y_2, x) \dots P(y_T|y_1, \dots, y_{T-1}, x) =P(y1โ€‹โˆฃx)P(y2โ€‹โˆฃy1โ€‹,x)P(y3โ€‹โˆฃy1โ€‹,y2โ€‹,x)โ€ฆP(yTโ€‹โˆฃy1โ€‹,โ€ฆ,yTโˆ’1โ€‹,x)

  • MT System ํ•™์Šต ๋ฐฉ๋ฒ•
    • ๋Œ€๊ทœ๋ชจ parallel corpus ํ™•๋ณด
    • Unsupervised NMT, data augmentation ๋“ฑ์— ๋Œ€ํ•œ ํฅ๋ฏธ๋กœ์šด ์—ฐ๊ตฌ ์กด์žฌ

Training a Neural Machine Translation System

Introduction to Attention

The Bottleneck Problem of Seq2Seq

Attention

  • Attention (mechanism)์€ bottleneck ๋ฌธ์ œ์— ๋Œ€ํ•œ ํ•ด๊ฒฐ์ฑ… ์ œ๊ณต
    • Core idea: Decoder์˜ ๊ฐ ๋‹จ๊ณ„์—์„œ encoder์— ์ง์ ‘ ์—ฐ๊ฒฐํ•˜์—ฌ source sequence์˜ ํŠน์ • ๋ถ€๋ถ„์— ์ง‘์ค‘(Focus)

Sequence-to-Sequence with Attention

alt text

Attention: In Equations

  • Procedure
    1. Encoder hidden states h1,โ€ฆ,hNโˆˆRhh_1, \dots, h_N \in \mathbb{R}^hh1โ€‹,โ€ฆ,hNโ€‹โˆˆRh ๋ณด์œ 
    2. Timestep ttt์—์„œ decoder hidden state stโˆˆRhs_t \in \mathbb{R}^hstโ€‹โˆˆRh ๋ณด์œ 
    3. ์ด ๋‹จ๊ณ„์˜ attention scores ete_tetโ€‹ ๊ณ„์‚ฐ

    et=[stTh1,โ€ฆ,stThN]โˆˆRNe^t = [s_t^T h_1, \dots, s_t^T h_N] \in \mathbb{R}^N et=[stTโ€‹h1โ€‹,โ€ฆ,stTโ€‹hNโ€‹]โˆˆRN

    1. Softmax๋ฅผ ์ทจํ•ด ์ด ๋‹จ๊ณ„์˜ attention distribution ฮฑt\alpha_tฮฑtโ€‹ ํš๋“ (ํ™•๋ฅ  ๋ถ„ํฌ์ด๋ฉฐ ํ•ฉ์€ 1)

    ฮฑt=softmax(et)โˆˆRN\alpha^t = \text{softmax}(e^t) \in \mathbb{R}^N ฮฑt=softmax(et)โˆˆRN

    1. ฮฑt\alpha_tฮฑtโ€‹๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ encoder hidden states์˜ weighted sum์ธ attention output ata_tatโ€‹ ๊ณ„์‚ฐ

    at=โˆ‘i=1NฮฑithiโˆˆRha_t = \sum_{i=1}^{N} \alpha_i^t h_i \in \mathbb{R}^h atโ€‹=i=1โˆ‘Nโ€‹ฮฑitโ€‹hiโ€‹โˆˆRh

    1. ๋งˆ์ง€๋ง‰์œผ๋กœ attention output ata_tatโ€‹๋ฅผ decoder hidden state sts_tstโ€‹์™€ concatenateํ•˜๊ณ  non-attention seq2seq model๊ณผ ๊ฐ™์ด ์ง„ํ–‰

    [at;st]โˆˆR2h[a_t; s_t] \in \mathbb{R}^{2h} [atโ€‹;stโ€‹]โˆˆR2h

Attention is Great!

  • Attention์€ NMT ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ
    • Decoder๊ฐ€ source์˜ ํŠน์ • ๋ถ€๋ถ„์— ์ง‘์ค‘ํ•˜๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด ๋งค์šฐ ์œ ์šฉ
  • Attention์€ MT ๊ณผ์ •์— ๋Œ€ํ•ด ๋” "Human-like"ํ•œ model์„ ์ œ๊ณต
    • ์ „์ฒด๋ฅผ ๊ธฐ์–ตํ•  ํ•„์š” ์—†์ด ๋ฒˆ์—ญํ•˜๋ฉด์„œ source sentence๋ฅผ ๋‹ค์‹œ ๋ณผ ์ˆ˜ ์žˆ์Œ
  • Attention์€ bottleneck ๋ฌธ์ œ ํ•ด๊ฒฐ
    • Decoder๊ฐ€ source๋ฅผ ์ง์ ‘ ๋ณผ ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์—ฌ bottleneck ์šฐํšŒ
  • Attention์€ vanishing gradient ๋ฌธ์ œ ํ•ด๊ฒฐ์— ๋„์›€
    • ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ state๋กœ์˜ shortcut ์ œ๊ณต
  • Attention์€ ์–ด๋А ์ •๋„์˜ interpretability(ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ) ์ œ๊ณต
    • Attention distribution์„ ๊ฒ€์‚ฌํ•˜์—ฌ decoder๊ฐ€ ๋ฌด์—‡์— ์ง‘์ค‘ํ–ˆ๋Š”์ง€ ํ™•์ธ ๊ฐ€๋Šฅ
    • (Soft) Alignment๋ฅผ ๋ฌด๋ฃŒ๋กœ ํš๋“
    • Alignment system์„ ๋ช…์‹œ์ ์œผ๋กœ ํ•™์Šตํ•˜์ง€ ์•Š์•˜์Œ์—๋„ network๊ฐ€ ์Šค์Šค๋กœ alignment๋ฅผ ํ•™์Šตํ–ˆ๋‹ค๋Š” ์ ์ด ํฅ๋ฏธ๋กœ์›€

Attention is A General Deep Learning Technique

  • Attention์ด machine Translation์„ ์œ„ํ•œ sequence-to-sequence model์„ ๊ฐœ์„ ํ•˜๋Š” ํ›Œ๋ฅญํ•œ ๋ฐฉ๋ฒ•์ž„์„ ํ™•์ธ
    • ๊ทธ๋Ÿฌ๋‚˜, Seq2seq๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค์–‘ํ•œ ์•„ํ‚คํ…์ฒ˜์™€ MT ์™ธ์˜ ๋‹ค์–‘ํ•œ ์ž‘์—…์—์„œ attention ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  • Attention์˜ ๋” ์ผ๋ฐ˜์ ์ธ ์ •์˜
    • Vector Values ์ง‘ํ•ฉ๊ณผ vector Query๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, Attention์€ Query์— ์˜์กดํ•˜์—ฌ Values์˜ weighted sum์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ธฐ๋ฒ•
  • ๋•Œ๋•Œ๋กœ Query๊ฐ€ Values์— Attendํ•œ๋‹ค๊ณ  ํ‘œํ˜„
    • ์˜ˆ: Seq2seq + Attention model์—์„œ ๊ฐ decoder hidden state (Query)๋Š” ๋ชจ๋“  encoder hidden states (Values)์— attend
  • ์ง๊ด€
    • Weighted sum์€ Values์— ํฌํ•จ๋œ ์ •๋ณด์˜ ์„ ํƒ์  ์š”์•ฝ(Selective summary)์ด๋ฉฐ, Query๊ฐ€ ์–ด๋–ค Values์— ์ง‘์ค‘ํ• ์ง€ ๊ฒฐ์ •
    • Attention์€ ์ž„์˜์˜ representations ์ง‘ํ•ฉ(Values)์œผ๋กœ๋ถ€ํ„ฐ ๋‹ค๋ฅธ representation(Query)์— ์˜์กดํ•˜์—ฌ fixed-size representation์„ ์–ป๋Š” ๋ฐฉ๋ฒ•
  • ๊ฒฐ๋ก 
    • Attention์€ ๋ชจ๋“  Deep Learning Model์—์„œ pointer ๋ฐ memory ์กฐ์ž‘์„ ์œ„ํ•œ ๊ฐ•๋ ฅํ•˜๊ณ  ์œ ์—ฐํ•˜๋ฉฐ ์ผ๋ฐ˜์ ์ธ ๋ฐฉ๋ฒ•์ด ๋จ.
    • 2010๋…„ ์ดํ›„ NMT๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ ์ƒˆ๋กœ์šด ์•„์ด๋””์–ด

From RNN to Attention-Based NLP Models Self-Attention

Self-Attention

As of Last Lecture: Recurrent Models for (Most) NLP

  • 2016-2018๋…„๊ฒฝ, NLP์˜ ์‚ฌ์‹ค์ƒ ํ‘œ์ค€ ์ „๋žต์€ ๋ฌธ์žฅ์„ Bidirectional LSTM์œผ๋กœ ์ธ์ฝ”๋”ฉํ•˜๋Š” ๊ฒƒ
    • ์˜ˆ: ๋ฒˆ์—ญ์—์„œ์˜ source sentence
  • ๊ทธ ํ›„, ์ถœ๋ ฅ(Translation, Sentence, Summary)์„ sequence๋กœ ์ •์˜ํ•˜๊ณ  ์ƒ์„ฑ์„ ์œ„ํ•ด LSTM ์‚ฌ์šฉ
  • ์œ ์—ฐํ•œ memory ์ ‘๊ทผ์„ ์œ„ํ•ด attention ์‚ฌ์šฉ

Same Goals, Different Building Blocks

  • Sequence-to-sequence ๋ฌธ์ œ์™€ Encoder-Decoder model์— ๋Œ€ํ•ด ํ•™์Šต
    • ํ˜„์žฌ๋กœ์„œ๋Š” ๋ฌธ์ œ๋ฅผ ๋ฐ”๋ผ๋ณด๋Š” ์™„์ „ํžˆ ์ƒˆ๋กœ์šด ๋ฐฉ์‹์„ ๋™๊ธฐ๋ถ€์—ฌํ•˜๋ ค๋Š” ๊ฒƒ์ด ์•„๋‹˜
    • ๋Œ€์‹  model์— ์ ์šฉํ•˜์—ฌ ํญ๋„“์€ ๋ฐœ์ „์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•  ์ตœ์ƒ์˜ Building Blocks๋ฅผ ์ฐพ์œผ๋ ค๋Š” ๊ฒƒ

Issues with Recurrent Models: Linear Interaction Distance

  • RNN์€ "Left-to-right"๋กœ unrolled(์ž…๋ ฅ์„ ๋ฐ›์•„๋“ค์—ฌ ์ฒ˜๋ฆฌ)๋จ
    • ์ด๋Š” Linear locality๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๋ฉฐ ์œ ์šฉํ•œ heuristic
    • ๊ฐ€๊นŒ์šด ๋‹จ์–ด๋“ค์€ ์ข…์ข… ์„œ๋กœ์˜ ์˜๋ฏธ์— ์˜ํ–ฅ์„ ๋ฏธ์นจ
  • Problem
    • RNN์€ ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ๋‹จ์–ด ์Œ์ด ์ƒํ˜ธ์ž‘์šฉํ•˜๊ธฐ ์œ„ํ•ด O(sequenceย length)O(\text{sequence length})O(sequenceย length) ๋‹จ๊ณ„๊ฐ€ ํ•„์š”

Issues with Recurrent Models: Linear Interaction Distance

  • ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ๋‹จ์–ด ์Œ์˜ ์ƒํ˜ธ์ž‘์šฉ์— O(sequenceย length)O(\text{sequence length})O(sequenceย length) ๋‹จ๊ณ„๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋Š” ์˜๋ฏธ
    • Long-distance dependencies ํ•™์Šต์˜ ์–ด๋ ค์›€ (Gradient ๋ฌธ์ œ ๋•Œ๋ฌธ)
    • ๋‹จ์–ด์˜ linear order๊ฐ€ "Baked in" ๋จ
      • ๊ทธ๋Ÿฌ๋‚˜ linear order๊ฐ€ ๋ฌธ์žฅ์„ ์ƒ๊ฐํ•˜๋Š” ์ตœ์„ ์˜ ๋ฐฉ๋ฒ•์ด ์•„๋‹ ์ˆ˜ ์žˆ์Œ

Issues with Recurrent Models: Lack of Parallelizability

  • Forward ๋ฐ backward pass๋Š” O(sequenceย length)O(\text{sequence length})O(sequenceย length)์˜ unparallelizable(๋ณ‘๋ ฌํ™” ๋ถˆ๊ฐ€)ํ•œ ์—ฐ์‚ฐ์„ ํฌํ•จ
    • GPU๋Š” ํ•œ ๋ฒˆ์— ๋งŽ์€ ๋…๋ฆฝ์ ์ธ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ ๊ฐ€๋Šฅ
    • ๊ทธ๋Ÿฌ๋‚˜ ๋ฏธ๋ž˜์˜ RNN hidden states๋Š” ๊ณผ๊ฑฐ์˜ RNN hidden states๊ฐ€ ๊ณ„์‚ฐ๋˜๊ธฐ ์ „๊นŒ์ง€๋Š” ์™„์ „ํžˆ ๊ณ„์‚ฐ๋  ์ˆ˜ ์—†์Œ
    • ์ด๋Š” ๋งค์šฐ ํฐ dataset์—์„œ์˜ ํ•™์Šต์„ ์ €ํ•ด

If Not Recurrence, Then What? How About Attention?

  • Attention์€ ๊ฐ ๋‹จ์–ด์˜ representations๋ฅผ Query๋กœ ์ทจ๊ธ‰ํ•˜์—ฌ Values ์ง‘ํ•ฉ์˜ ์ •๋ณด์— ์ ‘๊ทผํ•˜๊ณ  ํ†ตํ•ฉ
    • Decoder์—์„œ encoder๋กœ์˜ attention์„ ๋ณด์•˜์œผ๋‚˜, ์ด์ œ ๋‹จ์ผ ๋ฌธ์žฅ ๋‚ด์—์„œ์˜ attention์„ ๊ณ ๋ ค โ†’ Self-Attention!
    • Attention์˜ ๊ฒฝ์šฐ, ๋ณ‘๋ ฌํ™” ๋ถˆ๊ฐ€ํ•œ ์—ฐ์‚ฐ์˜ ์ˆ˜๊ฐ€ sequence length์— ๋”ฐ๋ผ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š์Œ
    • Maximum interaction distance: ๋ชจ๋“  ๋‹จ์–ด๊ฐ€ ๋ชจ๋“  layer์—์„œ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฏ€๋กœ O(1)O(1)O(1)

Attention as a Soft, Averaging Lookup Table

  • Attention์„ Key-Value store์—์„œ์˜ fuzzy lookup ์ˆ˜ํ–‰์œผ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์Œ

Self-Attention Hypothetical Example

Self-Attention: Keys, Queries, Values from the Same Sequence

  • ๋‹จ์–ด์žฅ VVV ๋‚ด์˜ ๋‹จ์–ด sequence w1:nw_{1:n}w1:nโ€‹ ๊ฐ€์ •
    • ์˜ˆ: "Zuko made his uncle tea"
  • ๊ฐ wiw_iwiโ€‹์— ๋Œ€ํ•ด xi=Ewix_i = Ew_ixiโ€‹=Ewiโ€‹ (์—ฌ๊ธฐ์„œ EโˆˆRdร—VE \in \mathbb{R}^{d \times V}EโˆˆRdร—V๋Š” embedding matrix)
  1. ๊ฐ word embedding์„ weight matrices Q,K,VQ, K, VQ,K,V (๊ฐ๊ฐ Rdร—d\mathbb{R}^{d \times d}Rdร—d)๋กœ ๋ณ€ํ™˜

qi=Qxiย (queries),ki=Kxiย (keys),vi=Vxiย (Values)q_i = Qx_i \text{ (queries)}, \quad k_i = Kx_i \text{ (keys)}, \quad v_i = Vx_i \text{ (Values)} qiโ€‹=Qxiโ€‹ย (queries),kiโ€‹=Kxiโ€‹ย (keys),viโ€‹=Vxiโ€‹ย (Values)

  1. Keys์™€ queries ๊ฐ„์˜ pairwise similarities ๊ณ„์‚ฐ ๋ฐ softmax๋กœ ์ •๊ทœํ™”

ei,j=qiโ‹…kj,ฮฑi,j=expโก(ei,j)โˆ‘kexpโก(ei,k)e_{i,j} = q_i \cdot k_j, \quad \alpha_{i,j} = \frac{\exp(e_{i,j})}{\sum_{k} \exp(e_{i,k})} ei,jโ€‹=qiโ€‹โ‹…kjโ€‹,ฮฑi,jโ€‹=โˆ‘kโ€‹exp(ei,kโ€‹)exp(ei,jโ€‹)โ€‹

  1. ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•œ ์ถœ๋ ฅ์„ Values์˜ weighted sum์œผ๋กœ ๊ณ„์‚ฐ

oi=โˆ‘jฮฑi,jvjo_i = \sum_{j} \alpha_{i,j} v_j oiโ€‹=jโˆ‘โ€‹ฮฑi,jโ€‹vjโ€‹

Barriers & Solutions for Self-Attention as A Building Block

  • Barrier(์žฅ์• ๋ฌผ): ์ˆœ์„œ์— ๋Œ€ํ•œ ๊ณ ์œ ํ•œ ๊ฐœ๋… ๋ถ€์žฌ
    • Solution: Input์— position representation ์ถ”๊ฐ€
  • Barrier: Deep Learning์„ ์œ„ํ•œ nonlinearity ๋ถ€์žฌ (๋‹จ์ง€ weighted average์ผ ๋ฟ์ž„)
    • Solution: ๊ฐ self-attention output์— ๋™์ผํ•œ feedforward network ์ ์šฉ
  • Barrier: Sequence ์˜ˆ์ธก ์‹œ ๋ฏธ๋ž˜๋ฅผ ๋ณด์ง€ ์•Š๋„๋ก ๋ณด์žฅ ํ•„์š”
    • Machine translation์˜ ๊ฒฝ์šฐ
    • ํ˜น์€ language modeling์˜ ๊ฒฝ์šฐ
    • Solution: Attention weight๋ฅผ ์ธ์œ„์ ์œผ๋กœ 0์œผ๋กœ ์„ค์ •ํ•˜์—ฌ ๋ฏธ๋ž˜๋ฅผ mask out
์ตœ๊ทผ ์ˆ˜์ •: 26. 6. 12. ์˜คํ›„ 3:28
Contributors: kmbzn, Claude Sonnet 4.6

BUILT WITH

CloudflareNode.jsGitHubGitVue.jsJavaScriptVSCodenpm

All trademarks and logos are property of their respective owners.
ยฉ 2026 kmbzn ยท MIT License