• Mindscape ๐Ÿ”ฅ
    • Playlist ๐ŸŽง
  • Algorithm

    • 1018๋ฒˆ: ์ฒด์ŠคํŒ ๋‹ค์‹œ ์น ํ•˜๊ธฐ
    • 1966๋ฒˆ: ํ”„๋ฆฐํ„ฐ ํ
    • Python ์‹œ๊ฐ„ ์ดˆ๊ณผ ๋ฐฉ์ง€๋ฅผ ์œ„ํ•œ ํŒ
    • C++ std::vector ์‚ฌ์šฉ๋ฒ• ์ •๋ฆฌ
    • Vim ์‚ฌ์šฉ ๋งค๋‰ด์–ผ
  • Ubuntu

    • ๋ฆฌ๋ˆ…์Šค ์šฐ๋ถ„ํˆฌ GRUB ํฐํŠธ ๋ณ€๊ฒฝ
    • ์šฐ๋ถ„ํˆฌ ์ด๋ฏธ์ง€ ๋น„๋””์˜ค ์ธ๋„ค์ผ(๋ฏธ๋ฆฌ๋ณด๊ธฐ) ์•ˆ ๋ณด์ž„ ๋ฌธ์ œ ํ•ด๊ฒฐ
    • Wine ํ™˜๊ฒฝ์—์„œ ์นด์นด์˜คํ†ก ์‹คํ–‰ ์‹œ explorer.exe ๋œจ์ง€ ์•Š๊ฒŒ ํ•˜๋Š” ๋ฒ•
    • ์šฐ๋ถ„ํˆฌ Wine ์นด์นด์˜คํ†ก ์‚ฌ์ง„ ์ด๋ฏธ์ง€ ์Šคํฌ๋ฆฐ์ƒท ๋ถ™์—ฌ๋„ฃ๊ธฐ
    • Wine ์นด์นด์˜คํ†ก ์ด๋ชจ์ง€ ๊นจ์ง ๋ฌธ์ œ ํ•ด๊ฒฐ
    • Ubuntu ์œˆ๋„์šฐ ์• ๋‹ˆ๋ฉ”์ด์…˜ ๋„๊ธฐ
  • Wellness

    • ์ฐจ์ „์žํ”ผ (Psyllium Husk)
    • ์—‘์ŠคํŠธ๋ผ ๋ฒ„์ง„ ์˜ฌ๋ฆฌ๋ธŒ์œ  (Extra Virgin Olive Oil)
    • ์ž๊ฐ€๋น„๊ฐ•์„ธ์ฒ™ (Nasal Irrigation)
    • QCY HT08 (MeloBuds Pro Plus)
    • ์ฝ˜์„œํƒ€ (Concerta)
    • ์ธ๋ฐ๋†€ (Inderal)
    • ์„คํŠธ๋ž„๋ฆฐ (Sertraline)
    • ๋ฉœ๋ผํ† ๋‹Œ (Melatonin)
    • ์น˜๊ฒฝ๋ถ€ ๋งˆ๋ชจ์ฆ
    • ๋ฐ”๋ฒจ ์Šค์ฟผํŠธ (Barbell Squat)
  • Humanities

    • Nordvik, Russia
    • North Sentinel Island
    • ๋กฑ๊ณ ๋กฑ๊ณ (Rongorongo)
    • ๋ฐ”๋กœํฌ ์Œ์•… (Baroque Music)
  • Design

    • ๊ตฌ๊ธ€์˜ ์•„์ด์ฝ˜ ๋Œ€๊ฐœํŽธ โ€” 6๋…„ ๋งŒ์˜ ์‹ค์ˆ˜ ์ธ์ •
    • ์ œ๋Ÿด๋“œ ์  ํƒ€ โ€” ๋Ÿญ์…”๋ฆฌ ์Šคํฌ์ธ  ์›Œ์น˜์˜ ์ฐฝ์‹œ์ž
    • ๋ฐ”์šฐํ•˜์šฐ์Šค โ€” ํ˜„๋Œ€ ๋””์ž์ธ์˜ ์›์ 
  • Brands

    • NOMOS Glashรผtte
    • Frรฉdรฉrique Constant
    • KZ (Knowledge Zenith)
    • ์—์ŠคํŠธ๋ผ (AESTURA)
    • JINHAO (้‡‘่ฑช)
    • Herman Miller
    • ๋ฐ์Šค์ปค (DESKER)
    • ๋ฌด์‹ ์‚ฌ ์Šคํƒ ๋‹ค๋“œ (Musinsa Standard)
  • Finance

    • ํ˜„๋Œ€์นด๋“œ ZERO โ€” Edition2 vs Edition3 ๋น„๊ต
    • ์‹ ํ•œ์นด๋“œ ์ฒ˜์Œ
    • S&P 500 ETF ํˆฌ์ž ๊ฐ€์ด๋“œ
    • ํŒŒํ‚นํ†ต์žฅ vs CMA ํ†ต์žฅ
    • ๋ฒ„ํฌ์…” ํ•ด์„œ์›จ์ด (Berkshire Hathaway)
    • ๋น„ํŠธ์ฝ”์ธ(Bitcoin)
  • Products

    • ์˜ค๋””์˜ค ์ธํ„ฐํŽ˜์ด์Šค (Audio Interface)
    • ์ฟ ๋ฃจํ† ๊ฐ€ (KURUTOGA)
    • CX31993 DAC ๋™๊ธ€
    • ํด๋ Œ์ง• ๋ฐ€ํฌ (Cleansing Milk)
    • ํ”ผ์ ฏ ํ† ์ด (Fidget Toy)
    • ThinkPad
  • Programming Languages

    • 8.0. Statement Level Control Structures
    • 8. Subprogram
    • 9. Implementing Subprogram
    • 10.1. Abstract Data Types and Encapsulation Constructs
    • 10.2. Support for Object Oriented Programming
    • 11. Concurrency
    • 12. FPL (1)
    • 13. FPL (2)
    • 14. Exception Handling and Event Handling
    • Final Exam

27. Transformers

์ž‘์„ฑ 2026. 6. 12.ยท์ˆ˜์ • 2026. 6. 12.

Self-Attention

Modeling sequence without recurrence

Review: Sequence-to-Sequence with Attention

alt text

If Not Recurrence, Then What? How About Attention?

  • Attention์€ ๊ฐ ๋‹จ์–ด์˜ ํ‘œํ˜„์„ Query๋กœ ์ทจ๊ธ‰ํ•˜์—ฌ Value ์ง‘ํ•ฉ์˜ ์ •๋ณด์— ์ ‘๊ทผ ๋ฐ ํ†ตํ•ฉํ•จ.
    • Decoder์—์„œ Encoder๋กœ์˜ Attention์ด ์•„๋‹Œ, ๋‹จ์ผ ๋ฌธ์žฅ ๋‚ด์—์„œ์˜ Attention โ†’\rightarrowโ†’ Self-Attention!
    • Attention ์‚ฌ์šฉ ์‹œ, sequence ๊ธธ์ด์— ๋”ฐ๋ฅธ ๋ณ‘๋ ฌํ™” ๋ถˆ๊ฐ€๋Šฅ ์—ฐ์‚ฐ ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š์Œ.
    • ์ตœ๋Œ€ ์ƒํ˜ธ์ž‘์šฉ ๊ฑฐ๋ฆฌ: O(1)O(1)O(1) (๋ชจ๋“  ๋‹จ์–ด๊ฐ€ ๋ชจ๋“  ๋ ˆ์ด์–ด์—์„œ ์ƒํ˜ธ์ž‘์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ). alt text

Self-Attention: Keys, Queries, Values from the Same Sequence

  • ๋‹จ์–ด sequence w1:nw_{1:n}w1:nโ€‹ (์–ดํœ˜ VVV) ๊ฐ€์ •
    • ์˜ˆ: "Zuko made his uncle tea"
  • ๊ฐ wiw_iwiโ€‹์— ๋Œ€ํ•ด xi=Ewix_i = Ew_ixiโ€‹=Ewiโ€‹ (EโˆˆRdร—VE \in \mathbb{R}^{d \times V}EโˆˆRdร—V๋Š” ์ž„๋ฒ ๋”ฉ ํ–‰๋ ฌ)
  1. ๊ฐ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ Q,K,VQ, K, VQ,K,V (๊ฐ Rdร—d\mathbb{R}^{d \times d}Rdร—d)๋กœ ๋ณ€ํ™˜ํ•จ.
    • qi=Qxiq_i = Qx_iqiโ€‹=Qxiโ€‹ (Queries), ki=Kxik_i = Kx_ikiโ€‹=Kxiโ€‹ (Keys), vi=Vxiv_i = Vx_iviโ€‹=Vxiโ€‹ (Values).
  2. Key์™€ Query ๊ฐ„์˜ ์Œ๋ณ„(pairwise) ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ ๋ฐ softmax\text{softmax}softmax๋กœ ์ •๊ทœํ™”

    eij=qiTkje_{ij} = q_i^T k_j eijโ€‹=qiTโ€‹kjโ€‹

    ฮฑij=expโก(eij)โˆ‘jโ€ฒexpโก(eijโ€ฒ)\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{j'} \exp(e_{ij'})} ฮฑijโ€‹=โˆ‘jโ€ฒโ€‹exp(eijโ€ฒโ€‹)exp(eijโ€‹)โ€‹

  3. Value์˜ ๊ฐ€์ค‘ ํ•ฉ์œผ๋กœ ๊ฐ ๋‹จ์–ด์˜ ์ถœ๋ ฅ ๊ณ„์‚ฐ

    oi=โˆ‘jฮฑijvjo_i = \sum_j \alpha_{ij}v_j oiโ€‹=jโˆ‘โ€‹ฮฑijโ€‹vjโ€‹

Self-Attention as a Building Block

  • ๋‹ค์ด์–ด๊ทธ๋žจ๊ณผ ๊ฐ™์ด LSTM ๋ ˆ์ด์–ด๋ฅผ ์Œ“๋Š” ๊ฒƒ์ฒ˜๋Ÿผ Self-Attention ๋ธ”๋ก์„ ์Œ“์Œ.
  • Self-Attention์ด ์ˆœํ™˜(Recurrence)์˜ ์™„์ „ํ•œ ๋Œ€์ฒด์ œ๊ฐ€ ๋  ์ˆ˜ ์žˆ๋Š”๊ฐ€?
    • No. ๋ช‡ ๊ฐ€์ง€ ๋ฌธ์ œ๊ฐ€ ์กด์žฌํ•˜๋ฉฐ ์ด๋ฅผ ์‚ดํŽด๋ณด๊ณ ์ž ํ•จ.
    • ์ฒซ์งธ, Self-Attention์€ ์ง‘ํ•ฉ(Set)์— ๋Œ€ํ•œ ์—ฐ์‚ฐ์ž„. ์ˆœ์„œ(Order)์— ๋Œ€ํ•œ ๋‚ด์žฌ์  ๊ฐœ๋…์ด ์—†์Œ.

Self-Attention์€ ์ž…๋ ฅ์˜ ์ˆœ์„œ์— ๋Œ€ํ•ด์„œ ์•Œ์ง€ ๋ชปํ•จ. alt text

Barriers & Solutions for Self-Attention as A Building Block

alt text

Fixing the First Self-Attention Problem: Sequence Order

  • Self-Attention์—๋Š” ์ˆœ์„œ ์ •๋ณด๊ฐ€ ์—†์œผ๋ฏ€๋กœ Key, Query, Value์— ๋ฌธ์žฅ์˜ ์ˆœ์„œ๋ฅผ ์ธ์ฝ”๋”ฉํ•ด์•ผ ํ•จ.
  • ๊ฐ sequence ์ธ๋ฑ์Šค๋ฅผ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•œ๋‹ค๊ณ  ๊ฐ€์ •
    • piโˆˆRdp_i \in \mathbb{R}^dpiโ€‹โˆˆRd, iโˆˆ{1,2,โ€ฆ,n}i \in \{1, 2, \dots, n\}iโˆˆ{1,2,โ€ฆ,n}์€ ์œ„์น˜ ๋ฒกํ„ฐ์ž„.
    • pip_ipiโ€‹์˜ ๊ตฌ์„ฑ ๋ฐฉ์‹์€ ์•„์ง ๊ณ ๋ คํ•˜์ง€ ์•Š์Œ.
    • ์ด ์ •๋ณด๋ฅผ Self-Attention ๋ธ”๋ก์— ํ†ตํ•ฉํ•˜๋Š” ๊ฒƒ์€ ๊ฐ„๋‹จํ•จ: ์ž…๋ ฅ์— pip_ipiโ€‹๋ฅผ ๋”ํ•จ.
    • xix_ixiโ€‹๋Š” ์ธ๋ฑ์Šค iii์˜ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์ž„์„ ๊ธฐ์–ต
    • ์œ„์น˜ ์ •๋ณด๊ฐ€ ํฌํ•จ๋œ ์ž„๋ฒ ๋”ฉ

x~i=xi+pi\tilde{x}_i = x_i + p_i x~iโ€‹=xiโ€‹+piโ€‹

Position Representation Vectors Through Sinusoids

  • Sinusoidal position representations(์‚ฌ์ธํŒŒ ์œ„์น˜ ํ‘œํ˜„)
    • ๋‹ค์–‘ํ•œ ์ฃผ๊ธฐ์˜ ์‚ฌ์ธ ํ•จ์ˆ˜๋ฅผ ์—ฐ๊ฒฐ
    • ์›๋ž˜์˜ transformer ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋œ ๋ฐฉ์‹
    • ์žฅ์ 
      • ์ฃผ๊ธฐ์„ฑ์€ "์ ˆ๋Œ€ ์œ„์น˜"๊ฐ€ ์ค‘์š”ํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌ
      • ์ฃผ๊ธฐ๊ฐ€ ๋‹ค์‹œ ์‹œ์ž‘๋˜๋ฏ€๋กœ ๋” ๊ธด sequence๋กœ extrapolate(์™ธ์‚ฝ) ๊ฐ€๋Šฅํ•  ์ˆ˜ ์žˆ์Œ.
    • ๋‹จ์ 
      • Not learnable(ํ•™์Šต๋˜์ง€ ์•Š์Œ).
      • ์‹ค์ œ๋กœ๋Š” ์™ธ์‚ฝ์ด ์ž˜ ์ž‘๋™ํ•˜์ง€ ์•Š์Œ.

alt text

Position Representation Vectors Learned from Scratch

  • Learned absolute position representations(ํ•™์Šต๋œ ์ ˆ๋Œ€ ์œ„์น˜ ํ‘œํ˜„)
    • ๋ชจ๋“  pip_ipiโ€‹๋ฅผ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์„ค์ •
    • ํ–‰๋ ฌ PโˆˆRdร—nP \in \mathbb{R}^{d \times n}PโˆˆRdร—n์„ ํ•™์Šตํ•˜๊ณ  ๊ฐ pip_ipiโ€‹๋ฅผ ํ•ด๋‹น ํ–‰๋ ฌ์˜ ์—ด๋กœ ์‚ฌ์šฉ
    • ๋Œ€๋ถ€๋ถ„์˜ ์‹œ์Šคํ…œ์—์„œ ์ด ๋ฐฉ์‹์„ ์‚ฌ์šฉ
    • ์žฅ์ : ์œ ์—ฐ์„ฑ(Flexibility)
      • ๊ฐ ์œ„์น˜๊ฐ€ ๋ฐ์ดํ„ฐ์— ์ ํ•ฉํ•˜๋„๋ก ํ•™์Šต๋จ.
    • ๋‹จ์ : 1,โ€ฆ,n1, \dots, n1,โ€ฆ,n ์˜ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ธ๋ฑ์Šค๋กœ๋Š” ์ ˆ๋Œ€ ์™ธ์‚ฝ ๋ถˆ๊ฐ€๋Šฅ
  • ๋•Œ๋•Œ๋กœ ๋” ์œ ์—ฐํ•œ ์œ„์น˜ ํ‘œํ˜„์„ ์‹œ๋„ํ•˜๊ธฐ๋„ ํ•จ
    • Relative linear position Attention(์ƒ๋Œ€ ์„ ํ˜• ์œ„์น˜ ์–ดํ…์…˜)
    • Dependency syntax-based position(์˜์กด ๊ตฌ๋ฌธ-๊ธฐ๋ฐ˜ ์œ„์น˜)
    • Rotary position embedding (๋กœํ„ฐ๋ฆฌ ์œ„์น˜ ์ž„๋ฒ ๋”ฉ, RoPE)

Adding Non-Linearities in Self-Attention

  • Self-Attention์—๋Š” elementwise(์š”์†Œ๋ณ„) ๋น„์„ ํ˜•์„ฑ์ด ์กด์žฌํ•˜์ง€ ์•Š์Œ.
    • Self-Attention ๋ ˆ์ด์–ด๋ฅผ ๋” ์Œ“๋Š” ๊ฒƒ์€ ๋‹จ์ˆœํžˆ Value ๋ฒกํ„ฐ๋ฅผ re-averaging(์žฌํ‰๊ท ํ™”)ํ•˜๋Š” ๊ฒƒ์— ๋ถˆ๊ณผํ•จ.
  • ๊ฐ„๋‹จํ•œ ํ•ด๊ฒฐ์ฑ…
    • ๊ฐ ์ถœ๋ ฅ ๋ฒกํ„ฐ๋ฅผ ํ›„์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด Feed-Forward Network ์ถ”๊ฐ€

mi=MLP(outputi)m_i = MLP(\text{\text{output}}_i) miโ€‹=MLP(outputiโ€‹)

=W2โ‹…ReLU(W1โ‹…outputi+b1)+b2= W_2 \cdot \text{ReLU}(W_1 \cdot \text{\text{output}}_i + b_1) + b_2 =W2โ€‹โ‹…ReLU(W1โ€‹โ‹…outputiโ€‹+b1โ€‹)+b2โ€‹

Masking the Future in Self-Attention

  • Decoder์—์„œ Self-Attention์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋ฏธ๋ž˜๋ฅผ ๋ณผ ์ˆ˜ ์—†๋„๋ก ํ•ด์•ผ ํ•จ.
    • ๋งค timestep๋งˆ๋‹ค Key์™€ Query ์ง‘ํ•ฉ์„ ๊ณผ๊ฑฐ ๋‹จ์–ด๋งŒ ํฌํ•จํ•˜๋„๋ก ๋ณ€๊ฒฝํ•˜๋Š” ๊ฒƒ๋„ ๊ฐ€๋Šฅ
    • Parallelization(๋ณ‘๋ ฌํ™”)๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด, ๋ฏธ๋ž˜ ๋‹จ์–ด์— ๋Œ€ํ•œ Attention ์ ์ˆ˜๋ฅผ โˆ’โˆž-\inftyโˆ’โˆž๋กœ ์„ค์ •ํ•˜์—ฌ ๋งˆ์Šคํ‚น ์ฒ˜๋ฆฌ

eij={qiTkj,jโ‰คiโˆ’โˆž,j>ie_{ij} = \begin{cases} q_i^T k_j, & j \le i \\ -\infty, & j > i \end{cases} eijโ€‹={qiTโ€‹kjโ€‹,โˆ’โˆž,โ€‹jโ‰คij>iโ€‹

Necessities for a Self-Attention Building Block

  • Self-Attention
    • ๋ฐฉ๋ฒ•๋ก ์˜ ๊ธฐ์ดˆ
  • Position Representations
    • Self-Attention์€ ์ž…๋ ฅ์˜ ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๋Š” ํ•จ์ˆ˜์ด๋ฏ€๋กœ sequence ์ˆœ์„œ๋ฅผ ๋ช…์‹œํ•ด์•ผ ํ•จ.
  • Nonlinearities
    • Self-Attention ๋ธ”๋ก์˜ ์ถœ๋ ฅ ๋ถ€๋ถ„์— ์œ„์น˜
    • ์ฃผ๋กœ ๊ฐ„๋‹จํ•œ Feed-Forward Network๋กœ ๊ตฌํ˜„๋จ.
  • Masking
    • ๋ฏธ๋ž˜๋ฅผ ๋ณด์ง€ ์•Š์œผ๋ฉด์„œ ์—ฐ์‚ฐ์„ ๋ณ‘๋ ฌํ™”ํ•˜๊ธฐ ์œ„ํ•จ
    • ๋ฏธ๋ž˜์˜ ์ •๋ณด๊ฐ€ ๊ณผ๊ฑฐ๋กœ "์œ ์ถœ"๋˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€

Transformers

Self-Attention์— ๊ธฐ๋ฐ˜ํ•œ Neural Networks

Transformer

  • ์˜ค์ง Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์œผ๋กœ๋งŒ ์„ค๊ณ„๋œ ์‹ ๊ฒฝ๋ง ์•„ํ‚คํ…์ฒ˜ (CNN์ด๋‚˜ RNN ์—†์ด).

    "Attention is all you need." (Ashish Vaswani et al., NeurIPS 2017)

    • Self-Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ํ•ต์‹ฌ์— ์œ„์น˜
    • ํ‘œํ˜„์˜ ๋ณ‘๋ ฌ ๊ณ„์‚ฐ ๊ฐ€๋Šฅ โ‡’\Rightarrowโ‡’ Scalability(ํ™•์žฅ์„ฑ) โ†‘\uparrowโ†‘
    • ๋ณธ๋ž˜ Machine Translation(Seq2Seq ์•„ํ‚คํ…์ฒ˜)์„ ์œ„ํ•ด ์ œ์•ˆ๋จ.

The Transformer Decoder

  • Transformer Decoder๋Š” Language Model๊ณผ ๊ฐ™์€ ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐฉ์‹
    • ์ตœ์†Œํ•œ์˜ Self-Attention ์•„ํ‚คํ…์ฒ˜์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ ๋ช‡ ๊ฐ€์ง€ ๊ตฌ์„ฑ ์š”์†Œ๊ฐ€ ์ถ”๊ฐ€๋จ.
    • Embedding๊ณผ position embedding์€ ๋™์ผ
    • ๋‹ค์Œ์œผ๋กœ Self-Attention์„ Multi-Head Self-Attention์œผ๋กœ ๋Œ€์ฒดํ•  ๊ฒƒ

Recall the Self-Attention Hypothetical Example

Hypothetical Example of Multi-Head Attention

  • Attention head 1์€ entity(๊ฐœ์ฒด)๋“ค์— ์ง‘์ค‘
  • Attention head 2๋Š” ๊ตฌ๋ฌธ์ (๋ฌธ๋ฒ•์ )์œผ๋กœ ๊ด€๋ จ๋œ ๋‹จ์–ด๋“ค์— ์ง‘์ค‘

Sequence-Stacked Form of Attention

  • ํ–‰๋ ฌ์„ ํ†ตํ•œ Key-Query-Value Attention ๊ณ„์‚ฐ ๋ฐฉ์‹
    • X=[x1;โ€ฆโ€‰;xn]โˆˆRnร—dX = [x_1; \dots; x_n] \in \mathbb{R}^{n \times d}X=[x1โ€‹;โ€ฆ;xnโ€‹]โˆˆRnร—d๋ฅผ ์ž…๋ ฅ ๋ฒกํ„ฐ์˜ concatenation์œผ๋กœ ์ •์˜

XKโˆˆRnร—d,XQโˆˆRnร—d,XVโˆˆRnร—dXK \in \mathbb{R}^{n \times d}, XQ \in \mathbb{R}^{n \times d}, XV \in \mathbb{R}^{n \times d} XKโˆˆRnร—d,XQโˆˆRnร—d,XVโˆˆRnร—d

  • ์ถœ๋ ฅ ์ •์˜

output=softmax(XQ(XK)T)XVโˆˆRnร—d\text{output} = \text{softmax}(XQ(XK)^T)XV \in \mathbb{R}^{n \times d} output=softmax(XQ(XK)T)XVโˆˆRnร—d

Multi-Headed Attention

  • ๋ฌธ์žฅ์˜ ์—ฌ๋Ÿฌ ์œ„์น˜๋ฅผ ๋™์‹œ์— ๋ณด๊ณ  ์‹ถ๋‹ค๋ฉด?
    • ๋‹จ์–ด iii์— ๋Œ€ํ•ด Self-Attention์€ xiTQTKxjx_i^T Q^T K x_jxiTโ€‹QTKxjโ€‹๊ฐ€ ๋†’์€ ๊ณณ์„ ๋ณด์ง€๋งŒ, ๋‹ค๋ฅธ ์ด์œ ๋กœ ๋‹ค๋ฅธ jjj์— ์ง‘์ค‘ํ•ด์•ผ ํ•  ์ˆ˜๋„ ์žˆ์Œ.
  • ์—ฌ๋Ÿฌ ๊ฐœ์˜ Q,K,VQ, K, VQ,K,V ํ–‰๋ ฌ์„ ํ†ตํ•ด ๋‹ค์ˆ˜์˜ Attention "Head"๋ฅผ ์ •์˜
    • Ql,Kl,VlโˆˆRdร—dhQ_l, K_l, V_l \in \mathbb{R}^{d \times \frac{d}{h}}Qlโ€‹,Klโ€‹,Vlโ€‹โˆˆRdร—hdโ€‹ (hhh๋Š” Attention Head์˜ ์ˆ˜, lll์€ 1๋ถ€ํ„ฐ hhh๊นŒ์ง€).
    • ๊ฐ Attention Head๋Š” ๋…๋ฆฝ์ ์œผ๋กœ Attention ์ˆ˜ํ–‰

outputl=softmax(XQlKlTXT)โˆ’XVl\text{output}_l = \text{softmax}(XQ_l K_l^T X^T) - XV_l outputlโ€‹=softmax(XQlโ€‹KlTโ€‹XT)โˆ’XVlโ€‹

  • ์—ฌ๊ธฐ์„œ outputlโˆˆRd/h\text{output}_l \in \mathbb{R}^{d/h}outputlโ€‹โˆˆRd/h.
  • ๊ทธ ํ›„ ๋ชจ๋“  head์˜ ์ถœ๋ ฅ์„ ๊ฒฐํ•ฉ!

output=[output1;โ€ฆโ€‰;outputh]Y\text{output} = [\text{output}_1; \dots; \text{output}_h]Y output=[output1โ€‹;โ€ฆ;outputhโ€‹]Y

  • ์—ฌ๊ธฐ์„œ YโˆˆRdร—dY \in \mathbb{R}^{d \times d}YโˆˆRdร—d.
  • ๊ฐ head๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๊ฒƒ์„ ๋ณด๊ณ , Value ๋ฒกํ„ฐ๋ฅผ ๋‹ค๋ฅด๊ฒŒ ๊ตฌ์„ฑ

Multi-Head Self-Attention is Computationally Efficient

  • hhh๊ฐœ์˜ Attention head๋ฅผ ๊ณ„์‚ฐํ•˜๋”๋ผ๋„ ๋น„์šฉ์ด ํฌ๊ฒŒ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š์Œ.
    • XQโˆˆRnร—dXQ \in \mathbb{R}^{n \times d}XQโˆˆRnร—d๋ฅผ ๊ณ„์‚ฐํ•œ ํ›„ Rnร—hร—dh\mathbb{R}^{n \times h \times \frac{d}{h}}Rnร—hร—hdโ€‹๋กœ reshapeํ•จ (XKXKXK, XVXVXV๋„ ๋™์ผ).
    • ๊ทธ ํ›„ Rhร—nร—dh\mathbb{R}^{h \times n \times \frac{d}{h}}Rhร—nร—hdโ€‹๋กœ transpose.
      • ์ด์ œ head ์ถ•์ด batch ์ถ•์ฒ˜๋Ÿผ ๋™์ž‘
    • ๊ฑฐ์˜ ๋ชจ๋“  ๊ณผ์ •์ด ๋™์ผํ•˜๋ฉฐ, ํ–‰๋ ฌ ํฌ๊ธฐ๋„ ๋™์ผ

Scaled Dot Product [Vaswani et al., 2017]

  • "Scaled Dot Product" Attention์€ ํ•™์Šต์„ ๋„์›€.
    • ์ฐจ์› ddd๊ฐ€ ์ปค์ง€๋ฉด ๋ฒกํ„ฐ ๊ฐ„ ๋‚ด์ (Dot product) ๊ฐ’์ด ์ปค์ง€๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Œ.
    • ์ด๋กœ ์ธํ•ด softmax\text{softmax}softmax ํ•จ์ˆ˜์˜ ์ž…๋ ฅ๊ฐ’์ด ์ปค์ ธ ๊ธฐ์šธ๊ธฐ(Gradient)๊ฐ€ ์ž‘์•„์ง.
  • ํ•ด๊ฒฐ์ฑ…

outputl=softmax(XQlKlTXT)โˆ’XVl\text{output}_l = \text{softmax}(XQ_l K_l^T X^T) - XV_l outputlโ€‹=softmax(XQlโ€‹KlTโ€‹XT)โˆ’XVlโ€‹

  • ๊ธฐ์กด์— ๋ณธ Self-Attention ํ•จ์ˆ˜ ๋Œ€์‹ , Attention ์ ์ˆ˜๋ฅผ d/h\sqrt{d/h}d/hโ€‹๋กœ ๋‚˜๋ˆ„์–ด ์ฐจ์› ์ˆ˜(d/hd/hd/h)์— ๋”ฐ๋ผ ์ ์ˆ˜๊ฐ€ ์ปค์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€

outputl=softmax(XQlKlTXTd/h)โˆ’XVl\text{output}_l = \text{softmax}(\frac{XQ_l K_l^T X^T}{\sqrt{d/h}}) - XV_l outputlโ€‹=softmax(d/hโ€‹XQlโ€‹KlTโ€‹XTโ€‹)โˆ’XVlโ€‹

The Transformer Decoder

  • ๋‘ ๊ฐ€์ง€ ์ตœ์ ํ™” trick
    • Residual Connections(์ž”์ฐจ ์—ฐ๊ฒฐ)
    • Layer Normalization(๋ ˆ์ด์–ด ์ •๊ทœํ™”)
    • ๋Œ€๋ถ€๋ถ„์˜ Transformer ๋‹ค์ด์–ด๊ทธ๋žจ์—์„œ ์ด๋“ค์€ "Add & Norm"์œผ๋กœ ํ•จ๊ป˜ ํ‘œ๊ธฐ๋จ.

Residual Connections [He et al., 2016]

  • Residual connections๋Š” ๋ชจ๋ธ ํ•™์Šต์„ ๋•๋Š” trick
    • X(i)=Layer(X(iโˆ’1))X^{(i)} = \text{Layer}(X^{(i-1)})X(i)=Layer(X(iโˆ’1)) (iii๋Š” ๋ ˆ์ด์–ด) ๋Œ€์‹ ,
    • X(i)=X(iโˆ’1)+Layer(X(iโˆ’1))X^{(i)} = X^{(i-1)} + \text{Layer}(X^{(i-1)})X(i)=X(iโˆ’1)+Layer(X(iโˆ’1)) ๋กœ ์„ค์ • (์ด์ „ ๋ ˆ์ด์–ด๋กœ๋ถ€ํ„ฐ์˜ "์ž”์ฐจ"๋งŒ ํ•™์Šตํ•˜๋ฉด ๋จ)
    • ์ž”์ฐจ ์—ฐ๊ฒฐ์„ ํ†ตํ•œ ๊ธฐ์šธ๊ธฐ ์ „ํŒŒ๊ฐ€ ์›ํ™œ (๊ธฐ์šธ๊ธฐ๊ฐ€ 1์ž„)

Layer Normalization [Ba et al., 2016]

  • Layer normalization์€ ๋ชจ๋ธ ํ•™์Šต ์†๋„๋ฅผ ๋†’์ด๋Š” trick
    • Idea: ๊ฐ ๋ ˆ์ด์–ด ๋‚ด์—์„œ ํ‰๊ท ์„ 0, ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ 1๋กœ ์ •๊ทœํ™”ํ•˜์—ฌ ์€๋‹‰ ๋ฒกํ„ฐ ๊ฐ’์˜ ์ •๋ณด ์—†๋Š” ๋ณ€๋™์„ ์ค„์ž„.
    • LayerNorm์˜ ์„ฑ๊ณต ์š”์ธ์ด ๊ธฐ์šธ๊ธฐ ์ •๊ทœํ™”๋กœ ์ถ”์ • [Xu et al., 2019]
  • ์„ธ๋ถ€ ์‚ฌํ•ญ
    • xโˆˆRdx \in \mathbb{R}^dxโˆˆRd๋ฅผ ๋ชจ๋ธ์˜ ๊ฐœ๋ณ„ (๋‹จ์–ด) ๋ฒกํ„ฐ๋ผ ๊ฐ€์ •
    • ฮผ=1dโˆ‘j=1dxj\mu = \frac{1}{d} \sum_{j=1}^d x_jฮผ=d1โ€‹โˆ‘j=1dโ€‹xjโ€‹: ํ‰๊ท  (ฮผโˆˆR\mu \in \mathbb{R}ฮผโˆˆR).
    • ฯƒ=1dโˆ‘j=1d(xjโˆ’ฮผ)2\sigma = \sqrt{\frac{1}{d} \sum_{j=1}^d (x_j - \mu)^2}ฯƒ=d1โ€‹โˆ‘j=1dโ€‹(xjโ€‹โˆ’ฮผ)2โ€‹: ํ‘œ์ค€ํŽธ์ฐจ (ฯƒโˆˆR\sigma \in \mathbb{R}ฯƒโˆˆR).
    • ฮณโˆˆRd\gamma \in \mathbb{R}^dฮณโˆˆRd์™€ ฮฒโˆˆRd\beta \in \mathbb{R}^dฮฒโˆˆRd๋Š” ํ•™์Šต๋œ "Gain"๊ณผ "Bias" ํŒŒ๋ผ๋ฏธํ„ฐ (์ƒ๋žต ๊ฐ€๋Šฅ)
    • Layer Normalization ๊ณ„์‚ฐ ์ˆ˜ํ–‰

    Output = $\frac{x - \mu}{\sigma} \cdot \gamma + \beta

The Transformer Decoder

  • Transformer Decoder Block๋“ค์˜ Stack ๊ตฌ์กฐ์ธ Transformer Decoder
  • ๊ฐ Block ๊ตฌ์„ฑ ์š”์†Œ
    • Self-Attention
    • Add & Norm
    • Feed-Forward
    • Add & Norm

The Transformer Encoder

  • Transformer Decoder๋Š” Language Model๊ณผ ๊ฐ™์ด Unidirectional(๋‹จ๋ฐฉํ–ฅ) ๋ฌธ๋งฅ์œผ๋กœ ์ œํ•œ๋จ.
  • Bidirectional(์–‘๋ฐฉํ–ฅ) RNN์ฒ˜๋Ÿผ ์–‘๋ฐฉํ–ฅ ๋ฌธ๋งฅ์„ ์›ํ•œ๋‹ค๋ฉด?
  • ์ด๊ฒƒ์ด Transformer Encoder
    • ์œ ์ผํ•œ ์ฐจ์ด์ ์€ Self-Attention์—์„œ ๋งˆ์Šคํ‚น์„ ์ œ๊ฑฐํ•œ๋‹ค๋Š” ์ 

The Transformer Encoder-Decoder

  • Machine Translation์—์„œ ์†Œ์Šค ๋ฌธ์žฅ์„ ์–‘๋ฐฉํ–ฅ ๋ชจ๋ธ๋กœ ์ฒ˜๋ฆฌํ•˜๊ณ  ํƒ€๊ฒŸ์„ ๋‹จ๋ฐฉํ–ฅ ๋ชจ๋ธ๋กœ ์ƒ์„ฑํ–ˆ๋˜ ๊ฒƒ์„ ์ƒ๊ธฐ
  • ์ด๋Ÿฌํ•œ Seq2Seq ํ˜•์‹์„ ์œ„ํ•ด ์ฃผ๋กœ Transformer Encoder-Decoder๋ฅผ ์‚ฌ์šฉ
    • ์ผ๋ฐ˜์ ์ธ Transformer Encoder ์‚ฌ์šฉ
    • Transformer Decoder๋Š” Encoder์˜ ์ถœ๋ ฅ์— ๋Œ€ํ•ด Cross-Attention์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ์ˆ˜์ •๋จ.

Cross-Attention

  • Self-Attention์€ Key, Query, Value๊ฐ€ ๋™์ผํ•œ ์†Œ์Šค์—์„œ ์˜ด.
  • Decoder์—์„œ๋Š” ์ด์ „์— ๋ณธ ๊ฒƒ๊ณผ ๋” ์œ ์‚ฌํ•œ ํ˜•ํƒœ์˜ Attention์„ ๊ฐ€์ง.
    • h1,โ€ฆ,hnh_1, \dots, h_nh1โ€‹,โ€ฆ,hnโ€‹์„ Transformer Encoder์˜ ์ถœ๋ ฅ ๋ฒกํ„ฐ๋ผ ํ•จ (hiโˆˆRdh_i \in \mathbb{R}^dhiโ€‹โˆˆRd).
    • z1,โ€ฆ,znz_1, \dots, z_nz1โ€‹,โ€ฆ,znโ€‹์„ Transformer Decoder์˜ ์ž…๋ ฅ ๋ฒกํ„ฐ๋ผ ํ•จ (ziโˆˆRdz_i \in \mathbb{R}^dziโ€‹โˆˆRd).
    • Key์™€ Value๋Š” Encoder์—์„œ ์ถ”์ถœ๋จ (๋ฉ”๋ชจ๋ฆฌ ์—ญํ• ).

      ki=Khi,vi=Vhik_i = K h_i, v_i = V h_i kiโ€‹=Khiโ€‹,viโ€‹=Vhiโ€‹

    • Query๋Š” Decoder์—์„œ ์ถ”์ถœ๋จ.

      qi=Qziq_i = Q z_i qiโ€‹=Qziโ€‹

A Graphical Explanation of Transformers (3Blue 1Brown)

  • Transformers (how LLMs work) explained visually
    • https://youtu.be/wjZofJX0v4M?si=2isiONQrpxzdTo9l
  • Attention in transformers, visually explained
    • https://youtu.be/eMlx5fFNoYc?si=Ij1kvExbMOzR8tZg

์ตœ๊ทผ ์ˆ˜์ •: 26. 6. 12. ์˜คํ›„ 3:28
Contributors: kmbzn, Claude Sonnet 4.6

BUILT WITH

CloudflareNode.jsGitHubGitVue.jsJavaScriptVSCodenpm

All trademarks and logos are property of their respective owners.
ยฉ 2026 kmbzn ยท MIT License