• Mindscape ๐Ÿ”ฅ
    • Playlist ๐ŸŽง
  • Algorithm

    • 1018๋ฒˆ: ์ฒด์ŠคํŒ ๋‹ค์‹œ ์น ํ•˜๊ธฐ
    • 1966๋ฒˆ: ํ”„๋ฆฐํ„ฐ ํ
    • Python ์‹œ๊ฐ„ ์ดˆ๊ณผ ๋ฐฉ์ง€๋ฅผ ์œ„ํ•œ ํŒ
    • C++ std::vector ์‚ฌ์šฉ๋ฒ• ์ •๋ฆฌ
    • Vim ์‚ฌ์šฉ ๋งค๋‰ด์–ผ
  • Ubuntu

    • ๋ฆฌ๋ˆ…์Šค ์šฐ๋ถ„ํˆฌ GRUB ํฐํŠธ ๋ณ€๊ฒฝ
    • ์šฐ๋ถ„ํˆฌ ์ด๋ฏธ์ง€ ๋น„๋””์˜ค ์ธ๋„ค์ผ(๋ฏธ๋ฆฌ๋ณด๊ธฐ) ์•ˆ ๋ณด์ž„ ๋ฌธ์ œ ํ•ด๊ฒฐ
    • Wine ํ™˜๊ฒฝ์—์„œ ์นด์นด์˜คํ†ก ์‹คํ–‰ ์‹œ explorer.exe ๋œจ์ง€ ์•Š๊ฒŒ ํ•˜๋Š” ๋ฒ•
    • ์šฐ๋ถ„ํˆฌ Wine ์นด์นด์˜คํ†ก ์‚ฌ์ง„ ์ด๋ฏธ์ง€ ์Šคํฌ๋ฆฐ์ƒท ๋ถ™์—ฌ๋„ฃ๊ธฐ
    • Wine ์นด์นด์˜คํ†ก ์ด๋ชจ์ง€ ๊นจ์ง ๋ฌธ์ œ ํ•ด๊ฒฐ
    • Ubuntu ์œˆ๋„์šฐ ์• ๋‹ˆ๋ฉ”์ด์…˜ ๋„๊ธฐ
  • Wellness

    • ์ฐจ์ „์žํ”ผ (Psyllium Husk)
    • ์—‘์ŠคํŠธ๋ผ ๋ฒ„์ง„ ์˜ฌ๋ฆฌ๋ธŒ์œ  (Extra Virgin Olive Oil)
    • ์ž๊ฐ€๋น„๊ฐ•์„ธ์ฒ™ (Nasal Irrigation)
    • QCY HT08 (MeloBuds Pro Plus)
    • ์ฝ˜์„œํƒ€ (Concerta)
    • ์ธ๋ฐ๋†€ (Inderal)
    • ์„คํŠธ๋ž„๋ฆฐ (Sertraline)
    • ๋ฉœ๋ผํ† ๋‹Œ (Melatonin)
    • ์น˜๊ฒฝ๋ถ€ ๋งˆ๋ชจ์ฆ
    • ๋ฐ”๋ฒจ ์Šค์ฟผํŠธ (Barbell Squat)
  • Humanities

    • Nordvik, Russia
    • North Sentinel Island
    • ๋กฑ๊ณ ๋กฑ๊ณ (Rongorongo)
    • ๋ฐ”๋กœํฌ ์Œ์•… (Baroque Music)
  • Design

    • ๊ตฌ๊ธ€์˜ ์•„์ด์ฝ˜ ๋Œ€๊ฐœํŽธ โ€” 6๋…„ ๋งŒ์˜ ์‹ค์ˆ˜ ์ธ์ •
    • ์ œ๋Ÿด๋“œ ์  ํƒ€ โ€” ๋Ÿญ์…”๋ฆฌ ์Šคํฌ์ธ  ์›Œ์น˜์˜ ์ฐฝ์‹œ์ž
    • ๋ฐ”์šฐํ•˜์šฐ์Šค โ€” ํ˜„๋Œ€ ๋””์ž์ธ์˜ ์›์ 
  • Brands

    • NOMOS Glashรผtte
    • Frรฉdรฉrique Constant
    • KZ (Knowledge Zenith)
    • ์—์ŠคํŠธ๋ผ (AESTURA)
    • JINHAO (้‡‘่ฑช)
    • Herman Miller
    • ๋ฐ์Šค์ปค (DESKER)
    • ๋ฌด์‹ ์‚ฌ ์Šคํƒ ๋‹ค๋“œ (Musinsa Standard)
  • Finance

    • ํ˜„๋Œ€์นด๋“œ ZERO โ€” Edition2 vs Edition3 ๋น„๊ต
    • ์‹ ํ•œ์นด๋“œ ์ฒ˜์Œ
    • S&P 500 ETF ํˆฌ์ž ๊ฐ€์ด๋“œ
    • ํŒŒํ‚นํ†ต์žฅ vs CMA ํ†ต์žฅ
    • ๋ฒ„ํฌ์…” ํ•ด์„œ์›จ์ด (Berkshire Hathaway)
    • ๋น„ํŠธ์ฝ”์ธ(Bitcoin)
  • Products

    • ์˜ค๋””์˜ค ์ธํ„ฐํŽ˜์ด์Šค (Audio Interface)
    • ์ฟ ๋ฃจํ† ๊ฐ€ (KURUTOGA)
    • CX31993 DAC ๋™๊ธ€
    • ํด๋ Œ์ง• ๋ฐ€ํฌ (Cleansing Milk)
    • ํ”ผ์ ฏ ํ† ์ด (Fidget Toy)
    • ThinkPad
  • Programming Languages

    • 8.0. Statement Level Control Structures
    • 8. Subprogram
    • 9. Implementing Subprogram
    • 10.1. Abstract Data Types and Encapsulation Constructs
    • 10.2. Support for Object Oriented Programming
    • 11. Concurrency
    • 12. FPL (1)
    • 13. FPL (2)
    • 14. Exception Handling and Event Handling
    • Final Exam

25. Language Modeling and Recurrent Neural Networks

์ž‘์„ฑ 2026. 6. 12.ยท์ˆ˜์ • 2026. 6. 12.

Language Modeling

Classic nnn-gram ๋ชจ๋ธ

Language Modeling

  • Next word๊ฐ€ ๋ฌด์—‡์ธ์ง€ ์˜ˆ์ธกํ•˜๋Š” Task
  • ๊ณต์‹์  ์ •์˜
    • Words sequence x(1),x(2),โ€ฆ,x(t)x^{(1)}, x^{(2)}, \dots, x^{(t)}x(1),x(2),โ€ฆ,x(t)๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, Next word x(t+1)x^{(t+1)}x(t+1)์˜ Probability distribution ๊ณ„์‚ฐ

P(x(t+1)โˆฃx(t),โ€ฆ,x(1))P(x^{(t+1)}|x^{(t)}, \dots, x^{(1)}) P(x(t+1)โˆฃx(t),โ€ฆ,x(1))

  • ์ด๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” System์„ Language Model์ด๋ผ ์ง€์นญ
  • Text ์ž์ฒด์— ํ™•๋ฅ ์„ ํ• ๋‹นํ•˜๋Š” ์‹œ์Šคํ…œ์œผ๋กœ ๊ฐ„์ฃผ ๊ฐ€๋Šฅ

P(x(1),โ€ฆ,x(T))P(x^{(1)}, \dots, x^{(T)}) P(x(1),โ€ฆ,x(T))

=P(x(1))ร—P(x(2)โˆฃx(1))ร—โ‹ฏร—P(X(T)โˆฃx(Tโˆ’1),ย โ€ฆ,x(1))= P(x^{(1)}) \times P(x^{(2)}|x^{(1)}) \times \dots \times P(X^{(T)}| x^{(T-1)} , ~\dots ,x^{(1)}) =P(x(1))ร—P(x(2)โˆฃx(1))ร—โ‹ฏร—P(X(T)โˆฃx(Tโˆ’1),ย โ€ฆ,x(1))

=โˆt=1TP(x(t)โˆฃx(t)โˆฃx(tโˆ’1),ย โ€ฆ,x(1))= \prod_{t=1} ^T P(x^{(t)}|x^{(t)} | x ^{(t-1)},~ \dots, x ^ {(1)}) =t=1โˆTโ€‹P(x(t)โˆฃx(t)โˆฃx(tโˆ’1),ย โ€ฆ,x(1))

  • Language Model์„ ๊ฐ€์ง€๊ณ  ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ?
    • Score sentences: ๋ฌธ์žฅ์˜ ์ž์—ฐ์Šค๋Ÿฌ์›€ ํ‰๊ฐ€
      • Jane went to the store. โ†’ high
      • Store to Jane went the. โ†’ low
    • Generate sentences(๋ฌธ์žฅ ์ƒ์„ฑ)
while didn't choose end-of-sentence symbol
    Calculate probability
    Sample a new word from the probability distribution

You Use Language Models Every Day!

  • ์ž๋™ ์™„์„ฑ
  • ๊ฒ€์ƒ‰ ์—”์ง„
  • ChatGPT

NNN-Gram Language Models

  • Deep learning ์ด์ „์˜ ๊ทผ๋ณธ์ ์ด๊ณ  ๊ณ ์ „์ ์ธ Language Model ๊ตฌํ˜„ ๋ฐฉ์‹
  • NNN-gram: nnn๊ฐœ์˜ ์—ฐ์†์ ์ธ words ๋ฉ์–ด๋ฆฌ
    • Unigrams: "the", "students"
    • Bigrams: "the students", "student opened"
    • Trigrams: "the students opened", "students opened their"
    • Four-grams: "the students opened their"
  • Idea: ๋นˆ๋„ ํ†ต๊ณ„๋ฅผ ์ˆ˜์ง‘ํ•˜์—ฌ next word ์˜ˆ์ธก์— ํ™œ์šฉ

  • Markov assumption
    • x(t+1)x^{(t+1)}x(t+1)์€ ์˜ค์ง ์•ž์„  nโˆ’1n-1nโˆ’1๊ฐœ์˜ words์—๋งŒ ์˜์กดํ•œ๋‹ค๊ณ  ๊ฐ€์ •

P(x(t+1)โˆฃx(t),โ€ฆ,x(1))=P(x(t+1)โˆฃx(t),โ€ฆ,x(tโˆ’n+2))P(x^{(t+1)}|x^{(t)}, \dots, x^{(1)}) = P(x^{(t+1)}|x^{(t)}, \dots, x^{(t-n+2)}) P(x(t+1)โˆฃx(t),โ€ฆ,x(1))=P(x(t+1)โˆฃx(t),โ€ฆ,x(tโˆ’n+2))

=P(x(t+1),x(t),ย โ€ฆ,ย x(tโˆ’n+2))P(x(t),ย โ€ฆ,ย x(tโˆ’n+2))= \frac{P(x^{(t+1)}, x^{(t)},~\dots,~x^{(t-n+2)})}{P(x^{(t)},~\dots,~x^{(t - n + 2)})} =P(x(t),ย โ€ฆ,ย x(tโˆ’n+2))P(x(t+1),x(t),ย โ€ฆ,ย x(tโˆ’n+2))โ€‹

  • Large corpus(๋Œ€๊ทœ๋ชจ ์–ธ์–ด ์ง‘ํ•ฉ)์—์„œ ๊ฐœ์ˆ˜๋ฅผ ์„ธ์–ด(Counting) ํ™•๋ฅ  ๊ณ„์‚ฐ(Statistical approximation)
    • Count ratio๋ฅผ ํ†ตํ•ด ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  ๊ทผ์‚ฌ

NNN-Gram Language Models: Example

  • 4-gram Language Model ํ•™์Šต์˜ ๊ฒฝ์šฐ๋ฅผ ๊ฐ€์ •
  • as the proctor started the clock, the students opened their ______

P(wโˆฃstudentsย openedย their)=count(studentsย openedย theirย w)count(studentsย openedย their)P(w|\text{students opened their})=\frac{\text{count}(\text{students opened their}~w)}{\text{count}(\text{students opened their})} P(wโˆฃstudentsย openedย their)=count(studentsย openedย their)count(studentsย openedย theirย w)โ€‹

  • Corpus ๋‚ด "students opened their" ๋’ค์— ์˜ค๋Š” ๋‹จ์–ด๋“ค์˜ ๋นˆ๋„๋ฅผ ํ™•์ธํ•˜์—ฌ ํ™•๋ฅ  ๋ถ€์—ฌ
    • "books": (0.4)(0.4)(0.4)
    • "exams": (0.1)(0.1)(0.1)

Generating Text with a NNN-Gram Language Model

  • ๊ฐ„๋‹จํ•œ trigram ์–ธ์–ด ๋ชจ๋ธ
    • 170๋งŒ ์ด์ƒ์˜ ๋‹จ์–ด ๋ชจ์Œ (Reuters: ๊ฒฝ์ œ, ๊ฒฝ์˜ ๋‰ด์Šค)
    • ์–ธ์–ด ๋ชจ๋ธ๋กœ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Œ.

today the price of gold per ton, while production of shoe lasts and shoe industry, the bank intervened just after it considered and rejected an imf demand to rebuild depleted european stocks, sept 30 end primary 76 cts a share.
"์˜ค๋Š˜ ํ†ค๋‹น ๊ธˆ๊ฐ’์€, ๊ตฌ๋‘ ๊ณจ ์ œ์ž‘๊ณผ ์‹ ๋ฐœ ์‚ฐ์—… ์ƒ์‚ฐ์ด ์ง€์†๋˜๋Š” ๋™์•ˆ, ์€ํ–‰์€ ๊ณ ๊ฐˆ๋œ ์œ ๋Ÿฝ ์ฃผ์‹(์žฌ๊ณ )์„ ์žฌ๊ฑดํ•˜๋ ค๋Š” IMF ์š”๊ตฌ๋ฅผ ๊ฒ€ํ† ํ•˜๊ณ  ๊ฑฐ๋ถ€ํ•œ ์งํ›„ ๊ฐœ์ž…ํ–ˆ๊ณ , 9์›” 30์ผ ๋งˆ๊ฐ ์ฃผ๊ฐ€๋Š” 76์„ผํŠธ์ž…๋‹ˆ๋‹ค."

  • ๋†€๋ผ์šธ ์ •๋„๋กœ ๋ฌธ๋ฒ•์ ์ด๋‹ค!
  • ...ํ•˜์ง€๋งŒ ์ผ๊ด€์„ฑ์ด ์—†์Œ (๋งฅ๋ฝ์ด ๋งž์ง€ ์•Š์Œ).
  • ์–ธ์–ด๋ฅผ ์ž˜ ๋ชจ๋ธ๋งํ•˜๋ ค๋ฉด ํ•œ ๋ฒˆ์— ์„ธ ๋‹จ์–ด ์ด์ƒ์€ ๊ณ ๋ คํ•ด์•ผ ํ•จ.
  • ํ•˜์ง€๋งŒ nnn์„ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด ํฌ์†Œ์„ฑ ๋ฌธ์ œ(sparsity problem)๋ฅผ ์•…ํ™”์‹œํ‚ค๊ณ , ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ฒŒ ๋จ.

Neural Language Models

Based on feed-forward NNs and RNNs

A (Fixed-Window) Neural Language Model

  • Output distribution

y^=softmax(Uh+b2)โˆˆRโˆฃVโˆฃ\hat{\bm{y}}=\text{softmax}(\bm{Uh} + \bm{b_2} ) \in \mathbb{R}^{|V|} y^โ€‹=softmax(Uh+b2โ€‹)โˆˆRโˆฃVโˆฃ

  • Hidden layer

h=f(We+b1)\bm{h} = f(\bm{We+b_1}) h=f(We+b1โ€‹)

  • Concatenated word embeddings

e=[e(1);ย e(2);ย e(3);ย e(4)]\bm{e}=[\bm{e}^{(1)};~\bm{e}^{(2)};~\bm{e}^{(3)};~\bm{e}^{(4)}] e=[e(1);ย e(2);ย e(3);ย e(4)]

  • Words / One-hot vectors

x(1),ย x(2),ย x(3),ย x(4)\bm{x}^{(1)},~\bm{x}^{(2)},~\bm{x}^{(3)},~\bm{x}^{(4)} x(1),ย x(2),ย x(3),ย x(4)

  • Yoshua Bengio et al. (2000)์ด ์ œ์•ˆํ•œ ์ดˆ๊ธฐ ๋ฒ„์ „
  • NNN-gram Language Model ๋Œ€๋น„ ๊ฐœ์„ ์ 
    • Sparsity problem(ํฌ์†Œ์„ฑ ๋ฌธ์ œ) ์—†์Œ
    • ๊ด€์ธก๋œ ๋ชจ๋“  NNN-grams๋ฅผ ์ €์žฅํ•  ํ•„์š” ์—†์Œ
  • ๋‚จ์€ ๋ฌธ์ œ์ 
    • Fixed window ํฌ๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ์ž‘์Œ
    • Window ํฌ๊ธฐ๋ฅผ ๋Š˜๋ฆฌ๋ฉด weights WWW๊ฐ€ ์ปค์ง
    • x(1)x^{(1)}x(1)๊ณผ x(2)x^{(2)}x(2)๊ฐ€ ์„œ๋กœ ์™„์ „ํžˆ ๋‹ค๋ฅธ weights WWW์— ๊ณฑํ•ด์ง (๋น„๋Œ€์นญ์„ฑ)
  • ์ž„์˜์˜ ๊ธธ์ด์˜ input์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” neural architecture ํ•„์š”์„ฑ ๋Œ€๋‘

Recurrent Neural Networks (RNN)

  • Core idea: ๋™์ผํ•œ weights WWW๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ ์šฉ

A Simple RNN Language Model

  • Recurrent ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•œ language Model ๋„์‹

RNN Language Models

  • RNN์˜ ์žฅ์ 
    • ๋ชจ๋“  ๊ธธ์ด์˜ input ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
    • ์ด๋ก ์ ์œผ๋กœ step ttt์˜ ์—ฐ์‚ฐ์— ์•„์ฃผ ์˜ค๋ž˜์ „ step์˜ ์ •๋ณด ํ™œ์šฉ ๊ฐ€๋Šฅ
    • Input context๊ฐ€ ๊ธธ์–ด์ ธ๋„ model size๊ฐ€ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š์Œ
    • ๋ชจ๋“  timestep์— ๋™์ผํ•œ weights๋ฅผ ์ ์šฉํ•˜๋ฏ€๋กœ input ์ฒ˜๋ฆฌ ๋ฐฉ์‹์— ๋Œ€์นญ์„ฑ ์กด์žฌ

  • RNN์˜ ๋‹จ์ 
    • Recurrent computation ์†๋„๊ฐ€ ๋А๋ฆผ
    • ์‹ค์ œ๋กœ๋Š” ๋จผ ๊ณผ๊ฑฐ์˜ ์ •๋ณด์— ์ ‘๊ทผํ•˜๊ธฐ ์–ด๋ ค์›€ (Vanishing gradient ๋“ฑ)

Training an RNN Language Model

  • ์ ˆ์ฐจ
    • Words sequence๋กœ ๊ตฌ์„ฑ๋œ big corpus ์ค€๋น„
    • RNN-LM์— ์ž…๋ ฅํ•˜์—ฌ ๋งค step ttt๋งˆ๋‹ค output distribution y^(t)\hat{\bm{y}}^{(t)}y^โ€‹(t) ๊ณ„์‚ฐ

  • Loss function
    • ์˜ˆ์ธก๋œ ํ™•๋ฅ ๋ถ„ํฌ y^(t)\hat{\bm{y}}^{(t)}y^โ€‹(t)์™€ ์‹ค์ œ ๋‹ค์Œ ๋‹จ์–ด y(t)\bm{y}^{(t)}y(t) (One-hot) ๊ฐ„์˜ Cross-entropy

J(t)(ฮธ)=CE(y(t),y^(t))=โˆ’โˆ‘wโˆˆVyw(t)logโกy^w(t)=โˆ’logโกy^x(t+1)(t)J^{(t)}(\theta) = CE(\bm{y}^{(t)}, \hat{\bm{y}}^{(t)}) = -\sum_{w \in V} \bm{y}_w^{(t)} \log \hat{\bm{y}}_w^{(t)} = -\log \hat{\bm{y}}_{x^{(t+1)}}^{(t)} J(t)(ฮธ)=CE(y(t),y^โ€‹(t))=โˆ’wโˆˆVโˆ‘โ€‹yw(t)โ€‹logy^โ€‹w(t)โ€‹=โˆ’logy^โ€‹x(t+1)(t)โ€‹


  • ์ „์ฒด training set์— ๋Œ€ํ•ด Average loss ๊ณ„์‚ฐ

J(ฮธ)=1Tโˆ‘t=1Tโˆ’logโกy^xt+1tJ(\theta)= \frac{1}{T} \sum _{t=1} ^T - \log \hat{\bm{y}} ^{t} _{x_{t+1}} J(ฮธ)=T1โ€‹t=1โˆ‘Tโ€‹โˆ’logy^โ€‹xt+1โ€‹tโ€‹

  • ์ „์ฒด corpus์— ๋Œ€ํ•ด ํ•œ ๋ฒˆ์— loss์™€ gradients๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์€ ๋น„์šฉ์ด ๊ณผ๋‹คํ•จ

J(ฮธ)=1Tโˆ‘t=1TJ(t)(ฮธ)J(\theta) = \frac{1}{T} \sum_{t=1} ^{T} J^{(t)} (\theta) J(ฮธ)=T1โ€‹t=1โˆ‘Tโ€‹J(t)(ฮธ)

  • ์‹ค์ œ๋กœ๋Š” x(1),โ€ฆ,x(T)x^{(1)}, \dots, x^{(T)}x(1),โ€ฆ,x(T)๋ฅผ sentence (๋˜๋Š” document) ๋‹จ์œ„๋กœ ์ฒ˜๋ฆฌ
  • Stochastic Gradient Descent (SGD)๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ž‘์€ chunk (Batch) ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด loss ๋ฐ gradient ๊ณ„์‚ฐ ํ›„ weights update ๋ฐ˜๋ณต

Generating Text with an RNN Language Model

  • ํŠน์ • text style๋กœ ํ•™์Šต๋œ RNN-LM์„ ํ†ตํ•ด ํ•ด๋‹น ์Šคํƒ€์ผ์˜ text ์ƒ์„ฑ ๊ฐ€๋Šฅ
  • ์˜ˆ: Obama speeches, Harry Potter ์†Œ์„ค ์Šคํƒ€์ผ ๋“ฑ

Recurrent Neural Networks for Other Applications

Tagging, classification, question answering, speech recognition

RNNs Can Be Used for Tagging

  • ์˜ˆ: Part-of-speech tagging, Named Entity Recognition ๋“ฑ

RNNs Can Be Used for Sentence Classification

  • ์˜ˆ: Sentiment classification(๊ฐ์ • ๋ถ„์„) ๋“ฑ

RNN-LMs Can Be Used to Generate Text

  • Speech recognition, Machine translation, Summarization ๋“ฑ

Variants of RNNs

RNN์˜ ๋ณ€ํ˜•

Bidirectional and Multi-Layer RNNs: Motivation

  • ํ•ด๋‹น ์€๋‹‰ ์ƒํƒœ๋ฅผ ๋ฌธ์žฅ ๋‚ด์˜ "terribly"๋ผ๋Š” ๋‹จ์–ด์˜ ๋ฌธ๋งฅ์  ํ‘œํ˜„์œผ๋กœ ๊ฐ„์ฃผ ๊ฐ€๋Šฅ
    • ์ด๊ฒƒ์„ contextual representation(๋ฌธ๋งฅ์  ํ‘œํ˜„)์ด๋ผ๊ณ  ๋ถ€๋ฆ„.
  • ์ด๋Ÿฌํ•œ ๋ฌธ๋งฅ์  ํ‘œํ˜„์€ ์™ผ์ชฝ ๋ฌธ๋งฅ(์˜ˆ: "the movie was")์— ๋Œ€ํ•œ ์ •๋ณด๋งŒ์„ ํฌํ•จ
  • ์ด ์˜ˆ์‹œ์—์„œ, "exciting"์€ ์˜ค๋ฅธ์ชฝ ๋ฌธ๋งฅ์— ์žˆ์œผ๋ฉฐ ์ด๊ฒƒ์ด "terribly"์˜ ์˜๋ฏธ๋ฅผ ์ˆ˜์ •ํ•˜๊ณ  ์žˆ์Œ (๋ถ€์ •์—์„œ ๊ธ์ •์œผ๋กœ).

Bidirectional RNNs

  • Forward ๋ฐ backward ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ํ™œ์šฉ

Multi-Layer RNNs

  • RNN์„ ์—ฌ๋Ÿฌ ์ธต์œผ๋กœ ์Œ“์•„ ๊ตฌ์„ฑ

Long Short-Term Memory RNNs (LSTMs)

  • 1997๋…„ Hochreiter์™€ Schmidhuber๊ฐ€ vanishing gradients problem์˜ ํ•ด๊ฒฐ์ฑ…์œผ๋กœ ์ œ์•ˆํ•œ RNN์˜ ์ผ์ข…
  • Inputs sequence x(t)x^{(t)}x(t)๊ฐ€ ์ฃผ์–ด์ง€๋ฉด hidden states sequence h(t)h^{(t)}h(t)์™€ cell states c(t)c^{(t)}c(t)๋ฅผ ๊ณ„์‚ฐ
  • Timestep ttt์—์„œ์˜ ๊ณผ์ •
  • Forget gate: ์ด์ „ cell state์—์„œ ์œ ์ง€ํ•  ๊ฒƒ๊ณผ ์žŠ์„ ๊ฒƒ์„ ์ œ์–ด

f(t)=ฯƒ(Wfh(tโˆ’1)+Ufx(t)+bf)f^{(t)} = \sigma(W_f h^{(t-1)} + U_f x^{(t)} + b_f) f(t)=ฯƒ(Wfโ€‹h(tโˆ’1)+Ufโ€‹x(t)+bfโ€‹)

  • Input gate: ์ƒˆ๋กœ์šด cell content ์ค‘ ์–ด๋–ค ๋ถ€๋ถ„์„ cell์— ๊ธฐ๋กํ• ์ง€ ์ œ์–ด

i(t)=ฯƒ(Wih(tโˆ’1)+Uix(t)+bi)i^{(t)} = \sigma(W_i h^{(t-1)} + U_i x^{(t)} + b_i) i(t)=ฯƒ(Wiโ€‹h(tโˆ’1)+Uiโ€‹x(t)+biโ€‹)

  • Output gate: cell์˜ ์–ด๋–ค ๋ถ€๋ถ„์„ hidden state๋กœ ์ถœ๋ ฅํ• ์ง€ ์ œ์–ด

o(t)=ฯƒ(Woh(tโˆ’1)+Uox(t)+bo)o^{(t)} = \sigma(W_o h^{(t-1)} + U_o x^{(t)} + b_o) o(t)=ฯƒ(Woโ€‹h(tโˆ’1)+Uoโ€‹x(t)+boโ€‹)

(Sigmoid function: ๋ชจ๋“  gate ๊ฐ’์€ 0๊ณผ 1 ์‚ฌ์ด)

  • New cell content: cell์— ๊ธฐ๋ก๋  ์ƒˆ๋กœ์šด ๋‚ด์šฉ

c~(t)=tanhโก(Wch(tโˆ’1)+Ucx(t)+bc)\tilde{c}^{(t)} = \tanh(W_c h^{(t-1)} + U_c x^{(t)} + b_c) c~(t)=tanh(Wcโ€‹h(tโˆ’1)+Ucโ€‹x(t)+bcโ€‹)

  • Cell state: ์ด์ „ cell state์—์„œ ์ผ๋ถ€ ๋‚ด์šฉ์„ ์ง€์šฐ๊ณ ("forget"), ์ƒˆ๋กœ์šด cell content๋ฅผ ์ผ๋ถ€ ๊ธฐ๋ก("input")

c(t)=f(t)โˆ˜c(tโˆ’1)+i(t)โˆ˜c~(t)c^{(t)} = f^{(t)} \circ c^{(t-1)} + i^{(t)} \circ \tilde{c}^{(t)} c(t)=f(t)โˆ˜c(tโˆ’1)+i(t)โˆ˜c~(t)

  • Hidden state: cell์—์„œ ์ผ๋ถ€ ๋‚ด์šฉ์„ ์ฝ์Œ("output")

h(t)=o(t)โˆ˜tanhโกc(t)h^{(t)} = o^{(t)} \circ \tanh c^{(t)} h(t)=o(t)โˆ˜tanhc(t)

  • ์ฐธ๊ณ  ์‚ฌํ•ญ
    • ์ด๋“ค์€ ๋ชจ๋‘ ๋™์ผํ•œ ๊ธธ์ด nnn์˜ vector
    • Gates๋Š” element-wise (๋˜๋Š” Hadamard) product โˆ˜\circโˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ ์šฉ

Diagram Legend

  • (X): Pointwise Multiplication (Element-wise)
  • (+): Pointwise Addition (The "Secret" regarding gradients)
  • SIGMOID, TANH: Neural Network Layers (Activation Functions)
  • ->: Vector Transfer
  • Combine: Concatenation
์ตœ๊ทผ ์ˆ˜์ •: 26. 6. 12. ์˜คํ›„ 3:28
Contributors: kmbzn, Claude Sonnet 4.6

BUILT WITH

CloudflareNode.jsGitHubGitVue.jsJavaScriptVSCodenpm

All trademarks and logos are property of their respective owners.
ยฉ 2026 kmbzn ยท MIT License