25. Language Modeling and Recurrent Neural Networks

작성 2026. 6. 12.·수정 2026. 6. 12.

Language Modeling

Classic $n$ -gram 모델

Language Modeling

Next word가 무엇인지 예측하는 Task
공식적 정의
- Words sequence $x^{(1)}, x^{(2)}, \dots, x^{(t)}$ 가 주어졌을 때, Next word $x^{(t+1)}$ 의 Probability distribution 계산

P(x^{(t+1)}|x^{(t)}, \dots, x^{(1)})

이를 수행하는 System을 Language Model이라 지칭
Text 자체에 확률을 할당하는 시스템으로 간주 가능

P(x^{(1)}, \dots, x^{(T)})

= P(x^{(1)}) \times P(x^{(2)}|x^{(1)}) \times \dots \times P(X^{(T)}| x^{(T-1)} , ~\dots ,x^{(1)})

= \prod_{t=1} ^T P(x^{(t)}|x^{(t)} | x ^{(t-1)},~ \dots, x ^ {(1)})

Language Model을 가지고 할 수 있는 것?
- Score sentences: 문장의 자연스러움 평가
  - Jane went to the store. → high
  - Store to Jane went the. → low
- Generate sentences(문장 생성)

while didn't choose end-of-sentence symbol
    Calculate probability
    Sample a new word from the probability distribution

You Use Language Models Every Day!

자동 완성
검색 엔진
ChatGPT

$N$ -Gram Language Models

Deep learning 이전의 근본적이고 고전적인 Language Model 구현 방식
$N$ $N$ -gram: $n$ $n$ 개의 연속적인 words 덩어리
- Unigrams: "the", "students"
- Bigrams: "the students", "student opened"
- Trigrams: "the students opened", "students opened their"
- Four-grams: "the students opened their"
Idea: 빈도 통계를 수집하여 next word 예측에 활용

Markov assumption
- $x^{(t+1)}$ 은 오직 앞선 $n-1$ 개의 words에만 의존한다고 가정

P(x^{(t+1)}|x^{(t)}, \dots, x^{(1)}) = P(x^{(t+1)}|x^{(t)}, \dots, x^{(t-n+2)})

= \frac{P(x^{(t+1)}, x^{(t)},~\dots,~x^{(t-n+2)})}{P(x^{(t)},~\dots,~x^{(t - n + 2)})}

Large corpus(대규모 언어 집합)에서 개수를 세어(Counting) 확률 계산(Statistical approximation)
- Count ratio를 통해 조건부 확률 근사

$N$ -Gram Language Models: Example

4-gram Language Model 학습의 경우를 가정
~~as the proctor started the clock, the~~ students opened their ______

P(w|\text{students opened their})=\frac{\text{count}(\text{students opened their}~w)}{\text{count}(\text{students opened their})}

Corpus 내 "students opened their" 뒤에 오는 단어들의 빈도를 확인하여 확률 부여
- "books": $(0.4)$
- "exams": $(0.1)$

Generating Text with a $N$ -Gram Language Model

간단한 trigram 언어 모델
- 170만 이상의 단어 모음 (Reuters: 경제, 경영 뉴스)
- 언어 모델로 텍스트를 생성할 수 있음.

today the price of gold per ton, while production of shoe lasts and shoe industry, the bank intervened just after it considered and rejected an imf demand to rebuild depleted european stocks, sept 30 end primary 76 cts a share.
"오늘 톤당 금값은, 구두 골 제작과 신발 산업 생산이 지속되는 동안, 은행은 고갈된 유럽 주식(재고)을 재건하려는 IMF 요구를 검토하고 거부한 직후 개입했고, 9월 30일 마감 주가는 76센트입니다."

놀라울 정도로 문법적이다!
...하지만 일관성이 없음 (맥락이 맞지 않음).
언어를 잘 모델링하려면 한 번에 세 단어 이상은 고려해야 함.
하지만 $n$ 을 증가시키면 희소성 문제(sparsity problem)를 악화시키고, 모델 크기를 증가시키게 됨.

Neural Language Models

Based on feed-forward NNs and RNNs

A (Fixed-Window) Neural Language Model

Output distribution

\hat{\bm{y}}=\text{softmax}(\bm{Uh} + \bm{b_2} ) \in \mathbb{R}^{|V|}

Hidden layer

\bm{h} = f(\bm{We+b_1})

Concatenated word embeddings

\bm{e}=[\bm{e}^{(1)};~\bm{e}^{(2)};~\bm{e}^{(3)};~\bm{e}^{(4)}]

Words / One-hot vectors

\bm{x}^{(1)},~\bm{x}^{(2)},~\bm{x}^{(3)},~\bm{x}^{(4)}

Yoshua Bengio et al. (2000)이 제안한 초기 버전
$N$ $N$ -gram Language Model 대비 개선점
- Sparsity problem(희소성 문제) 없음
- 관측된 모든 $N$ -grams를 저장할 필요 없음
남은 문제점
- Fixed window 크기가 너무 작음
- Window 크기를 늘리면 weights $W$ 가 커짐
- $x^{(1)}$ 과 $x^{(2)}$ 가 서로 완전히 다른 weights $W$ 에 곱해짐 (비대칭성)
임의의 길이의 input을 처리할 수 있는 neural architecture 필요성 대두

Recurrent Neural Networks (RNN)

Core idea: 동일한 weights $W$ 를 반복적으로 적용

A Simple RNN Language Model

Recurrent 구조를 활용한 language Model 도식

RNN Language Models

RNN의 장점
- 모든 길이의 input 처리 가능
- 이론적으로 step $t$ 의 연산에 아주 오래전 step의 정보 활용 가능
- Input context가 길어져도 model size가 증가하지 않음
- 모든 timestep에 동일한 weights를 적용하므로 input 처리 방식에 대칭성 존재

RNN의 단점
- Recurrent computation 속도가 느림
- 실제로는 먼 과거의 정보에 접근하기 어려움 (Vanishing gradient 등)

Training an RNN Language Model

절차
- Words sequence로 구성된 big corpus 준비
- RNN-LM에 입력하여 매 step $t$ 마다 output distribution $\hat{\bm{y}}^{(t)}$ 계산

Loss function
- 예측된 확률분포 $\hat{\bm{y}}^{(t)}$ 와 실제 다음 단어 $\bm{y}^{(t)}$ (One-hot) 간의 Cross-entropy

J^{(t)}(\theta) = CE(\bm{y}^{(t)}, \hat{\bm{y}}^{(t)}) = -\sum_{w \in V} \bm{y}_w^{(t)} \log \hat{\bm{y}}_w^{(t)} = -\log \hat{\bm{y}}_{x^{(t+1)}}^{(t)}

전체 training set에 대해 Average loss 계산

J(\theta)= \frac{1}{T} \sum _{t=1} ^T - \log \hat{\bm{y}} ^{t} _{x_{t+1}}

전체 corpus에 대해 한 번에 loss와 gradients를 계산하는 것은 비용이 과다함

J(\theta) = \frac{1}{T} \sum_{t=1} ^{T} J^{(t)} (\theta)

실제로는 $x^{(1)}, \dots, x^{(T)}$ 를 sentence (또는 document) 단위로 처리
Stochastic Gradient Descent (SGD)를 활용하여 작은 chunk (Batch) 데이터에 대해 loss 및 gradient 계산 후 weights update 반복

Generating Text with an RNN Language Model

특정 text style로 학습된 RNN-LM을 통해 해당 스타일의 text 생성 가능
예: Obama speeches, Harry Potter 소설 스타일 등

Recurrent Neural Networks for Other Applications

Tagging, classification, question answering, speech recognition

RNNs Can Be Used for Tagging

예: Part-of-speech tagging, Named Entity Recognition 등

RNNs Can Be Used for Sentence Classification

예: Sentiment classification(감정 분석) 등

RNN-LMs Can Be Used to Generate Text

Speech recognition, Machine translation, Summarization 등

Variants of RNNs

RNN의 변형

Bidirectional and Multi-Layer RNNs: Motivation

해당 은닉 상태를 문장 내의 "terribly"라는 단어의 문맥적 표현으로 간주 가능
- 이것을 contextual representation(문맥적 표현)이라고 부름.
이러한 문맥적 표현은 왼쪽 문맥(예: "the movie was")에 대한 정보만을 포함
이 예시에서, "exciting"은 오른쪽 문맥에 있으며 이것이 "terribly"의 의미를 수정하고 있음 (부정에서 긍정으로).

Bidirectional RNNs

Forward 및 backward 정보를 결합하여 활용

Multi-Layer RNNs

RNN을 여러 층으로 쌓아 구성

Long Short-Term Memory RNNs (LSTMs)

1997년 Hochreiter와 Schmidhuber가 vanishing gradients problem의 해결책으로 제안한 RNN의 일종
Inputs sequence $x^{(t)}$ 가 주어지면 hidden states sequence $h^{(t)}$ 와 cell states $c^{(t)}$ 를 계산
Timestep $t$ 에서의 과정
Forget gate: 이전 cell state에서 유지할 것과 잊을 것을 제어