• Mindscape πŸ”₯
    • Playlist 🎧
  • πŸ€– Artifical Intelligence

    • 1. Basics; Linear Algebra
    • 2. Basics; Linear Algebra (2), Search (1)
    • 3. Search (2)
    • 4. Knowledge and Logic (1)
    • 5. Knowledge and Logic (2)
    • 6. Probability
    • 7. Information Theory
    • 8. Probabilitc Reasoning (2)
    • 9. Probabilitc Reasoning (3)
    • 10. Machine Learning (1)
    • 11. Machine Learning (2)
    • 12. Machine Learning (3)
    • 13. Linear Models
    • 14. Other Classic ML Models (1)
    • 15. Other Classic ML Models (2)
  • πŸ”’ Computer Security

    • 01. Overview
    • 02. μ •λ³΄λ³΄μ•ˆμ •μ±… 및 λ²•κ·œ
    • 03. Cryptographic Tools
    • 04. User Authentication
    • 05. Access Control
    • 06. Database Security
    • 07. Malicious Software
    • 08. Firmware Analysis
  • πŸ—„οΈ Database System

    • 1. Introduction
    • 2. Relational Model
    • 3. SQL
    • 6. E-R Model
    • 7. Relational Database Design (1)
    • 7. Relational Database Design (2)
    • 13. Data Storage Structures
    • 14. Indexing
    • 15. Query Processing
  • πŸ“ Software Engineering

    • 2. Introduction to Software Engineering
    • 3. Process
    • 4. Process Models
    • 5. Agile
    • 6. Requirements
    • 7. Requirements Elicitation and Documentation
    • 8. Architecture
    • 9. Unified Modelling Language
    • 10. Object-Oriented Analysis
    • Object-Oriented Design
  • 🧠 Algorithm

    • Python μ‹œκ°„ 초과 λ°©μ§€λ₯Ό μœ„ν•œ 팁
    • C++ std::vector μ‚¬μš©λ²• 정리
    • Vim μ‚¬μš© 맀뉴얼
    • 1018번: 체슀판 λ‹€μ‹œ μΉ ν•˜κΈ°
    • 1966번: ν”„λ¦°ν„° 큐

13. Linear Models

Linear Functions

  • 연속값 μž…λ ₯(continuous-valued inputs)을 λ°›λŠ” μ„ ν˜• ν•¨μˆ˜(linear functions)λΌλŠ” λ‹€λ₯Έ κ°€μ„€ 곡간(hypothesis space) μ‚¬μš©
  • Univariate linear functionλ₯Ό μ‚¬μš©ν•œ μ„ ν˜• νšŒκ·€(linear regression), 즉 "직선 ν”ΌνŒ…(fitting a straight line)"을 λ‹€λ£Έ
  • λ‹€λ³€μˆ˜ 사둀(multivariate case) 및 κ²½μ„±/μ—°μ„± μž„κ³„κ°’(hard and soft thresholds)을 μ μš©ν•˜μ—¬ μ„ ν˜• ν•¨μˆ˜λ₯Ό λΆ„λ₯˜κΈ°(classifiers)둜 λ³€ν™˜ν•˜λŠ” 방법도 포함

alt text

Univariate Linear Regression

  • μž…λ ₯ xxx와 좜λ ₯ yyyλ₯Ό κ°–λŠ” univariate μ„ ν˜• ν•¨μˆ˜(직선)의 ν˜•νƒœ
    • y=w1x+w0y = w_1 x + w_0y=w1​x+w0​
  • w1w_1w1​과 w0w_0w0β€‹λŠ” ν•™μŠ΅ν•  μ‹€μˆ˜ κ°’μ˜ κ³„μˆ˜(coefficients)
  • 이 κ³„μˆ˜λ“€μ„ κ°€μ€‘μΉ˜(weights)라고 뢀름.
    • ν•­λ“€μ˜ μƒλŒ€μ  κ°€μ€‘μΉ˜ λ³€κ²½μœΌλ‘œ yyy κ°’ λ³€ν™”
  • κ°€μ€‘μΉ˜ 벑터(vector) w\mathbf{w}wλ₯Ό <w1,Β w0><w_1,~w_0><w1​,Β w0​>둜 μ •μ˜
  • κ°€μ„€(hypothesis) ν•¨μˆ˜: hw(x)=w1x+w0h_{\mathbf{w}}(x) = w_1 x + w_0hw​(x)=w1​x+w0​
  • μ„ ν˜• νšŒκ·€(linear regression)의 μž‘μ—…: 데이터(data)에 κ°€μž₯ 잘 λ§žλŠ” hwh_{\mathbf{w}}hw​λ₯Ό μ°ΎλŠ” 것
  • λͺ©ν‘œ: κ²½ν—˜μ  손싀(empirical loss)을 μ΅œμ†Œν™”ν•˜λŠ” κ°€μ€‘μΉ˜ w1,Β w0w_1,~w_0w1​,Β w0​ μ°ΎκΈ°
  • μ „ν†΅μ μœΌλ‘œ L2L_2L2​라 λΆˆλ¦¬λŠ” 제곱 였차 손싀(squared-error loss) ν•¨μˆ˜λ₯Ό λͺ¨λ“  ν•™μŠ΅ μ˜ˆμ œμ— λŒ€ν•΄ ν•©μ‚°ν•˜μ—¬ μ‚¬μš©

Loss(hw)=βˆ‘j=1NL2(yj,Β hw(xj))\text{Loss}(h_{\mathbf{w}}) = \sum_{j=1}^{N} L_2(y_j,~h_{\mathbf{w}}(x_j)) Loss(hw​)=j=1βˆ‘N​L2​(yj​,Β hw​(xj​))

=βˆ‘j=1N(yjβˆ’hw(xj))2= \sum_{j=1}^{N} (y_j - h_{\mathbf{w}}(x_j))^2 =j=1βˆ‘N​(yjβ€‹βˆ’hw​(xj​))2

=βˆ‘j=1N(yjβˆ’(w1xj+w0))2= \sum_{j=1}^{N} (y_j - (w_1 x_j + w_0))^2 =j=1βˆ‘N​(yjβ€‹βˆ’(w1​xj​+w0​))2

  • λͺ©ν‘œ: wβˆ—=arg⁑min⁑wLoss(hw)\mathbf{w}^* = \arg\min_{\mathbf{w}} \text{Loss}(h_{\mathbf{w}})wβˆ—=argminw​Loss(hw​) μ°ΎκΈ°
  • 손싀 ν•© βˆ‘j=1N(yjβˆ’(w1xj+w0))2\sum_{j=1}^{N} (y_j - (w_1 x_j + w_0))^2βˆ‘j=1N​(yjβ€‹βˆ’(w1​xj​+w0​))2은 w0w_0w0​와 w1w_1w1​에 λŒ€ν•œ νŽΈλ„ν•¨μˆ˜(partial derivatives)κ°€ 000일 λ•Œ μ΅œμ†Œν™”λ¨.
  • 000으둜 μ„€μ •λ˜λŠ” 방정식

βˆ‚βˆ‚w0βˆ‘j=1N(yjβˆ’(w1xj+w0))2=0\frac{\partial}{\partial w_0} \sum_{j=1}^{N} (y_j - (w_1 x_j + w_0))^2 = 0 \quad βˆ‚w0β€‹βˆ‚β€‹j=1βˆ‘N​(yjβ€‹βˆ’(w1​xj​+w0​))2=0

and\text{and} and

βˆ‚βˆ‚w1βˆ‘j=1N(yjβˆ’(w1xj+w0))2=0\quad \frac{\partial}{\partial w_1} \sum_{j=1}^{N} (y_j - (w_1 x_j + w_0))^2 = 0 βˆ‚w1β€‹βˆ‚β€‹j=1βˆ‘N​(yjβ€‹βˆ’(w1​xj​+w0​))2=0

  • 이 방정식듀은 μœ μΌν•΄(unique solution)λ₯Ό 가짐

w1=Nβˆ‘jxjyjβˆ’(βˆ‘jxj)(βˆ‘jyj)Nβˆ‘jxj2βˆ’(βˆ‘jxj)2w_1 = \frac{N \sum_j x_j y_j - (\sum_j x_j)(\sum_j y_j)}{N \sum_j x_j^2 - (\sum_j x_j)^2} w1​=Nβˆ‘j​xj2β€‹βˆ’(βˆ‘j​xj​)2Nβˆ‘j​xj​yjβ€‹βˆ’(βˆ‘j​xj​)(βˆ‘j​yj​)​

w0=βˆ‘jyjβˆ’w1(βˆ‘jxj)Nw_0 = \frac{\sum_j y_j - w_1 (\sum_j x_j)}{N} w0​=Nβˆ‘j​yjβ€‹βˆ’w1​(βˆ‘j​xj​)​

Univariate Linear Regression

Proof (w0w_0w0​ μœ λ„)

βˆ‚βˆ‚w0βˆ‘j=1N(yjβˆ’(w1xj+w0))2=βˆ‚βˆ‚w0βˆ‘j=1N(w02βˆ’2(yjβˆ’w1xj)w0+… )\frac{\partial}{\partial w_0} \sum_{j=1}^{N} (y_j - (w_1 x_j + w_0))^2 = \frac{\partial}{\partial w_0} \sum_{j=1}^{N} (w_0^2 - 2(y_j - w_1 x_j)w_0 + \dots) βˆ‚w0β€‹βˆ‚β€‹j=1βˆ‘N​(yjβ€‹βˆ’(w1​xj​+w0​))2=βˆ‚w0β€‹βˆ‚β€‹j=1βˆ‘N​(w02β€‹βˆ’2(yjβ€‹βˆ’w1​xj​)w0​+…)

=βˆ‘j=1N(2w0βˆ’2(yjβˆ’w1xj))=0= \sum_{j=1}^{N} (2w_0 - 2(y_j - w_1 x_j)) = 0 =j=1βˆ‘N​(2w0β€‹βˆ’2(yjβ€‹βˆ’w1​xj​))=0

∴Nw0=βˆ‘j=1Nyjβˆ’w1βˆ‘j=1Nxj\therefore N w_0 = \sum_{j=1}^{N} y_j - w_1 \sum_{j=1}^{N} x_j ∴Nw0​=j=1βˆ‘N​yjβ€‹βˆ’w1​j=1βˆ‘N​xj​

∴w0=βˆ‘jyjβˆ’w1(βˆ‘jxj)N\therefore w_0 = \frac{\sum_j y_j - w_1 (\sum_j x_j)}{N} ∴w0​=Nβˆ‘j​yjβ€‹βˆ’w1​(βˆ‘j​xj​)​

Proof (w1w_1w1​ μœ λ„)

βˆ‚βˆ‚w1βˆ‘j=1N(yjβˆ’(w1xj+w0))2\frac{\partial}{\partial w_1} \sum_{j=1}^{N} (y_j - (w_1 x_j + w_0))^2 βˆ‚w1β€‹βˆ‚β€‹j=1βˆ‘N​(yjβ€‹βˆ’(w1​xj​+w0​))2

=βˆ‚βˆ‚w1βˆ‘j=1N(xj2w12+2xj(w0βˆ’yj)w1+… )= \frac{\partial}{\partial w_1} \sum_{j=1}^{N} (x_j^2 w_1^2 + 2x_j(w_0 - y_j)w_1 + \dots) =βˆ‚w1β€‹βˆ‚β€‹j=1βˆ‘N​(xj2​w12​+2xj​(w0β€‹βˆ’yj​)w1​+…)

=βˆ‘j=1N(2xj2w1+2xj(w0βˆ’yj))=0= \sum_{j=1}^{N} (2x_j^2 w_1 + 2x_j(w_0 - y_j)) = 0 =j=1βˆ‘N​(2xj2​w1​+2xj​(w0β€‹βˆ’yj​))=0

βˆ΄βˆ‘j=1Nxj2w1+βˆ‘j=1N(xjw0βˆ’xjyj)=0\therefore \sum_{j=1}^{N} x_j^2 w_1 + \sum_{j=1}^{N} (x_j w_0 - x_j y_j) = 0 ∴j=1βˆ‘N​xj2​w1​+j=1βˆ‘N​(xj​w0β€‹βˆ’xj​yj​)=0

  • w0w_0w0​ λŒ€μž…

βˆ΄βˆ‘j=1Nxj2w1+βˆ‘j=1Nxj(βˆ‘kykβˆ’w1βˆ‘kxkN)βˆ’βˆ‘j=1Nxjyj=0\therefore \sum_{j=1}^{N} x_j^2 w_1 + \sum_{j=1}^{N} x_j \left( \frac{\sum_k y_k - w_1 \sum_k x_k}{N} \right) - \sum_{j=1}^{N} x_j y_j = 0 ∴j=1βˆ‘N​xj2​w1​+j=1βˆ‘N​xj​(Nβˆ‘k​ykβ€‹βˆ’w1β€‹βˆ‘k​xk​​)βˆ’j=1βˆ‘N​xj​yj​=0

  • (μŠ¬λΌμ΄λ“œ ν‘œκΈ° βˆ‘k\sum_kβˆ‘kβ€‹λŠ” βˆ‘j\sum_jβˆ‘j​와 동일)
  • 양변에 NNN을 κ³±ν•˜κ³  w1w_1w1​에 λŒ€ν•΄ 정리

∴Nβˆ‘j=1Nxj2w1+(βˆ‘j=1Nxj)(βˆ‘jyjβˆ’w1βˆ‘jxj)βˆ’Nβˆ‘j=1Nxjyj=0\therefore N \sum_{j=1}^{N} x_j^2 w_1 + \left(\sum_{j=1}^{N} x_j\right) \left(\sum_j y_j - w_1 \sum_j x_j\right) - N \sum_{j=1}^{N} x_j y_j = 0 ∴Nj=1βˆ‘N​xj2​w1​+(j=1βˆ‘N​xj​)(jβˆ‘β€‹yjβ€‹βˆ’w1​jβˆ‘β€‹xj​)βˆ’Nj=1βˆ‘N​xj​yj​=0

∴(Nβˆ‘j=1Nxj2βˆ’(βˆ‘j=1Nxj)(βˆ‘jxj))w1\therefore \left( N \sum_{j=1}^{N} x_j^2 - \left(\sum_{j=1}^{N} x_j\right) \left(\sum_j x_j\right) \right) w_1 ∴(Nj=1βˆ‘N​xj2β€‹βˆ’(j=1βˆ‘N​xj​)(jβˆ‘β€‹xj​))w1​

=Nβˆ‘j=1Nxjyjβˆ’(βˆ‘j=1Nxj)(βˆ‘jyj)= N \sum_{j=1}^{N} x_j y_j - \left(\sum_{j=1}^{N} x_j\right) \left(\sum_j y_j\right) =Nj=1βˆ‘N​xj​yjβ€‹βˆ’(j=1βˆ‘N​xj​)(jβˆ‘β€‹yj​)

∴w1=Nβˆ‘jxjyjβˆ’(βˆ‘jxj)(βˆ‘jyj)Nβˆ‘jxj2βˆ’(βˆ‘jxj)2\therefore w_1 = \frac{N \sum_j x_j y_j - (\sum_j x_j)(\sum_j y_j)}{N \sum_j x_j^2 - (\sum_j x_j)^2} ∴w1​=Nβˆ‘j​xj2β€‹βˆ’(βˆ‘j​xj​)2Nβˆ‘j​xj​yjβ€‹βˆ’(βˆ‘j​xj​)(βˆ‘j​yj​)​

Weight Space

  • λ§Žμ€ ν•™μŠ΅ ν˜•νƒœκ°€ 손싀(loss)을 μ΅œμ†Œν™”ν•˜κΈ° μœ„ν•΄ κ°€μ€‘μΉ˜(weights)λ₯Ό μ‘°μ •ν•˜λŠ” 것을 ν¬ν•¨ν•˜λ©°, κ°€μ€‘μΉ˜ 곡간(weight space)μ—μ„œ μΌμ–΄λ‚˜λŠ” 일에 λŒ€ν•œ μ‹œκ°μ  이해가 도움됨.
  • κ°€μ€‘μΉ˜ 곡간: κ°€λŠ₯ν•œ λͺ¨λ“  κ°€μ€‘μΉ˜ μ„€μ •μœΌλ‘œ μ •μ˜λ˜λŠ” 곡간
  • 예: Univariate μ„ ν˜• νšŒκ·€μ˜ 경우, w0w_0w0​와 w1w_1w1β€‹λ‘œ μ •μ˜λ˜λŠ” κ°€μ€‘μΉ˜ 곡간은 2차원
  • 손싀 ν•¨μˆ˜(loss function)λ₯Ό w0w_0w0​와 w1w_1w1β€‹μ˜ ν•¨μˆ˜λ‘œ 3D plot에 μ‹œκ°ν™” κ°€λŠ₯
  • 손싀 ν•¨μˆ˜λŠ” 볼둝(convex) ν•¨μˆ˜μ΄λ©°, μ΄λŠ” L2L_2L2​ 손싀 ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜λŠ” λͺ¨λ“  μ„ ν˜• νšŒκ·€ λ¬Έμ œμ—μ„œ μ‚¬μ‹€μž„
  • 볼둝 ν•¨μˆ˜λŠ” μ§€μ—­ μ΅œμ ν•΄(local optima)κ°€ μ•„λ‹Œ μ „μ—­ μ΅œμ ν•΄(global optimum)λ₯Ό 보μž₯ alt text

Gradient Descent κ²½μ‚¬ν•˜κ°•λ²•

  • Univariate linear model은 νŽΈλ„ν•¨μˆ˜κ°€ 000이 λ˜λŠ” 졜적의 ν•΄λ₯Ό μ°ΎκΈ° μ‰½λ‹€λŠ” 쒋은 νŠΉμ„±μ„ 가짐
  • ν•˜μ§€λ§Œ 항상 뢄석적인 ν•΄λ₯Ό κ΅¬ν•˜κΈ° μ‰¬μš΄ 것은 μ•„λ‹ˆλ―€λ‘œ, λ„ν•¨μˆ˜μ˜ 000 지점을 μ°ΎλŠ” 해법에 μ˜μ‘΄ν•˜μ§€ μ•Šκ³  λ³΅μž‘μ„±μ— 상관없이 λͺ¨λ“  손싀 ν•¨μˆ˜μ— 적용 κ°€λŠ₯ν•œ 손싀 μ΅œμ†Œν™” 방법을 λ„μž…
  • 연속적인 κ°€μ€‘μΉ˜ 곡간(continuous weight space)을 λ§€κ°œλ³€μˆ˜μ˜ 점진적 μˆ˜μ •μ„ 톡해 탐색: 경사 ν•˜κ°•λ²•(Gradient Descent)

gradient_descent

  1. κ°€μ€‘μΉ˜ κ³΅κ°„μ—μ„œ μž„μ˜μ˜ μ‹œμž‘μ (starting point) 선택
  • 예: μ„ ν˜• νšŒκ·€μ˜ (w0,Β w1)(w_0,~w_1)(w0​,Β w1​) 평면
  1. 경사(gradient)의 μΆ”μ •μΉ˜ 계산
  2. κ°€μž₯ κ°€νŒŒλ₯Έ 내리막 λ°©ν–₯(steepest downhill direction)으둜 μ•½κ°„ 이동
  3. (local) μ΅œμ†Œ 손싀을 κ°–λŠ” κ°€μ€‘μΉ˜ κ³΅κ°„μ˜ ν•œ μ§€μ μœΌλ‘œ convergenceν•  λ•ŒκΉŒμ§€ 반볡
  • μ•Œκ³ λ¦¬μ¦˜

    w←\mathbf{w} \leftarroww← (λ§€κ°œλ³€μˆ˜ κ³΅κ°„μ˜ μž„μ˜μ˜ 지점)
    while (μˆ˜λ ΄λ˜μ§€ μ•ŠλŠ” λ™μ•ˆ) do

    for (w\mathbf{w}w의 각 wjw_jwj​에 λŒ€ν•΄) do

    wj←wjβˆ’Ξ±βˆ‚βˆ‚wjLoss(w)w_j \leftarrow w_j - \alpha \frac{\partial}{\partial w_j} \text{Loss}(\mathbf{w})wj​←wjβ€‹βˆ’Ξ±βˆ‚wjβ€‹βˆ‚β€‹Loss(w)

  • λ§€κ°œλ³€μˆ˜ Ξ±\alphaΞ±λŠ” learning rate(ν•™μŠ΅λ₯ ) λ˜λŠ” step size(μŠ€ν… 크기)라고 뢈림
  • Ξ±\alphaΞ±λŠ” κ³ μ • μƒμˆ˜(fixed constant)일 μˆ˜λ„ 있고, ν•™μŠ΅ 과정이 진행됨에 따라 μ‹œκ°„ 경과에 따라 κ°μ†Œ(decay)ν•  μˆ˜λ„ 있음.

Gradient Descent for Univariate Linear Regression

  • λ‹¨λ³€λŸ‰ νšŒκ·€μ˜ 경우, 손싀은 이차(quadratic)μ‹μ΄λ―€λ‘œ νŽΈλ„ν•¨μˆ˜λŠ” μ„ ν˜•(linear)이 됨.
  • ν•˜λ‚˜μ˜ ν›ˆλ ¨ 예제 (x,Β y)(x,~y)(x,Β y)만 μžˆλŠ” λ‹¨μˆœν™”λœ 경우

    βˆ‚βˆ‚wiLoss(w)=βˆ‚βˆ‚wi(yβˆ’hw(x))2\frac{\partial}{\partial w_i} Loss(\mathbf{w}) = \frac{\partial}{\partial w_i} (y - h_{\mathbf{w}}(x))^2 βˆ‚wiβ€‹βˆ‚β€‹Loss(w)=βˆ‚wiβ€‹βˆ‚β€‹(yβˆ’hw​(x))2

    =2(yβˆ’hw(x))β‹…βˆ‚βˆ‚wi(yβˆ’hw(x))= 2(y - h_{\mathbf{w}}(x)) \cdot \frac{\partial}{\partial w_i} (y - h_{\mathbf{w}}(x)) =2(yβˆ’hw​(x))β‹…βˆ‚wiβ€‹βˆ‚β€‹(yβˆ’hw​(x))

    =2(yβˆ’hw(x))β‹…βˆ‚βˆ‚wi(yβˆ’(w1x+w0))= 2(y - h_{\mathbf{w}}(x)) \cdot \frac{\partial}{\partial w_i} (y - (w_1x + w_0)) =2(yβˆ’hw​(x))β‹…βˆ‚wiβ€‹βˆ‚β€‹(yβˆ’(w1​x+w0​))

  • w0w_0w0​와 w1w_1w1​ λͺ¨λ‘μ— 적용

    βˆ‚βˆ‚w0Loss(w)=βˆ’2(yβˆ’hw(x))\frac{\partial}{\partial w_0} Loss(\mathbf{w}) = -2(y - h_{\mathbf{w}}(x)) βˆ‚w0β€‹βˆ‚β€‹Loss(w)=βˆ’2(yβˆ’hw​(x))

    βˆ‚βˆ‚w1Loss(w)=βˆ’2(yβˆ’hw(x))β‹…x\frac{\partial}{\partial w_1} Loss(\mathbf{w}) = -2(y - h_{\mathbf{w}}(x)) \cdot x βˆ‚w1β€‹βˆ‚β€‹Loss(w)=βˆ’2(yβˆ’hw​(x))β‹…x

  • 이 κ²°κ³Όλ₯Ό μ›λž˜μ˜ 경사 ν•˜κ°•λ²• 방정식에 λŒ€μž…ν•˜κ³ , 2λ₯Ό λͺ…μ‹œλ˜μ§€ μ•Šμ€ ν•™μŠ΅λ₯  Ξ±\alphaα에 ν¬ν•¨μ‹œν‚€λ©΄, λ‹€μŒ ν•™μŠ΅ κ·œμΉ™(learning rule)을 μ–»μŒ

    w0←w0+Ξ±(yβˆ’hw(x))w_0 \leftarrow w_0 + \alpha (y - h_{\mathbf{w}}(x)) w0​←w0​+Ξ±(yβˆ’hw​(x))

    w1←w1+Ξ±(yβˆ’hw(x))β‹…xw_1 \leftarrow w_1 + \alpha (y - h_{\mathbf{w}}(x)) \cdot x w1​←w1​+Ξ±(yβˆ’hw​(x))β‹…x

  • 이 μ—…λ°μ΄νŠΈλŠ” μ§κ΄€μ μœΌλ‘œ 이해 κ°€λŠ₯: λ§Œμ•½ hw(x)>yh_{\mathbf{w}}(x) > yhw​(x)>y (즉, 좜λ ₯이 λ„ˆλ¬΄ 큼)이면, w0w_0w0​λ₯Ό μ•½κ°„ 쀄이고, xxxκ°€ μ–‘μ˜ μž…λ ₯이면 w1w_1w1​을 쀄이고 xxxκ°€ 음의 μž…λ ₯이면 w1w_1w1​을 늘림

Batch and Stochastic Gradient Descent

  • NNN개의 ν›ˆλ ¨ μ˜ˆμ œμ— λŒ€ν•΄, 각 예제의 κ°œλ³„ 손싀 합계λ₯Ό μ΅œμ†Œν™”ν•˜κ³ μž 함.
  • ν•©μ˜ λ„ν•¨μˆ˜λŠ” λ„ν•¨μˆ˜μ˜ ν•©μ΄λ―€λ‘œ

    w0←w0+Ξ±βˆ‘j=1N(yjβˆ’hw(xj))w_0 \leftarrow w_0 + \alpha \sum_{j=1}^N (y_j - h_{\mathbf{w}}(x_j)) w0​←w0​+Ξ±j=1βˆ‘N​(yjβ€‹βˆ’hw​(xj​))

    w1←w1+Ξ±βˆ‘j=1N(yjβˆ’hw(xj))β‹…xjw_1 \leftarrow w_1 + \alpha \sum_{j=1}^N (y_j - h_{\mathbf{w}}(x_j)) \cdot x_j w1​←w1​+Ξ±j=1βˆ‘N​(yjβ€‹βˆ’hw​(xj​))β‹…xj​

  • 이 μ—…λ°μ΄νŠΈλŠ” λ‹¨λ³€λŸ‰ μ„ ν˜• νšŒκ·€λ₯Ό μœ„ν•œ 배치 경사 ν•˜κ°•λ²•(batch gradient descent) ν•™μŠ΅ κ·œμΉ™ (결정둠적 경사 ν•˜κ°•λ²•(deterministic gradient descent)이라고도 함)
  • λͺ¨λ“  ν›ˆλ ¨ 예제λ₯Ό λ‹€λ£¨λŠ” ν•œ 단계λ₯Ό 에포크(epoch)라고 함.
  • 더 λΉ λ₯Έ λ³€ν˜•: ν™•λ₯ μ  경사 ν•˜κ°•λ²•(stochastic gradient descent) λ˜λŠ” SGD
  • 각 λ‹¨κ³„μ—μ„œ λ¬΄μž‘μœ„λ‘œ 적은 수의 ν›ˆλ ¨ 예제λ₯Ό μ„ νƒν•˜κ³ , 경사 ν•˜κ°•λ²• 방정식에 따라 μ—…λ°μ΄νŠΈ
  • μ›λž˜ SGD 버전은 각 λ‹¨κ³„λ§ˆλ‹€ 단 ν•˜λ‚˜μ˜ ν›ˆλ ¨ 예제만 μ„ νƒν–ˆμ§€λ§Œ, ν˜„μž¬λŠ” NNN개 예제 쀑 mmm개의 λ―Έλ‹ˆλ°°μΉ˜(minibatch)λ₯Ό μ„ νƒν•˜λŠ” 것이 더 일반적

Stochastic Gradient Descent

  • 일뢀 CPU λ˜λŠ” GPU μ•„ν‚€ν…μ²˜μ—μ„œλŠ”, mmm을 μ„ νƒν•˜μ—¬ 병렬 벑터 μ—°μ‚°(parallel vector operations)을 ν™œμš©, mmm개 예제둜 μŠ€ν…μ„ λ°ŸλŠ” 것이 단일 예제 μŠ€ν…λ§ŒνΌ 빠름.
  • μ΄λŸ¬ν•œ μ œμ•½ λ‚΄μ—μ„œ, mmm을 각 ν•™μŠ΅ λ¬Έμ œμ— 맞게 μ‘°μ •(tuned)ν•΄μ•Ό ν•˜λŠ” ν•˜μ΄νΌνŒŒλΌλ―Έν„°(hyperparameter)둜 μ·¨κΈ‰
  • λ―Έλ‹ˆλ°°μΉ˜ SGD의 수렴이 μ—„κ²©ν•˜κ²Œ 보μž₯λ˜μ§€λŠ” μ•ŠμŒ. μ΅œμ†Œκ°’ μ£Όλ³€μ—μ„œ μ•ˆμ •λ˜μ§€ μ•Šκ³  진동(oscillate)ν•  수 있음.
  • 이λ₯Ό μ™„ν™”ν•˜κΈ° μœ„ν•΄ ν•™μŠ΅λ₯  Ξ±\alphaΞ±λ₯Ό κ°μ†Œμ‹œν‚€λŠ” μŠ€μΌ€μ€„(schedule)을 λ§Œλ“€ 수 있음.
  • SGDλŠ” μ„ ν˜• νšŒκ·€ μ΄μ™Έμ˜ λͺ¨λΈ, 특히 신경망(neural networks)에 널리 적용됨.
  • 손싀 ν‘œλ©΄(loss surface)이 λ³Όλ‘ν•˜μ§€ μ•Šμ€ κ²½μš°μ—λ„, 이 μ ‘κ·Ό 방식은 μ „μ—­ μ΅œμ†Œκ°’μ— κ°€κΉŒμš΄ 쒋은 μ§€μ—­ μ΅œμ†Œκ°’μ„ μ°ΎλŠ” 데 νš¨κ³Όμ μž„μ΄ μž…μ¦λ¨.

Multivariable Linear Regression

Multivariable (Multivariate) Linear Regression

  • 각 예제 xj\mathbf{x}_jxj​가 nnn-μš”μ†Œ 벑터인 multivariable linear regression(λ‹€λ³€λŸ‰ μ„ ν˜• νšŒκ·€) 문제둜 μ‰½κ²Œ ν™•μž₯ κ°€λŠ₯
  • κ°€μ„€ 곡간(hypothesis space)은 λ‹€μŒ ν˜•νƒœμ˜ ν•¨μˆ˜ μ§‘ν•©

hw(xj)h_{\mathbf{w}}(\mathbf{x}_j) hw​(xj​)

=w0+w1xj,1+w2xj,2+β‹―+wnxj,n= w_0 + w_1x_{j,1} + w_2x_{j,2} + \dots + w_n x_{j,n} =w0​+w1​xj,1​+w2​xj,2​+β‹―+wn​xj,n​

=w0+βˆ‘i=1nwixj,i= w_0 + \sum_{i=1}^n w_i x_{j,i} =w0​+i=1βˆ‘n​wi​xj,i​

  • 더 κ°„λ‹¨ν•œ ν‘œκΈ°λ₯Ό μœ„ν•΄, 항상 1κ³Ό 같은 값을 κ°–λŠ” κ°€μƒμ˜(dummy) μž…λ ₯ 속성 xj,0x_{j,0}xj,0​λ₯Ό λ§Œλ“¦.
  • 그러면, hhhλŠ” κ°€μ€‘μΉ˜μ™€ μž…λ ₯ λ²‘ν„°μ˜ dot product

hw(xj)=wβ‹…xj=wTx=βˆ‘i=0nwixj,ih_{\mathbf{w}}(\mathbf{x}_j) = \mathbf{w} \cdot \mathbf{x}_j = \mathbf{w}^T\mathbf{x} = \sum_{i=0}^n w_i x_{j,i} hw​(xj​)=wβ‹…xj​=wTx=i=0βˆ‘n​wi​xj,i​

  • 졜적의 κ°€μ€‘μΉ˜ 벑터 wβˆ—\mathbf{w}^*wβˆ—λŠ” μ˜ˆμ œμ— λŒ€ν•œ 제곱 였차 손싀을 μ΅œμ†Œν™” wβˆ—=argminwβˆ‘jL2(yj,Β wβ‹…xj)\mathbf{w}^* = \text{argmin}_{\mathbf{w}} \sum_{j} L_2(y_j,~\mathbf{w} \cdot \mathbf{x}_j)wβˆ—=argminwβ€‹βˆ‘j​L2​(yj​,Β wβ‹…xj​)

How to Compute wβˆ—\textbf{w}^*wβˆ— in Multivariable Linear Regression

  • λ‹¨λ³€λŸ‰ μ„ ν˜• νšŒκ·€μ˜ 경우처럼, 경사 ν•˜κ°•λ²•μ€ 손싀 ν•¨μˆ˜μ˜ (μœ μΌν•œ) μ΅œμ†Œκ°’μ— 도달
  • 각 κ°€μ€‘μΉ˜ wiw_iwi​에 λŒ€ν•œ μ—…λ°μ΄νŠΈ 방정식

wi←wi+Ξ±βˆ‘j(yjβˆ’hw(xj))β‹…xj,iw_i \leftarrow w_i + \alpha \sum_{j} (y_j - h_{\mathbf{w}}(\mathbf{x}_j)) \cdot x_{j,i} wi​←wi​+Ξ±jβˆ‘β€‹(yjβ€‹βˆ’hw​(xj​))β‹…xj,i​

  • μ„ ν˜• λŒ€μˆ˜(linear algebra)와 벑터 미적뢄(vector calculus)의 도ꡬλ₯Ό μ‚¬μš©ν•˜λ©΄, 손싀을 μ΅œμ†Œν™”ν•˜λŠ” w\mathbf{w}wλ₯Ό 해석적(analytically)으둜 ν’€ μˆ˜λ„ 있음.
  • y\mathbf{y}yλ₯Ό ν›ˆλ ¨ 예제의 좜λ ₯ 벑터, X\mathbf{X}Xλ₯Ό 데이터 ν–‰λ ¬(data matrix) (즉, ν–‰λ‹Ή ν•˜λ‚˜μ˜ nnn-차원 예제λ₯Ό κ°–λŠ” μž…λ ₯ ν–‰λ ¬)이라 함.
  • 예츑된 좜λ ₯ λ²‘ν„°λŠ” y^=Xw\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}y^​=Xw
  • λͺ¨λ“  ν›ˆλ ¨ 데이터에 λŒ€ν•œ 제곱 였차 손싀

L(w)=∣∣y^βˆ’y∣∣22=∣∣Xwβˆ’y∣∣22L(\mathbf{w}) = ||\hat{\mathbf{y}} - \mathbf{y}||_2^2 = ||\mathbf{X}\mathbf{w} - \mathbf{y}||_2^2 L(w)=∣∣y^β€‹βˆ’y∣∣22​=∣∣Xwβˆ’y∣∣22​

  • 기울기λ₯Ό 000으둜 μ„€μ •

βˆ‡WL(w)=βˆ‡W∣∣Xwβˆ’y∣∣22\nabla_{\mathbf{W}} L(\mathbf{w}) = \nabla_{\mathbf{W}} ||\mathbf{X}\mathbf{w} - \mathbf{y}||_2^2 βˆ‡W​L(w)=βˆ‡Wβ€‹βˆ£βˆ£Xwβˆ’y∣∣22​

=βˆ‡W(Xwβˆ’y)T(Xwβˆ’y)= \nabla_{\mathbf{W}} (\mathbf{X}\mathbf{w} - \mathbf{y})^T (\mathbf{X}\mathbf{w} - \mathbf{y}) =βˆ‡W​(Xwβˆ’y)T(Xwβˆ’y)

=βˆ‡W[wTXTXwβˆ’2yTXw+yTy]= \nabla_{\mathbf{W}} [\mathbf{w}^T\mathbf{X}^T\mathbf{X}\mathbf{w} - 2\mathbf{y}^T\mathbf{X}\mathbf{w} + \mathbf{y}^T\mathbf{y}] =βˆ‡W​[wTXTXwβˆ’2yTXw+yTy]

=2XTXwβˆ’2XTy= 2\mathbf{X}^T\mathbf{X}\mathbf{w} - 2\mathbf{X}^T\mathbf{y} =2XTXwβˆ’2XTy

=2XT(Xwβˆ’y)=0= 2\mathbf{X}^T (\mathbf{X}\mathbf{w} - \mathbf{y}) = 0 =2XT(Xwβˆ’y)=0

  • μž¬μ •λ¦¬ν•˜λ©΄, μ΅œμ†Œ 손싀 κ°€μ€‘μΉ˜ λ²‘ν„°λŠ” λ‹€μŒκ³Ό κ°™μŒ. (μ •κ·œ 방정식(Normal Equation)) wβˆ—=(XTX)βˆ’1XTy\mathbf{w}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}wβˆ—=(XTX)βˆ’1XTy

Regularization for Multivariable Linear Regression

  • 고차원 κ³΅κ°„μ˜ multivariable linear regressionμ—μ„œλŠ” μ‹€μ œλ‘œλŠ” κ΄€λ ¨ μ—†λŠ” 차원이 μš°μ—°νžˆ μœ μš©ν•œ κ²ƒμ²˜λŸΌ 보여 과적합(overfitting)을 μ΄ˆλž˜ν•  수 있음.
  • λ”°λΌμ„œ, 과적합을 ν”Όν•˜κΈ° μœ„ν•΄ λ‹€λ³€λŸ‰ μ„ ν˜• ν•¨μˆ˜μ— μ •κ·œν™”(regularization)λ₯Ό μ‚¬μš©ν•˜λŠ” 것이 일반적
  • μ •κ·œν™”λ₯Ό μ‚¬μš©ν•˜λ©΄ κ°€μ„€(hypothesis)의 총 λΉ„μš©(cost)을 μ΅œμ†Œν™”, κ²½ν—˜μ  손싀과 κ°€μ„€μ˜ λ³΅μž‘λ„(complexity)λ₯Ό λͺ¨λ‘ 계산

    Cost(h)=EmpLoss(h)+λ Complexity(h)Cost(h) = EmpLoss(h) + \lambda~Complexity(h) Cost(h)=EmpLoss(h)+λ Complexity(h)

  • λ³΅μž‘λ„λŠ” κ°€μ€‘μΉ˜μ˜ ν•¨μˆ˜λ‘œ μ§€μ • κ°€λŠ₯

    Complexity(hw)=Lq(w)=βˆ‘i∣wi∣qComplexity(h_{\mathbf{w}}) = L_q(\mathbf{w}) = \sum_{i} |w_i|^q Complexity(hw​)=Lq​(w)=iβˆ‘β€‹βˆ£wiβ€‹βˆ£q

  • q=1q=1q=1이면, L1L_1L1​ μ •κ·œν™”, μ ˆλŒ“κ°’μ˜ 합을 μ΅œμ†Œν™”
  • q=2q=2q=2이면, L2L_2L2​ μ •κ·œν™”, 제곱의 합을 μ΅œμ†Œν™”

Property of L1L_1L1​ Regularization

  • L1L_1L1​ μ •κ·œν™”λŠ” μ€‘μš”ν•œ 이점이 있음.
    • Sparse model을 μƒμ„±ν•˜λŠ” κ²½ν–₯
  • 즉, μ’…μ’… λ§Žμ€ κ°€μ€‘μΉ˜λ₯Ό 000으둜 μ„€μ •ν•˜μ—¬, ν•΄λ‹Ήν•˜λŠ” 속성(attributes)이 μ™„μ „νžˆ κ΄€λ ¨ μ—†λ‹€κ³  μ„ μ–Έ
  • Loss(w)+Ξ»Complexity(w)Loss(\mathbf{w}) + \lambda Complexity(\mathbf{w})Loss(w)+Ξ»Complexity(w)λ₯Ό μ΅œμ†Œν™”ν•˜λŠ” 것은 Complexity(w)≀cComplexity(\mathbf{w}) \leq cComplexity(w)≀c μ œμ•½ ν•˜μ— Loss(w)Loss(\mathbf{w})Loss(w)λ₯Ό μ΅œμ†Œν™”ν•˜λŠ” 것과 동일

alt text

Linear Classification & Logistic Regression

Linear Classification

  • μ„ ν˜• ν•¨μˆ˜λŠ” regression뿐만 μ•„λ‹ˆλΌ classification(λΆ„λ₯˜)λ₯Ό μˆ˜ν–‰ν•˜λŠ” 데에도 μ‚¬μš© κ°€λŠ₯
    • 예: μ§€μ§„/ν•΅ν­λ°œ λΆ„λ₯˜
  • κ²°μ • 경계(decision boundary)λŠ” 두 클래슀(classes)λ₯Ό λΆ„λ¦¬ν•˜λŠ” μ„  (or κ³ μ°¨μ›μ—μ„œλŠ” ν‘œλ©΄)
  • μ„ ν˜• κ²°μ • κ²½κ³„λŠ” linear separator라고 ν•˜λ©°, μ΄λŸ¬ν•œ seperatorλ₯Ό ν—ˆμš©ν•˜λŠ” 데이터λ₯Ό linearly separableν•˜λ‹€κ³  함.

βˆ’4.9+1.7x1βˆ’x2=0-4.9 + 1.7x_1 - x_2 = 0 βˆ’4.9+1.7x1β€‹βˆ’x2​=0

w=[βˆ’4.9,Β 1.7,Β βˆ’1]\mathbf{w} = [-4.9,~1.7,~-1] w=[βˆ’4.9,Β 1.7,Β βˆ’1]

hw(x)=1Β ifΒ wβ‹…xβ‰₯0Β andΒ 0Β otherwise.h_{\mathbf{w}}(\mathbf{x}) = 1 \text{ if } \mathbf{w} \cdot \mathbf{x} \geq 0 \text{ and } 0 \text{ otherwise.} hw​(x)=1Β ifΒ wβ‹…xβ‰₯0Β andΒ 0Β otherwise.

alt text

Linear Classifers with a Threshold Function

  • hhhλŠ” μ„ ν˜• ν•¨μˆ˜ wβ‹…x\mathbf{w} \cdot \mathbf{x}wβ‹…xλ₯Ό μž„κ³„ ν•¨μˆ˜(threshold function)에 ν†΅κ³Όμ‹œν‚¨ 결과둜 생각할 수 있음.

hw(x)=Threshold(wβ‹…x)Β whereΒ Threshold(z)h_{\mathbf{w}}(\mathbf{x}) = Threshold(\mathbf{w} \cdot \mathbf{x}) \text{~where~} Threshold(z) hw​(x)=Threshold(wβ‹…x)Β whereΒ Threshold(z)

=1Β ifΒ zβ‰₯0Β andΒ 0Β otherwise.= 1 \text{ if } z \geq 0 \text{ and } 0 \text{ otherwise.} =1Β ifΒ zβ‰₯0Β andΒ 0Β otherwise.

  • 문제: (1) 경사 ν•˜κ°•λ²•κ³Ό (2) 졜적 κ°€μ€‘μΉ˜(wβˆ—\mathbf{w}^*wβˆ—)λ₯Ό λ„μΆœν•˜κΈ° μœ„ν•œ λ‹«νžŒ ν˜•νƒœ(closed form)의 계산, λ‘˜ λ‹€ ν™œμš© λΆˆκ°€
  • wβ‹…x=0\mathbf{w} \cdot \mathbf{x} = 0wβ‹…x=0인 지점을 μ œμ™Έν•œ κ°€μ€‘μΉ˜ 곡간 거의 λͺ¨λ“  κ³³μ—μ„œ κΈ°μšΈκΈ°κ°€ 000이고, wβ‹…x=0\mathbf{w} \cdot \mathbf{x} = 0wβ‹…x=0인 μ§€μ μ—μ„œλŠ” κΈ°μšΈκΈ°κ°€ μ •μ˜λ˜μ§€ μ•ŠκΈ° λ•Œλ¬Έ alt text

Problems of Linear Classification with a Hard Threshold

  • μ„ ν˜• ν•¨μˆ˜μ˜ 좜λ ₯을 μž„κ³„ ν•¨μˆ˜μ— ν†΅κ³Όμ‹œν‚€λŠ” 것이 μ„ ν˜• λΆ„λ₯˜κΈ°(linear classifier)λ₯Ό 생성함을 확인
  • ν•˜μ§€λ§Œ μž„κ³„κ°’μ˜ κ²½μ„±(hard nature)은 λͺ‡ κ°€μ§€ 문제λ₯Ό μ•ΌκΈ°
  • κ°€μ„€ hw(x)h_{\mathbf{w}}(\mathbf{x})hw​(x)λŠ” λ―ΈλΆ„ λΆˆκ°€λŠ₯ν•˜λ©° μž…λ ₯κ³Ό κ°€μ€‘μΉ˜μ— λŒ€ν•΄ λΆˆμ—°μ† ν•¨μˆ˜μž„. μ΄λŠ” perceptron rule을 μ‚¬μš©ν•œ ν•™μŠ΅μ„ 맀우 예츑 λΆˆκ°€λŠ₯ν•˜κ²Œ λ§Œλ“¦.
  • λ˜ν•œ, μ„ ν˜• λΆ„λ₯˜κΈ°λŠ” 경계에 맀우 κ°€κΉŒμš΄ μ˜ˆμ œμ— λŒ€ν•΄μ„œλ„ 항상 111 λ˜λŠ” 000의 μ™„μ „ν•œ 확신에 μ°¬ μ˜ˆμΈ‘μ„ μ•Œλ¦Ό. 일뢀 μ˜ˆμ œλŠ” λͺ…ν™•ν•œ 000 λ˜λŠ” 111둜, λ‹€λ₯Έ μ˜ˆμ œλŠ” λΆˆλΆ„λͺ…ν•œ 경계선 μΌ€μ΄μŠ€λ‘œ λΆ„λ₯˜ν•  수 μžˆλ‹€λ©΄ 더 쒋을 것
  • 이 λͺ¨λ“  λ¬Έμ œλŠ” μž„κ³„ ν•¨μˆ˜λ₯Ό λΆ€λ“œλŸ½κ²Œ(softening) ν•¨μœΌλ‘œμ¨ (κ²½μ„± μž„κ³„κ°’μ„ 연속적이고 λ―ΈλΆ„ κ°€λŠ₯ν•œ ν•¨μˆ˜λ‘œ 근사) 크게 ν•΄κ²° κ°€λŠ₯

Logistic Function

  • λ‘œμ§€μŠ€ν‹± ν•¨μˆ˜(logistic function) (μ‹œκ·Έλͺ¨μ΄λ“œ ν•¨μˆ˜(sigmoid function)라고도 함)

    Logistic(z)=11+eβˆ’zLogistic(z) = \frac{1}{1 + e^{-z}} Logistic(z)=1+eβˆ’z1​

    alt text
  • μž„κ³„ ν•¨μˆ˜λ₯Ό λ‘œμ§€μŠ€ν‹± ν•¨μˆ˜λ‘œ λŒ€μ²΄

hw(x)=Logistic(wβ‹…x)h_{\mathbf{w}}(\mathbf{x}) = Logistic(\mathbf{w} \cdot \mathbf{x}) hw​(x)=Logistic(wβ‹…x)

=11+eβˆ’wβ‹…x= \frac{1}{1 + e^{-\mathbf{w} \cdot \mathbf{x}}} =1+eβˆ’wβ‹…x1​

  • Data set에 λŒ€ν•œ 손싀을 μ΅œμ†Œν™”ν•˜κΈ° μœ„ν•΄ 이 λͺ¨λΈμ˜ κ°€μ€‘μΉ˜λ₯Ό λ§žμΆ”λŠ”(fitting) κ³Όμ •: λ‘œμ§€μŠ€ν‹± νšŒκ·€(logistic regression)

How to Compute wβˆ—\mathbf{w}^*wβˆ— in Logistic Regression

  • 가섀이 더 이상 000 λ˜λŠ” 111만 좜λ ₯ν•˜μ§€ μ•ŠμœΌλ―€λ‘œ L2L_2L2​ 손싀 ν•¨μˆ˜ μ‚¬μš©
  • gggλ₯Ό λ‘œμ§€μŠ€ν‹± ν•¨μˆ˜, gβ€²g'g′을 κ·Έ λ„ν•¨μˆ˜λ‘œ μ‚¬μš©
  • 단일 예제 (x,Β y)(\mathbf{x},~y)(x,Β y)에 λŒ€ν•΄, 기울기 μœ λ„λŠ” hhh의 μ‹€μ œ ν˜•νƒœκ°€ μ‚½μž…λ˜λŠ” μ§€μ κΉŒμ§€ μ„ ν˜• νšŒκ·€μ™€ 동일

    βˆ‚βˆ‚wiLoss(w)=βˆ‚βˆ‚wi(yβˆ’hw(x))2\frac{\partial}{\partial w_i} Loss(\mathbf{w}) = \frac{\partial}{\partial w_i} (y - h_{\mathbf{w}}(\mathbf{x}))^2 βˆ‚wiβ€‹βˆ‚β€‹Loss(w)=βˆ‚wiβ€‹βˆ‚β€‹(yβˆ’hw​(x))2

    =2(yβˆ’hw(x))β‹…βˆ‚βˆ‚wi(yβˆ’hw(x))= 2(y - h_{\mathbf{w}}(\mathbf{x})) \cdot \frac{\partial}{\partial w_i} (y - h_{\mathbf{w}}(\mathbf{x})) =2(yβˆ’hw​(x))β‹…βˆ‚wiβ€‹βˆ‚β€‹(yβˆ’hw​(x))

    =βˆ’2(yβˆ’hw(x))β‹…gβ€²(wβ‹…x)β‹…βˆ‚βˆ‚wi(wβ‹…x)= -2(y - h_{\mathbf{w}}(\mathbf{x})) \cdot g'(\mathbf{w} \cdot \mathbf{x}) \cdot \frac{\partial}{\partial w_i} (\mathbf{w} \cdot \mathbf{x}) =βˆ’2(yβˆ’hw​(x))β‹…gβ€²(wβ‹…x)β‹…βˆ‚wiβ€‹βˆ‚β€‹(wβ‹…x)

    =βˆ’2(yβˆ’hw(x))β‹…gβ€²(wβ‹…x)β‹…xi= -2(y - h_{\mathbf{w}}(\mathbf{x})) \cdot g'(\mathbf{w} \cdot \mathbf{x}) \cdot x_i =βˆ’2(yβˆ’hw​(x))β‹…gβ€²(wβ‹…x)β‹…xi​

  • λ‘œμ§€μŠ€ν‹± ν•¨μˆ˜μ˜ λ„ν•¨μˆ˜ gβ€²g'g′은 gβ€²(z)=g(z)(1βˆ’g(z))g'(z) = g(z)(1 - g(z))gβ€²(z)=g(z)(1βˆ’g(z))λ₯Ό 만쑱

    βˆ‚βˆ‚z11+eβˆ’z=βˆ‚βˆ‚z(1+eβˆ’z)βˆ’1\frac{\partial}{\partial z} \frac{1}{1 + e^{-z}} = \frac{\partial}{\partial z} (1 + e^{-z})^{-1} βˆ‚zβˆ‚β€‹1+eβˆ’z1​=βˆ‚zβˆ‚β€‹(1+eβˆ’z)βˆ’1

    =βˆ’1β‹…(1+eβˆ’z)βˆ’2β‹…(βˆ’eβˆ’z)= -1 \cdot (1 + e^{-z})^{-2} \cdot (-e^{-z}) =βˆ’1β‹…(1+eβˆ’z)βˆ’2β‹…(βˆ’eβˆ’z)

    =eβˆ’z(1+eβˆ’z)2= \frac{e^{-z}}{(1 + e^{-z})^2} =(1+eβˆ’z)2eβˆ’z​

    =11+eβˆ’zβ‹…eβˆ’z1+eβˆ’z=g(z)(1βˆ’g(z))= \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}} = g(z)(1 - g(z)) =1+eβˆ’z1​⋅1+eβˆ’zeβˆ’z​=g(z)(1βˆ’g(z))

λ”°λΌμ„œ,

gβ€²(wβ‹…x)=g(wβ‹…x)(1βˆ’g(wβ‹…x))g'(\mathbf{w} \cdot \mathbf{x}) = g(\mathbf{w} \cdot \mathbf{x})(1 - g(\mathbf{w} \cdot \mathbf{x})) gβ€²(wβ‹…x)=g(wβ‹…x)(1βˆ’g(wβ‹…x))

=hw(x)(1βˆ’hw(x))= h_{\mathbf{w}}(\mathbf{x})(1 - h_{\mathbf{w}}(\mathbf{x})) =hw​(x)(1βˆ’hw​(x))

  • 손싀을 μ΅œμ†Œν™”ν•˜κΈ° μœ„ν•œ κ°€μ€‘μΉ˜ μ—…λ°μ΄νŠΈλŠ” μž…λ ₯κ³Ό 예츑 κ°„μ˜ 차이 (yβˆ’hw(x))(y - h_{\mathbf{w}}(\mathbf{x}))(yβˆ’hw​(x)) λ°©ν–₯으둜 μŠ€ν…μ„ 밟으며, μŠ€ν…μ˜ κΈΈμ΄λŠ” μƒμˆ˜ Ξ±\alphaα와 gβ€²g'g′에 따라 달라짐

    wi←wi+Ξ±(yβˆ’hw(x))β‹…hw(x)(1βˆ’hw(x))β‹…xiw_i \leftarrow w_i + \alpha (y - h_{\mathbf{w}}(\mathbf{x})) \cdot h_{\mathbf{w}}(\mathbf{x})(1 - h_{\mathbf{w}}(\mathbf{x})) \cdot x_i wi​←wi​+Ξ±(yβˆ’hw​(x))β‹…hw​(x)(1βˆ’hw​(x))β‹…xi​

졜근 μˆ˜μ •: 25. 11. 6. μ˜€ν›„ 12:07
Contributors: kmbzn
Prev
12. Machine Learning (3)
Next
14. Other Classic ML Models (1)

BUILT WITH

CloudflareNode.jsGitHubGitVue.jsJavaScriptVSCodenpm

All trademarks and logos are property of their respective owners.
Β© 2025 kmbzn Β· MIT License