14. Indexing

Indexing(인덱싱): 원하는 데이터에 대한 접근 속도를 높이는 데 사용되는 메커니즘
Search Key(검색 키): 파일에서 레코드를 찾는 데 사용되는 속성 또는 속성 집합
Index file(인덱스 파일)은 search-key와 pointer(s)( $\text{Block pointer, offset}$ ) 형태의 레코드(index entries(인덱스 엔트리)라 불림)로 구성
인덱스 파일은 일반적으로 원본 데이터 파일보다 훨씬 작음
두 가지 기본 인덱스 종류
- Ordered indices(순서 인덱스): search key가 정렬된 순서로 저장
- Hash indices(해시 인덱스): search key가 hash function(해시 함수)을 사용하여 "buckets(버킷)"에 균일하게 분산
Index evaluation metrics(인덱스 평가 지표)
- 효율적으로 지원되는 Access types(접근 유형)
  - Point queries(포인트 쿼리): search key에 대해 지정된 값을 갖는 레코드
  - Range queries(범위 쿼리): search key 값이 지정된 범위 내에 있는 레코드
- 데이터 레코드에 대한 Access/Insertion/Deletion times(접근/삽입/삭제 시간)
- Space overhead(공간 오버헤드)

Ordered Indices

Ordered index(순서 인덱스): index entries가 search key 값에 따라 정렬되어 저장
Clustering index(클러스터링 인덱스)
- Sequentially ordered data file(순차적으로 정렬된 데이터 파일)에서, search key가 데이터 파일의 순차적 순서도 정의하는 인덱스
- Primary index(기본 인덱스)라고도 함
- 기본 인덱스의 search key는 일반적으로 primary key(기본 키)이지만, 필수는 아님
Secondary index(보조 인덱스)
- search key가 데이터 파일의 순차적 순서와 다른 순서를 지정하는 인덱스
- Nonclustering index(비클러스터링 인덱스)라고도 함
Index-sequential file(인덱스-순차 파일)
- search key를 기준으로 정렬된 Sequential data file에, search key에 대한 clustering index가 있는 파일

Dense Index

Dense index(밀집 인덱스): 데이터 파일의 모든 search-key 값에 대해 index entry가 나타남

Sparse Index

Sparse Index(희소 인덱스): 일부 search-key 값에 대해서만 index entries를 포함
레코드가 search key를 기준으로 순차적으로 정렬되어 있을 때 적용 가능
search-key 값 $K$ $K$ 를 가진 레코드를 찾으려면
- $K$ 보다 작거나 같은 search-key 값 중 가장 큰 값을 가진 index entry를 찾음
- 해당 index record가 가리키는 레코드에서부터 파일 순차 검색
Sparse Index(Cont.)
- Dense indices와 비교
  - 데이터 레코드의 insertion(삽입) 및 deletion(삭제)에 대해 Less space and less maintenance overhead(더 적은 공간과 더 적은 유지 관리 오버헤드)
  - 일반적으로 레코드를 찾는 데 Dense index보다 slower(느림)
- Access time(접근 시간)과 space overhead(공간 오버헤드) 사이의 trade-off(균형)
- Good compromise(좋은 절충안)
  - Clustering index의 경우: 파일의 모든 block(블록)에 대해 index entry를 가진 sparse index. 해당 block에서 least search-key value(가장 작은 search-key 값)에 해당
  - Note) Query processing(쿼리 처리)의 주요 비용은 block I/O time(블록 I/O 시간); 메모리 내 블록 스캐닝 시간은 negligible(무시할 수 있음)
  - 이 방식은 dense index와 동일한 수의 block I/O를 가지면서도 space overhead가 훨씬 적음

Multilevel Index

Index가 메모리에 맞지 않으면, access가 expensive(비용이 많이 듦)
Solution(해결책): Multilevel Index(다단계 인덱스)
- 디스크에 보관된 인덱스를 sequential file(순차 파일)로 취급하고 그 위에 sparse index를 구축
- Outer index(외부 인덱스): basic index(기본 인덱스)의 sparse index
- Inner index(내부 인덱스): basic index file
- Outer index조차 너무 커서 메인 메모리에 맞지 않으면, 또 다른 level의 index 생성 가능
모든 level의 indices는 파일에서의 insertion 또는 deletion 시에 updated(갱신)되어야 함

Index Update: Insertion

Index는 database modification(데이터베이스 수정)에 오버헤드를 부과
- 레코드가 삽입 또는 삭제될 때, 관계형의 모든 인덱스가 갱신되어야 함
- 레코드가 갱신될 때, 갱신된 속성에 대한 모든 인덱스가 갱신되어야 함
Index update upon insertion:~1) Dense indices
- 삽입되는 레코드의 search-key 값을 사용하여 lookup(조회) 수행
- search-key 값이 인덱스에 나타나지 않으면, insert(삽입)
- search-key 값이 인덱스에 나타나면
  - 인덱스 엔트리가 동일한 search-key 값을 가진 모든 레코드에 대한 pointers를 저장하는 경우, 새로운 레코드에 대한 pointer를 add(추가)
  - 그렇지 않은 경우, 인덱스 엔트리가 해당 search-key 값을 가진 첫 번째 레코드에 대한 pointer만 저장. 새로운 레코드를 동일한 search-key 값을 가진 다른 레코드들 뒤에 배치
- Indices는 sequential files로 유지 관리 됨 $\to$ 새로운 엔트리를 위한 space를 생성해야 하며, overflow blocks(오버플로 블록)이 필요할 수 있음
Index update upon insertion: 2) Sparse indices
- 삽입되는 레코드의 search-key 값을 사용하여 lookup 수행
- 인덱스가 파일의 각 블록에 대한 엔트리를 저장하는 경우, unless(다음의 경우가 아니라면) 인덱스를 수정할 필요가 없음
  - 새로운 block이 생성되는 경우(기존 블록이 이미 가득 찼기 때문): 새로운 블록의 first search-key value(첫 번째 search-key 값)이 인덱스에 삽입됨
  - 새로운 레코드가 블록에서 least search-key value(가장 작은 search-key 값)를 가지는 경우: 블록을 가리키는 index entry를 update(갱신)

Index Update: Deletion

Index update upon deletion:~1) Dense indices -(Case~1) 삭제된 레코드가 해당 search-key 값을 가진 파일 내 유일한 레코드인 경우, 인덱스에서도 search-key가 deleted(삭제)됨
- 그렇지 않은 경우
  - 인덱스 엔트리가 동일한 search-key 값을 가진 모든 레코드에 대한 pointers를 저장하는 경우, 삭제된 레코드에 대한 pointer를 delete(삭제)
  - 그렇지 않은 경우, 인덱스 엔트리가 해당 search-key 값을 가진 첫 번째 레코드에 대한 pointer만 저장 -(Case 2) 삭제된 레코드가 해당 search-key 값을 가진 첫 번째 레코드인 경우, 인덱스 엔트리를 update하여 다음 레코드를 가리키도록 함 -(Case 3) 그렇지 않은 경우, 인덱스 갱신은 required(요구되지 않음)
Index update upon deletion: 2) Sparse indices -(Case~1) 인덱스가 삭제된 레코드의 search-key 값과 일치하는 index entry를 포함하지 않는 경우, do nothing(아무것도 하지 않음)
- 그렇지 않은 경우 -(Case 2) 삭제된 레코드가 해당 search-key를 가진 유일한 레코드인 경우, index entry를 파일에서 next search-key value(다음 search-key 값)로 replace(대체) - 다음 search-key 값이 이미 index entry를 가지고 있다면, 해당 엔트리는 deleted됨 -(Primary index for non-key attribute(비-키 속성에 대한 기본 인덱스)) 그렇지 않은 경우, search-key 값에 대한 index entry가 삭제된 레코드를 가리키는 경우, 인덱스 엔트리를 update하여 동일한 search-key 값을 가진 next record(다음 레코드)를 가리키도록 함

Secondary Indices: An example

instructor 파일의 salary 필드에 대한 Secondary index(nonunique search key(고유하지 않은 search key))
Index record는 해당 특정 search-key 값을 가진 모든 실제 레코드에 대한 pointers를 포함하는 bucket을 가리킴
Secondary indices는 dense(밀집)해야 함
Index update 과정은 clustering index의 dense index 경우와 동일
Secondary(nonclustering) index를 사용한 Sequential scan(순차 스캔)은 HDD에서 expensive(비용이 많이 듦)
- 각 레코드 접근 시 디스크에서 새로운 block을 가져와야 할 수 있음

Indices on Multiple Keys

Composite search key(복합 search key): 두 개 이상의 속성을 포함하는 search key
- E.g., instructor relation의 속성 (name, ID)에 대한 인덱스
값은 lexicographically(사전 순으로) 정렬됨
- E.g. $ (\text{John},~12121) <(\text{John},~13514)$ 및 $ (\text{John},~13514) <(\text{Peter},~11223)$
name만으로 쿼리하거나, (name, ID)로 쿼리 가능

B $^+$ -Tree(and B-Tree) Index Files

B $^+$ -Tree Index Files

Index-sequential file organization의 Disadvantage(단점)
- 파일이 커짐에 따라 overflow blocks이 많이 생성되어 성능 저하(인덱스 조회 및 순차 스캔 모두)
- 비용이 많이 들고 주기적인 전체 파일 reorganization(재구성)이 필요
B $^+$ $^{+}$ -tree index structure의 Advantage(장점)
- Insertion(삽입) 및 deletion(삭제) 시 small, local changes(작고 국소적인 변경)로 자동으로 self-reorganizes(자체 재구성)
- 성능 유지를 위해 전체 파일의 reorganization이 필요하지 않음
B $^+$ $^{+}$ -trees의(Minor) disadvantage((사소한) 단점)
- Extra insertion and deletion overhead, space overhead(추가적인 삽입 및 삭제 오버헤드, 공간 오버헤드)
B $^+$ -trees의 장점이 단점보다 크기 때문에 extensively(광범위하게) 사용됨
B $^+$ $^{+}$ -tree는 다음 속성을 만족하는 rooted tree(루트가 있는 트리)
- Root에서 leaf(리프)까지의 모든 경로는 길이가 같음: Balanced tree(균형 트리)
- Root나 leaf가 아닌 각 node(i.e., internal node(내부 노드))는 $\lceil n/2 \rceil$ $⌈ n /2 ⌉$ 와 $n$ $n$ 사이의 children(자식)을 가짐( $n$ $n$ 은 특정 트리에 대해 고정)
  - At least $\lceil n/2 \rceil$ and at most $n$ children(pointers)
- Leaf node는 $\lceil(n-1)/2 \rceil$ $⌈(n - 1) /2 ⌉$ 와 $n-1$ $n - 1$ 사이의 values(값)을 가짐
  - At least $\lceil(n-1)/2 \rceil$ and at most $n-1$ values(not pointers)
- Special cases(특수 경우)
  - Root가 leaf가 아닌 경우, at least 2 children을 가짐
  - Root가 leaf인 경우(즉, 트리에 다른 노드가 없는 경우), 0과 $n-1$ 사이의 values를 가질 수 있음
B $^+$ -tree는 multilevel index(다단계 인덱스)이지만, multilevel index-sequential file과는 다른 구조를 가짐

B $^+$ -Tree Node Structure

Typical node(일반적인 노드)
$P_1$ $K_1$ $P_2$ ... $P_{n+1}$ $K_{n + 1}$ $P_n$
- $K_i$ 는 search-key values
- $P_i$ 는 children(non-leaf nodes의 경우) 또는 records/buckets of records(leaf nodes의 경우)에 대한 pointers
- Note: 최대 $n$ 개의 pointers와 $n-1$ 개의 key values가 있을 수 있음
- 노드 내의 search-keys는 ordered(순서가 지정됨): $K_1 < K_2 < K_3 < \dots < K_{n-1}$ (초기에는 중복 키가 없다고 가정)

Leaf Nodes in B $^+$ -Trees

Leaf node의 Properties(속성)
- $i =~1, 2, \dots, n-1$ 에 대해, pointer $P_i$ 는 search-key value $K_i$ 를 가진 file record를 가리킴
- $L_i$ 와 $L_j$ 가 leaf nodes이고 $i < j$ 인 경우( $L_i$ 가 트리에서 $L_j$ 의 왼쪽에 있음), $L_i$ 의 search-key values는 $L_j$ 의 search-key values보다 작음
- $P_n$ 은 search-key order로 next leaf node(다음 leaf node)를 가리킴
- 순차 처리를 신속하게 하기 위해 모든 leaf nodes를 search-key order로 Chain together(연결)
- B $^+$ -tree index가 dense index(일반적인 경우)로 사용되는 경우, 모든 search-key value가 일부 leaf node에 나타나야 함. 그러나 non-leaf node에 나타나는 search-key는 레코드 삭제로 인해 leaf node에 나타나지 않을 수 있음(나중에 확인)

Non-Leaf Nodes in B $^+$ -Trees

Non-leaf nodes는 leaf nodes에 대한 multi-level sparse index를 형성
$n$ $n$ 개의 pointers를 가진 non-leaf node의 경우
- $P_1$ 이 가리키는 subtree의 All search-keys는 $K_1$ 보다 작음
- $2 \leq i \leq n-1$ 에 대해, $P_i$ 가 가리키는 subtree의 All search-keys는 $K_{i-1}$ 보다 크거나 같고 $K_i$ 보다 작음
- $P_n$ 이 가리키는 subtree의 All search-keys는 $K_{n-1}$ 보다 크거나 같음
Note: Non-leaf nodes는 그들 사이에 duplicate search-key values를 가지지 않음
General structure

Observations about B $^+$ -trees

Inter-node connections(노드 간 연결)이 pointers로 이루어지기 때문에, "logically(논리적으로)" 가까운 blocks이 "physically(물리적으로)" 가까울 필요는 없음
B $^+$ -tree의 non-leaf levels은 hierarchy of sparse indices(희소 인덱스의 계층 구조)를 형성
B $^+$ $^{+}$ -tree는 상대적으로 small number of levels(적은 수의 레벨)을 포함
- Root 아래 레벨은 at least $2 \cdot \lceil n/2 \rceil$ values
- 다음 레벨은 at least $2 \cdot \lceil n/2 \rceil \cdot \lceil n/2 \rceil$ values
파일에 $K$ $K$ 개의 search-key values가 있는 경우, tree height(트리 높이)는 $\lceil \log_{\lceil n/2 \rceil}(K) \rceil$ $⌈ lo g_{⌈ n /2 ⌉} (K)⌉$ 를 초과하지 않음
- 따라서 searches(검색)가 효율적으로 수행될 수 있음
Index가 logarithmic time(로그 시간)으로 재구성될 수 있으므로, main file(메인 파일)에 대한 insertions 및 deletions도 효율적으로 처리될 수 있음

Queries on B $^+$ -Trees: Point Query

function find(v)
1. Set C = root node
2. while(C is not a leaf node) begin
   Let i = smallest number s.t. v ≤ C.Ki
   if there is no such number i then
      /* v is larger than every key in C */
      Set C = the node pointed by the last non-null pointer in C
   else if(v = C.Ki ) Set C = C.Pi +1
   else set C = C.Pi /* v < C.Ki */
   end
   /* Now, C is a leaf node */
3. if for some i, Ki = v then return C.Pi
4. else return null /* no record with search-key value v exists */

Queries on B $^+$ -Trees: Range Query

Range queries: 주어진 범위 내의 search key 값을 가진 all records(모든 레코드)를 찾음
function findRange(lb, ub)는 $\text{lb} \le V \le \text{ub}$ $lb \leq V \leq ub$ 인 search key value $V$ $V$ 를 가진 모든 레코드 집합을 반환 -~1. $C$ $C$ =$ \text{lb} $가 나타날 leaf node를 찾음( find(v)에서 $C$ $C$ 를 찾는 것과 동일)
- 1. $i~=~C$ 에서 $K_i \ge \text{lb}$ 인 smallest value(가장 작은 값)
- 1. while( $K_i \le \text{ub}$ )
  - $C.P_i$ 를 results에 Add
  - $i~=~i~+~1$ (if more records in $C$ ) or move to the next leaf node setting $i =~1$
Real implementations(실제 구현)은 일반적으로 next() 함수를 사용하여 일치하는 레코드를 one at a time(한 번에 하나씩) 가져오는 iterator interface(반복자 인터페이스)를 제공

Queries on B $^+$ -Trees: Cost Analysis

파일에 $K$ 개의 search-key values가 있는 경우, 트리의 height는 $\lceil \log_{\lceil n/2 \rceil}(K) \rceil$ 를 초과하지 않음
Node는 일반적으로 disk block(디스크 블록)과 같은 크기, 일반적으로 4KB
$n$ 은 일반적으로 약~100(40 Bytes/인덱스 엔트리 = 32 Bytes/search key + 8 Bytes/disk block pointer)
Search-key size가~12 Bytes인 경우(20 Bytes/엔트리 크기), $n$ $n$ 은 약 200 -~1 million search key values 및 $n=100$ $n = 100$ 인 경우
- Root에서 leaf까지의 index lookup(인덱스 조회)에 대해 At most $\lceil \log_{50}(1,000,000) \rceil = 4$ nodes accessed(접근) -~1 million search key values를 가진 balanced binary tree(균형 이진 트리)와 비교: 조회 시 약 20 nodes accessed
- 모든 node access는 disk I/O를 필요로 할 수 있으며, 약 20 milliseconds의 비용이 들기 때문에 위의 차이는 significant(중요)
Index를 traverse(순회)한 후, 일치하는 레코드를 fetch(가져오기) 위해 one more(random) I/O가 필요

Non-Unique Keys

Search key $a_i$ $a_{i}$ 가 not unique(고유하지 않은 경우), 대신 unique(고유)한 composite key $ (a_i, A_{\text{pp}})$에 대한 인덱스를 생성
- $A_{\text{pp}}$ 는 primary key, record ID 또는 uniqueness를 보장하는 기타 attribute(속성)일 수 있음
Search for $a_i = v$ $a_{i} = v$ 는 composite key에 대한 range search(범위 검색)으로 구현 가능
- Range $ (v, -\infty) $to$ (v, +\infty)$
But more I/O operations(더 많은 I/O 작업)이 실제 레코드를 fetch하는 데 필요
- Index가 clustering인 경우, 모든 access는 sequential(순차적)
- Index가 non-clustering인 경우, 각 record access는 I/O operation을 필요로 할 수 있음

Updates on B $^+$ -Trees: Insertion

Record가 data file에 already added(이미 추가)되었다고 가정
- $P_r$ 및 $v$ 는 각각 record에 대한 pointer 및 search key value
search-key value가 나타날 leaf node $L$ $L$ 을 Find(찾음)
1. $L$ 에 room(공간)이 있는 경우, $(v, P_r)$ 쌍을 $L$ 에 insert(Note: leaf node에는 최대 $n-1$ 쌍)
2. 그렇지 않은 경우, node를 split(새로운 $ (v, P_r)$ 엔트리를 포함하여)
  - Splitting a leaf node $L$ $L$ :
    - 정렬된 순서로 $n$ 개의(search-key, pointer) 쌍을 취함(삽입되는 쌍 포함). 첫 번째 $\lceil n/2 \rceil$ 를 original node $ (L)$에 배치하고, 나머지를 new node $ (L')$에 배치
    - $k$ 를 $L'$ 의 least key value라고 함. $ (k, L')$를 split되는 node의 parent(부모)에 insert
  - Parent가 full(가득 찬 경우), parent를 split하고 split을 더 위로 propagate(전파)
  - Splitting of nodes는 full이 아닌 node를 찾을 때까지 위로 진행
  - 최악의 경우 root node가 split되어 트리의 height가~1 increase(증가)할 수 있음
Splitting a non-leaf node: 이미 full인 internal node $N$ $N$ 에 $ (k, L')$를 삽입할 때
- $N$ 을 $n+1$ pointers와 $n$ keys를 위한 공간이 있는 in-memory area $M$ 으로 Copy(복사)
- $ (k, L')$를 $M$ 에 Insert
- $M$ 에서 $P_1, K_1, \dots, K_{\lceil(n+1)/2 \rceil-1}, P_{\lceil(n+1)/2 \rceil}$ 를 다시 node $N$ 으로 Copy
- $M$ 에서 $P_{\lceil(n+1)/2 \rceil+1}, K_{\lceil(n+1)/2 \rceil+1}, \dots, K_n, P_{n+1}$ 를 새로 할당된 node $N'$ 으로 Copy
- $ (K_{\lceil(n+1)/2 \rceil}, N')$를 $N$ 의 parent에 Insert
- Note: leaf node를 split하는 것과는 달리, search-key는 'copied'되지 않고 parent node로 'moved'됨(i.e., no duplication!)

Updates on B $^+$ -Trees: Deletion

Record가 file에서 already deleted(이미 삭제)되었다고 가정. $v$ 는 record의 search key value이고, $P_r$ 은 record에 대한 pointer
$ (P_r, v)$를 leaf node에서 Remove(제거)
If(만약) leaf node가 제거로 인해 too few entries( $< \lceil(n-1)/2 \rceil$ $< ⌈(n - 1) /2 ⌉$ )를 가지고, node의 entries와 sibling(형제)의 entries가 single node( $\le n-1$ $\leq n - 1$ )에 fit(맞는 경우), merge siblings(형제 합치기)
- 두 node의 모든 entries를 left node에 Insert하고 다른 node를 delete
- 삭제된 node를 가리키는 pointer가 $P_i$ 인 경우, 쌍 $ (K_{i-1}, P_i)$를 그 parent로부터 recursively(재귀적으로) delete
Otherwise(그렇지 않은 경우), node가 제거로 인해 too few entries를 가지고, node의 entries와 sibling의 entries가 single node에 fit하지 않는 경우, redistribute pointers(포인터 재분배)
- Node와 sibling 사이에 pointers를 redistribute하여 둘 다 minimum number of entries( $\ge \lceil(n-1)/2 \rceil$ )보다 more(더 많이) 가지도록 함
- Node의 parent에서 corresponding search-key value(해당 search-key 값)를 Update
Node deletions는 삭제 후 $\lceil n/2 \rceil$ 개 이상의 pointers를 가진 node를 찾을 때까지 cascade upwards(위로 전파)될 수 있음(Note: internal node는 at least $\lceil n/2 \rceil$ pointers를 가져야 함)
Root node가 삭제 후 only one pointer(하나의 포인터)만 가지는 경우, deleted되고 sole child(유일한 자식)가 root가 됨

Complexity of B $^+$ -Tree Updates

Insertion 및 deletion of a single entry의 Cost(I/O 작업 수 측면에서)는 트리의 height에 proportional(비례)
$K$ 개의 entries와 최대 fanout(분기수) $n$ 이 있는 경우, entry의 insert/delete의 worst-case complexity는 $O(\log_{\lceil n/2 \rceil}(K))$
In practice(실제로는), I/O 작업 수는 less(더 적음)
- Internal nodes는 buffer(버퍼)에 있는 경향이 있음
- Splits/merges는 rare(드물고), 대부분의 insert/delete 작업은 only affect a leaf node(leaf node만 영향)
Average node occupancy(평균 노드 점유율)는 insertion order에 depends on(달려 있음)
- Random insertion(무작위 삽입) 시 $\ge 2/3$
- Sorted order(정렬된 순서)로 삽입 시 $1/2$

B $^+$ -Tree File Organization

B $^+$ $^{+}$ -Tree 'File' Organization
- B $^+$ -tree file organization의 Leaf nodes는 pointers 대신 records(레코드)를 저장
- Insertion/deletion/updates가 있을 때도 data records를 clustered(클러스터링)된 상태로 유지하는 데 도움 $\to$ data file degradation problem(overflow blocks) 해결
- Leaf nodes는 여전히 half full(절반) 상태가 요구됨
- Records가 pointers보다 크기 때문에, leaf node에 저장할 수 있는 records의 maximum number는 non-leaf node의 pointers 수보다 less(적음)
- Good space utilization(좋은 공간 활용)이 중요
  - Splits 및 merges 동안 redistribution(재분배)에 more sibling nodes(더 많은 형제 노드)를 포함시킴
  - Redistribution에 2 siblings를 포함시키면(split/merge를 피하기 위해), 각 node는 at least $\lfloor 3n/4 \rfloor$ entries를 가지게 됨
- Insertion 및 deletion은 B $^+$ -tree index에서 entries의 insertion 및 deletion과 동일한 방식으로 처리

B-Tree Index Files

B $^+$ -tree와 유사하지만, B-tree는 search-key values가 only once(한 번만) 나타나도록 허용. Redundant storage of search keys(search key의 중복 저장)를 제거
Non-leaf nodes의 search keys는 B-tree의 어디에도 나타나지 않음. Non-leaf node의 각 search key에 대해 additional pointer field(추가 포인터 필드)가 포함되어야 함
Generalized B-tree leaf node
Non-leaf node: pointers $B_i$ 는 bucket 또는 file record pointers
Advantages of B-Tree indices
- Corresponding B $^+$ -Tree보다 May use less tree nodes(더 적은 수의 트리 노드를 사용할 수 있음)
- 때때로 leaf node에 도달하기 전에 search-key value를 find(찾는 것이 가능)
Disadvantages of B-Tree indices
- Only small fraction(작은 부분)의 search-key values만 일찍 발견됨
- Non-leaf nodes는 larger(더 크므로), fan-out이 reduced(감소) $\to$ B-Trees는 일반적으로 corresponding B $^+$ -Tree보다 greater depth(더 큰 깊이)를 가짐
- Insertion 및 deletion이 B $^+$ -Trees보다 more complicated(더 복잡)
- Implementation(구현)이 B $^+$ -Trees보다 harder(더 어려움)
- Typically, B-Trees의 장점은 단점보다 out weigh(크지 않음)

Other Issues in Indexing

Record relocation(레코드 재배치) and secondary indices
- Record가 moves(이동)하면, record pointers를 저장하는 all secondary indices(모든 보조 인덱스)가 have to be updated(갱신되어야 함)
- $\to$ B $^+$ -tree file organizations에서 leaf node splits가 very expensive(매우 비싸짐)
- Solution: secondary index에서 record pointer 대신 B $^+$ $^{+}$ -tree file organization의 search key를 사용
  - Record를 찾기 위해 file organization의 Extra traversal(추가 순회)
  - Queries에 대한 Higher cost(더 높은 비용), but node splits are cheap(노드 분할은 저렴함)
Indexing strings(문자열 인덱싱)
- Variable length strings(가변 길이 문자열) as keys
  - Variable fanout(가변 분기수)
  - Pointers의 수가 아닌 space utilization(공간 활용)을 splitting(분할)의 criterion(기준)으로 사용
  - Prefix compression(접두사 압축)
    - Internal nodes의 key values는 prefixes(접두사) of full key일 수 있음 $\to$ fanout increases(증가)
    - Key value로 분리된 subtrees의 entries를 distinguish(구별)하기에 충분한 문자를 유지
    - E.g., "Silas"와 "Silberschatz"는 "Silb"로 분리될 수 있음

Other Issues in Indexing: Bulk Loading & Bottom-Up Build

Entries를 one-at-a-time(한 번에 하나씩) B $^+$ $^{+}$ -tree에 삽입하면 entry당 $\ge~1 \text{ I/O}$ $\geq 1 I/O$ 가 필요할 수 있음
- 최악의 경우, 트리의 height에 비례
- A large number of entries를 한 번에 삽입하는 경우(bulk loading(대량 로딩)) very inefficient(매우 비효율적)
- B $^+$ -tree index가 large relation(큰 관계)에 구축될 때 bulk loading이 필요
Efficient alternative~1(효율적인 대안~1)
- Efficient external sorting algorithms(효율적인 외부 정렬 알고리즘)을 사용하여 index entries를 Sort(정렬)
- Insert in sorted order(정렬된 순서로 삽입)
- 특정 leaf node로 가는 모든 entries는 consecutively(연속적으로) 나타남 $\to$ leaf node는 only once(한 번만) written out(기록)하면 됨
- Much improved IO performance(훨씬 향상된 IO 성능), but most leaf nodes half full
Efficient alternative 2(효율적인 대안 2): Bottom-up B $^+$ $^{+}$ -tree construction(상향식 B $^+$ $^{+}$ -tree 구축)
- 이전과 같이 entries를 sort
- 그리고 leaf level부터 시작하여 layer-by-layer(계층별로) tree를 create(생성)
- 정렬된 entries를 block에 fit할 수 있는 만큼 많은 entries를 유지하면서 blocks로 Break up(나눔) $\to$ resulting blocks(결과 블록)이 leaf level을 형성
- 각 block의 minimum value와 block에 대한 pointer는 next level의 entries를 생성하는 데 사용
- Most database systems(대부분의 데이터베이스 시스템)에서 bulk-load utility(대량 로드 유틸리티)의 일부로 구현

Hash Indices

Bucket: $\ge~1$ index entries를 포함하는 storage unit(저장 단위)(일반적으로 disk block)
Entry의 search-key value에서 hash function을 사용하여 entry의 bucket을 Obtain(얻음)
Hash function $h$ : 모든 search-key values 집합 $K$ 에서 모든 bucket addresses 집합 $B$ 로의 함수
Hash function은 access, insertion, deletion을 위한 entries를 locate(찾는 데) 사용
Different search-key values를 가진 entries는 same bucket(같은 버킷)에 매핑될 수 있음. 따라서 entry를 찾기 위해 entire bucket(전체 버킷)을 sequentially(순차적으로) searched(검색)해야 함

Static Hashing

In a hash index(해시 인덱스), buckets은 records에 대한 pointers를 가진 entries를 저장(i.e., buckets store index entries)
In a hash file-organization(해시 파일 구성), buckets은 records를 저장
Bucket overflow(버킷 오버플로)는 다음으로 인해 발생할 수 있음
- Insufficient buckets(불충분한 버킷)
- Records의 distribution(분포)에서의 Skew(왜곡). 두 가지 이유로 발생 가능
  - Multiple records(여러 레코드)가 same search-key value(동일한 search-key 값)를 가짐
  - Chosen hash function(선택된 해시 함수)이 key values의 non-uniform distribution(비균일 분포)를 생성
Bucket overflow의 probability(확률)는 줄일 수 있지만, eliminated(제거)할 수 없음. overflow buckets(오버플로 버킷)을 사용하여 처리

Hash Functions

Hash functions은 uniform(균일) and random(무작위)이 요구됨
Uniform: 각 bucket에 all theoretically possible values(모든 이론적으로 가능한 값) 집합에서 same number의 search-key values가 할당됨
Random: 각 bucket은 파일 내 search-key values의 actual distribution(실제 분포)에 irrespective(관계없이), same number의 entries가 할당됨
Typical hash functions(일반적인 해시 함수)는 search-key의 internal binary representation(내부 이진 표현)에 대한 computation(계산)을 수행
- E.g., string search-key의 경우, string의 모든 characters(문자)의 binary representations를 added하고, sum modulo the number of buckets(합계를 버킷 수로 나눈 나머지)를 returned(반환)

Handling of Bucket Overflows

Overflow chaining(오버플로 체인): 주어진 bucket의 overflow buckets은 linked list(연결 리스트)로 chained together(함께 연결)
Overflow chaining을 사용하는 Hash indexing: closed addressing(폐쇄 주소 지정)(or closed hashing(폐쇄 해싱))
Overflow buckets을 사용하지 않는 alternative(대안)인 open addressing(개방 주소 지정)(or open hashing(개방 해싱))은 database applications에 not suitable(적합하지 않음)

Deficiencies of Static Hashing

Static hashing에서, 함수 $h$ $h$ 는 search-key values를 fixed set( $B$ $B$ )의 bucket addresses로 매핑. Databases는 시간이 지남에 따라 grow or shrink(커지거나 줄어듦)
- Initial number of buckets가 too small이고, 파일이 grows(커지면), too much overflows(너무 많은 오버플로)로 인해 성능이 degrade(저하)됨
- Anticipated growth(예상되는 성장)을 위해 space가 할당되면, initially(초기에) 상당한 양의 space가 wasted(낭비)됨(and buckets will be underfull(덜 채워짐))
- Database가 shrinks(줄어들면), again space가 wasted됨
One solution(하나의 해결책)
- New hash function으로 파일의 Periodic re-organization(주기적인 재구성)
- Expensive(비용이 많이 들고), disrupts normal operations(정상적인 작업을 방해)
Better solution(더 나은 해결책): Dynamic Hashing(동적 해싱)
- Buckets의 number를 dynamically(동적으로) modify(수정)할 수 있는 Techniques(기술)
- Size가 grows and shrinks하는 database에 Good
- Hash function을 dynamically modify할 수 있도록 Allows(허용)
- Extendable hashing(확장 가능 해싱): 동적 해싱의 한 형태(교재 Section 24.5.2 참조)

Comparison of Ordered Indexing and Hashing

Periodic re-organization의 Cost
Insertions and deletions의 Relative frequency(상대 빈도)
Is it desirable to optimize average access time at the expense of worst-case access time?(최악의 경우 접근 시간을 희생하여 평균 접근 시간을 최적화하는 것이 바람직한가?)
Expected type of queries(예상되는 쿼리 유형)
- Hashing은 일반적으로 key의 specified value를 가진 records를 retrieving(검색)하는 데 better(더 좋음)
- Range queries가 common(흔한 경우), ordered indices가 preferred(선호됨)
In practice(실제로는)
- PostgreSQL은 hash indices를 지원하지만, poor performance(저조한 성능)로 인해 discourages use(사용을 권장하지 않음)
- Oracle은 static hash organization을 지원하지만, not hash indices(해시 인덱스는 아님)
- SQLServer는 only B $^+$ -trees를 지원