Skip to content

CS 4432: Database Systems II

  • Introduction

    • IO devices and latency --- PlanetScale

    • RDBMS

      • A collection of files that store the data

        • But: Files that we do not directly access or read
      • A big C program written by someone else that accesses and updates those files for you

        • But: Huge program containing 100s of 1000s of lines

hi

  • Structure

    • Frontend: SQL

      • Data Definition Language - DDL

      • Data Manipulation Language - DML

    • Backend

      • Query optimizer

      • Query engine

      • Storage management

      • Transaction management (concurrency, recovery)

  • Abstraction

    • Tables (logical) → Creating file (data), may create indexes, updating catalog tables, read and update files on disk, check constraints (physical)

    • Queries (logical, declarative) → Imperative query execution plan

      • Complex cost model and many alternatives

      • The optimizer chooses the best execution plan for a query (may involve indexes)

    • Indexes (logical, declarative) → Create a index file, create a specialized data structure, whenever the data changes, the index has to be updated

    • Transactions (logical) → Concurrency control and logging and recovery control to ensure ACID properties

hi

  • File & System Structure: Records in blocks, dictionary, buffer management

    • Disks and files

      • DBMS stores info on disks

      • Main memory is only for processing

      • Major implications on their design

        • READ: transfer data from disk → main memory

        • WRITE: transfer data from RAM → disk

        • High-cost operations (I/O) relative to in-memory operations

      • OS may take away memory in use by DBMS mid-operation (joining tables)

        • Why DBMS has storage manager and buffer manager

hi

hi

hi

  • Disks

    • Data is stored and retrieved in units called disk blocks (sometimes also referred to as “pages”)

    • Disk block typically between 4KB to 8KB

    • Movement to main memory

      • Must read or write one block at a time, meaning all R/W heads move together across tracks

      • Only one head is active at a time during a read/write operation

hi

hi

  • Movements

    • Arm moves in-out

      • Called seek time

      • Mechanical

    • Platter rotates

      • Called latency (or rotation) time

      • Mechanical

hi

  • Sequential I/O typically costs less than random I/O because of less rotation time (all blocks sit next to each other on a track)

    • Read two random blocks

      • 2 seek times

      • 2 latency times

      • 2 transfer times

    • Read two blocks in a sequence

      • 1 seek time

      • 1 latency time

      • 2 transfer times

  • Data storage

    • Size = 2 * num of platters * tracks * sectors * bytes per sector

      • 1KB = 1024 bytes
    • Accessing disk block latency = seek time (warm-up + travel) + latency (rotational) time + data transfer time

      • Average seek time = time to travel half of the number of tracks on a platter surface

        • ~9ms, ~4ms for high-performance drives
      • Average rotational latency = half of the time to make one full rotation

        • 10,000RPM = 3ms for newer drivers
      • Transfer time

        • 100-200MB/sec

        • Reading 4K data block takes ~0.02-0.04ms

        • Very small relative to seek and rotation time

      • Key to lower I/O cost: reduce seek/rotation latency → Sequential I/O

    • Accelerating access to blocks: performed by Disk Controller

      • Hardware component on the HDD that is responsible for managing data transfer

      • Main techniques

        • RAID: Redundant Array of Independent Disks: techniques used to improve storage performance, reliability, or both

          • Combines multiple physical disks into a single logical unit for redundancy and performance

          • Techniques: striping, mirroring, and parity

          • Several RAID levels


    RAIDDescription** Redundancy Performance Storage Use Case Level Efficiency**


    RAID 0 Striping None High 100% Non-critical, high-speed tasks

    RAID 1 Mirroring High Moderate 50% Critical data, system drives

    RAID 5 Striping with Medium Good (N-1)/N General-purpose Parity storage

    RAID 6 Striping with High Moderate (N-2)/N High-reliability Double Parity storage

    RAID 10 Mirroring + High High 50% High-performance, Striping critical data


  • Placing related blocks on same cylinder

    • Put on the same cylinder to read them at once

    • Make use of the 2 surfaces in the same cylinder

    • Add more blocks on the next sectors on the same track

    • “At once” meaning “one at a time but really fast” because transfer time <<< rotation time

hi

  • Using multiple disks: Striping (RAID 0)

    • Use multiple smaller disks instead of one larger disk

    • Each disk can access data independently, in parallel

      • N disks → N times faster access
  • Mirroring (RAID 1)

    • Use pair of disks that mirror each other

    • Good for durability but higher overhead for writing

  • Disk scheduling

    • Disk Controller holds a sequence of block requests (queue of requests)

    • Not necessarily serve requests in their arrival order (FIFO) ➔ Use better scheduling policy

    • Elevator (SCAN) policy

      • Purpose: Optimize disk read/write head movement.

      • Mechanism:

        • Moves in one direction (e.g., inward to outward or vice versa).

        • Services all requests along the path in the current direction.

        • Reverses direction at the end of the disk.

      • Advantages:

        • Fairness: Services requests at both ends of the disk.

        • Reduces starvation compared to FCFS.

      • Disadvantages:

        • Overhead: May travel to the end unnecessarily.

        • Latency: Requests at the far end may wait longer.

      • Example:

        • Cylinders: 0 (inner) to 199 (outer).

        • Head at 50, moving outward.

        • Requests: [10, 60, 80, 120, 30].

        • Services: 60 → 80 → 120 → (reverses) → 30 → 10.

      • Efficient and fair, but may incur overhead due to unnecessary disk arm movement

  • Prefetching and buffering

    • If DBMS can predict the sequence of access, it can pre-fetch and buffer more blocks even before requesting them.

    • Naive single buffer algorithm

      • Say a file has sequence of blocks B1, B2,… and a program that process B1, B2, B3,…

      • Execution of algorithm

        • (1) Read B1 → Buffer

        • (2) Process Data in Buffer

        • (3) Read B2 → Buffer

        • (4) Process Data in Buffer

      • Cost: n(P+R)

        • P: time to process a in-memory block

        • R: time to read disk block (seek + latency + transfer)

        • n: # of blocks

    • Double buffering

      • Purpose: Overlap I/O operations with computation to reduce idle time.

      • How it works:

        • Two in-memory buffers are used: Buffer A and Buffer B.

        • While the CPU processes data in Buffer A, the I/O system loads data into Buffer B.

        • Once the CPU finishes processing Buffer A, it switches to Buffer B, and the I/O system loads new data into Buffer A.

      • Advantages:

        • Reduces CPU idle time by overlapping I/O and computation.

        • Smooths out performance for real-time systems (e.g., video playback).

    • Prefetching

      • Purpose: Reduce latency by loading data into memory before it is needed.

      • How it works:

        • The system predicts which data will be needed next and loads it into memory in advance.

        • Prefetched data is stored in a buffer or cache.

      • Advantages:

        • Hides memory or disk latency by overlapping data transfer with computation.

        • Improves performance for sequential or predictable access patterns.

    • Combining double buffering and prefetching

      • While the CPU processes data in Buffer A, prefetching loads data into Buffer B.

      • When the CPU switches to Buffer B, prefetching loads the next set of data into Buffer A.

    • Cost: R + nP

      • Have to pay R for the first block

      • Assuming sequential I/O, time to read other blocks is very small compared to P

      • Time to retrieve them overlaps with nP

  • Record representation

    • Database records (tuples)

      • Each record, and each field is just a sequence of bytes on a disk block

      • Field has type that can be fixed- or variable- length

      • All fields in a record are aligned to start at 4- or 8-byte boundaries & concatenated

hi

hi

  • Each record has a header holding some info

  • Two types of records: fixed-length and variable-length

    • Different fixed-length records have the same size

    • Different variable-length records may have different sizes

      • VARCHAR (variable characters), VARBINARY, TEXT, BLOB, JSON/XML
  • Record header

    • Often it is convenient to keep some “header”

    • information in each record:

      • A pointer to schema information (attributes/fields, types, their order in the tuple, constraints)

      • Length of the record/tuple

      • Timestamp of last modification

hi

  • Packing records into blocks

    • Start with a block header

    • Followed by sequence of records

    • May end with some unused space

    • No record is split between blocks

hi

  • Accessing fixed-length records

    • Information about field types are the same for all records in a file; stored in system catalogs

    • Finding i’th record in a block does not require scan over previous records

    • Finding i’th field does not require scan over previous fields

  • Records with variable-length fields

    • A field whose length can be different across records

    • Complicated design and access, but sometimes efficient to have

    • Effective representation

      • Fixed length fields are kept ahead of the variable length fields

      • Record header contains

        • Record length

        • Pointers to beginning of each variable length field except the first one

hi

  • Efficient access → still reading i’th record without scanning over previous fields

  • Packing records into blocks

    • Packed approach (fixed-length records)

      • Packed means no holes between records

      • Insertion

        • If enough free space at the end the insert in this block

        • Increment number of tuples in block header

      • Deletion

        • Move the last record to fill in the empty space

        • Decrement N

hi

  • Bitmap/unpacked approach (fixed-length records)

hi

  • Insertion

    • Find free slot anywhere in the block

    • Insert the record

    • Increment N

    • Set its free bit to 1

  • Decrement

    • No copying/movement

    • Decrement N

    • Set its bit to 0

  • Better approach, but does not work for variable-length records

  • Variable-length records

hi

  • The slot directory starts from one end and grows as needed

  • The data records start from the other end (No need for fixed allocation for SLOT directory)

  • Rid does not change even if the record moves within the block

  • Deleting a record will leave some empty space in the middle

  • Buffer manager

hi

  • Index/file/record manager issues page commands to buffer manager

  • Buffer manager manages what blocks should be in memory and for how long

    • Computations/modifications on/to data pages can only be made when they are in main memory, NOT on disk

hi

  • Insight: [https://chatgpt.com/share/67e6ed51-0ea0-800b-8680-f5e888d3780b]](https://chatgpt.com/share/67e6ed51-0ea0-800b-8680-f5e888d3780b)

  • Buffer pool

    • Array called buffer pool, each entry is called frame

    • Buffer pool information table contains: <frame number, disk page ID, pin count, dirty>

      • Dirty: whether the content has changed

      • Pin: can be taken out if need and count = 0, else has to stay in-memory until unpinned

      • LTTimestamp: last touched (R+W) timestamp

    • Each entry can hold 1 database disk block

      • An in-memory disk block is a.k.a. memory page
    • Buffer manager keeps track of

      • Which pages are empty

        • Naive: scan array of frames looking for blockId -1 (O(n))

        • Better: array of empty frames

        • Even better: bitmap of which frames are used (O(1))

      • Which disk page exists in which frame

        • Naive: scan array of frames looking for blockId i (O(n))

        • Better: hashmap blockId → frame number (O(1))

          • No need to keep blockId in frame metadata anymore

hi

hi

  • How do we choose which frame to free among the unpinned ones? Buffer replacement policies

    • LRU, FIFO, clock replacement algorithm

    • Have a big impact on number of I/O’s → depending on the access pattern

    • Also may need additional metadata to be stored for each frame

    • LRU

      • For each frame in buffer pool, keep track of time when last accessed ➔ (LTtimestamp)

      • Replace the frame which has the oldest (earliest) time

      • Very common policy: intuitive and simple

        • Works well for repeated accesses to popular pages
    • Clock

      • An approximation of LRU

      • Each frame has

        • Pin count ➔ If larger than 0, do not touch it

        • Ref bit (second chance bit) ➔ 0 or 1

      • Imagine frames organized into a cycle

      • A pointer rotates to find a candidate frame to free

        • IF pin-count > 0 Then ➔ Skip

        • IF (pin-count = 0) & (Ref = 1) ➔ Set (Ref = 0) and skip (second chance)

        • IF (pin-count = 0) & (Ref = 0) ➔ free and re-use

      • Need additional metadata

        • Second chance (reference bit)
      • Small overhead

    • Problem: sequential flooding

      • Queries can involve repeated sequential scans

      • # free buffer frames < # pages in file means some page requests cause an I/O.

      • High cost ➔ Each access modifies the metadata

      • Affects both LRU and Clock Replacement

hi

  • Pinning memory pages

    • Pinning a page means not to take from the memory until unpinned

    • Why pin a page?

      • Keep it until the transaction completes

      • Page is reference a lot

      • Recovery control

      • Concurrency control (enforcing certain order)

    • Releasing unmodified pages

      • When done with a page

        • Unpin the page (if you can)

        • Since page is not modified → claim this page in the free list

        • No need to write back to disk

    • Releasing modified pages: have to notify the disk manager to modify the corresponding on-disk block afterward

  • Sorting in main memory

    • In DBMS, the data can be much larger than main memory

    • Why sort?

      • Data requested in sorted order

      • Many DB operations require sorting

    • Using secondary storage effectively

      • General wisdom

        • I/O costs dominate

        • Design algorithms to reduce I/O

        • CPU overhead becomes less important

    • Merge sort

      • In-memory: EZ

      • Externally (using disk)

        • How many memory buffers are needed?

        • What is the I/O cost?

        • 2-way external sort (Naive method)

        • Phase 1: Prepare

          • Read a page, sort it, write it

          • Only one buffer page is used

        • Phase 2, 3…(recursively): Merge two sub-lists in each phase

          • Three buffer pages used

hi

  • M-way external sort

    • Sorting a file with N pages using M buffer pages

    • Pass 0: use M buffer pages.

      • Produce N/M sorted runs of M pages each.
    • Pass 1, 2, …, etc.: merge M-1 runs.

hi

  • Number of passeshi

  • Total I/O cost = 2N * (# of passes)

  • Summary

    • Use as much memory buffers you can

    • Cost metric

      • I/O only (till now)

      • CPU is nontrivial, worth reducing

        • Use a quick main-memory algorithm
    • For the in-memory sorting part

      • Use Quicksort or heapsort
  • Indexing

    • Heap file of a relation: storage structure used to store the tuples (rows) of that relation without any specific order

    • Table scan: Naive way

      • Open the heap file of relation R

      • Access each data page

      • Access each record in the data page

      • Check the condition

      • I/O cost: O(N)

      • Memory needed: only 1 memory block

      • Index scans are more efficient, but requires DBMS to index the table beforehand

    • Concepts

      • Indexing is used to speed up access to desired data

      • Search key — attribute to set of attributes used to look up records in a relation (WHERE clause in the SQL query)

        • Example:

SELECT id, name, address

FROM R

WHERE id=1000; → Search key

  • Index file

    • Consists of records (called index entries) of the format

hi

  • Typically much smaller than the data file it’s indexing

  • Types of indexes

    • Dense vs sparse

    • Primary vs secondary

    • One-level vs multi-level

  • Index evaluation metrics

    • Saves query times, but adds overhead to insertion, update/deletion, and space

    • Supported access patterns

      • Equality search (x = 100): records with a specified value in the attribute

      • Range search (10 < x < 100): records with an attribute value falling in a specified range of values.

  • Primary index

    • An index on the ordering column (the column on which the data file is sorted)

    • Dense index in ordered file

      • Ordered (sequential) files

        • Records are stored sorted based on the indexed attribute
      • Dense index

        • Has one entry for each data tuple

hi

  • Number of entries in index = records in file

  • But index size is much smaller (since only 2 columns)

  • Index scan

    • Read each page from the index

    • — Search for key = 100

    • — Follow the pointer ➔ (Record Id)

  • Index binary search

    • Since all keys are sorted

    • Read middle page in index

    • Either you find the key, or

      • Move up or down
  • Sparse index in ordered file

hi

  • Sparse index

    • An entry for only the 1rst record in each data block

    • Much smaller than a dense index

    • Can only be built on ordered (sequential) files

  • Index binary search still works

    • Either locate the search key in the index, OR

    • Locate the largest key smaller than your search key

    • Follow the pointer and check the data block

  • Multi-level index

hi

  • 1st level Index file is just a file with sorted keys

    • We can build a 2nd level index on top of it
  • Beyond the 1st level (which can be dense or sparse), all following levels should be sparse to save on space

  • Sparse vs dense index

    • Sparse

      • Less space

      • Only for sorted files or higher-level indexes

      • Better for insertion

      • Cannot tell if record does not exist without checking data file

    • Dense

      • More space

      • Must use for unsorted files (secondary indexes)

      • Can tell if record does not exist without checking the data file

  • Insertion/deletion in ordered file

    • Sparse index: deletion

      • If deletion were not an indexed record, index requires no re-organization

        • Data blocks will have some empty space
      • If deletion were an indexed record (1rst entry in the data block), the search key value in the index entry will change to the new 1rst entry in the data block

        • The 1rst entry may or may not move

hi

  • Dense index

hi

  • Sparse index: insertion

hi

hi

  • Use of overflow blocks

    • Overflow blocks are invisible to the index (do not affect the index entries)

hi

  • Dense index: insertion — similar, often more expensive

  • Primary vs secondary index

    • Primary index

      • An index on the ordering column (the column on which the data file is sorted)

      • Can be dense or sparse

      • Can be one-level or multi-level

      • Big advantage

        • Records having the same key (or adjacent keys) are in the same (or adjacent) data blocks → Leads to sequential I/Os

hi

  • Unordered files

    • File where records are not ordered on the indexed column
  • Secondary index

    • An index on a relation’s unordered (physically) column

    • The file may be ordered on another column, say Name.

      • An index on the Name column is primary index (Can be sparse or dense)

      • An index on any other column, say ID, is called secondary index (can only be dense)

    • An index entry for each data record

    • Pointers will cause random I/Os (even for same or adjacent values)

hi

  • Multi-level secondary index

hi

  • Duplicate values

    • Option 1: typical dense index

      • Repeated keys can be many

      • Problem: excess overhead

        • Disk space

        • Search time

hi

  • Option 2: indirection

    • Build a 2nd level index to store distinct values of the key

    • Each distinct value stored once (Saves space)

    • Each value points to a bucket of pointers to the duplicate values

hi

  • B-trees

    • A disk-based multi-level balanced tree index structure

      • Disk-based

        • Each node in the tree is a disk page → has to fit in one disk page

        • The pointers are locations on disk either block ID or record ID

      • Multi-level

        • Has one root (can be the only level)

        • The height of the tree grows and shrinks dynamically as needed

        • We always (in search, insert, delete) start from the root and move down one level at a time

      • Balanced

        • All leaf nodes are at the same height
    • B+ tree

      • Self-balancing, ordered tree data structure that allows searches, sequential access, insertions, and deletions in O(log n)

        • Generalization of a binary search tree, since a node can have more than two children

        • Optimized for systems that read and write large blocks of data

    • Properties

      • Each node (which is disk block) holds N keys and N + 1 pointers

      • Keys in any node are sorted from left to right

      • Keys in any level are also sorted from left to right

      • Each node is at least 50% full (Minimum 50% occupancy)

        • Except the Root can have as minimum (1 key & 2 pointers)
      • No empty spaces in the middle of a node

      • Number of pointers in any node = number of keys + 1

        • The rest are Null
    • Leaf node values

      • Approach #1: Record IDs

        • A pointer to the location of the tuple to which the index entry corresponds
      • Approach #2: Tuple Data

        • The leaf nodes store the actual contents of the tuple

        • Secondary indexes must store the Record ID as their values

    • B+ tree visualization

hi

  • Good utilization

    • B-tree nodes should not be too empty

    • Each node is at least 50% full (Minimum 50% occupancy)

      • Except the Root can have as minimum (1 key & 2 pointers)
    • Use at least

      • Non-leaf: ceil((n+1)/2) pointers

      • Leaf: floor((n+1)/2) pointers to data

  • B-tree vs. B+tree

    • The original B-Tree from 1972 stored keys and values in all nodes in the tree.

    • A B+ Tree only stores values in leaf nodes.

    • B+tree more space-efficient, since each key only appears once in the tree.

    • Inner nodes only guide the search process.

  • Dense vs. sparse

    • In Theory, leaf level can be either dense or sparse

      • Sparse only if the data file is sorted
    • In Practice, in most DBMS the leaf level is always dense

      • One index entry for each data record
    • What about values in non-leaf levels?

      • Not necessarily a subset of leaf values
  • B-tree in practice

    • Usually 3- or 4-level B-tree index is enough to index even large files

    • The height of the tree is usually very small

      • Good for searching, insertion, and deletion

      • Few I/Os (especially that the root can be always cached in memory)

  • Querying

    • Equality Search: Find all records with a search-key value of k.

      • 1. Start with the root node

        • Examine the keys (use binary search) until find a pointer to follow
      • 2. Repeat until reach a leaf node

      • 3. Search keys in the leaf node for the first key = k

      • 4. Follow the pointers to the data records

    • Range Search: Find all records between [k1, k2].

      • 1. Start with the root node and search for k1

        • Examine the keys (use binary search) until find a pointer to follow
      • 2. Repeat until reach a leaf node

      • 3. Search the keys in the leaf node for the first key = k1 or the first larger

      • 4. Move right while the key is not larger than k2 (may move from one node to another node)

      • 5. Follow the pointers to the data records

  • Insertion

    • Find correct leaf L (searching)

    • Put data entry onto L

      • If L has enough space, done

      • Else, must split leaf node L (into L and a new node Lnew)

        • Redistribute entries evenly

        • Copy up middle key (1st value in LNEW) to parent.

        • Insert index entry pointing to LNEW into parent of L.

    • This can happen recursively

      • To split internal node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.)
    • Splits “grow” tree; root split increases height.

      • Tree growth: gets wider or one level taller at top.
  • Deletion

    • Start at root, find leaf L where entry belongs. (Searching)

    • Remove the entry.

      • If L is at least half-full, done!

        • Do not leave empty spaces in the middle
      • If L has less than the minimum entries,

        • Try to re-distribute, borrowing from sibling (adjacent node with same parent as L).

        • If re-distribution fails, merge L and sibling.

    • If merge occurred, must delete entry (pointing to L or sibling) from parent of L.

    • Merge could propagate to root, decreasing height.

  • More practical deletion

    • When deleting 24*, allow its node to be under utilized

    • After some time with more insertions

      • Most probably more entries will be added to this node
    • If many nodes become under-utilized ➔ Re-build the index

  • Hashing

    • Static hashing

      • Disk-based hash-based indexes

        • Adaptation of main memory hash tables

        • Support equality searches

        • No range searches

      • Hash table has N buckets (static means “N” does not change

        • Each bucket will be one disk page

        • Hash function h(k) maps key k to one of the buckets

hi

  • Within a bucket

hi

  • Should we keep entries sorted?

    • Yes, if we care about CPU time

    • Makes the insertion and deletion a bit more expensive

  • Lookup (Search for key = d)

    • Apply the hash function over d ➔ h(d) mod N = 0

    • Read the disk page of bucket 0

    • Search for key d

      • If keys are sorted, then search using Binary search
  • Insertion

  • Insertion with overflow

    • Insert key e ➔ h(e) mod N = 1

    • Create an overflow bucket and insert e

    • Overflow bucket is another disk block

hi

  • Deletion

    • Search for the key to be deleted

    • In case of overflow buckets

      • The overflow bucket may no longer be needed
  • Extensible (dynamic) hashing

    • No overflow buckets

    • The number of primary buckets is not fixed and it can grow

    • Start with some # of buckets (e.g., N=4 and h(k) = k)

    • Each bucket can fit fixed # of entries (e.g., M=4)

    • Problem: storing index entries directly on primary bucket eventually requires overflow buckets

    • Solution: Rather than storing entries directly, use a directory

    • Binary representation (N=4)

      • Here we can use 2 bits to represent buckets (global depth=2)

      • How can we find bucket for key 12?

        • Look at 2 rightmost bits of h(12)
      • Every key in bucket A has the same 2 last bits (local depth=2)

        • Same for others

hi

  • Given a key k ➔ we need to know k will go to which entry in the directory

  • Hash Function

    • Convert k into binary representation (0s and 1s)

    • Consider only the number of bits equal to the Global Depth value

      • Take bits from right side (Least Significant bits)
  • Insertion

    • Case 1: Expand under local depth = global depth

      • Increment global depth

      • This means ➔ double the directory size

      • For the overflow bucket, divide into two

      • Increment their local depth

      • Re-distribute the keys based on the new local depth bits

      • For all other buckets, leave them as is (local depth < global depth)

      • Number of incoming pointers to each of these bucket is doubled

    • Case 2: Split while local depth < global depth

      • No need to double Directory

      • Only split the bucket

      • Increment local depth

      • Re-distribute its keys

  • SQL & Query Processing

    • Execution engine

      • Receives query plan from the query compiler and executes
    • Relational algrebra express query plans

      • Intuition

        • Avoid cross product

        • Push down unary operators

        • Perform joins on smaller datasets

hi

  • Set of operators that operate on relations

  • Operator semantics based on Set or Bag theory

  • Relational algebra form underlying basis (and optimization rules) for SQL

  • Query planning

    • Goal: equivalent query plans but cheaper operationally

    • Intuition

      • Minimize the size (number of tuples, number of attributes) of intermediate results

      • Push selections down in the expression tree as far as possible

      • Push down projections, or add new projections where applicable

hi

hi

hi

  • Overview of query execution

hi

  • Heuristics

    • Pushing selections

hi

  • Dhisfo

    • Replace the left side of one of these (and similar) rules by the right side

    • Can greatly reduce number of tuples in intermediate results

    • Operator evaluation

      • Two categorizations

        • Technique

        • Number of passes

      • Algorithms for evaluating relational operators use some simple ideas extensively:

        • Indexing: Can use WHERE conditions to retrieve small set of tuples (selections, joins)

        • Iteration: Sometimes, faster to scan all tuples even if there is an index. (Sometimes we can scan the data entries in an index instead of the table itself (when?)

        • Partitioning: By using sorting or hashing, we can partition the input tuples and replace an expensive operation by similar operations on smaller inputs.

      • Number of passes: 1 pass, 2 pass, multi-pass

    • Common statistics over relation R

      • M: # of memory buffers available

      • B(R): # of blocks to hold all R tuples

      • T(R): # tuples in R

      • S(R): # of bytes in each of R’s tuple

      • V(R, A): # distinct values in attribute R.A

    • One pass operator evaluation algorithms

      • Join (R, S)

hi

  • Duplicate elimination

hi

  • Group by

hi

  • Set union(R,S)

hi

  • Blocking vs. non-blocking operators

    • Blocking operator cannot produce any tuples to the output until it processes all its inputs

    • Non-blocking operator (Pipelined) can produce tuples to output without waiting until all input is consumed

    • Examples

      • Join, duplicate elimination, union ➔ Non-blocking

      • Grouping ➔ Blocking

      • Selection, Projection ➔ Non-blocking

      • Sorting ➔ Blocking

  • Two pass algorithms

    • First Pass: Do a prep-pass and write the intermediate result back to disk

      • We count Reading + Writing
    • Second Pass: Read from disk and compute the final results

      • We count Reading only (if it is the final pass)
    • Sort-based two-pass algorithms

      • 1st pass does a sort on some parameter(s) of each operand

      • 2nd pass algorithm relies on sort results; can be pipelined

    • Hash-based two-pass algorithms

      • 1st pass partitions input relation

      • 2nd pass algorithm will work on each partition independently

    • 2-Pass external sort

hi

  • Phase 1 → no constraints

  • Phase 2 → each run must have a memory buffer + one for output

  • B(R)/M ≤ M-1

  • Approx. B(R)/M ≤ M

  • B(R) ≤ M^2