Skip to content

CUDA

Ecosystem
Cheatsheet
Introduction
- GPUs vs. CPUs
  - CPU: Central Processing Unit
    - General purpose
    - High clock speed
    - Few cores
    - High cache
    - Low Latency
    - Low throughput
  - GPU: Graphics Processing Unit
    - Specialized
    - Low clock speed
    - Many cores
    - Low cache
    - High Latency
    - High throughput
  - TPU: Tensor Processing Unit
    - Specialized GPUs for deep learning algorithms (matrix multiplication, etc)
  - FPGA: Field Programmable Gate Array
    - Specialized hardware that can be reconfigured to perform specific tasks
    - Very low latency
    - Very high throughput
    - Very high power consumption
    - Very high cost
- What makes GPUs so fast for deep learning?

Less control (less instructions)
More cores
L1 cache shared between cores
CPU (host)
- minimize time of one task
- metric: latency in seconds
GPU (device)
- maximize throughput
- metric: throughput in tasks per second (ex: pixels per ms)
Typical CUDA program structure

CPU allocates CPU memory
CPU copies data to GPU
CPU launches kernel on GPU (processing is done here)
CPU copies results from GPU back to CPU to do something useful with it
- Lingo
  - kernels (not popcorn, not convolutional kernels, not linux kernels, but GPU kernels)
  - threads, blocks, and grid (next chapter)
  - GEMM = GEneral Matrix Multiplication
  - SGEMM = Single precision (fp32) GEneral Matrix Multiplication
  - cpu/host/functions vs gpu/device/kernels
  - CPU is referred to as the host. It executes functions.
  - GPU is referred to as the device. It executes GPU functions called kernels.

CUDA