Distributed Systems

Proxy

Reverse proxy

A server or software that mediates internet traffic flow (e.g., making HTTP requests). There are two types
- Forward proxy
  - Provides single or multiple clients with access to a proxy service on a common internal network
  - Acts as an intermediary that performs tasks on behalf of clients over the Internet
  - Each request before it reaches the server first passes through the proxy server, which then executes it
  - Benefits
    - Protects the client’s online identity → The IP address of the client is hidden, only the IP address of the forward proxy is visible
    - Bypass browsing restrictions (firewalls)
    - Block clients’ access to certain content
- Reverse proxy
  - Intercepts the requests from clients and talks to the web server on behalf of the clients
  - Benefits
    - Protect a website against a DDOS attack → The website’s IP addresses are hidden from the clients
    - Load balancing by distributing the traffic to a large pool of web servers and preventing any one of them from becoming overloaded
      - Assumption: the proxy server can handle a large amount of incoming requests
    - Cache static content
    - Handle SSL encryption
Best practices for HTTP requests in production
- Retries: Requests may occasionally fail due to transient issues (e.g., slow network, node failure, power outage, spike in traffic, etc.). Retry failed requests a handful of times to account for these issues.
- Exponential backoff: To avoid bombarding the application with retries during a transient error, apply an exponential backoff on failure. Each retry should wait exponentially longer than the previous one before running. For example, the first retry may wait 0.1s after a failure, and subsequent retries wait 0.4s (4 x 0.1), 1.6s, 6.4s, 25.6s, etc. after the failure.
- Timeouts: Add a timeout to each retry to prevent requests from hanging. The timeout should be longer than the application’s latency to give your application enough time to process requests.
Best practices for HTTP clients
- Example
- The I.D.E.A.L. HTTP client
  - I --- Investing in stability (mandatory)
    - Network is unreliable, services are unpredictable → HTTP communications are unstable → Needs stability
      - Timeouts
        
        Set timeouts for all external requests
        
        GitHub - ankane/go-timeouts: Timeouts for popular Go packages
        
        Timeouts for popular Python packages
        
        GitHub - ankane/node-timeouts: Timeouts for popular Node packages
        
        Don’t let services hang indefinitely
        
        Reasonably low duration — several seconds is sufficient
        
        Long-enough hanging requests should be failed to allow the application to degrade gracefully while missing some functionality
        
        Fail fast and loud (Erlang/Elixir philosophy)
        
        Should be set wherever possible (database, cache, etc.)
      - Error handling
        
        Always expect and handle HTTP errors gracefully
        
        Use custom error classes
        
        Can easily group them in error reporting system
        
        Craft a consistent error hierarchy

HTTP client error

Retries
- Attempt to retry requests to simplify things when recovering from sneaky flapped errors
- Idempotent requests are safe for retries (GET, HEAD, PUT, DELETE, OPTIONS, TRACE), never try non-idempotent requests (POST, PATCH)
- Limit the number of retries and the time between them to make way for service recovery
  - Retry with exponential backoff
- Multiple processes waiting for the same event simultaneously triggered once the event happens
  - Thundering herd problem
- More advanced: circuit breaker pattern
  - Prevent overloading services with retries
D --- Debugging uncertainties (mandatory)
- Logging, monitoring, error reporting
  - Keep logs of external requests
- User-agents
  - Identify the HTTP client on our side to other services

User-agents

Correlation ID and tracing
- Add a correlation ID to your HTTP client to track requests to external services in the logs and to correlate them within request and response pairs

Correlation ID

E --- Exploring the client (mandatory)
- Configuration
  - Keep configuration clean and directed to a single service
  - The best configuration is no configuration
    - Keep the defaults sane for end users
- Performance
  - Beware of memory exhaustion due to larger file downloads or uploads
    - Stream large files instead of loading them into memory
  - Fine-tune HTTP client based on application-specific performance metrics
    - Leverage HTTP/2 for connection reuse
    - Employ binary protocols to reduce payload sizes
    - Some methods require collaboration with upstream services
    - Parallelize requests
    - HTTP caching
    - Persistent connections
  - Benchmark your application to find the best approach
- Standardization
  - Have a standardized way to handle external requests in an application
A --- Advancing with abstractions (opinionated)
- HTTP client as a library
  - An HTTP client can be separated into several reused layers
    - Configuration
    - Unified error handling and error hierarchy
    - Retry
    - Logging and monitoring facilities
    - API client
    - Domain models
L --- Laying out a sound structure (opinionated)
- Typed domain models
Testing
- Use a contract testing approach to ensure that the HTTP client works as expected
MapReduce
- Programming model for distributing and parallelizing a large computation over a cluster of machines → Faster execution time
- Developed at Google
- High-level workflow
  - The user
    - Define the map function
    - Define reduce function
  - MapReduce
    - Partition input data
    - Schedule tasks
    - Handle machine failures
    - Facilitate inter-machine communication
- Deep dive
  - The MapReduce library splits the input into M chunks based on what the user specified
  - Make copies of the program for the machines in the cluster
    - One of them is the master, the rest are workers
    - The workers will be given map task or reduce task (M map tasks and R reduce tasks in total)
  - The worker with a map task executes the user-defined map function on an input chunk
    - The value returned (an intermediate key-value pair) will first be saved in RAM (buffered in memory)
    - Eventually persisted on the worker’s local disk, which is partitioned into R regions/files
      - One for each unique key of the R reduce tasks
    - The file paths of intermediate key-value pairs are passed back to the master
    - The master then forwards these locations to the reduce workers
  - The reduce worker gets all key-value pairs associated with the reduce task’s key
    - Once it has all the key-value pairs, it will pass them to the reduce function and save the output to a globally available storage (S3, GFS)
  - The MapReduce returns the output to the user
CRDT (Conflict-free Replicated Data Type)
- A data structure that can be stored on different computers (peers)
- Each peer can update its own state instantly, without a network request to check with other peers
- Peers may have different states at different points in time, but are guaranteed to eventually converge on a single agreed-upon state
- Great for building rich collaborative apps without requiring a central server to sync changes
  - E.g., Google Docs, Figma
- Two kinds of CRDTs
  - State-based
    - Transmit their full state between peers
    - A new state is obtained by merging all the states together
  - Operation-based
    - Transmit only the actions that users take, which can be used to calculate a new state
- Operation-based CRDT
  - + more efficient → only send differential state
  - - constraint on communication channel: exact-once delivery, in causal order, to each peer
- State-based CRDTs
  - Any data structure that implements this interface
interface CRDT<T, S> {
value: T;
state: S;
merge(state: S): void;
}
value: the part that the rest of our program cares about
state: the metadata needed for peers to agree on the same value
- To update other peers, the whole state is serialized and sent to them
merge function: takes some state (possibly received from another peer) and merges it with the local state