Skip to content

Distributed Systems

  • Proxy

Reverse proxy

  • A server or software that mediates internet traffic flow (e.g., making HTTP requests). There are two types

    • Forward proxy

      • Provides single or multiple clients with access to a proxy service on a common internal network

      • Acts as an intermediary that performs tasks on behalf of clients over the Internet

      • Each request before it reaches the server first passes through the proxy server, which then executes it

      • Benefits

        • Protects the client’s online identity → The IP address of the client is hidden, only the IP address of the forward proxy is visible

        • Bypass browsing restrictions (firewalls)

        • Block clients’ access to certain content

    • Reverse proxy

      • Intercepts the requests from clients and talks to the web server on behalf of the clients

      • Benefits

        • Protect a website against a DDOS attack → The website’s IP addresses are hidden from the clients

        • Load balancing by distributing the traffic to a large pool of web servers and preventing any one of them from becoming overloaded

          • Assumption: the proxy server can handle a large amount of incoming requests
        • Cache static content

        • Handle SSL encryption

  • Best practices for HTTP requests in production

    • Retries: Requests may occasionally fail due to transient issues (e.g., slow network, node failure, power outage, spike in traffic, etc.). Retry failed requests a handful of times to account for these issues.

    • Exponential backoff: To avoid bombarding the application with retries during a transient error, apply an exponential backoff on failure. Each retry should wait exponentially longer than the previous one before running. For example, the first retry may wait 0.1s after a failure, and subsequent retries wait 0.4s (4 x 0.1), 1.6s, 6.4s, 25.6s, etc. after the failure.

    • Timeouts: Add a timeout to each retry to prevent requests from hanging. The timeout should be longer than the application’s latency to give your application enough time to process requests.

  • Best practices for HTTP clients

    • Example

    • The I.D.E.A.L. HTTP client

      • I --- Investing in stability (mandatory)

        • Network is unreliable, services are unpredictable → HTTP communications are unstable → Needs stability

HTTP client error

  • Retries

    • Attempt to retry requests to simplify things when recovering from sneaky flapped errors

    • Idempotent requests are safe for retries (GET, HEAD, PUT, DELETE, OPTIONS, TRACE), never try non-idempotent requests (POST, PATCH)

    • Limit the number of retries and the time between them to make way for service recovery

      • Retry with exponential backoff
    • Multiple processes waiting for the same event simultaneously triggered once the event happens

      • Thundering herd problem
    • More advanced: circuit breaker pattern

      • Prevent overloading services with retries
  • D --- Debugging uncertainties (mandatory)

    • Logging, monitoring, error reporting

      • Keep logs of external requests
    • User-agents

      • Identify the HTTP client on our side to other services

User-agents

  • Correlation ID and tracing

    • Add a correlation ID to your HTTP client to track requests to external services in the logs and to correlate them within request and response pairs

Correlation ID

  • E --- Exploring the client (mandatory)

    • Configuration

      • Keep configuration clean and directed to a single service

      • The best configuration is no configuration

        • Keep the defaults sane for end users
    • Performance

      • Beware of memory exhaustion due to larger file downloads or uploads

        • Stream large files instead of loading them into memory
      • Fine-tune HTTP client based on application-specific performance metrics

        • Leverage HTTP/2 for connection reuse

        • Employ binary protocols to reduce payload sizes

        • Some methods require collaboration with upstream services

        • Parallelize requests

        • HTTP caching

        • Persistent connections

      • Benchmark your application to find the best approach

    • Standardization

      • Have a standardized way to handle external requests in an application
  • A --- Advancing with abstractions (opinionated)

    • HTTP client as a library

      • An HTTP client can be separated into several reused layers

        • Configuration

        • Unified error handling and error hierarchy

        • Retry

        • Logging and monitoring facilities

        • API client

        • Domain models

  • L --- Laying out a sound structure (opinionated)

    • Typed domain models
  • Testing

    • Use a contract testing approach to ensure that the HTTP client works as expected
  • MapReduce

    • Programming model for distributing and parallelizing a large computation over a cluster of machines → Faster execution time

    • Developed at Google

    • High-level workflow

      • The user

        • Define the map function

        • Define reduce function

      • MapReduce

        • Partition input data

        • Schedule tasks

        • Handle machine failures

        • Facilitate inter-machine communication

    • Deep dive

      • The MapReduce library splits the input into M chunks based on what the user specified

      • Make copies of the program for the machines in the cluster

        • One of them is the master, the rest are workers

        • The workers will be given map task or reduce task (M map tasks and R reduce tasks in total)

      • The worker with a map task executes the user-defined map function on an input chunk

        • The value returned (an intermediate key-value pair) will first be saved in RAM (buffered in memory)

        • Eventually persisted on the worker’s local disk, which is partitioned into R regions/files

          • One for each unique key of the R reduce tasks
        • The file paths of intermediate key-value pairs are passed back to the master

        • The master then forwards these locations to the reduce workers

      • The reduce worker gets all key-value pairs associated with the reduce task’s key

        • Once it has all the key-value pairs, it will pass them to the reduce function and save the output to a globally available storage (S3, GFS)
      • The MapReduce returns the output to the user

  • CRDT (Conflict-free Replicated Data Type)

    • A data structure that can be stored on different computers (peers)

    • Each peer can update its own state instantly, without a network request to check with other peers

    • Peers may have different states at different points in time, but are guaranteed to eventually converge on a single agreed-upon state

    • Great for building rich collaborative apps without requiring a central server to sync changes

      • E.g., Google Docs, Figma
    • Two kinds of CRDTs

      • State-based

        • Transmit their full state between peers

        • A new state is obtained by merging all the states together

      • Operation-based

        • Transmit only the actions that users take, which can be used to calculate a new state
    • Operation-based CRDT

      • + more efficient → only send differential state

      • - constraint on communication channel: exact-once delivery, in causal order, to each peer

    • State-based CRDTs

      • Any data structure that implements this interface

    interface CRDT<T, S> {
    value: T;
    state: S;
    merge(state: S): void;
    }



  • value: the part that the rest of our program cares about

  • state: the metadata needed for peers to agree on the same value

    • To update other peers, the whole state is serialized and sent to them
  • merge function: takes some state (possibly received from another peer) and merges it with the local state