Encoding a Long URL to a Short URL

In this part, we'll explore how to encode a long URL into a short URL. There are different techniques we can use to achieve this.

Option 1: Hash Functions

According to Wikipedia, a hash function is any function that maps data of arbitrary size to fixed-size values.

Let’s take MD5 as an example. The output length of the MD5 hash function is 128 bits, or 16 bytes. When represented as a hexadecimal string, it’s 32 characters long.

Here’s a simple example of using MD5 in Ruby:

require 'digest/md5'
Digest::SHA256.hexdigest("www.test.com")
# => "84cc0e5c525dc728e1769ad6663341c8"

As you can see, the output is 32 characters, which is too long for our use case. To address this, we can use a simple trick: taking only the first 7 characters.

Digest::MD5.hexdigest("www.test.com")[0..6]
# => "84cc0e5"

Cool; it's pretty easy, right?

Unfortunately, this is not a perfect solution as the MD5 algorithm might lead to collisions. Here are the scenarios where collisions could happen:

The MD5 algorithm can possibly generate the same hash code for different strings (this is very rare).
Even if the entire hash code is not the same, you could encounter a collision where two different hash codes share the same first 7 characters that we plan to use.

We could introduce a unique index to solve this problem. This would allow us to catch any duplicate hash codes when writing them to the database and retry generating a new hash code. However, using a unique index has its downsides. It would place a lock on our database, which wouldn't scale well if we receive a lot of writes. Additionally, if we plan to shard our database to scale writes across different regions, the unique index approach would no longer work effectively.

Option 2: Counter (Optimal Solution)

Instead of using a hash function, we can use a counter-based approach to generate short URLs. This method is much simpler and avoids the collision issues that can occur with hash functions.

How It Works

The idea is straightforward: we maintain a global counter that increments with every new URL request. Each time a new URL is submitted, we increment the counter and convert the value of the counter into a short string using Base62 encoding.

Base62 encoding uses a character set of 62 characters (0-9, a-z, A-Z), which is perfect for generating short, readable URLs. By encoding the incremented counter value into Base62, we generate a unique and compact string for each URL.

Why It’s Optimal

Uniqueness: Since the counter increases sequentially, every value is guaranteed to be unique, which eliminates the risk of collisions.
Short Length: With Base62 encoding, we can generate short strings that are much smaller than the original counter value, making the URLs compact and easy to share.
Scalability: The counter-based approach is scalable and performs well, even with large numbers of URLs, since it's a simple increment and encoding operation.
No Need for a Unique Index: Unlike the hash-based approach, we don’t need to rely on a unique index in the database, as the counter ensures uniqueness on its own.

The Problem with Counters in Scalable Applications

When scaling the application with multiple instances (e.g., 3 or more), managing the counter across instances can become problematic. If the counter logic is handled by one instance, that instance becomes a single point of failure. If it goes down, the entire counter mechanism fails, which disrupts the generation of short URLs.

Even if each instance manages its own counter, there’s still a challenge. Once an instance exhausts its local counter range, it would need a mechanism to obtain the next range of counters. This leads to more complexity in coordination across instances.

To avoid these issues, we need a global counter service responsible for managing the counter in a distributed and scalable manner. This ensures that all instances can safely and consistently generate unique short URLs, without risking collisions or relying on a single instance to manage the counter.

Ensuring Scalability with Distributed Systems

In a distributed system where multiple instances of the URL shortener are running, it's critical to ensure that each instance generates unique short URLs without collisions. To achieve this, we rely on etcd as a distributed coordination service.

What is etcd?

etcd is a distributed, reliable key-value store used for coordinating configuration data across multiple servers or machines. It ensures strong consistency and provides a way to manage shared data between multiple instances of our application. In our case, etcd will manage the global counter used to generate unique short URLs.

etcd plays a key role in ensuring that all instances of the URL shortener service are synchronized, so that each instance retrieves the correct counter range without collisions, even in a distributed, multi-instance environment.

Why Use a 3-Node etcd Cluster?

To avoid creating a single point of failure, etcd is not run in standalone mode. Instead, we deploy a 3-node etcd cluster to ensure high availability and fault tolerance. With this setup, even if one node goes down, the other nodes will continue to manage the distributed counter, ensuring that the service remains functional.

Running etcd in a 3-node cluster guarantees that our coordination service is highly available and resilient to failures. Each etcd node shares the responsibility of managing the counters, and they work together using the Raft consensus algorithm to ensure consistency across all nodes.

How etcd Coordinates Between Machines

In our URL shortener service, etcd acts as a coordination service between the different machines or servers. Here’s how it works:

Tracking Active Machines: etcd maintains a list of machines (or instances) that are currently active. Each instance communicates with etcd to register itself and retrieve the range of counters it is responsible for.
Assigning Counter Ranges: etcd keeps track of the last counter that was used across all instances. When a new instance is added to the system (e.g., for scalability), it talks to etcd and receives a new, unallocated range of counters that it can use to generate short URLs.
Handling Counter Exhaustion: If an instance exhausts its current counter range, it communicates with etcd again to request the next available counter range. This ensures that every instance is always generating unique short URLs, even when it runs out of its initial counter range.

By coordinating with etcd, all instances of the service can generate unique short URLs without needing to worry about collisions or stale counters. etcd ensures that only one instance uses a specific range of counters at a time, making the system scalable and resilient.

Here is a link of Docker Compose configuration, we have set up a 3-node etcd cluster to ensure high availability.

Example: Using a Counter with Base62

Let’s say the counter starts at 1 and increments by 1 with every new URL. The counter values will be encoded into Base62 like this:

Counter: 1 → Base62: 1
Counter: 62 → Base62: 10
Counter: 3844 → Base62: 100

Each counter value generates a unique, encoded string that we can use as the short URL.

Here’s a simplified version of how the encoding would work in Ruby:

class Base62
  CHARSET = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".chars
  BASE = CHARSET.size

  def self.encode(num)
    return CHARSET[num] if num.zero?
    raise ArgumentError, "Number must be non-negative" if num.negative?

    str = ""
    while num > 0
      str = CHARSET[num % BASE] + str
      num /= BASE
    end
    str
  end
end

Getting the Next Counter via `CounterService`

The CounterService is responsible for managing the global counter, ensuring that each request for a new counter is handled in a thread-safe manner, even in a multi-threaded environment. When a new URL is shortened, CounterService.get_next_counter is invoked to retrieve the next available counter.

Here’s how CounterService.get_next_counter works:

Counter Initialization at Boot-Up: The counter range is initialized once during the server boot-up in config.ru. This ensures that the counter range is prepared and ready to handle requests as soon as the server is up and running.
Thread Safety with Mutex: To handle concurrent requests, get_next_counter uses a mutex to ensure that only one thread can modify the counter at a time. This prevents race conditions and ensures that the counter is incremented consistently, even in a multi-threaded environment.

def get_next_counter
  counter_mutex.synchronize do
    current_counter = counter
    if current_counter >= counter_range.last
      self.counter_range = get_counter_range
      self.counter = counter_range.first
      current_counter = counter
    end
    self.counter += 1
    current_counter
  end
end

Counter Range Exhaustion: Once the current counter reaches the end of the allocated range, CounterService will request a new counter range by calling get_counter_range. This ensures that a fresh range of counters is always available for the next requests.

Here’s the implementation of the get_counter_range method:

def get_counter_range
  loop do
    current_value = ETCD_CLIENT.get(COUNTER_KEY).kvs.first&.value.to_i
    new_value = current_value + RANGE_SIZE
    txn = ETCD_CLIENT.transaction do |txn|
      txn.compare = [
        txn.value(COUNTER_KEY, :equal, current_value.to_s),
      ]
      txn.success = [
        txn.put(COUNTER_KEY, new_value.to_s)
      ]
    end

    if txn.succeeded
      puts "Instance #{ENV['SERVICE_NAME']} obtained counter range #{current_value} to #{new_value - 1}"
      return (current_value...new_value)
    end
  end
end

Here’s how the `get_counter_range` method works:

Fetching the Current Counter Value: The method begins by retrieving the current counter value from etcd, using ETCD_CLIENT.get(COUNTER_KEY).kvs.first&.value.to_i. This fetches the most up-to-date value of the global counter from the etcd cluster.
Calculating the New Counter Range: The new range is calculated by adding a fixed RANGE_SIZE to the current_value. This ensures that the instance requesting the range will handle a specific block of counters.
Distributed Counter Allocation: The method employs etcd’s transactional operations to ensure safe and consistent counter updates across multiple instances:
1. Atomic Update: The txn.compare block checks if the current value in etcd is still the same as the current_value retrieved earlier. This is to make sure that no other instance has updated the counter in the meantime.
2. Success Block: If the comparison succeeds (i.e., no other instance has modified the counter), the transaction proceeds with updating the counter to new_value using txn.put.
Consistency in a Distributed Environment: This method ensures consistency in distributed environments where multiple server instances are running. By atomically checking and updating the counter using etcd's transaction mechanism, it guarantees that each instance gets its own unique range of counters without overlap or collisions.
Retry Mechanism: If the transaction fails (which means another instance updated the counter), the loop retries by fetching the new current value from etcd. This ensures that the service will always get a valid counter range.
Logging: A log message (puts) is used to track which instance obtained a counter range, which can be helpful for debugging and monitoring.

By using this method, CounterService can handle distributed counter allocation in a consistent, safe manner, ensuring that each instance of the URL shortener service always has a unique counter range to generate short URLs, even in a multi-instance, distributed environment.

Use Cases Supported by `CounterService`

Unique Counter Generation: CounterService guarantees the generation of a unique counter value for every request. By using a distributed counter system (e.g., with etcd or another source), it ensures that no two instances of the service generate the same counter value.
Thread-Safe Operations: The use of a mutex ensures that multiple threads in a multi-threaded web server (e.g., Puma) can safely access and modify the counter without causing race conditions. This is critical for ensuring the integrity and uniqueness of counter values in a concurrent environment.
Range-Based Counter Allocation: To improve performance, CounterService works with ranges of counters. It keeps track of a counter range, incrementing the current counter within that range. When the range is exhausted, it requests a new range, minimizing the need to repeatedly fetch individual counter values from an external source.
Distributed Counter Coordination: In distributed environments, where multiple instances of the URL shortener are running, CounterService can work with a coordination service like etcd. This allows each instance to request a unique range of counters, ensuring there are no overlaps or collisions across instances.

Conclusion

The CounterService is a crucial component of our URL shortener system, handling the management and allocation of counters in a way that is both thread-safe and scalable. By utilizing a distributed counter system and supporting range-based allocation, it ensures the generation of unique, incremented counters for every URL shortening request, even in a multi-threaded or distributed environment.

Explore the Source Code

If you want to explore the code or run two instances of the URL shortener service locally, you can find the full source code in my GitHub repository:

scalable_url_shortener

The repository includes everything you need, including Docker Compose configurations, to set up and run multiple instances of the service, a 3-Node etcd Cluster, and MongoDB.

Feel free to clone the repository and experiment with running the service locally!

Scalable Url Shortener Part2