Run Length Encoding

Run length encoding is the most well-known compression, probably due to how intuitive it is. It’s simplicity doesn’t mean it’s not being used though, RLE has applications in:

  • Image and Video compression
  • Audio Compression
  • Bzip2

Intuition

A run refers to a repeated sequence of characters. The idea is very simple: instead of writing aaaa, write a4 since a repeats 4 times in total.

Implementation

When implementing the algorithm, we need to think about bytes. Refer to the section on bytes if you are confused with what bytes are.

The simplest way to implement this algorithm would be to have the length immediately after each character, and have the length represented by a single byte.

This means we can refer to lengths of 255 at most. We will talk about potential improvements to this later, but for most real-world cases this limitation is fine.

Pseudo-code

  1. Iterate through the input array, one character at a time.
  2. For each character, initialise a new length counter.
  3. While the next character is equal to the current one, move your pointer forward and increment your counter.
  4. When the run ends, add the character to the output array and then add the length counter after it.
  5. Repeat steps 2-4 until reaching the end of the array.

Potential Improvements

You might notice a lot of the time, run-length encoding can make the input bigger.

  • A simple example of this: abcd -> a1b1c1d1.
  • The input doubled in size, because there was no runs of repeated characters.

We can add a byte to the start of the output that acts as a flag. If the compressed result is bigger in size than the input, this flag will be set to 0x00 indicating that the rest of the data is stored uncompressed. With this addition, any output will grow at most one byte.

Demo

Encoder Demo

Code

std::vector<unsigned char> RunLengthEncoding::encode(const std::vector<unsigned char>& data) {
  std::vector<unsigned char> compressed;
  int n = data.size();

  for (int i = 0; i < n; i++) {
    unsigned char byte = data[i];
    int count = 1;

    while (i < n - 1 && data[i] == data[i + 1] && count < 255) {
      count++;
      i++;
    }

    compressed.push_back(byte);
    compressed.push_back(static_cast<unsigned char>(count));
  }

  return compressed;
}

std::vector<unsigned char> RunLengthEncoding::decode(const std::vector<unsigned char>& data) {
  std::vector<unsigned char> decompressed;
  unsigned char curByte;
  for (int i = 0; i < data.size(); i++) {
    if (i % 2 == 0) {
      curByte = data[i];
    } else {
      int count = static_cast<int>(data[i]);
      for (int j = 0; j < count; j++) {
        decompressed.push_back(curByte);
      }
    }
  }
  return decompressed;
}

Further Reading