A base64 decoder in pure shell script

I can't go into the details of why, but at work we found ourselves in a position where we needed to update one of the JAR files in our agent software using nothing more than a shell script. The agents run on customer endpoints, so we have zero control over the tools available. We know the systems don't have either base64 or curl handy, so we were in kind of a pickle. How could we get our updated JAR out to the system and installed where we needed it?

Well, how hard can base64 decoding be, anyway? Google to the rescue!

This page provides a really easy to follow tutorial for how to decode base64 by hand, which I turned into a shell script.

The good news is that I was able to implement it in pure shell script, with no external dependencies. The bad news is that bash runs it extremely slowly, to the order of 20-30 minutes to decode ~153k of base64 data. The korn shell, by contrast, is literally 10x as fast, taking only 2m20s to decode the same dataset. Still really slow compared to the base64 utility, but better than a kick in the teeth! and fast enough for what we need.

The first version of my script had things factored out nicely into functions, but I had to eliminate those to optimize for speed. So, apologies for ugliness.

There are a few known issues with this script:

  1. It does not handle the case where a base64 quartet is split across lines. If your base64 content is split across lines, ensure the length of each line is divisible by 4.
  2. The file must contain at least one newline. If the base64 string is all one line with no newline at the end, the script will just exit and create a 0-byte output file.
  3. The decoded file may have an extra byte at the end. I am not quite sure where it's coming from. For our use case, the extra byte isn't a problem (our payload is a tgz file that decompresses without issue).
#!/usr/bin/env ksh
# also works with bash, but bash is an order of magnitude slower.

binary_table=('000000' '000001' '000010' '000011' '000100' '000101' '000110' '000111' \
              '001000' '001001' '001010' '001011' '001100' '001101' '001110' '001111' \
              '010000' '010001' '010010' '010011' '010100' '010101' '010110' '010111' \
              '011000' '011001' '011010' '011011' '011100' '011101' '011110' '011111' \
              '100000' '100001' '100010' '100011' '100100' '100101' '100110' '100111' \
              '101000' '101001' '101010' '101011' '101100' '101101' '101110' '101111' \
              '110000' '110001' '110010' '110011' '110100' '110101' '110110' '110111' \
              '111000' '111001' '111010' '111011' '111100' '111101' '111110' '111111')

# A neat feature of shell arrays is that they are sparse, and we leverage that here.
# This builds a mapping between the ASCII code of the base64 string characters and
# the corresponding offset in the binary_table above.
ascii_map=()
for x in $(seq 0 25); do
  # A-Z
  ascii_map[65+$x]=$x
  # a-z
  ascii_map[97+$x]=$(($x+26))
  if [ $x -lt 10 ]; then
    # 0-9
    ascii_map[48+$x]=$(($x+52))
  fi
done
# +
ascii_map[43]=62
# /
ascii_map[47]=63

input_file=$1
# The string passed will be a binary representation of at most 3 bytes.
print_binary() {
  bytes=$1
  # pad out to the nearest byte
  while [ $((${#bytes} % 8)) -gt 0 ]; do
    bytes="${bytes}0"
  done
  num_bytes=$((${#bytes} / 8))
  # split string into 8-bit substrings
  for ((byte_counter = 0; byte_counter < $num_bytes; ++byte_counter)); do
    to_print=${bytes:$((byte_counter*8)):8}
    byte=0
    # parse binary string into an integer representation
    for ((bit_counter = 0; bit_counter < 8 ; ++bit_counter)); do
      to_shift=$((7 - $bit_counter))
      bit=$((${to_print:$bit_counter:1} << $to_shift))
      byte=$(($byte + $bit))
    done
    # once we have the byte decoded, print it to stdout
    printf "\x$(printf '%02x' $byte)"
  done
}

if [ -z "$input_file" ]; then
  echo "usage: $(basename $0) [base64_file]"
  exit 1
fi

if [ ! -r "$input_file" ]; then
  echo "error: $input_file is not readable."
  exit 2
fi

count=0
binary_string=''
# BUG: The input file must contain at least one newline or else this no-ops
# BUG: This will produce faulty output if the lines on the inputfile are not a multiple of 4 in length
# read each line
while read -r word; do
  for ((i = 0; i < ${#word}-1; ++i)); do
    # get the base64-encoded letter
    letter=${word:$i:1}
    # turn it into an ASCII code
    ascii_code=$(printf '%d' "'${letter}")
    # look up in our table to see the binary snippet that corresponds with it
    binary_table_idx=${ascii_map[$ascii_code]}

    # if the file has an invalid base64 character we just skip it
    if [ -n "$binary_table_idx" ]; then
      binary_string="${binary_string}${binary_table[$binary_table_idx]}"
      count=$(($count + 1))
      if [ $((count % 4)) -eq 0 ]; then
        print_binary $binary_string
        binary_string=''
      fi
    fi
  done
done < "$input_file"
# Since '=' is ignored in the above loop, we need to handle the case where there's
# trailing padding. 
if [ -n "$binary_string" ]; then
  print_binary $binary_string
fi