A base64 decoder in pure shell script
I can't go into the details of why, but at work we found ourselves in a position where we needed to update one of the JAR files in our agent software using nothing more than a shell script. The agents run on customer endpoints, so we have zero control over the tools available. We know the systems don't have either base64
or curl
handy, so we were in kind of a pickle. How could we get our updated JAR out to the system and installed where we needed it?
Well, how hard can base64 decoding be, anyway? Google to the rescue!
This page provides a really easy to follow tutorial for how to decode base64 by hand, which I turned into a shell script.
The good news is that I was able to implement it in pure shell script, with no external dependencies. The bad news is that bash
runs it extremely slowly, to the order of 20-30 minutes to decode ~153k of base64 data. The korn shell, by contrast, is literally 10x as fast, taking only 2m20s to decode the same dataset. Still really slow compared to the base64
utility, but better than a kick in the teeth! and fast enough for what we need.
The first version of my script had things factored out nicely into functions, but I had to eliminate those to optimize for speed. So, apologies for ugliness.
There are a few known issues with this script:
- It does not handle the case where a base64 quartet is split across lines. If your base64 content is split across lines, ensure the length of each line is divisible by 4.
- The file must contain at least one newline. If the base64 string is all one line with no newline at the end, the script will just exit and create a 0-byte output file.
- The decoded file may have an extra byte at the end. I am not quite sure where it's coming from. For our use case, the extra byte isn't a problem (our payload is a tgz file that decompresses without issue).
#!/usr/bin/env ksh
# also works with bash, but bash is an order of magnitude slower.
binary_table=('000000' '000001' '000010' '000011' '000100' '000101' '000110' '000111' \
'001000' '001001' '001010' '001011' '001100' '001101' '001110' '001111' \
'010000' '010001' '010010' '010011' '010100' '010101' '010110' '010111' \
'011000' '011001' '011010' '011011' '011100' '011101' '011110' '011111' \
'100000' '100001' '100010' '100011' '100100' '100101' '100110' '100111' \
'101000' '101001' '101010' '101011' '101100' '101101' '101110' '101111' \
'110000' '110001' '110010' '110011' '110100' '110101' '110110' '110111' \
'111000' '111001' '111010' '111011' '111100' '111101' '111110' '111111')
# A neat feature of shell arrays is that they are sparse, and we leverage that here.
# This builds a mapping between the ASCII code of the base64 string characters and
# the corresponding offset in the binary_table above.
ascii_map=()
for x in $(seq 0 25); do
# A-Z
ascii_map[65+$x]=$x
# a-z
ascii_map[97+$x]=$(($x+26))
if [ $x -lt 10 ]; then
# 0-9
ascii_map[48+$x]=$(($x+52))
fi
done
# +
ascii_map[43]=62
# /
ascii_map[47]=63
input_file=$1
# The string passed will be a binary representation of at most 3 bytes.
print_binary() {
bytes=$1
# pad out to the nearest byte
while [ $((${#bytes} % 8)) -gt 0 ]; do
bytes="${bytes}0"
done
num_bytes=$((${#bytes} / 8))
# split string into 8-bit substrings
for ((byte_counter = 0; byte_counter < $num_bytes; ++byte_counter)); do
to_print=${bytes:$((byte_counter*8)):8}
byte=0
# parse binary string into an integer representation
for ((bit_counter = 0; bit_counter < 8 ; ++bit_counter)); do
to_shift=$((7 - $bit_counter))
bit=$((${to_print:$bit_counter:1} << $to_shift))
byte=$(($byte + $bit))
done
# once we have the byte decoded, print it to stdout
printf "\x$(printf '%02x' $byte)"
done
}
if [ -z "$input_file" ]; then
echo "usage: $(basename $0) [base64_file]"
exit 1
fi
if [ ! -r "$input_file" ]; then
echo "error: $input_file is not readable."
exit 2
fi
count=0
binary_string=''
# BUG: The input file must contain at least one newline or else this no-ops
# BUG: This will produce faulty output if the lines on the inputfile are not a multiple of 4 in length
# read each line
while read -r word; do
for ((i = 0; i < ${#word}-1; ++i)); do
# get the base64-encoded letter
letter=${word:$i:1}
# turn it into an ASCII code
ascii_code=$(printf '%d' "'${letter}")
# look up in our table to see the binary snippet that corresponds with it
binary_table_idx=${ascii_map[$ascii_code]}
# if the file has an invalid base64 character we just skip it
if [ -n "$binary_table_idx" ]; then
binary_string="${binary_string}${binary_table[$binary_table_idx]}"
count=$(($count + 1))
if [ $((count % 4)) -eq 0 ]; then
print_binary $binary_string
binary_string=''
fi
fi
done
done < "$input_file"
# Since '=' is ignored in the above loop, we need to handle the case where there's
# trailing padding.
if [ -n "$binary_string" ]; then
print_binary $binary_string
fi