Tuesday, 16 April 2013

Passphrase Generators Part II

A while back I wrote a post about using bash as a passphrase generator (here). The idea being that it would be better to pick "randomly" from a list of common words than it would be to try to mentally choose four words at random. Humans are not very good a being random [1]. Derren Brown, if you're reading this, I'd love to see an episode where you plant a password into someone's head for something they believe is of utmost importance. This post addresses some security concerns about how the words are chosen at random.

Speaking of passphrases and passwords, I should probably clarify what I mean by the two terms. A password I take to mean a private string which more or less resembles a "word" in the traditional sense. Typical and poor examples would be "password", "Password1", "aksdfjslfjdf" etc. Something like "8yDcBPAllEFOLNNI89ZN4nq5" being a good example. A passphrase I take to mean a private string composed of multiple "words". For example "That's me in the corner, that's me in the spot light", "slgl sdf mcr asrj" are poor examples, and "Cupertino teenage fell Paradox" is a good example. The analogy being to word and phrase in spoken languages. So generally, a good passphrase will be better than a good password because of length, assuming you havn't done something silly like using the lyrics of your favorite song. The passphrase would still need to be random, but easier to remember.

Anyway, having spent some time chatting on IRC about the flaw of using % 10000, using Bash's $RANDOM variable and a few other things, I thought I would update the script to be more secure. The entropy lost from using % 10000 isn't particularly significant, but the way $RANDOM works means you could probably reduce the choice of possible words significantly if you knew when approximately the passphrase was being generated. Basically, $RANDOM seeds off gettimeofday's seconds, useconds and the bash PID. I haven't done the maths for exactly what the impact is, but I thought it would be worth this blog post.

I am actually going to provide two new scripts. The previous method used a dictionary file with 10000 common words downloaded from the internet, and embedded them in the Bash script. Now, I still think this is a good approach because 10000^4 still gives enough possible combinations and because there are only 10000 words, many words will be familiar to most people and should therefore be easier to remember. I will first present an alternative approach.


#!/bin/bash
words=${1:-4}

while true; do
    clear
    for i in $(seq $words); do
        word=$(/usr/bin/shuf -n1 --random-source=/dev/urandom /usr/share/dict/words)
        echo -n "$word "
    done
    echo ""
    read -p "Accept (y/n) [n]: " answer

    if [ "$answer" == "y" ]; then
        exit 0
    fi

done

You will probably notice two things. Firstly, it only takes one argument; the number of words to show, or 4 by default. The previous version of the script printed X number of words on Y lines so that you could choose a passphrase that suited you. However, there was always the temptation to create a new passphrase from a combination of what was shown on the screen. This is not ideal. You are better off rejecting a number of whole passphrases and then accepting one as-is, rather than combining several passphrases into one. The reason being that the process of combining passphrases is no longer random and there will probably be a bias towards a structure such as "adjective noun verb adjective". Or something like that anyway.

So, instead, we just keep looping until you accept the passphrase and clear the screen if you reject it (the default). By clearing the screen I hope to lessen the chance of you combining several passphrase - out of sight, out of mind as they say.

The second major change here is that instead of using 10000 common words downloaded from the Internet, we're using the OS's provided dictionary file. If your distribution doesn't have that file, see here http://en.wikipedia.org/wiki/Words_(Unix).

We use the shuf command X number of times to select X words for the passphrase. On one of my machines there are 98326 words in /usr/share/dict/words. I think I saw many more on my Ubuntu system. So you can probably expect some pretty obscure passphrases. You'll also note that we tell shuf to use /dev/urandom as the source of randomness as the built in source isn't secure.

Right, now onto the other option. In this case we're going to just modify the previous script and fix the source of randomness. We'll also introduce the while true loop and hide previous rejected passphrases.

#!/bin/bash

dict=$1
script=$2

chars="a-zA-Z0-9 "
size=$(cat $dict | wc -l)
echo "#!/bin/bash" > $script
( echo -n "dict=( "; i=0; cat $dict | while read word; do echo -n "$word " | tr -c -d "$chars"; if [ "$i" -eq "10" ]; then i=1; echo ""; fi; ((i++)); done; echo ")" ) >> $script

echo "size=$size" >> $script
cat << 'EOF' >> $script
words=${1:-4}

while true; do
    clear
    for i in $(seq $words); do
        index=$(cat /dev/urandom | tr -d -c '0-9' | fold -w 4 | head -1 | sed 's/^0*//')
        word=${dict[$index]}
        echo -n "$word "
    done
    echo ""
    read -p "Accept (y/n) [n]: " answer
    if [ "$answer" == "y" ]; then
        exit 0
    fi
done
EOF
chmod +x $script

Other than the loop change, the only difference is our source of randomness. Instead of using $RANDOM we read /dev/urandom directly. We strip out anything other than digits using the tr command and fold it to 4 characters (0000 to 9999). We then just take the first result and strip off any zeros from the front. We do this because bash doesn't treat 0080 as 80 and isn't happy using 0080 as an index. There is probably a better way of stripping of the zeros using bash's string manipulation, but this will do for now.

Running this script is the same as the previous version, you give it the dictionary and an output file name. E.g.

$ wget http://wortschatz.uni-leipzig.de/Papers/top10000en.txt
$ ./make.sh top10000en.txt ppgen.sh

Unfortunately the dictionary isn't provided over HTTPS, so it is possible that someone is going to MITM your download and change all the words to suit themselves.

Perhaps something like this might help:

$ wc -l top10000en.txt
10000 top10000en.txt


 $ sort < top10000en.txt | uniq | wc -l
10000


That way at least you know there are 10000 unique words. Either that or find a new dictionary available over HTTPS (and let me know where you found it if possible).

Hopefully you'll find either of these approaches useful.


[1] http://www.ncbi.nlm.nih.gov/pubmed/17888582