A Challengers Handbook



Substitution Ciphers

A substitution cipher is one where letters have been substituted for something else, in a simple predefined way. For example we may decide on a mapping like:

plaintext: abcdefghijklmnopqrstuvwxyz

And so 'the' in our original message becomes 'cmk' in our cipher.

A good understanding of substitution ciphers and how to solve them will get you a long way in challenges. Although I am going to talk about single alphabet substitutions you will often find that challenge levels are based on substitution ciphers and by a suitable mapping you can turn a problem into a substitution cipher of the form above. The way to do this is simply to identify where the letters are. For example if you have a set of graphics or pictures that represent a cipher, and in counting them you see that there are less than 26, and you have some repeated sequences, and some are much more common than others. The way to approach this is to substitute letters for each symbol and treat it as a substitution cipher. In other problems there may be a very common symbol - is it simply a separator of some kind ? Does it represent the start of a new letter, or word ?

Bear in mind that the ciphertext need not be another alphabet - it could be a set of pictures, or 3 digit numbers, or 2 letters, or anything else that you can think of which would easily map back. Sometimes the ciphertext may not even have 26 distinct combinations, fewer can suffice. For example, consider the matrix:


Here we map a letter to the character at the top of the row and left column, so 'a' becomes 'rq' and 'b' becomes 'eq'. Since there is no j inside the matrix we map it the same as 'i' and use 'du'. Which is the answer is apparent when we decode a message. The way to solve such a cipher is to realise that the ciphertext has only 10 different characters and that these follow a pattern of every other character being one of 'ready' and the others are one of 'quits'. Now you can construct a mapping from the two character substitution alphabet into a one character substitution alphabet, and treat it as for any other substitution cipher.

So suppose there is a cipher of some kind and that it consists of the letters a-z, and that there may or may not be suitable word divisions (ie it may have spaces in it or not). I would use several methods to attack this:

  • My first approach is to run my substitution cipher solver on it. If the problem has enough text (say 120 or more characters) and the letter frequencies are fairly normal then my solver should solve it in a couple of seconds. I'll give you the source code to the solver when I discuss it further below.
  • If there are word divisions then I would now look to use a combination of pattern matching and solving it by hand. For example words like 'successful' have very distinct mappings. our ciphertext 'dxllkdduxt' still has the same pattern and a pattern search against a good dictionary will give us very few possibilities. These possibilities can then be tested against other words. You can solve very short ciphers in this way.
  • I would now try to solve the problem by hand, perhaps with some help from pattern matching, but mostly through intuition. Whenever I do this I always have some cribs in mind - common words which might appear like 'problem', 'solution', 'password', 'congratulations', 'well done', 'challenge', 'answer', 'codeword', 'cipher', 'encryption', 'substitution', as well as common sequences in these type of problems, like 'the answer is'.
  • If the problem is proving difficult to attack then a better approach is needed and I would now start a more systematic approach considering frequency analysis, repeats, digraphs and trigraphs, vowel identification, etc.

    So lets talk about my automatic solver. How does this work ? Well, the first thing I did was to analyse some English text with this program. All it does is read in a big chunk of text (I fed it a fairly large novel). It outputs some text files containing various frequency counts, from this you can see the expected occurences of letters in typical English text. It also outputs two letter, three letter and four letter combination statistics. It is the four letter combinations - tetragrams - that my solver uses. The other files are merely for analysis. It is important that you dont analyse rubbish - html for example - because your common letter combinations will become things like 'html' and 'brbr' which will not work in the solver and will get you nowhere. So you need to analyse a long, and representative piece of text. Of course it need not be English, you may be interested in Dutch for example :) So we have a file now with a set of four letter frequencies (make your own). This is where the Crypt Analyser comes in. It reads your crypt and starts with a random solution. It tries to improve the solution by scoring it against the frequencies expected in the decrypted message, using the tetragram file we generated. When the current guess cannot be improved further it starts from another random position. Whenever the best guess is beaten it throws it up on the screen. A long crypt is normally solved pretty quick - a second or two. If you have to wait longer than a few seconds then it hasnt worked, go try something else instead. Sometimes it will be 90% correct and you only have to change a few low frequency letters around.

    If the program fails and you have word divisions then the next thing to look at is pattern matching. Pattern matching essentially means that if you have a word in ciphertext 'dxllkdduxt' then you are looking for possible fits to this word, ie a word with six different letters, of length 10, with one letter that appears in positions 'd....dd...', one in positions '..ll......', one in positions '.x......x.'. Perl is ideal for this, although I tend to use short C programs sometimes. The things to look out for are unusual patterns of letters in words and long words with several repeated letters, or long words with a large number of different letters. Inevitably pattern matching against a wordlist will produce only a small number of possible words. Often which word is the answer is apparent from the context of the problem but if you can get one large word then you have broken the cipher.

    At this point I would get out a pen and paper, or more probably notepad, and start doing some manual substitutions. So lets look at an example:

    juww tdru mdh zpiu fdwiut gzkf fhqfgkghgkdr skazuc prt 
    gzu jdct gzpg mdh ruut vdc gzu prfjuc kf pwazpqug

    The first thing that I notice about this is the presence of 'mdh' on two occasions, and 'gzu' on two occasions. The commonest word in the written English language is 'the' and so we have a possible entrance with one of these words. The commonest letter is 'e' and so potentially since there are a few 'u's in the cipher we could have 'gzu'='the'. This would also fit with the 'uu' in 'ruut' and the 'u' ending words like 'zpiu'. We also note that the second most frequenct letter in English is 't' (frequencies are 'etaonirsh...' and that 'g' is a frequent ciphertext letter. Also note that 'gzpg' and 'gzkf' would correspond to 'th..'. What all of this amounts to is good evidence that 'gzu'='the'. So I make the substitution:

    juww tdru mdh zpiu fdwiut gzkf fhqfgkghgkdr skazuc prt 
     e      e     h  e     e  th       t t t       he
    gzu jdct gzpg mdh ruut vdc gzu prfjuc kf pwazpqug
    the      th t      ee      the     e        h  et

    You should now begin to see other words forming. We have 'gzpg'='th.t' and so we must have 'p'='a' the only possibility. This gives 'pawzpqug'='a..ha.et' and should you need to perform a dictionary search you should soon find 'alphabet'. We also have 'zpiu'='ha.e' and is probably 'have', 'prt'='a..' and is probably 'and' - confirmed by 'ruut'='need', etc. Making these substitutions gives:

    juww tdru mdh zpiu fdwiut gzkf fhqfgkghgkdr skazuc prt 
     ell d ne     have   lve  th     b t t t  n   phe  and
    gzu jdct gzpg mdh ruut vdc gzu prfjuc kf pwazpqug
    the      that     need     the an  e     alphabet

    This is practically done, it remains to decide that it begins 'well done' and that 'k'='i' and 'f'='s'. At the end of this if you lay out the alphabet and the cipher alphabet you will see:

    pqstuv zk  w rda cfghij m 

    You will notice that a lot of the letters in the cipher alphabet are in order, this is because a keyword has been used. We can add 'q'='b', 'x'='l', and then 'z'='n' or 'o' and 'g'='x' or 'y'. A careful check of the possibilities gives the reconstructed alphabet:


    And you can see the keyword in the ciphertext alphabet, it is 'keyword'. Keywords could also appear the other way around, ie write out the cipher alphabet and the plaintext correspondence beneath it. So when you are told that the answer is the keyword this is where to look :)

    OK, we have done the pen and paper bit, and made some shrewd guesses but what if we do this and we still can't solve it ? This is the time to get heavy :) First of all you should be thinking about frequency analysis of the cipher, and you should know the frequencies of your language (remember the other files... we have the frequencies in one of those...). I am not going to go into a great depth here but if you want to follow this line of thinking, as well as more complex approaches - consonant and vowel identification and the consonant-line method are detailed in Cryptanalysis by Helen Fouche Gaines. In essence you should note that:

  • The vowels A E I O are normally in the section of letters with high frequency counts.
  • Vowels overall account for around 40% of text.
  • Letters contacting (next to) low frequency letters are normally vowels.
  • Letters showing a wide variety in their contact letters are normally vowels.
  • In repeated digrams one letter is normally a vowel.
  • In reversed digrams one letter is normally a vowel.
  • Doubled consonants are normally flanked by vowels and vice versa.
  • It is unusual to find more than five consonants in succession.
  • Vowels do not often contact one another.

    Here is problem 50 from Gaines:

    pouyh ibquav puko m eguhac mk koh poukh
    obljh, kohbcbgh gbbjhnhyk ghshunhc ma uawlgr kb
    bah hrh pouso smljhc iyuacahjj.

    Frequency count:

    a 7  b 8  c 5  d    e 1  f
    g 5  h 17 i 2  j 5  k 7  l 3
    m 4  n 2  o 8  p 4  q 1  r 2
    s 3  t    u 9  v 1  w 1  x
    y 3  z
    98 letters

    So 'h' is by far most frequent and immediately becomes a candidate for the letter 'e'. After that we have 'abkou' with similar frequencies and 'cgjmp' just below. The evidence for 'h'='e' continues to increase with it ending several words, being next to last letter in one word and being in 'hrh'='eve' ? Lets look at its digrams:

    hy 1 yh 1
    ha 1 ah 2
    hn 1 nh 2
    hr 1 rh 1
    hj 1 jh 3
    hs 1 sh 1
    uh 1
    oh 2
    kh 1
    hb 1
    gh 2
    hc 2

    So it is frequently in a reversed digram and contacts one or two low frequency letters. It has contacts with many letters, and so is almost certainly a vowel. Due to its common word endings it is almost certainly 'e'. Lets make the substitution:

    pouyh ibquav puko m eguhac mk koh poukh
        e                  e        e     e
    obljh, kohbcbgh gbbjhnhyk ghshunhc ma uawlgr kb
        e    e    e     e e    e e  e      
    bah hrh pouso smljhc iyuacahjj.
      e e e           e        e

    Now we note that the first word and eighth words are very similar, 'k' and 'y' must be consonants and at least one of 'pou' is a vowel. This is unlikely to be 'p' however, and 'o' ends at least two words, so lets look instead at 'u'. 'u' also occurs in the fifth word next to our 'e', and so both 'a' and 'o' possibilites are out, leaving us with either 'i' or 'u'. However we also have a 'hu' in the ciphertext later on, and a word beginning with 'u'. This, and the fact that 'u' is the second commonest letter in the cipher tends to make me think that 'u'='i'. I'm sure that by now you have thought about the single letter 'm' in the cipher, and that it must be 'i' or 'a' if it is a word. If we have used 'i' then only 'a' is left, and it starts two 2 letter words which is good news (am,at,as,etc). So we now have, making these substitutions:

    pouyh ibquav puko m eguhac mk koh poukh
      i e    i    i   a   ie   a    e   i e
    obljh, kohbcbgh gbbjhnhyk ghshunhc ma uawlgr kb
        e    e    e     e e    e ei e  a  i 
    bah hrh pouso smljhc iyuacahjj.
      e e e   i    a  e    i   e

    Now turn your attention to 'kb', one of these must be a vowel (and i include 'y' when i say this). We have 'poukh'='..i.e' on the top row which rules out the 'k'. Since the 'b' is at the end of this two letter word, and also since it is doubled earlier in the crypt it must really be an 'o'. This does give us an unusual 'eo' combination in one word though. Before the 'kb' we have a 6 letter word beginning with i and we must have some other vowels in here. It seems that one of 'wlgr' is a vowel. This could not be 'g' as a word starts 'gbb...'='.oo..' and cannot be 'r' due to 'hrh'='e.e'. 'w' only occurs in the cipher once so is a possible however 'l' has some nice other contacts for being a 'u'. It occurs in 'obljh'='.o..e' and in 'smljhc'='.a..e.'. Putting these possibilities into our crypt gives:

    pouyh ibquav puko m eguhac mk koh poukh
      i e  o i    i   a   ie   a    e   i e
    obljh, kohbcbgh gbbjhnhyk ghshunhc ma uawlgr kb
     ou e    eo o e  oo e e    e ei e  a  i  u    o
    bah hrh pouso smljhc iyuacahjj.
    o e e e   i    au e    i   e

    We now have an interesting array of words, and many possible letters jump out. The 'jj' at the end of the crypt and the 'obljh'='.ou.e' combine to lead you into 'j'='s' or 'l' although 'l' is easily ruled out. From the two letter words 'mk'='a.' and 'kb'='.o' we have 'k'='t', 's' or 'n' but we have used 's', so we are left with 'n' or 't'. It seems more likely that 'mk koh poukh'='at the .hite' than 'an n.e ..ine', and so we make this substitution too:

    pouyh ibquav puko m eguhac mk koh poukh
     hi e  o i    ith a   ie   at the  hite
    obljh, kohbcbgh gbbjhnhyk ghshunhc ma uawlgr kb
    house  theo o e  oose e t  e ei e  a  i  u   to
    bah hrh pouso smljhc iyuacahjj.
    o e e e  hi h  ause    i   ess
    You should now be able to see 'p'='w' giving the words 'with' and 'white' in the plaintext, and you may be able to spot 'theodore roosevelt' on the second row. In fact we already have 'white house' in the text as well. We note that 'ma'='a.' and 'bah'='o.e' lead you to 'a'='n', 'pouso'='whi.h' gives 's'='c'. So we have:
    pouyh ibquav puko m eguhac mk koh poukh
    while  o in  with a  riend at the white
    obljh, kohbcbgh gbbjhnhyk ghshunhc ma uawlgr kb
    house  theodore roosevelt received an in ur  to
    bah hrh pouso smljhc iyuacahjj.
    one e e which caused  lindness

    Finally filling in the last few letters, and noting which ones are still left to be used leads to:

    pouyh ibquav puko m eguhac mk koh poukh
    while boxing with a friend at the white
    obljh, kohbcbgh gbbjhnhyk ghshunhc ma uawlgr kb
    house  theodore roosevelt received an injury to
    bah hrh pouso smljhc iyuacahjj.
    one eye which caused blindness

    Notice that 'hrh'='eye', something that we did not consider amongst the 'ere', 'eve' possibilities that I quickly came up with ;) Anyway, i've skipped over a lot in that last one, where you could consider so many other things in the crypt, and you really need to study the solution of these problems more to appreciate the fine points. After reading Gaines you really do think thats it not as much of a guessing game as it first appears.

    Gaines goes on to show how you can attack some difficult problems where authors have tried to twist normal letter frequencies. In harder challenge problems you can expect this kind of thing - perhaps the letter 'e' is not in the problem at all, often a clue will be given by the challenge name, just be aware :)

    Back to Encryption