In the aftermath of the PlaidCTF event, I realized that there were some pretty easy points that could have been achieved if I had a more solid understanding of crypto. So lately I've been focusing some of my free time getting more acquainted with it - learning it's history (which is quite fascinating actually) and taking on a few challenges to break some of the easier encryption methods. I've since started the Matasano Crypto Challenges and tackled a few others that I've come across on the web. On Twitter, I came across the The Pi-Cubed Programming Challenge website which has a handful of "mini-challenges," the last few of which have been related to crypto. I dedicated a few afternoons looking at Mini-Challenge #6 and Mini-Challenge #7 and managed to solve each without much trouble.
Mini-Challenge #6 was a code-breaking exercise where the user was tasked with decrypting three messages encoded different ways. Cracking each code gave you a hink on how to tackle the next one. The first of the three messages was a simple Caesar Cipher, so I won't go into much detail about breaking that one. The second message encoded using a mono-alphabetic substitution cipher but maintained the original word spacing found in the plaintext. I originally cracked this message manually, using some logical assumptions and some guess and check but after chatting with @misanto on Twitter, I decided to take a crack at creating a program that will decrypt the message for me.
I haven't had much experience doing something like this before but I decided to hack something together quick and see what I could come up with. There are definitely some improvements that can still be made to make this more efficient and it really only works in this encryption scenario (mono-alphabetic substitution with original spacing intact) but it was a fun exercise nonetheless. I plan on speeding it up a bit using di-gram and tri-gram frequency analysis to make some better substitution guesses but as it is, my solution uses single letter frequency to try and crack the message.
My methodology went something like this:
1) Analyze the encrypted message and get the count of each character used in the ciphertext.
2) Start the decryption by mapping the most common character found in the ciphertext with the most common English letter frequency.
3) Once complete, compare the words in the (possibly) decrypted message against a dictionary of 10,000 common English words and determine what percentage of them match.
4) After the baseline decryption was complete I created a subset of possible character mappings and ran through steps 2 and 3 above for each, keeping track of whichever yielded a higher word match. I tweaked this a bit to keep the total decryptions as low as I could, while still generating decent results.
5) Once the subset of mappings was done I then took the best decryption so far and ran each still-unmatched word through a larger dictionary of words using the Python method get_close_matches() to see if there were any dictionary words that closely matched the unmatched one. If it found any possible matches that were the same word length but only differed by 2 characters, it would swap that mapping and see if the result was better. (ex. acknowlesges -> acknowledged, where they differ only by the letters 'd' and 's').
In the end, it actually worked out pretty well. The second message in the challenge ended up with 94.12% of the words matching my larger dictionary and a message that was easily readable. The downside is that it takes a while to run (about 10-15 minutes on my Macbook Pro) and runs through the decryption and matching process ~100,000 times. As I said before, I plan on making a number of changes to improve this but I found it to be acceptable for my needs in this challenge.
The current code and dictionary files can be found here.