Practical Steganalysis - 1 - vecna@s0ftpj.org - http://www.s0ftpj.org http://www.delirandom.net 07/04/2007 START http://www.spammimic.com is a free service providing steganography over spam. Spammimic generates a spam mail from a message, the user can then cut/paste the stegomessage from the browser and send to the recipient. This is an example, from http://www.spammimic.com/encode.shtml: The user's message: hello test Becames: Dear Decision maker , We know you are interested in receiving amazing intelligence . This is a one time mailing there is no need to request removal if you won't want any more . This mail is being sent in compliance with Senate bill 1625 ; Title 4 ; Section 302 . THIS IS NOT MULTI-LEVEL MARKETING ! Why work for somebody else when you can become rich as few as 33 days . Have you ever noticed people love convenience plus most everyone has a cellphone ! Well, now is your chance to capitalize on this ! We will help you decrease perceived waiting time by 130% and turn your business into an E-BUSINESS . You are guaranteed to succeed because we take all the risk . But don't believe us ! Prof Ames of Massachusetts tried us and says "My only problem now is where to park all my cars" ! We are licensed to operate in all states ! We beseech you - act now . Sign up a friend and your friend will be rich too ! Thank-you for your serious consideration of our offer ! OK? Different words in the input message produce different spam as output, the site spammimic provides a section for spam decoding too. This as example, from http://www.spammimic.com/decode.shtml : Your spam message "Dear Decision maker , We know you are in..." decodes to "hello test". What is steganalysis ? It is the steganography countermeasure, steganalysis tries to infer if an "innocent looking" data as been used as container for other hidden data. In this application the SPAM is the container, and steganalysis must detect the steganography spam between tons of real-spam messages. I've seen spammimic's generations, I tried some choosen message attacks. Choosen message attack is effective when the attacker could forge arbitrary message and analyze the steganography message. Spammimic is a closed source tool, as explained in the FAQ: http://www.spammimic.com/feedback.shtml they have got some good reasons for being closed source, but a security software is better open source as stated by Kerckhoff's principle (http://en.wikipedia.org/wiki/Kerckhoffs%27_principle). The analysis of the stegomessage is made after a large non-random message encoding. The stegomessages derived should be analyzed to highlight security proprieties of the system. Analysis should show redundancies, collisions, implementation problems, etc... The vulnerability in spammimic is the redundancy of some patterns. Those patterns could be searched inside a spam archive. An email with zero or few pattern matches, is a real-spam email. An email with some spammimic's characteristics is a stegomessage. This is the vulnerability of spammimic: the use of a small and predictable dictionary. SPAM, could be one of the best steganographic containers because it always include some pseudo-random content. How steganalysis has been done? I took 400 random real words (from a wordlist). I encoded each word as a single-word message with: tail -400 new_york_times_most_used_words.txt > 400_words x=`seq 1 400` for i in $x; do word=`tail -$i 400_words | head -1` && \ curl -d "plaintext=$word" http://www.spammimic.com/encode.cgi>dumps/$i.output \ && sleep 2; done I wrote a script able to convert the encode.cgi dumps to simple spam sections: #!/bin/sh -x # as first argument I require number of file in dumps number=`seq 1 $1` for i in $number; do lines=`wc -l dumps/$i.output | cut -b -3` tail -$(($lines - 41)) dumps/$i.output | grep -v "<" >> $i.spam done Now i had 400 spam blocks ready to be analyzed. I parsed the 400 stegospam messages with sort | uniq, generating a list of common segments: debian:~/steganalysis# more spammimic_dict ! . , because the internet operates on "internet time" but don't believe us ...but it won't be sent until you click on send) dear business person dear colleague dear cybercitizen dear decision maker dear e-commerce professional [...] Email messages have been checked for these patterns, then a result value was given according to the number of pattern matches. This value for spammimic messages (generated from one-word in input) was about 20-30. Checking for these patterns in real spam messages returned these values: 0 (16113 messages), 1 (428 msg), 2 (133 msg), 3 (18 msg). In 16.736 real spam messages from http://untroubled.org/spam (I used spam archive of March 2007) I found two strange mails: one message (2007/03/1174971001.14272_162.txt) has 36 patterns occurrences and (2007/03/1173129552.10325_753.txt) 10. To minimize false positives, I checked two other values: 1) The length of message, because rather than the absolute number of patterns, the percentage of patterns in message length is a better value for analysis. 2) The number of pattern delimiters: " . ", " ! ", " , " are the three pattern delimiters used in spammimic, and a lot of real spam messages could contain these. I coded a simple script in python, reversed_spammimc.py . The script simply outputs some values: [number of patterns found][size of message][number of first Pattern Separator] [number of second PS][number of third PS][file name]. These pieces of information are enough to discriminate steganographic message vs real spam. WHAT IS SPAMMIMIC'S FAULT ? Steganography could be divided in two large categories: software that generates cover and software that embeds data in existing cover. Usually the second definition is the commonly used (with the example of a message embedded into an image). Steganography applications able to generate a cover have some advantages and are almost immune to known cover attacks, but the generation must be coherent with Internet analysis. I don't know how spam pattern is changing through the years, but at the moment the internal dictionary inside of spammimic is almost obsolete. It is not easy for spammimic coder to avoid this degradation of security, because changing the dictionary requires a definition of an internal "steganographic format" to let detect which version of dictionary was used at encoding time, in order to correctly decoding the message. Some years ago I found the same problem coding "blastersteg" (http://bfi.s0ftpj.org/dev/BFi12-dev-10), a steganographic communication system hiding between the random traffic generated from boxes infected by Blaster worm. At the moment Blaster is not easy anymore to find, so using blastersteg is unsafe because it creates an anomaly in Internet traffic. An anomaly could be detected by the stegoanalyst and in the end this fact makes this software vulnerable to steganalysis. The package contains these files: reversed_spammimic.py: script to parse spam mail and compare with spammimic_dict spammimic_dict: dictionary reversed from spammimic output *.log: output for analyzed spam and analyzed spammimic output spammimicanalysis:txt: this file