Practical Steganalysis - 1 - vecna@s0ftpj.org - http://www.s0ftpj.org
                                                http://www.delirandom.net

07/04/2007 

START

http://www.spammimic.com is a free service providing steganography over spam.
Spammimic generates a spam mail from a message, the user can then cut/paste 
the stegomessage from the browser and send to the recipient.

This is an example, from http://www.spammimic.com/encode.shtml:

The user's message:

hello test

Becames:

Dear Decision maker , We know you are interested in 
receiving amazing intelligence . This is a one time 
mailing there is no need to request removal if you 
won't want any more . This mail is being sent in compliance 
with Senate bill 1625 ; Title 4 ; Section 302 . THIS 
IS NOT MULTI-LEVEL MARKETING ! Why work for somebody 
else when you can become rich as few as 33 days . Have 
you ever noticed people love convenience plus most 
everyone has a cellphone ! Well, now is your chance 
to capitalize on this ! We will help you decrease perceived 
waiting time by 130% and turn your business into an 
E-BUSINESS . You are guaranteed to succeed because 
we take all the risk . But don't believe us ! Prof 
Ames of Massachusetts tried us and says "My only problem 
now is where to park all my cars" ! We are licensed 
to operate in all states ! We beseech you - act now 
. Sign up a friend and your friend will be rich too 
! Thank-you for your serious consideration of our offer 
! 

OK?

Different words in the input message produce different spam as output, the
site spammimic provides a section for spam decoding too. This as example,
from http://www.spammimic.com/decode.shtml :

Your spam message "Dear Decision maker , We know you are in..." decodes to
"hello test".

What is steganalysis ? 
It is the steganography countermeasure, steganalysis tries to infer if an
"innocent looking" data as been used as container for other hidden data.
In this application the SPAM is the container, and steganalysis must 
detect the steganography spam between tons of real-spam messages.

I've seen spammimic's generations, I tried some choosen message attacks.
Choosen message attack is effective when the attacker could forge arbitrary
message and analyze the steganography message.

Spammimic is a closed source tool, as explained in the FAQ:
http://www.spammimic.com/feedback.shtml

they have got some good reasons for being closed source, but a security 
software is better open source as stated by Kerckhoff's principle
(http://en.wikipedia.org/wiki/Kerckhoffs%27_principle).

The analysis of the stegomessage is made after a large non-random message
encoding.
The stegomessages derived should be analyzed to highlight security
proprieties of the system.
Analysis should show redundancies, collisions, implementation problems, etc...

The vulnerability in spammimic is the redundancy of some patterns.
Those patterns could be searched inside a spam archive.
An email with zero or few pattern matches, is a real-spam email.
An email with some spammimic's characteristics is a stegomessage.

This is the vulnerability of spammimic: the use of a small and predictable
dictionary.
SPAM, could be one of the best steganographic containers because it always
include some pseudo-random content.

How steganalysis has been done?

I took 400 random real words (from a wordlist).
I encoded each word as a single-word message with:

tail -400 new_york_times_most_used_words.txt > 400_words
x=`seq 1 400`
for i in $x; do word=`tail -$i 400_words | head -1` && \
curl -d "plaintext=$word" http://www.spammimic.com/encode.cgi>dumps/$i.output \
&& sleep 2; done

I wrote a script able to convert the encode.cgi dumps to simple spam sections:

#!/bin/sh -x
# as first argument I require number of file in dumps
number=`seq 1 $1`

for i in $number; 
	do lines=`wc -l dumps/$i.output | cut -b -3`
	tail -$(($lines - 41)) dumps/$i.output | grep -v "<" >> $i.spam
	done

Now i had 400 spam blocks ready to be analyzed.
I parsed the 400 stegospam messages with sort | uniq, generating a list of
common segments:

debian:~/steganalysis# more spammimic_dict 
 ! 
 . 
 , 
because the internet operates on "internet time"
but don't believe us
...but it won't be sent until you click on send)
dear business person
dear colleague
dear cybercitizen
dear decision maker
dear e-commerce professional
[...]

Email messages have been checked for these patterns, then a result value
was given according to the number of pattern matches.
This value for spammimic messages (generated from one-word in input) was
about 20-30.
Checking for these patterns in real spam messages returned these values:

0 (16113 messages), 1 (428 msg), 2 (133 msg), 3 (18 msg).

In 16.736 real spam messages from http://untroubled.org/spam (I used spam
archive of March 2007) I found two strange mails:

one message (2007/03/1174971001.14272_162.txt) has 36 patterns occurrences and 
(2007/03/1173129552.10325_753.txt) 10.

To minimize false positives, I checked two other values:

1) The length of message, because rather than the absolute number of patterns,
the percentage of patterns in message length is a better value for analysis.
2) The number of pattern delimiters: " . ", " ! ", " , " are the three pattern
delimiters used in spammimic, and a lot of real spam messages could contain
these.

I coded a simple script in python, reversed_spammimc.py .
The script simply outputs some values:
[number of patterns found][size of message][number of first Pattern Separator]
[number of second PS][number of third PS][file name].

These pieces of information are enough to discriminate steganographic message
vs real spam.

WHAT IS SPAMMIMIC'S FAULT ? 

Steganography could be divided in two large categories: software that generates
cover and software that embeds data in existing cover. Usually the second
definition is the commonly used (with the example of a message embedded into
an image).
Steganography applications able to generate a cover have some advantages and
are almost immune to known cover attacks, but the generation must be coherent
with Internet analysis.

I don't know how spam pattern is changing through the years, but at the moment
the internal dictionary inside of spammimic is almost obsolete.
It is not easy for spammimic coder to avoid this degradation of security,
because changing the dictionary requires a definition of an internal
"steganographic format" to let detect which version of dictionary was used at
encoding time, in order to correctly decoding the message.
Some years ago I found the same problem coding "blastersteg"
(http://bfi.s0ftpj.org/dev/BFi12-dev-10), a steganographic communication
system hiding between the random traffic generated from boxes infected by
Blaster worm. At the moment Blaster is not easy anymore to find, so using
blastersteg is unsafe because it creates an anomaly in Internet traffic. An
anomaly could be detected by the stegoanalyst and in the end this fact makes
this software vulnerable to steganalysis.

The package contains these files:

reversed_spammimic.py:
        script to parse spam mail and compare with spammimic_dict
spammimic_dict:
        dictionary reversed from spammimic output
*.log:
        output for analyzed spam and analyzed spammimic output
spammimicanalysis:txt:
        this file