Spam in images

Pavel Kankovsky peak at argo.troja.mff.cuni.cz
Sun Aug 13 20:30:21 CEST 2006


On Wed, 9 Aug 2006, .rp wrote:

> hmm, I wonder if it would be worth hooking in an OCR program to read the 
> image and what the min hardware requirements would be to scan them 
> without bringing a system to a crawl.

I think OCR is an overkill. After all, the classifier does not really need
to be able to read the text--it needs to be able to find out whether the
picture is similar to known examples of spam (and ham) pictures.

As far as I can tell, nowadays it is possible to recognize a majority of
image files attached to spams just by looking at their first few
BASE64-encoded lines. It might be worth trying to modify Bogofilter's
parser to extract some kind of quasi-tokens from image files (e.g. break
the file into N-bytes pieces for some small N and translate every piece
(or first M pieces) into a token, perhaps with BASE64 in order to keep 
tokens (mostly) printable).

Things might become more tricky if spammers start randomizing their
"picto-spams" in the future.

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."




More information about the Bogofilter mailing list