base64 spam / forcing bogofilter -p judgement

Thu Nov 7 17:18:34 CET 2002

Parker Morse <morse at sinauer.com> writes:

>> Looks like we might just usurp Debian's mimedecode until we have our own
>> character set canonicalization.
>
> Wouldn't it be enough to have bogofilter understand word boundaries in
> base64?

Well, you need to decode base64 to figure. It looks like Debian's
mimedecode had much the features we need, I'll need to review the code
for security though before I pull it into bogofilter.

> I'm fairly new to this, but if I understand the theory correctly, the
> base64 encoding of spam words should be just as strong evidence of spam
> as the decoded word - perhaps more. In fact, some words might turn out
> to be "ham" words in plain text, but "spam" words when
> base64-encoded. Does that make sense?

Indeed, that's something I had not thought about before you asked. OTOH,
this scheme would have to go like this: parse it, and if it was base64,
then modify the token in a way that no regular token would look
like. Alternatively, we'd have to store tokens and meta-information
separately.

> That would also allow one to put a large corpus of raw spam which is
> mixed (encoded and plaint-text) into bogofilter without needing to
> decode all the base64 first.
>
> Or does it turn out that just understanding word boundaries is as hard
> as decoding the whole thing?

Yes, that's the case. Your idea about storing the original encoding with
the word is not bad. But I cannot think of an efficient implementation
now.

-- 
Matthias Andree