try to use bogolexer 0.15.7 on messages with attaches
David Relson
relson at osagesoftware.com
Wed Nov 5 15:02:55 CET 2003
On Wed, 5 Nov 2003 17:47:00 +0400
Mike Lykov <combr at vesna.ru> wrote:
> You wrote:
>
> I write directly to you, if you want, you may answer in maillist ;)
>
> > > Can you comment results in first letter (where base64 part marks
> > > as bo, bt. bl and other simultaneously) ?
> >
> > Don't understand your question. The first letter is 'h' for head
> > and'b' for body. base64 text is always in the body.
>
> *** 235 bh 77
> PCFET0NUWVBFIEhUTUwgUFVCTElDICItLy9XM0MvL0RURCBIVE1MIDQuMV*** 236 bl
> 77 L0VOIj4NCjwhLS0gc2F2ZWQgZnJvbSB1cmw9KDAxMjQpZmlsZTovL0M6XERvY***
> 237 bs 77
> ZCUyMFNldHRpbmdzXG9zb2tpbmFcTG9jYWwlMjBTZXR0aW5nc1xUZW1wb3J*** 238 bs
> 77 dCUyMEZpbGVzXE9MSzg0XNHl7Ojt4PDx6uD/JTIw8ODx8fvr6uAzLmh0bSAtLR
>
> Yes, it's a body. But why so base64 part are recognized as bh (html)
> in 1 line, bl (lcomment) in 2 line, and scomment in 3 & 4 lines ?
The different states "html, lcomment, scomment" indicate what the parser
is doing when the line is read in. Look at lexer_v3.l to see how the
different states are used. Since flex may read ahead in the message
while trying to match a pattern, the states may not be what you
expected. This is a debug display to help developers understand what's
going on. If you use an additional "-v" in your command line,
bogofilter will display the line before _and_ after base64 decoding.
> I has met such case:
> (copying from early letter)
>
> *** 226 bt 77
> IFNvZnRMaW5lDQp1bnN1YnNjcmliZSBTZW1pbmFycw0KPG1haWx0bzpzdWJ*** 227 bt
> 77bmUtZGlyZWN0LnJ1P3N1YmplY3Q9dW5zdWJzY3JpYmUgU2VtaW5hcnM+IC*** 228 bt
> 53 IO7yIO3u4u7x8uXpIPHl7Ojt4PDu4iBTb2Z0TGluZQ0KDQoNCg0K
> *** 229 bt 1
>
>
> (above all ok - end of base64 encoded file (first attach in a letter)
> and begin another (second attach))
>
> *** 230 hi 44 ------=_NextPart_000_EF17_01C3507A.A8F2C320
>
> *** 231 hi 25 Content-Type: text/html;
>
> *** 232 hi 18 charset="koi8-r"
>
> *** 233 hi 34 Content-Transfer-Encoding: base64
>
> *** 234 hi 1
>
>
> All ok (begin second attach in a same letter)
>
> *** 235 bh 77
> PCFET0NUWVBFIEhUTUwgUFVCTElDICItLy9XM0MvL0RURCBIVE1MIDQuMV*** 236 bl
> 77 L0VOIj4NCjwhLS0gc2F2ZWQgZnJvbSB1cmw9KDAxMjQpZmlsZTovL0M6XERvY***
> 237 bs 77
> ZCUyMFNldHRpbmdzXG9zb2tpbmFcTG9jYWwlMjBTZXR0aW5nc1xUZW1wb3J*** 238 bs
> 77 dCUyMEZpbGVzXE9MSzg0XNHl7Ojt4PDx6uD/JTIw8ODx8fvr6uAzLmh0bSAtLR
>
> Not ok - see above
Looks OK to me.
> > > Can you explain this part of newly generated wordlist goodlist.db
> > > ?
> > No. It works fine for my test case. Can you gzip the original
> > message and send it to me?
>
> I will try, but i cannot now take a file of suitable size (my test
> corpus of letters - 45 Mbytes ;)
>
> > > but so parts must be ignored, by your words!
> > images and applications are ignored. text/plain, text/html, etc are
> > processed. Look at files src/lexer.c and src/token.c for details.
>
> And what about this :
> -----------
> Content-Type: application/msword; name="Contract.doc"
>
> Content-Transfer-Encoding: base64
>
> Content-Disposition: attachment; filename="Contract.doc"
> -----------
> Content-Type: application/x-msexcel; name="=?koi8-r?B?5dbFzS4gxMn
> Content-Transfer-Encoding: base64
>
> Content-Disposition: attachment; filename="=?koi8-r?B?5dbFzS4gxMn
>
> It is ignored ? Or this base64-encoded content are tokenized ?
>
> I have many, many letters wiith attaches of that types passing my
> server. Are there its encoded content important or not as the image or
> application ?
You have sufficient code in the bogofilter installation to answer these
questions for yourself. Use "bogolexer -p < message" to see what
happens.
David
More information about the Bogofilter
mailing list