how many tokens?

Nick Simicich njs at scifi.squawk.com
Thu Feb 27 19:04:19 CET 2003


At 09:59 AM 2003-02-26 -0800, Chris Wilkes wrote:

>On Wed, Feb 26, 2003 at 12:43:45PM -0500, David Relson wrote:
> >
> > No Due Dates<FONT COLOR="fef0d0">zzzzzz</FONT>No Hidden Charges<FONT
> > COLOR="#fef0d0">zzzzzz</FONT>No
> >
> > line 3 - Should "Dates<FONT COLOR="fef0d0">zzzzzz</FONT>No" produce
> > two
> > tokens ("dates", "zzzzzz") or just one, i.e. "dateszzzzzzno" ?
>
>Do you want to keep the FONT tags around?  A lot of spam HTML email has
>crazy fonts all over the place and I think a count of them would help
>identify spam.

The tags are kept and parsed even if they are moved out of the middle of words.

>Course I'm of the mind that any HTML email I get is highly suspect from
>the get-go.  Maybe I should make a pre-filter for my script to run BF so
>I can have seperate text and html email file databases and cutoff rules.
>Anyone doing that?

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!



More information about the bogofilter-dev mailing list