<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <title></title>

</head>

<body>

<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">

<title></title>

<blockquote type="cite" cite="mid3EDF6B78.3888.10C0D13@localhost">   

  <blockquote type="cite">     

    <blockquote type="cite">       

      <pre wrap=""></pre>

     </blockquote>

   </blockquote>

  <blockquote type="cite">     

    <blockquote type="cite">       

      <pre wrap="">The idea here is that these weird forms are fairly rare so they can be

stored separately with no great storage cost - degeneration is to one of

the standard forms only - not to any of the other weird formats.

      </pre>

     </blockquote>

    <pre wrap="">Time will tell if the weird formats are rare, or not.  Certainly one lookup

for the three standard forms is a winning speed strategy.

    </pre>

   </blockquote>

  <pre wrap=""><!---->

I am not sure how valid this is, but we could have a fourth count for *all* 

weird formats. So tHe, tHE, ThE etc. all count as weird.

This might result in higher pspam values as all weird forms are counted as 

one form, but this only happens if the token has several different weird 

formats (rare event?).

  </pre>

 </blockquote>

 In the spam I've seen, it isn't so rare.<br>

<br>

Some words (which are already spamish) are mis-capitalized far more often

than others.  Furthermore, deliberately mis-capitalized words are mis-capitalized

in every possible way with seemingly relatively equal probability.<br>

<br>

IMHO, a single count for 'mis-capitalized' might even  give better results

than maintaining a separate count for each possible variation.

<blockquote type="cite" cite="mid3EDF6B78.3888.10C0D13@localhost">

  <pre wrap="">If there are no weird formats, the fourth count need not be stored (if you

can do right truncation of count fields), so it might be quite space 

efficient.

Information is lost if we do this, as we cannot regenerate the actual 

weird tokens and their counts (e.g. via bogoutil) e.g. we have output the 

token in a specific weird format, like "tHe" rather than the actual one

- but maybe it does not matter too much - after all we lost information 

about formats in the original casefolded database,

  </pre>

 </blockquote>

 I don't think it matters.  Although proper names (such as McDonald or BogoFilter)

will be capitalized  the same way almost every time in non-spam and genuine

typos (in non-spam) will be relatively rare, deliberately mis-capitalized

words are mis-capitalized in every possible way and they're almost always

a strong indication of spam.  It should be sufficient to simply report 'mis-capitalized'

as a single form -- it doesn't really matter which capitalization variants

were seen.<br>

</body>

</html>