spam conference report

Gyepi SAM gyepi at praxis-sw.com
Sun Jan 19 17:28:30 CET 2003


On Sun, Jan 19, 2003 at 01:09:14PM +0100, Matthias Andree wrote:
> Gyepi SAM <gyepi at praxis-sw.com> writes:
> > There are webcasts online at www.spamconference.org so this will be
> > limited
> > to my impressions and how it relates to bogofilter.
> 
> Ho-hum, while they cover the complete conference, it's long. Is there
> any summary of the respective presentations besides the abstracts?

I don't think so.

> > 4. Don't exclude [^[:alnum]] characters. This has the interesting result
> > that 'spam!!" and "spam!!!" are different words. A suggestion for
> > unknown words is to look for the closest match so "spam!!!!" will return
> > the count for "spam!!!"
> 
> Hum. I wonder how one'd do a proximity match FAST, say, no worse than
> O(n log n) what we have now.

One way I can think of is to cluster all proximate words into a single database value
whose key is some root of all the words, perhaps the stem. Presuming that the list is not too
long, we still maintain order n log n. So to look up a word:

1. compute the word's stem
2. use the stem as a lookup into the database
3. get back a list of words and their counts.
4. walk the list, looking for an exact match
5. perhaps if no exact match is found, use the first word's count.

Step three would require some way to represent the words and counts within each value. Borrowing from
DJ Bernstein's CDB format: each list would be self defining. 
The first 4 bytes of the value would contain the count of words in the list.
Each word is prefixed with 4 bytes containing its length.
The 4 bytes after the word contains its count
The n bytes after the count contains the timestamp, n being however many bytes are used for timestamp data.

The example. Say we have the following list of proximate words and counts: 

	free!  20
	free!! 8
        free!!! 3

They would all be stored under the key 'free'. And would have the value (without the newlines).

3
5free!
20
20020102
6free!!
8
20020302
7free!!!
3
20030102

-Gyepi




More information about the bogofilter-dev mailing list