wordlist maintenance [was: several bugs/glitches/typos/questions]

David Relson relson at osagesoftware.com
Sat Mar 8 05:43:10 CET 2003


At 10:43 PM 3/7/03, W M Brelsford wrote:

>On Fri Mar 07 2003 at 10:11 PM -0500, David Relson wrote:
> > At 09:20 PM 3/7/03, W M Brelsford wrote:
> > >And, "bogoutil -n -d file.db" still does not combine tokens.  I
> > >assume this would take more involved code, and may not be worth
> > >worrying about as long as it's documented.  Presumably one would run
> > >bogoutil -n -m once per database, set "replace_nonascii_characters=Y"
> > >in bogofilter.cf and be done with it.
> >
> > The "-d" (dump) option writes each token.  When used with the "-n"
> > (replace-nonascii-characters) options, the token written has translated
> > characters (if appropriate).  Given the one pass nature of "-d", there 
> will
> > _not_ be any combining.
> >
> > However if you use "-l" (load) from the dump output file, combining _will_
> > occur.
>
>Makes sense.  Perhaps -n in the man page should then read:
>
>         The "bad" characters will be converted to question marks
>         and, except with -d, matching tokens will be combined.

Bill,

The most recent date is being kept and the man page has been updated.

David

P.S.  Apparently there was an editing error in the previous patch.  The 
line of code to delete a token (that had too low a count, was too old, etc) 
was apparently deleted.  It's fixed in CVS and the attached patch.
-------------- next part --------------
Index: src/datastore.h
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/src/datastore.h,v
retrieving revision 1.3
diff -u -r1.3 datastore.h
--- src/datastore.h	8 Feb 2003 20:34:24 -0000	1.3
+++ src/datastore.h	8 Mar 2003 04:30:46 -0000
@@ -61,6 +61,9 @@
 /** Set the value associated with a given word in a list */
 void db_setvalue(void *, const word_t *, uint32_t);
 
+/** Update the value associated with a given word in a list */
+void db_updvalue(void *vhandle, const word_t *word, uint32_t count);
+
 /** Get the database message count */
 uint32_t db_get_msgcount(void*);
 
Index: src/datastore_db.c
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/src/datastore_db.c,v
retrieving revision 1.11
diff -u -r1.11 datastore_db.c
--- src/datastore_db.c	7 Mar 2003 17:57:32 -0000	1.11
+++ src/datastore_db.c	8 Mar 2003 04:30:47 -0000
@@ -333,6 +333,26 @@
 }
 
 
+/*
+Update the VALUE in database, using WORD as database key.
+Adds COUNT to existing count.
+Sets date to newer of TODAY and date in database.
+*/
+void db_updvalue(void *vhandle, const word_t *word, uint32_t count){
+  dbv_t val;
+  int ret = db_get_dbvalue(vhandle, word, &val);
+  if (ret != 0) {
+      val.count = count;
+      val.date  = today;		/* date in form YYYYMMDD */
+  }
+  else {
+      val.count += count;
+      val.date  = max(val.date, today);	/* date in form YYYYMMDD */
+  }
+  db_set_dbvalue(vhandle, word, &val);
+}
+
+
 static void db_set_dbvalue(void *vhandle, const word_t *word, dbv_t *val){
   int ret;
   DBT db_key;
Index: src/maint.c
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/src/maint.c,v
retrieving revision 1.10
diff -u -r1.10 maint.c
--- src/maint.c	8 Mar 2003 00:10:34 -0000	1.10
+++ src/maint.c	8 Mar 2003 04:30:47 -0000
@@ -158,7 +158,7 @@
     memcpy(&val, data->text, data->leng);
 
     if (!keep_count(val.count) || !keep_date(val.date) || !keep_size(key->leng)) {
-
+	db_delete(userdata, key);
 	if (DEBUG_DATABASE(0)) {
 	    fputs("deleting ", dbgout);
 	    word_puts(&w, 0, dbgout);
@@ -169,15 +169,13 @@
 	if (replace_nonascii_characters)
 	{
 	    byte *tmp = xstrdup(key->text);
-	    unsigned long count = val.count;
 	    if (do_replace_nonascii_characters(tmp, key->leng))
 	    {
 		db_delete(userdata, key);
 		w.text = tmp;
 		w.leng = key->leng;
-		count += db_getvalue(userdata, &w);
 		set_date(val.date);
-		db_setvalue(userdata, &w, count);
+		db_updvalue(userdata, &w, val.count);
 	    }
 	    xfree(tmp);
 	}
Index: src/tests/bogoutil/t.nonascii.replace
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/src/tests/bogoutil/t.nonascii.replace,v
retrieving revision 1.1
diff -u -r1.1 t.nonascii.replace
--- src/tests/bogoutil/t.nonascii.replace	8 Mar 2003 00:11:13 -0000	1.1
+++ src/tests/bogoutil/t.nonascii.replace	8 Mar 2003 04:30:47 -0000
@@ -12,13 +12,13 @@
 #
 # test below
 # remember to use ${srcdir}
-echo  	41 A4 BA B5 B5 20 31 0A \
-	41 C1 BA B8 B5 20 32 0A \
-	41 BA C1 B8 B5 20 33 0A \
-  	42 A4 BA B8 B5 B5 20 31 0A \
-	42 C1 BA B8 B5 B5 20 32 0A \
-	42 BA C1 B8 B5 B5 20 33 0A \
-	42 C1 BA B5 B8 B5 20 34 0A \
+echo  	41 A4 BA B5 B5 20 31     20  32 30 30 33 30 33 30 33 0A \
+	41 C1 BA B8 B5 20 32     20  32 30 30 32 31 32 30 32 0A \
+	41 BA C1 B8 B5 20 33     20  32 30 30 33 30 33 30 31 0A \
+  	42 A4 BA B8 B5 B5 20 31  20  32 30 30 33 30 33 30 33 0A \
+	42 C1 BA B8 B5 B5 20 32  20  32 30 30 32 31 32 30 32 0A \
+	42 BA C1 B8 B5 B5 20 33  20  32 30 30 33 30 33 30 31 0A \
+	42 C1 BA B5 B8 B5 20 34  20  32 30 30 33 30 33 30 34 0A \
 | ../dehex >${TMPDIR}/input
 
 WORDLIST="${TMPDIR}/spamlist.db"
@@ -33,12 +33,12 @@
 LEN1=`wc -l ${TMPDIR}/output.1 | awk '{print $1}'`
 LEN2=`wc -l ${TMPDIR}/output.2 | awk '{print $1}'`
 
-TOK1=`head -1 ${TMPDIR}/output.2 | awk '{print $2 }'`
-TOK2=`tail -1 ${TMPDIR}/output.2 | awk '{print $2 }'`
+TOKDAT1=`head -1 ${TMPDIR}/output.2 | awk '{print $2 "." $3 }'`
+TOKDAT2=`tail -1 ${TMPDIR}/output.2 | awk '{print $2 "." $3 }'`
 
-RESULT=`printf "%d.%d.%d.%d" $LEN1 $LEN2 $TOK1 $TOK2`
+RESULT=`printf "%d.%d.%s.%s" $LEN1 $LEN2 $TOKDAT1 $TOKDAT2`
 
-WANT="7.2.6.10"
+WANT="7.2.6.20030303.10.20030304"
 
 if [ "$RESULT" != "$WANT" ] ; then
     echo want: $WANT, have: $RESULT
Index: doc/bogoutil.xml
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/doc/bogoutil.xml,v
retrieving revision 1.6
diff -u -r1.6 bogoutil.xml
--- bogoutil.xml	6 Mar 2003 23:42:58 -0000	1.6
+++ bogoutil.xml	8 Mar 2003 04:37:45 -0000
@@ -109,7 +109,8 @@
 	    Option <option>-n</option> stands for "replace non-ascii characters".  
 	    It will replace characters with the high bit (0x80) by question marks.  
 	    This can be useful if a word list has lots of unreadable tokens, for example from asian spam.
-	    The "bad" characters will be converted to question marks and matching tokens will be combined.
+	    The "bad" characters will be converted to question marks and matching tokens will be combined
+	    when used with '-m' or '-l', but not with '-d'.
 	</para>
 	<para>
 	    Option <option>-a age</option> indicates an acceptable token age, with older ones being discarded.  



More information about the Bogofilter mailing list