Word pairs

Cedric Foll cedric.foll at ac-rouen.fr
Wed May 14 11:00:38 CEST 2003


I've wrote a pre processing script which return word pairs.
It's wroten in Ruby. I'm using it with succes to class HTML page.
It can't deal with MIME message.

u can use it like that:

mybogo.rb < file | bogofilter


Le mer 14/05/2003 à 09:58, michael at optusnet.com.au a écrit :
> Does anyone have a patch lying around that changes bogofilter
> to use word pairs as token instead of single words?
> 
> Many thanks,
> michael, experimenting but lazy.
> 
> ---------------------------------------------------------------------
> FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
> To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
> For summary digest subscription: bogofilter-digest-subscribe at aotto.com
> For more commands, e-mail: bogofilter-help at aotto.com
> 
-------------- next part --------------
#!/usr/bin/ruby

#LIMIT_SIZE = 100000
LIMIT_SIZE = -1

page = ""
words = []

page = $stdin.readlines(nil)[0]
exit if !page
page = page[0..LIMIT_SIZE] if LIMIT_SIZE
words = page.split(/[ ,\+;!:\n\.\(\)\/"'\?\*\[\]\-=><{}]+/)
words.reject! {|word| word.length==0 or word == "&nbsp"}
words.map!{|w| w.tr('A-Z','a-z')}

more = []
for i in 0..words.length-2
	more[i] = words[i]+"_"+words[i+1]
end

page.scan(/meta[^>]*name="(publisher|topic|theme|keywords|abstract|page-topic|description)"[^>]*content="([^"]*)"/mi) {|x,res|
more += res.split(/[ ,]+/).map{|word| "meta_"+word} if res
}

puts words * ' '
puts more * ' '



More information about the Bogofilter mailing list