What how do we determine if a message is spam or not?
take bunch of good email, make a table, token, # of times.
take bunch of bad email, make a table, token # of times.
(I'll refer to these two tables as "training data", stored in file training.dat)
third table: all tokens, probability that an email containing it is spam.
(formula to calculate probability biased to avoid false positives. "good email is junk" which is worse than a false negative "bad email is not junk")
when new mail arrives, find 15 "most interesting" tokens. ("most interesting" are those with biggest delta from 50%)
use the probabilties from the third table to determine the probability the message is junk.
if > 90%, let the user know it is junk
This algorithm comes from Paul Graham's paper "A Plan For Spam".
What's a token?
alphanumeric characters, dashes, apostrophes, and dollar signs to be part of tokens, and everything else to be a token separator.
We ignore tokens that are all digits.
We store tokens as UTF-8. This is how we handle non US-ASCII email.
(for US-ASCII, we lower case before storing).
What part of the email?
All text parts, including headers, subject, attachments that we can decode as text.
When do we do it?
- new mail arrives (but after filters)
- or when the user manually analyzes mail "Tools | Run Junk Mail Controls On Selected Messages"
- unseen mail when opening a folder
- reparsing a local folder
- old mail
- if new, but we're white listing (is the sender in my
addressbook? note the Klez virus)
- opening the junk or trash folders (if it's already junk or trash,
who cares)
- when the user manually marks a message as junk (or not junk).
- envelope, thread pane, toolbar button, Tools | ... (soon to be Message | Mark | ...)
- note, we don't update the tables when analyzing mail
- network i/o: need to download all text parts of messages (not a big deal for pop3, but big deal for imap)
- file i/o: need to load tables at startup (training.dat)
- cpu: parsing tables, tokenizing, determining probabilities
- moving to the junk folder
- purging on startup
- this is why we do filters and whitelisting first, want spam to be
last
- training.dat is local, and per profile, but cross account.
- for imap, we store junk keyword on the server
- acts like a special folder
- for imap, put it on imap server (saves network i/o)
- user can pick it, and change it (bugs here)
- used to keep junk out of users face, allow for purging
- could be the trash folder
- on startup (?), remove old stuff marked as junk in junk folder
- off by default (false positives + purge == dataloss)
- new user has no training data
- enabled, white listing by PAB, no move (false positives)
- improve the initial experience
- junk mail envelope (screenshot)
- on manual analyze, move (if the users has that set up)
- on manual mark, move to junk (or delete)
- purge should follow IMAP delete model
- purge should only purge message marked as junk (dataloss if trash folder)
- bugs (imap: state goes away, pop: fail to move)
- removing the junk from the filter UI (should stay in search)
- hooks for QA (for purge testing, minute, not daily, force spam analysis)
- tools to dump training.dat
- improved UI (see specs)
- could be used for news
- generalize (have multiple tables) for things other than "junk or
not junk".
- Paul Graham
- Initial Mozilla implementation work by beard, bienvenu, dmose, ducarroz, jglick, naving, nhotta, peterv, and sspitzer.