You are currently viewing a snapshot of taken on April 21, 2008. Most of this content is highly out of date (some pages haven't been updated since the project began in 1998) and exists for historical purposes only. If there are any pages on this archive site that you think should be added back to, please file a bug.

Junk Mail in Mozilla
Seth Spitzer

What how do we determine if a message is spam or not?

take bunch of good email, make a table, token, # of times.
take bunch of bad email, make a table, token # of times.
(I'll refer to these two tables as "training data", stored in file training.dat)

third table:  all tokens, probability that an email containing it is spam. 
(formula to calculate probability biased to avoid false positives.  "good email is junk" which is worse than a false negative "bad email is not junk")

when new mail arrives, find 15 "most interesting" tokens.  ("most interesting" are those with biggest delta from 50%)
use the probabilties from the third table to determine the probability the message is junk. 
if > 90%, let the user know it is junk

This algorithm comes from Paul Graham's paper "A Plan For Spam".

What's a token?

alphanumeric characters, dashes, apostrophes, and dollar signs to be part of tokens, and everything else to be a token separator.
We ignore tokens that are all digits. 
We store tokens as UTF-8.  This is how we handle non US-ASCII email.
(for US-ASCII, we lower case before storing).

What part of the email?

All text parts, including headers, subject, attachments that we can decode as text.

When do we do it?
  1. new mail arrives (but after filters)
  2. or when the user manually analyzes mail "Tools | Run Junk Mail Controls On Selected Messages"
  3. unseen mail when opening a folder
  4. reparsing a local folder
When don't we do it?
  1. old mail
  2. if new, but we're white listing (is the sender in my addressbook?  note the Klez virus)
  3. opening the junk or trash folders (if it's already junk or trash, who cares)
When do we update the tables?
  1. when the user manually marks a message as junk (or not junk).
  2. envelope, thread pane, toolbar button, Tools | ... (soon to be Message | Mark | ...)
  3. note, we don't update the tables when analyzing mail
What's the cost?
  1. network i/o: need to download all text parts of messages (not a big deal for pop3, but big deal for imap)
  2. file i/o: need to load tables at startup (training.dat)
  3. cpu: parsing tables, tokenizing, determining probabilities
  4. moving to the junk folder
  5. purging on startup
  6. this is why we do filters and whitelisting first, want spam to be last
multiple accounts / multiple profiles
  1. training.dat is local, and per profile, but cross account.
  2. for imap, we store junk keyword on the server
the junk folder
  1. acts like a special folder
  2. for imap, put it on imap server (saves network i/o)
  3. user can pick it, and change it (bugs here)
  4. used to keep junk out of users face, allow for purging
  5. could be the trash folder
  1. on startup (?), remove old stuff marked as junk in junk folder
  2. off by default (false positives + purge == dataloss)
What's the initial experience?
  1. new user has no training data
  2. enabled, white listing by PAB, no move (false positives)
what's not done?
  1. improve the initial experience
  2. junk mail envelope (screenshot)
  3. on manual analyze, move (if the users has that set up)
  4. on manual mark, move to junk (or delete)
  5. purge should follow IMAP delete model
  6. purge should only purge message marked as junk (dataloss if trash folder)
  7. bugs (imap:  state goes away, pop: fail to move)
  8. removing the junk from the filter UI (should stay in search)
future improvments
  1. hooks for QA (for purge testing, minute, not daily, force spam analysis)
  2. tools to dump training.dat
  3. improved UI (see specs)
  4. could be used for news
  5. generalize (have multiple tables) for things other than "junk or not junk".
  1. Paul Graham
  2. Initial Mozilla implementation work by beard, bienvenu, dmose, ducarroz, jglick, naving, nhotta, peterv, and sspitzer.