Junk Notes

You are currently viewing a snapshot of www.mozilla.org taken on April 21, 2008. Most of this content is highly out of date (some pages haven't been updated since the project began in 1998) and exists for historical purposes only. If there are any pages on this archive site that you think should be added back to www.mozilla.org, please file a bug.

Roadmap
Projects
Coding
Testing
Tools
- Bugzilla
- Tinderbox
- Bonsai
- LXR
FAQs

Junk Mail in Mozilla
2/17/2003
Seth Spitzer

What how do we determine if a message is spam or not?

take bunch of good email, make a table, token, # of times.
take bunch of bad email, make a table, token # of times.
(I'll refer to these two tables as "training data", stored in file training.dat)

third table: all tokens, probability that an email containing it is spam.
(formula to calculate probability biased to avoid false positives. "good email is junk" which is worse than a false negative "bad email is not junk")

when new mail arrives, find 15 "most interesting" tokens. ("most interesting" are those with biggest delta from 50%)
use the probabilties from the third table to determine the probability the message is junk.
if > 90%, let the user know it is junk

This algorithm comes from Paul Graham's paper "A Plan For Spam".

What's a token?

alphanumeric characters, dashes, apostrophes, and dollar signs to be part of tokens, and everything else to be a token separator.
We ignore tokens that are all digits.
We store tokens as UTF-8. This is how we handle non US-ASCII email.
(for US-ASCII, we lower case before storing).

What part of the email?

All text parts, including headers, subject, attachments that we can decode as text.

When do we do it?

new mail arrives (but after filters)
or when the user manually analyzes mail "Tools | Run Junk Mail Controls On Selected Messages"
unseen mail when opening a folder
reparsing a local folder

When don't we do it?

old mail
if new, but we're white listing (is the sender in my addressbook? note the Klez virus)
opening the junk or trash folders (if it's already junk or trash, who cares)

When do we update the tables?

when the user manually marks a message as junk (or not junk).
envelope, thread pane, toolbar button, Tools | ... (soon to be Message | Mark | ...)
note, we don't update the tables when analyzing mail

What's the cost?

network i/o: need to download all text parts of messages (not a big deal for pop3, but big deal for imap)
file i/o: need to load tables at startup (training.dat)
cpu: parsing tables, tokenizing, determining probabilities
moving to the junk folder
purging on startup
this is why we do filters and whitelisting first, want spam to be last

multiple accounts / multiple profiles

training.dat is local, and per profile, but cross account.
for imap, we store junk keyword on the server

the junk folder

acts like a special folder
for imap, put it on imap server (saves network i/o)
user can pick it, and change it (bugs here)
used to keep junk out of users face, allow for purging
could be the trash folder

purging

on startup (?), remove old stuff marked as junk in junk folder
off by default (false positives + purge == dataloss)

What's the initial experience?

new user has no training data
enabled, white listing by PAB, no move (false positives)

what's not done?

improve the initial experience
junk mail envelope (screenshot)
on manual analyze, move (if the users has that set up)
on manual mark, move to junk (or delete)
purge should follow IMAP delete model
purge should only purge message marked as junk (dataloss if trash folder)
bugs (imap: state goes away, pop: fail to move)
removing the junk from the filter UI (should stay in search)

future improvments

hooks for QA (for purge testing, minute, not daily, force spam analysis)
tools to dump training.dat
improved UI (see specs)
could be used for news
generalize (have multiple tables) for things other than "junk or not junk".

thanks

Paul Graham
Initial Mozilla implementation work by beard, bienvenu, dmose, ducarroz, jglick, naving, nhotta, peterv, and sspitzer.