Support for Real Time Black Listsby Scott MacGregor
Mozilla Mail has made some excellent strides in the fight against unwanted mail (SPAM). Particularly with regards to the Bayesian junk mail controls. This technique is extremely effective when the user trains the junk algorithms with positive and negative data. We would like to push the SPAM work even further by adding support for real time black lists to Mozilla Thunderbird Mail. RBLs work a differently than the bayesian junk controls. An RBL is a list of 'known' SMTP servers notorious for sending SPAM. When the user receives a message, we can look at the IP address of the originating SMTP server, and look it up in a black list. One of the advantages of RBLs is that they don't require any work from the user. There is no training involved. In addition, the lists are 'live' and evolve over time as more sites are identified as senders of SPAM.
There are many sites available on the web that offer RBLs. Many of them are even free. Here are a couple a brief internet search turned up:
A key component for adding RBL support to Thunderbird Mail is to allow the user to determine which RBLs should be used.
The RBL experience should be as integrated with the current junk mail controls as possible. We will not be introducing a new way to present junk mail to the user, only a new mechanism for determining how to mark a message as junk. I currently envision a tabbed panel being added to the existing junk mail control dialog, like this. This 2nd panel would be titled "Real Time Black Lists" and would contain a one or two sentence description of real time black lists. RBL properties would be configured on a per account basis. There would then be space for the user to configure the following information:
- One or more DNS servers to use as RBLs. i.e. the user can point us to the server(s) which should be used to look up servers
- A DNS RBL server can return various pieces of information such as: the server uses an open relay or the server is a confirmed spam source. We should allow the user to determine if a message should be marked as JUNK for each available criteria.
As incoming mail is processed, the user's mail filters are invoked. After that, we will check to see if the sender is in the user's white list. If it is not then we'll look to see if we need to check with an RBL. If the message comes back as spam then we will mark the message as junk just like the bayesian code does today. After checking with the RBLs, if the message is still clean, then we'll fall back to the current bayesian junk mail logic. The bayesian junk controls require the body to be downloaded, so we'll do the RBL test first in the hopes that we can avoid downloading the message body if we determine it is spam.
The Mozilla Mail team has done a huge amount of work to get Bayesian SPAM detection up and running. They had to solve a lot of hard issues for adding logic to mark messages as SPAM, build a junk mail folder, add UI for junk mail, add code for aging messages in the junk folder, etc. Fortunately, we are going to be able to leverage all of this work. All we are doing is introducing a new mail filter plugin which can mark a message as junk just like the bayesian algorithms. This should make this feature much easier to add than the existing junk mail controls.
Here is a brief overview of the work we need to do to make this happen:
Support for multiple junk mail plugins
We currently have an API for a junk mail plugin which allows a plugin to asychronously report junk mail back to our imap store. This needs to be extended to support multiple junk mail plugins and we need a way to synchronize the plugins. For instance, we don't want to fire the bayesian junk mail plugin for a message that was marked as junk by the RBL plugin. This may be solved by using the category manager which I think allows us to enforce an order on the components for a category.
Fetching the Received Header
If the user turns on the RBL junk mail plugin, IMAP needs to fetch the Received header so the plugin can parse out the IP address of the SMTP server and send it to the RBL DNS server for testing. The IMAP code has a way to specify the list of headers to be fetched from the server when downloading header information for new mail. We need to tap into this process and make sure this header gets downloaded. Advantages to doing it this way: if the bayesian controls are turned off and RBLs are turned on, we won't have to download the entire message body in order to do the RBL lookup. Same applies if the RBL lookup determines the message is junk. We won't have to run the bayesian plugin so the body won't have to be downloaded. If the imap code is now downloading the Received header, then we need a temporary in memory place to store it so the plugin can fetch it later. We may be able to tap into nsIMsgHdr for this.
Add an X-Header to the message
The user may wish to build views or do searches based on the results of the RBL lookup. For instance, I should be able to search on all messages that were sent through an open relay. We'll add an X-RBL-Lookup header to the message when we store it locally which contains the status of the RBL lookup.
Adding the actual RBL Plugin
Write the junk mail plugin for RBL lookups. The plugin needs to parse out the IP address of the SMTP server from the Received header, generate the IP address for DNS resolution and perform the DNS lookup. Based on the IP address returned and the user's preferences, mark the message as junk and fill in the X-RBL-Lookup header. If the plugin marks a message as junk, it needs to do set the junkscoreorigin as a plugin to make sure it does the right thing with regards to training.
Eventually we'll want to tap into the existing junk mail logging with the RBL lookup results.
- Can we pre-populate with several free RBL lists?
- What thread do the DNS lookups need to happen on? Our DNS service runs on it's own thread already. Is it asynch? If we are running on the UI thread, will it call back into us asynchronously with the result or are we going to be blocking the UI thread? Would be a bad user experience if the RBL is down and we block the mail UI waiting for DNS resolution to time out.
- Would we add the new X-RBL-Lookup header to the list of custom headers automatically, so it would show in advanced search? If so, we'll want to hide it in the filter dialog (like we do with junk status), since we don't know this header until after filters run.