Green Mail Filter, flexible rule-driven mail processor

Introduction
Who is it for
Portability
Downloads
Sample uses
   Fan-in mail collector
   Antivirus filter
   Bayesian^* or any other technique en vogue
   Filtering against external spammers blacklist^*
   Challenge/response whitelist^*
   Industrial mail processor
   Generic POP3 server
Ruleset
   How it works
   Samples
   List of special header fields
Performance
Copyright, license and feedback

^*) Default set of rules has these.

Green Mail Filter is an all-purpose mail processor, it has lots of possible uses. Green is a multi-user server application, it processes users' mail. Although it comes with a default installation of just one user, it's by no means limited with that. Green can be thought as a platform for developing mail filters.

In the shortest possible way - Green collects users' mail messages from somewhere, runs each received message through a set of rules (aka ruleset) and finally puts the message to one of the user's mailboxes. The primary use for such behaviour was separating legitimate mail (default "mail" mailbox) from spam (default "spam" mailbox).

Green collects user's mail from external POP3 accounts configured for each user. There is also a separate spooling directory (.spool) for each user, all the files that appear there get collected as well. Spooling messages as plain files can be useful in an environment where mail servers are already in place and filtering needs to be added.

The processing of the collected messages is fully determined with the exact rules making up the user's ruleset. Therefore the Green filter is only as good as it's ruleset is. See detailed explanation of how ruleset works here.

When message processing is complete, the message is put to one of the user's mailboxes (each mailbox is a subdirectory in a user's directory). Green also contains a POP3 server, and so the user can connect to it and read her mail as usual.

To start using Green, please follow these minimum steps:
1. Download and install. Among other things, the installer prompts for a username for the user account being created. Let's say you enter jsmith here.
2. The Green core is installed as a system service and starts immediately.
3. Start Green Mail Filter manager application.
4. Under Users/jsmith configure the POP3 password you'd like to be used in the internal Green's mail server. Let's say you enter mypass here.
5. Under Users/jsmith/Accounts, create and configure one or more records for external POP3 accounts, these are used for collecting mail from.
6. Start your mail client (Outlook, TheBat, whatever) and configure it to use POP3 server at 127.0.0.1:110 (this is the builtin Green's mail server). Set login information to jsmith (username) and mypass (password).
7. Check your mail.

One thing to note is that how several mailboxes ("mail", "spam", etc.) can be read through single POP3 account created for single user at 127.0.0.1:110. Here is how - if the user logs in to Green's POP3 server using her "username" and "password", she gets a view of a default "mail" mailbox. If the user wants to read other mailboxes (ex. "spam"), she logs in using a special username: "username/spam" and the same "password". In other words, Green's POP3 server accepts usernames of the form "username/mailbox" where mailbox name defaults to "mail".

Green is likely to be used by an enthusiastic e-mail user who has at significant knowledge of how e-mail works. The more e-mail and networking knowledge - the better. Users that can do programming will benefit from fully exploiting the Green's potential.

Developers can use Green as a platform for building arbitrary mail filters. As Green rulesets can use Python scripts, knowledge of this wonderful programming language is certainly a plus.

E-mail providers can install Green on the server side and give their customers additional value by configuring mail filtering in any way they like.

Companies that face a need of processing lots of e-mail can use Green as a generic mail processor.

Green runs under Windows 2000 or better.

Green contains three main modules - server core (green.exe), Win32 service wrapper (greensvc.exe) and GUI control panel (greenmgr.exe). The server core is written in portable C++ in a portable fashion, therefore it can be more or less easily ported to Unix. Service wrapper has no meaning under Unix, and so it needs not to be ported. The GUI control panel is written in Delphi and so it'd be problematic to port. On the other hand (1) the control panel is not a required component of the server which is fully configurable with standalone XML files, and (2) it's theoretically possible that the GUI control panel is run on a separate machine from the service core so it can control the server core running on a separate machine.

Server core should run under Windows 98 or better, but because it's installed as a service, it's only reasonable to run it under Windows 2000 or higher. Besides, extensive tests were not performed under Windows 98 at all.

Binary installation package (Windows 2000 or higher, requires administrative rights to install):

Version 1.4, released Sept 13, 2005 (changes)
Installer (~3,8M): green-1.4.exe (same thing, zipped: green-1.4.zip)
Installer signature: green-1.4.exe.sig

Listed below are a few sample applications for Green filter. As each is nothing but a few rules in a ruleset, they can obviously be mixed and matched in any imaginable way. Some of these sample rulesets come bundled with the installation in the "samples" subdirectory and you can try them out using "Import" popup menu item in the GUI control panel.

1. Fan-in mail collector:

This is the simplest possible application. Green collects user's mail from one or more external POP3 accounts as well as from spooling directory and puts it all to "mail" mailbox so that the user can fetch it all in one place from the Green's POP3 server. Not much of a filtering taking place though.

2. Antivirus filter:

If you install a 3rd party antivirus, possibly a free one that comes with a command line scanner, it's easy to set up a rule which would execute this scanner against each received message and filter appropriately.

3. Bayesian (included with default ruleset) or any other technique en vogue:

For a developer or enthusiastic user with programming knowledge, Green could be a solid prototyping/production ground. As Green's rules can contain arbitrary Python code, there is no limit on what can be done. Want it bayesian ? Want path analysis ? Want a database of patterns ? Want to try any other brand new idea ? Green serves as a generic filtering platform.

4. Filtering against external spammers blacklist (included with default ruleset):

Piece of cake, as soon as there is any public API to such a system, several lines of Python should make it connect and make filtering judgement.

5. Challenge/response whitelist (included with default ruleset):

The idea behind whitelisting is having a list of known e-mail senders you only expect e-mail from. Having automated whitelist is similar, but it attempts to distinguish persons sending mail to you from spam machines. The separation is done by challenging each unknown sender with a request to proof she's not a machine. If she is not a machine, she reads the challenging reply message the filter sends back to her and follows the instructions included in it. On the other hand, noone ever reads responses to spam, and so spam is filtered out.

Details vary (change the ruleset and you change them), but at default, in it's reply message Green prompts that your peers send you a single mail message once with a subject line that contains a short random number, ex. 333. Receiving such a message proves that the author has read the message and reacted upon, hence she's a human. Once such a message is received, the sender is permanently whitelisted.

Cool as they may sound, automated whitelists have so many downsides to them, right to the point at which you wouldn't want to use them. Although I'm not very much against the challenge/response filtering, you may check the following links to see why people strongly believe it's bad:

http://kmself.home.netcom.com/Rants/challenge-response.html
http://tardigrade.net/challengeresponse.html
http://richi.co.uk/blog/2005/05/why-challengeresponse-is-bad.html
http://www.businessweek.com/magazine/content/03_27/b3840044.htm

Just to make things worse, the idea has fallen to the patent madness, the challenge response technology is covered with several US patents, and although there is a handful of products that have it implemented (see this: http://spamlinks.net/filter-cr.htm), and there have been a few suits filed, the outcome is unclear.

Keep in mind that with Green you are not limited with any given set of rules, it can do anything. Doing even such a complex thing as challenge response is piece of cake, in fact it's a single rule. You may think about the challenge/response whitelist as of a mere example of the expressive power of Green rulesets. Whitelisting is just one of the spam fighting ideas, but if you come up with a better one, go right ahead, drop in a rule or two and away you go !

6. Industrial mail processor:

Automatically processing large volumes of incoming mail is a task that frequently appears in different industries. Using Python scripts you can make Green connect to anything you have and use it for any purpose - maintaining a mailing list, verifying digital signatures, sorting support mail, and so on and so forth. As Green collects messages that are spooled as regular files, it's easy to send it mail for processing too (although you might want to decrease the spool reading interval in the server configuration).

7. Generic POP3 server:

Finally, Green can be used as just as a POP3 server, as it in fact contains one. Anything that gets to the users' mailboxes is served out via POP3. No filtering at all here.

How it works:

Ruleset is the heart of Green Mail Filter message processing.

Ruleset is an ordered collection of rules. Rules are applied to each message one by one starting with the first rule and proceeding to the next until some rule dispatches the message to a particular mailbox or the end of the ruleset is reached.

Each rule has a textual (and otherwise meaningless) description, which is only useful for a user reading the ruleset. Each rule can be enabled or disabled. Disabled rules are obviously skipped. Each rule can also have an expiration date, so that a ruleset does not get clogged with the temporary created rules. Expired rules are silently dropped from the ruleset.

Each rule contains one or more matches (something to match messages against) and one or more actions (applied to a matched message). The process of message processing by a rule is as follows. First, the message is shown to each match one by one starting with the first. Given a message, a match examines it and returns either match or no match. If any of the matches doesn't match the message, the rule processing terminates and server proceeds to the next rule in the ruleset. Then, if all of the matches did match the message, the rule is said to be activated and all of it's actions are applied to the message one by one starting with the first. Each action does whatever it feels necessary and returns the name of the mailbox to put the message to or empty name if it makes no such decision. As soon as some action returns a non-empty mailbox name, the rule activation is considered complete and it's further actions are not executed at all. If all actions return empty mailbox names, the server proceeds to the next rule hoping it can make a decision. If on the other hand the rule returns a mailbox name, the message is put to that mailbox and processing stops.

There are three kinds of matches that can be used in rules. The simplest kind of match matches any message, and as such is useful for wildcard rules, that are applied to all messages. The second kind of match is a regular expression match, it compares specified message header fields values against specified regular expressions and matches if they do. The third and the most flexible sort of match is a Python script match. It's really an arbitrary snippet of Python code which is executed in the context of the message being processed. What it does is up to it's designer, it can do anything, and it matches as soon as it returns result = "match".

Similarly, there are three kinds of actions. The simplest kind of action does exactly what actions are for and nothing else - it simply returns the specified name of the mailbox. Second kind of action is a Python script action, again, it's a piece of Python code which can do anything, then possibly returning result = "mailbox_name". Finally, the third kind of action is the thing that allows Green to do all sorts of magic things - it's an action to modify the ruleset at runtime, i.e. add or remove rules on the fly.

The simpler matches and actions you can examine by the samples. The most advanced action - the ruleset modifying action - is described here though. Such action has a child rule in it, the rule remains in a latent state and never has a chance to execute. Whenever such action is activated, it inserts a copy of the child rule somewhere in the ruleset - at the top of the ruleset, before the current rule, after the current rule, at the end of the ruleset. At this moment the child rule is hatched and becomes the real rule - the part of the ruleset. Note the most important thing - the current rule (containing the action being activated) is already executed, and before it started executing, all of it's textual parts have been adapted to the current message - {{header-field-name}} entries have been replaced with appropriate values. Now, what this really means is that all the subordinate rules of all the actions on and on down to the very bottom of the subordinate rules tree have been adapted too. Therefore the copy of the subordinate rule being inserted is a copy adapted to the message being processed, not an ad-hoc copy.

Here is one more advanced issue. Whenever Green reads mail from external POP3 servers, it does it in two steps. First, only the header of the message is fetched with TOP command. This header is parsed and passed to the ruleset exactly as described above. Many times the filtering decision can be made based on the header only, and many times you wouldn't even want to fetch the rest of the message once you see it's header (ex. worms and such). This makes the first filtering pass. If, upon a successful filtering, the decision is made to dispatch the mail to a mailbox, it's fetched with RETR command and just put to the already known target mailbox. If the name of the mailbox returned from the first filtering pass is "<delete>", the message is physically deleted on the server with DELE command without fetching. But, there also are many times when filtering can only be carried out upon a message body. If this is a case, some rule should return a "<contents>" mailbox name to put the message to. This is another special "mailbox name", and it means - fetch the message body contents and make a second pass through the ruleset. If "<contents>" is the result of the first filtering pass, message body is fetched with RETR command, and filtering starts again from the top of the ruleset. This makes the second filtering pass. By that time the entire message with header and body physically exists in some local file and it's name is added to the header as a special header field "x-green-message-filename" so that matches and actions can act upon it. Note that only Python matches and actions receive that field, because only they are have capabilities of doing anything with the file.

Here is a few samples of both matches and actions:

Consider this mail message as an example:

Received: from some.fake.name(fakedsl-123-45-67-89.fake.name [123.45.67.89])
  by hut.user.com (8.12.10/8.12.10) with SMTP id i8FGu6Wc033206
  for <innocent@user.com> Wed, 15 Sep 2004 10:56:10 -0600 (MDT)
Received: from 98.76.54.32 by smtp.fake.name;
  Wed, 15 Sep 2004 16:56:27 +0000
From: "Believe Me" <believe.me@great.stuff>
To: innocent@user.com
Subject: Get great stuff for a great price.

We have a new offer for you. Buy cheap stuff through our online store.

- Private online ordering
- World wide shipping

Order your stuff offshore and save over 70%!

Best regards,
Bad Businessman

This message is matched with the following header field value matches (see the GUI control panel application):

Regular expression match example #1:

Header field value(s) to match: from
Regular expression to match field value against:
.*Bel.*

Notes: header field name can be put in lowercase, upper case or mixed case, no difference. The regular expression syntax to conform is described here.

Regular expression match match example #2:

Header field value(s) to match: x-green-parsed-from-mailbox
Regular expression to match field value against:
believe.me@great\.stuff

Notes: before the message is passed to the ruleset, it's header is decorated with a number of extra header fields that simplify processing. All such extra fields begin with "x-green-". Here is the list of all the extra fields. In this example x-green-parsed-from-mailbox is used, and it's preset to the mailbox only part of the From address. If bare "From" was used instead, a message with

From: "Gotcha: believe.me@great.stuff" <fooled@you.com>

would have matched, and that would be a bad thing.

Regular expression match example #3:

Header field value(s) to match: x-green-parsed-from-mailbox
Regular expression to match field value against:
{{x-green-parsed-from-mailbox}}

Notes: before a rule is applied to a message, all the macro entries {{field-name}} anywhere in the textual parts of the rule are replaced with the values of the corresponding fields of this particular message. In this case, before a rule is invoked, it's regular expression becomes "believe.me@great\.stuff". Also note that such a match matches any message as in A={{A}}.

Regular expression match example #4:

Header field value(s) to match: subject;from
Regular expression to match field value against:
.*stuff.*
.*great.*
If there is more than once field occurance, each must match: true

Notes: more than one header field can be specified, as well as more that one regular expression. The behaviour of such a match then depends on the "each must match" flag - if it's set, each header field value must match each of the regular expressions...

Regular expression match example #5:

Header field value(s) to match: subject;from
Regular expression to match field value against:
.*Believe.*
If there is more than once field occurance, each must match: false

... and if it's not set - any value may match any regular expression.

Python match example #1:

from random import randint
if randint(0, 1) == 0:
result = "match"
else:
result = ""

Notes: this matches messages at random. Hardly useful, except for load balancing or something.

Python match example #2:

if "received" in headers:
  for i in range(headers["received:count"]):
    if headers["received:%d" % i].find("trusted.com") >= 0:
      result = "match"
      break
  else:
    result = ""
else:
  result = ""

Notes: this is the "right" way to enumerate multiple values in the header.

Python match example #3:

assert headers.get("from", "") == "{{from}}"
result = ""

Notes: both ways of accessing the header can be used, although headers["name"] is highly preferred. Always return "match" or empty string, if the result remains None, you get "None" mailbox.

Python action example #1:

if headers["subject"].startswith("Spam"):
result = "spam"
else:
result = "" # who knows

Notes: Actions return mailbox name. An empty mailbox name means "no decision".

Python action example #2:

if headers.has_key("x-green-message-filename"):
  if open(headers["x-green-message-filename"]).read().find("VIRUS") >= 0:
  result = "quarantine"
else:
  result = "<contents>"

Notes: "x-green-message-filename" is yet another special header field which is set to the full file name of the message being processed. You can create your own mailboxes at will. <contents> is a special "mailbox name", a message being put to that mailbox is fetched from the originating server for filtering in full, not just it's headers. Use this when you need to filter based on the message body.

Python action example #3:

from shutil import copy
if headers.has_key("x-green-message-filename"):
  copy(headers["x-green-message-filename"], headers["x-green-user-copy-to-this-directory"])
  result = "<delete>"
else:
  result = ""

Notes: <delete> is a special "mailbox name", a message being put to that mailbox is silently deleted. Header field values that begin with "x-green-user-" are user-specific and are configured on the user configuration page in the GUI manager.

Python action example #4:

db_connection = shared_state["db_connection"]
db_connection.execute("insert into messages (id) values(\"%s\")" % headers["x-green-message-id"])

Notes: x-green-message-id is a globally unique value that can be used for identifying messages. By a strange coincidence it's also the name of the file the filtered message is being put to at the end of the game. shared_state is the global dictionary available to all the scripts, and it's contents is preserved across different scripts invocations. shared_state keeps it's state until the server is stopped and is useful for storing stuff for later use. Other that shared_state, all the execution context is cleared between scripts invocations.

Here is the list of all the special fields added to the message header before passing to the ruleset:

x-green-username:
The name of the current message recipient user.

x-green-quoted-header:
A list of "> "-prefixed lines of the message header. Useful for replies and quoting.

x-green-expires-...:
A string with ISO date which is "..." in the future. See the GUI control panel for the possible values of "...".

x-green-message-size:
An estimated size of the message in bytes. It is available on the first filtering pass too, even though message body hasn't been fetched yet. The output from the "LIST" POP3 command is used for this estimate.

x-green-random-short, x-green-random-long:
Strings with random numbers, short is in range 0-999 and long is in range 0-4294967295.

x-green-message-id:
Globally unique message identifier. Contains high precision time, message header hash and random component extracted from a GUID. Useful for tracking messages between rule invocations, state saving and such. It also equals to the name of the file where the message is finally put after filtering.

x-green-message-filename:
This is only passed to the Python matches/actions on the second pass filtering and it contains a full name of the temporary file the current message is stored in. The file contains the copy of the message including headers and everything. You should not delete or rename the file.

x-green-training-mailbox:
This is only set on a manual training pass and contains the name of the mailbox the user has selected as the "true" message destination mailbox. Trainable rules are the only ones to care.

x-green-parsed-...:
Parsed mailbox list. "..." is any other header field name, but only a few can actually be parsed - "from", "to", "cc", "sender" and "reply-to". For these, this parsed value contains strings like "mailbox-list(mailbox(local-part(xljxuygdkhu)domain(yahoo.com)))", for the rest it contains just a copy of the original field and can be ignored.

x-green-parsed-...-mailbox:
Similar to x-green-parsed-... but takes parsing one step further i.e. for the parseable fields "from", "to", "cc", "sender" and "reply-to" it contains valid mail addresses, ex. xljxuygdkhu@yahoo.com.

x-green-user-...:
Such strings contain arbitrary user-specific configuration parameters that can be edited on the user configuration page and reside in user.xml. Ex: x-green-user-canonical-mailbox contains the name of the user known to the SMTP server which can be used for sending mail.

x-green-...:
All the server configuration options that can be edited on the options configuration page and reside in config.xml get passed as x-green-... strings. Ex: x-green-smtp-address is the global setting for the outgoing SMTP server.

Performance of such a complex system is always difficult to measure, and with Green it certainly depends on the particular ruleset. The results of load testing should give at least some idea.

The load tests were run on Pentium 4, 3.2 GHz, 1G RAM, 2 striped IDE HDD, under Windows 2003, with certain NTFS performance tweaks applied.

The load test config had 500 users having 50000 different real life mail messages in their mailboxes in total, total size of 350M, all having the same short static ruleset containing a few regular expressions and a few Python scripts, with all activity for all users repeating periodically in 5 minutes - each user logs in randomly once in 5 minutes, reads mail, filters it and puts back to some other user's mailbox, so that it keeps looping forever. If my math is right, such a test matches a load of a real life mail server servicing at least 10 times as many users.

The server quickly came to congestion, it predictably ate all the CPU it could and about 400M of memory. Then disk I/O became the bottleneck, the server was spending about 30% of it's CPU time in the kernel. This test was even more likely to be I/O bound for the particular ruleset. As noted before it's essentially empty, with a regular expression match here and there and a few Python scripts several lines long, i.e. no significant CPU processing per message.

Anyhow, the server was filtering messages at a sustained rate of slightly over 100 messages per second, and it kept running like that for days, filtering millions of messages.

Version 1.4, Sept 13, 2005:
Has GUI fixed. Fixed an issue with outgoing client POP3 connections again, this time it's about having an external POP3 server down, the POP3 client code kept retrying again and again. Now it's one attempt per configured timeout.

Version 1.3, Sept 12, 2005:
Fixed an issue with multiple outgoing client POP3 connections initiated at the same time to the same server/account (minor threading issue). This build has management console GUI completely broken :(

Version 1.2, Sept 10, 2005:
Minor GUI fixes concerning management console with large fonts.

Version 1.1, Aug 10, 2005:
Default ruleset is no longer empty/passthrough, now it contains rules that implement blacklisting + bayesian + challenge-response filtering.

Version 1.0, Jul 31, 2005:
Initial release (the 2003 early prototype release doesn't count).

Green is developed and written by Dmitry Dvoinikov <dmitry@targeted.org>
(c) 2003-2005 Dmitry Dvoinikov
http://www.targeted.org/

Portions of this software:
Python (c) 1991-1995 Stichting Mathematisch Centrum, CNRI
LibXML library (c) 1998-2003 Daniel Veillard
Boost C++ libraries (c) Contributors
Boost Regex library (c) 1998-2003 Dr. John Maddock
SSLeay suite (c) 1995-1998 Eric Young

See LICENSE for more information

Use this forum to talk about Green, or you can mail the author at dmitry@targeted.org.