During my time as a sysop and later as CTO I had quite a few e-mail servers under me. Over 50 to be exact. These servers were not standalone ones, but passed on e-mails to each other. We designed the system to avoid bottlenecks and make it easily extendable. It fully met the expectations in this respect, it was very easy to plug additional nodes into the system.
There was however an other aspect, where the system was not so great. And as a matter of fact, I haven’t seen any e-mail system, that was great at this. What I mean is tracking e-mails, debugging problems. Having to SSH into just a fraction of this many servers or reading logs from this amount of servers is a really, really painful way to do it.
Most of the time tracing mails was easy. Find out, which outer mail server the e-mail in question was received at and determine the cause of rejection. In other cases however we had to literally hunt the e-mail all across the system just to find out, that is was delivered fine into the user’s spam folder.
An other issue in question is monitoring such a system. Just checking, that port 25 is open and responds to requests isn’t really enough. Checking, that some e-mail gets delivered into the right place doesn’t quite cut it either. You need to be sure it goes the proper way through the system, taking into account, that it may randomly pick more than one path.
What we need is a place the message we send for testing is marked reliably. There are exactly two places, where this information can be found: in the mail’s received header and in the mail logs. The former isn’t a very good choice, because if the mail is lost due to an error, we don’t have a clue, what went wrong. The latter however is promising.
First of all, aggregate all data to a central server. Since syslog does a marvelous job at this, its the perfect candidate for the job. Next write an application, that skims these logs for e-mails to trace. For this to work, we need to tag the e-mail with something unique, like a return path or a subject. This unique identifier of course has to go into the logs, so configure your mail server accordingly.
Now comes the hard part. We need to set up a database of correct paths for certain types of mail in our system. Depending on the purpose this can be locally delivered mail, forwarded mail, etc. If we need to inspect a certain failure, we need a GUI to do so. I just did a quick mock how such an interface could look like.

The blue color shows the actual path the e-mail has taken. The blue-gray shows, what way it should have and the blue-red shows, where it went wrongly. The red node signalizes, that the mail was incorrectly dropped. The certain parts of course should be click-able and show, what happened and when. Maybe show a part of the log, where the event happened. Since this is a sketch, there is a lot to do, for production a more refined approach is needed.
As we now have a full map of our system, we can safely write a check on the paths that a mail takes. If a check mail deviates from its designed path or gets stuck, the monitoring system could alert the sysop on duty.
The solution would be similar when tracing a user mail. Search for incoming mails on the MX machines, then ask the system to trace the e-mail using the same GUI.
To sum it all up: sexy monitoring interfaces are important to make work a pleasure. Easy to use ones are even more important. If a sysadmin has to spend long minutes to localize a possibly critical outage, that’s just plain unacceptable with business critical servers. So do take time to implement a thorough e-mail monitoring, don’t just watch port 25.