A Bayesian Approach to Filtering Junk E-mail

Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz

Access postscript or pdf file.


In addressing the growing problem of junk email on the Internet, we examine methods for the automated construction of filters to eliminate such unwanted messages from a user's mail stream. By casting this problem in a decision theoretic framework, we are able to make use of probabilistic learning methods in conjunction with a notion of differential misclassification cost to produce filters which are especially appropriate for the nuances of this task. While this may appear, at first, to be a straightforward text classification problem, we show that by considering domain-specic features of this problem, in addition to the raw text of E-mail messages, we can produce much more accurate filters. Finally, we show the efficacy of such filters in a real world usage scenario, arguing that this technology is mature enough for deployment.

Reference: M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to filtering junk email., AAAI Workshop on Learning for Text Categorization, July 1998, Madison, Wisconsin. AAAI Technical Report WS-98-05

Keywords: Bayesian spam filter, Bayesian text classification, Spam email, unsolicited email, filtering junk email, probabilistic methods.

Read article on early spam filter efforts at MS Research (William Baldwin, Forbes Magazine, September 98).
View graphic from Forbes article.

Back to Eric Horvitz's home page.