Content area
Full Text
Abstract-Support Vector Machine (SVM) is a powerful classification and regression tool. Varying approaches including SVM based techniques are proposed for email classification. Automated email classification according to messages or user-specific folders and information extraction from chronologically ordered email streams have become interesting areas in text machine learning research. This paper presents a parallel SVM based on MapReduce (PSMR) algorithm for email classification. We discuss the challenges that arise from differences between email foldering and traditional document classification. We show experimental results from an array of automated classification methods and evaluation methodologies, including Naive Bayes, SVM and PSMR method of foldering results on the Enron datasets based on the timeline. By distributing, processing and optimizing the subsets of the training data across multiple participating nodes, the parallel SVM based on MapReduce algorithm reduces the training time significantly.
Index Terms-Email Classification; Parallel SVM; MapReduce
(ProQuest: ... denotes formulae omitted.)
I. INTRODUCTION
With the rapidly development of the Internet and computer technology, the quantity of electronic data is in exponential growth. Data deluge has become a salient problem should be solved. Text categorization has been a highly popular machine learning application in the past decade. Gretarsson et al. present a web-based system for visual and interactive analysis of large sets of documents using statistical topic models [1]. A variety of other problem domains have been explored, including categorization by events, communication and even by author's gender [2, 3, 4].
Data mining is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems. This problem has been researched by many scholars in all kinds of application area for many years and many data mining methods have been developed and applied to practice. However, most classical data mining methods out of reach in practice in face of big data. Computation and data intensive scientific data analyses are increasingly prevalent in recent years. Support Vector Machines (SVMs) [5] are powerful classification and regression tools, but their compute and storage requirements increase rapidly with the number of training vectors, putting many problems of practical interest out of their reach. Efficient parallel algorithms and implementation techniques are the key to meeting the scalability and performance requirements entailed in...