Email Spam Detection Using Content Semantic And Entropy-based Features

Telecommunication Engineering Project Topics

Get the Complete Project Materials Now! ยป

The spam detection technique helps the business prevent unnecessary emails from reaching inboxes and preventing any consequential harm to an organization or user. To train the machine learning algorithms, several researchers use features extracted from the email. These features, however, capture only the content and are computed per email message. However, information aggregated per sender, i.e., content, entropy, and spam email similarity per sender, are not stud-ied. In this study, we propose to use additional features that capture the content, similarity, and entropy of emails sent by the same sender. In this regard, we extracted six new features that could help to improve spam email detection. The six features are number of emails, duration, character length of the sender address, number of recipients, the similarity between emails, and entropy value within the sender subject. To build the prediction model, we used four machine learning algorithms: K Nearest Neighbor, Support Vector Machine, Logistic Regression, and Random Forest. The proposed approach is evaluated using a dataset collected from ethio-telecom. The results show that the dataset augmented with the new features improves email spam detection performance. The F1-score of email spam detection is improved by 6.6%, 9.9%, 20.7%, and 11.3% using K Nearest Neighbor, Support Vector Machine, Logistic Regression, and Random Forest respectively. The overall improvement is 12.1% on average. Among the Four algorithms used to build the predictive models, Random Forest performs better in detect-ing spam emails. We computed feature importance using the Information Gain and Gain Ratio algorithms to see which features helped to improve email spam detection. The result shows that the new features, length of address, the number of receivers, duration, sent emails, and entropy, are in the top six ranks. This indicates that the newly introduced features contributed to the improvement seen in email spam detection.

Get Full Work

Report copyright infringement or plagiarism

Be the First to Share On Social



1GB data
1GB data

RELATED TOPICS

1GB data
1GB data
Email Spam Detection Using Content Semantic And Entropy-based Features

187