CS329 2025 T10 Proposal Report
CS329 2025 T10 Proposal Report
Abstract
EmaiLLM focuses on developing an advanced email management system using computational linguistics
and natural language processing (NLP) to tackle email clutter. The system will leverage large language
models (LLMs) to automatically categorize emails based on user-defined keywords, allowing users to retain
only the most important messages while filtering out irrelevant ones. By streamlining email organization and
reducing cognitive load, the system will enhance productivity and decision-making for a wide range of users,
from professionals to individuals seeking better email management.
The intellectual merit of the project lies in its novel approach to email prioritization, combining state-of-the-
art NLP techniques with intelligent filtering mechanisms. The integration of LLMs, such as LLaMa-8b-instruct,
for contextual understanding and automated classification, will provide a more accurate and personalized
email management experience than existing solutions. Key innovations include a user-centric classification
system that allows custom keyword definitions and a sophisticated email categorization process that minimizes
inbox clutter. This represents a significant advancement over traditional rule-based systems and offers a more
adaptable, automated solution.
The broader societal impact of this project lies in improving digital communication management by
streamlining email organization. By automatically categorizing and filtering emails based on user-defined
keywords, the system will save time, reduce stress, and enhance productivity.
1 Introduction
1.1 Objectives
This project aims to leverage computational linguistics and natural language processing (NLP) to develop
a practical solution for managing email clutter by extracting and presenting only essential information. The
goal is to create a functional system capable of scanning an email inbox, identifying and extracting the most
useful messages, organizing them into labeled categories, and efficiently managing redundant or irrelevant
emails through backup and deletion.
1.2 Motivation
Extracting valuable information from an email can often feel like mining for diamonds—rare, yet highly
valuable. Many individuals struggle with disorganized inboxes that hinder productivity and obscure critical
communications. By effectively categorizing and filtering emails, users can optimize their workflow, ensuring
that no important opportunity is overlooked while eliminating distractions. This project seeks to streamline
email management, ultimately enhancing efficiency and decision-making.
This system has the potential to benefit a wide range of users, from professionals managing high volumes
of emails to individuals seeking greater organization in their digital communications. By automating the
identification and prioritization of essential emails, this solution can significantly reduce cognitive load, save
time, and improve overall productivity.
1.3 Problem Statement
The challenge of identifying important emails can be distilled into two key tasks: summarizing email
content and determining its relevance based on predefined keywords. A core technical challenge lies in
developing an algorithm that can accurately assess the significance of an email according to user-defined
criteria. This requires advancements in NLP techniques, particularly in content analysis, summarization, and
contextual relevance detection.
2 Background
2.1 Academic Research on Email Classification
Email classification has been a significant area of research in NLP and ML. Various studies have explored
different techniques for categorizing emails based on content, context, and user behavior. Most of these
approaches utilize supervised learning methods to classify emails into predefined categories.
Researchers have consistently compared the performance of different classification algorithms for email
categorization. Iqbal et al. found that Support Vector Machines (SVM) and Artificial Neural Networks
(ANN) achieved the highest accuracy rates of approximately 98% for spam email classification using the UCI
spambase dataset [6]. Their work highlighted the importance of feature selection in improving classification
accuracy while reading computational complexity.
Bahgat et al. conducted comparative studies showing that Logistic Regression outperforms other algorithms
such as Naive Bayes, SVM, J48, Random Forest, and radial basis function networks for email classification
with proper feature selection [1]. By applying Correlation-based Feature Selection (CFS), they achieved an
accuracy rate above 90% with a 90% reduction in feature space, demonstrating the value of semantic filtering
approaches.
While many research papers focus on binary classification (primarily spam versus non-spam), studies
have expanded to multi-class categorization. Chaipornkaew et al. developed a generalized email classification
system for workflow analysis that organizes emails based on business operations (sales, transportation, billing,
shipping, etc.) [3]. Their model demonstrated the practical application of classification techniques to business
environments. although achieving only moderate accuracy (approximately 65%).
The methodology for text preprocessing in most academic research follows similar patterns: tokenization,
lemmatization/stemming, stopword removal, and feature extraction. The feature representation methods
typically employ techniques such as Term Frequency-Inverse Document Frequency (TF-IDF), n-grams, or
word embeddings before applying machine learning models.
2
2.2 Existing Industry Solutions
In response to the growing problem of email overload, numerous commercial solutions have emerged.
These solutions can be broadly categorized into two groups: (1) AI-powered email management tools and (2)
email cleaning applications.
3
3. Tagging and Cleaning System: EmaiLLM combines classification with automatic archiving into
local folders, providing a comprehensive solution that both organizes relevant emails and reduces inbox
clutter while adding conspicuous tags to related keywords.
3 Proposed Approach
3.1 Methodology
The email classifier system will preprocess email data to detect important keywords specified by the user.
The system will categorize emails based on user input and classify relevant emails for retention in the inbox
while removing bulk, unimportant emails to a local folder for storage optimization.
4
3.1.5 Dataset Description and Preparation
The dataset consists of real emails manually downloaded from user inboxes and synthetic emails generated
using LLMs. The combination will provide a variety of scenarios to test the system’s classification capabilities.
The training dataset and the test dataset are humanly labeled. Every document in each of the datasets
consist of human labels of keywords and categories and the original email text.
4 Timeline
4.1 Weekly Schedule
Week 1: Project Planning and Feedback Gathering (March 18-22) Milestone: Finalize project scope and
requirements, and gather feedback. Key Deliverables: Revised proposal pitch, project structure, and team
roles. Risk Mitigation: Incomplete project scope may delay execution. To mitigate this, team members will
focus on defining requirements early, and team meetings will address any unclear areas.
Week 2: Data Collection and Email Corpus Generation (March 23-29) Milestone: Manual email download
and LLM-based email corpus generation. Key Deliverables: Dataset of synthetic and real emails, cleaned and
preprocessed for use in development. Risk Mitigation: Potential difficulties in acquiring sufficient email data.
If this occurs, the team will use the LLM or additional data sources to expedite synthetic data generation.
Week 3: Algorithm Development and Workflow Design (March 30-April 5) Milestone: Develop the
core email classification, keyword detection, and workflow processes. Key Deliverables: Initial working
algorithm for email filtering and keyword-based tagging. Risk Mitigation: Unexpected complexities in NLP
implementation may arise. Team will use pre-trained models and modular development to reduce technical
debt.
Week 4: Testing and Performance Evaluation (April 6-12) Milestone: Conduct thorough testing, optimize
storage usage, and evaluate performance. Key Deliverables: Detailed performance report (classification
accuracy, inbox storage improvements). Risk Mitigation: Testing delays or low accuracy can lead to prolonged
5
development cycles. Regular test checkpoints will be created, and continuous integration will help track
progress.
Week 5: UI Implementation and Final Integration (April 13-19) Milestone: Develop the user interface,
integrate the front and back end, and refine for usability. Key Deliverables: Fully integrated system with a
user-friendly interface, ready for live demo. Risk Mitigation: Integration issues or delayed UI development may
occur. Collaborative development with Caleb (UI/UX) and Michael (Backend) will ensure synchronization.
Last Week: Live Demonstration and Final Adjustments Milestone: Prepare for a live demo showing the
system in action. Key Deliverables: Demo-ready system highlighting classification, keyword tagging, and
inbox storage optimization.
4.2.3 Meetings
Weekly meetings on Friday 6:30 PM will be held to track progress, assign new tasks, and address any
blockers.
6
References
[1] Eman M. Bahgat, Sherine Rady, Walaa Gad, and Ibrahim F. Moawad. Efficient email classification
approach based on semantic methods. Ain Shams Engineering Journal, 9(4):3259–3269, 2018. ISSN
2090-4479. doi: https://doi.org/10.1016/j.asej.2018.06.001. URL https://www.sciencedirect.com/
science/article/pii/S2090447918300455.
[2] Bardeen. We tried 5 ai email management tools for inbox cleanups, 2024. URL https://www.bardeen.
ai/posts/email-inbox-management-ai.
[3] Piyanuch Chaipornkaew, Takorn Prexawanprasut, Chia-Lin Chang, and Michael McAleer. A generalized
email classification system for workflow analysis. Technical report, Facultad de CC Económicas y
Empresariales. Instituto Complutense de Análisis . . . , 2017.
[4] Clean Email. Clean email inbox – organize and remove emails you don’t need, 2024. URL https:
//clean.email/.
[5] EmailAnalytics. 8 best email cleaner apps to take back your inbox (2025), 2024. URL https://
emailanalytics.com/email-cleaners/.
[6] Khalid Iqbal and Muhammad Shehrayar Khan. Email classification analysis using machine learning
techniques. Applied Computing and Informatics, 2022.
[7] Right Inbox. The 6 best email cleaner tools for 2024, 2024. URL https://www.rightinbox.com/blog/
email-cleaner-tools.
A Contributions
Name: Contribution of Work (Contribution Percentage)