0% found this document useful (0 votes)
11 views7 pages

CS329 2025 T10 Proposal Report

EmaiLLM is a project aimed at creating an advanced email management system that utilizes computational linguistics and natural language processing to categorize and filter emails based on user-defined keywords. The system seeks to reduce email clutter, enhance productivity, and improve decision-making by automatically identifying and organizing important messages while minimizing irrelevant ones. Key innovations include a user-centric classification system and the integration of large language models for more accurate email management compared to existing solutions.

Uploaded by

ilysmbish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

CS329 2025 T10 Proposal Report

EmaiLLM is a project aimed at creating an advanced email management system that utilizes computational linguistics and natural language processing to categorize and filter emails based on user-defined keywords. The system seeks to reduce email clutter, enhance productivity, and improve decision-making by automatically identifying and organizing important messages while minimizing irrelevant ones. Key innovations include a user-centric classification system and the integration of large language models for more accurate email management compared to existing solutions.

Uploaded by

ilysmbish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Title EmaiLLM: Bulk Cleanser

Course CS/QTM/LING-329: Computational Linguistics


Authors Michael Jung, BS in Computer Science, michael.jung@emory.edu
Caleb Jennings, BS in Math/Computer Science, caleb.jennings@emory.edu
Nate Hu, BS in Computer Science and QTM-Data Science, nate.hu@emory.edu
Kairos Wu, BS in Computer Science, kaeiro.wu@emory.edu

Abstract
EmaiLLM focuses on developing an advanced email management system using computational linguistics
and natural language processing (NLP) to tackle email clutter. The system will leverage large language
models (LLMs) to automatically categorize emails based on user-defined keywords, allowing users to retain
only the most important messages while filtering out irrelevant ones. By streamlining email organization and
reducing cognitive load, the system will enhance productivity and decision-making for a wide range of users,
from professionals to individuals seeking better email management.
The intellectual merit of the project lies in its novel approach to email prioritization, combining state-of-the-
art NLP techniques with intelligent filtering mechanisms. The integration of LLMs, such as LLaMa-8b-instruct,
for contextual understanding and automated classification, will provide a more accurate and personalized
email management experience than existing solutions. Key innovations include a user-centric classification
system that allows custom keyword definitions and a sophisticated email categorization process that minimizes
inbox clutter. This represents a significant advancement over traditional rule-based systems and offers a more
adaptable, automated solution.
The broader societal impact of this project lies in improving digital communication management by
streamlining email organization. By automatically categorizing and filtering emails based on user-defined
keywords, the system will save time, reduce stress, and enhance productivity.

1 Introduction
1.1 Objectives
This project aims to leverage computational linguistics and natural language processing (NLP) to develop
a practical solution for managing email clutter by extracting and presenting only essential information. The
goal is to create a functional system capable of scanning an email inbox, identifying and extracting the most
useful messages, organizing them into labeled categories, and efficiently managing redundant or irrelevant
emails through backup and deletion.

1.2 Motivation
Extracting valuable information from an email can often feel like mining for diamonds—rare, yet highly
valuable. Many individuals struggle with disorganized inboxes that hinder productivity and obscure critical
communications. By effectively categorizing and filtering emails, users can optimize their workflow, ensuring
that no important opportunity is overlooked while eliminating distractions. This project seeks to streamline
email management, ultimately enhancing efficiency and decision-making.
This system has the potential to benefit a wide range of users, from professionals managing high volumes
of emails to individuals seeking greater organization in their digital communications. By automating the
identification and prioritization of essential emails, this solution can significantly reduce cognitive load, save
time, and improve overall productivity.
1.3 Problem Statement
The challenge of identifying important emails can be distilled into two key tasks: summarizing email
content and determining its relevance based on predefined keywords. A core technical challenge lies in
developing an algorithm that can accurately assess the significance of an email according to user-defined
criteria. This requires advancements in NLP techniques, particularly in content analysis, summarization, and
contextual relevance detection.

1.4 Innovation Component


While numerous email summarization tools exist, none focus solely on summarizing only the most important
emails while filtering out the rest. Although Gmail provides spam filtering, many non-spam messages still
contribute to inbox clutter, necessitating a more refined approach to automatic email organization. Existing
email efficiency tools are underutilized, largely due to their lack of effectiveness. This project aims to bridge
this gap by introducing a novel approach to email prioritization, combining advanced NLP techniques with
intelligent filtering mechanisms.
Developing this system requires addressing several technical barriers, including improving email content
summarization, designing an effective keyword-based relevance assessment algorithm, and ensuring adaptability
across different email formats and structures. By innovating beyond current solutions, this project seeks to
establish a new standard for email management tools, offering a user-centric approach that enhances both
efficiency and usability

2 Background
2.1 Academic Research on Email Classification
Email classification has been a significant area of research in NLP and ML. Various studies have explored
different techniques for categorizing emails based on content, context, and user behavior. Most of these
approaches utilize supervised learning methods to classify emails into predefined categories.
Researchers have consistently compared the performance of different classification algorithms for email
categorization. Iqbal et al. found that Support Vector Machines (SVM) and Artificial Neural Networks
(ANN) achieved the highest accuracy rates of approximately 98% for spam email classification using the UCI
spambase dataset [6]. Their work highlighted the importance of feature selection in improving classification
accuracy while reading computational complexity.
Bahgat et al. conducted comparative studies showing that Logistic Regression outperforms other algorithms
such as Naive Bayes, SVM, J48, Random Forest, and radial basis function networks for email classification
with proper feature selection [1]. By applying Correlation-based Feature Selection (CFS), they achieved an
accuracy rate above 90% with a 90% reduction in feature space, demonstrating the value of semantic filtering
approaches.
While many research papers focus on binary classification (primarily spam versus non-spam), studies
have expanded to multi-class categorization. Chaipornkaew et al. developed a generalized email classification
system for workflow analysis that organizes emails based on business operations (sales, transportation, billing,
shipping, etc.) [3]. Their model demonstrated the practical application of classification techniques to business
environments. although achieving only moderate accuracy (approximately 65%).
The methodology for text preprocessing in most academic research follows similar patterns: tokenization,
lemmatization/stemming, stopword removal, and feature extraction. The feature representation methods
typically employ techniques such as Term Frequency-Inverse Document Frequency (TF-IDF), n-grams, or
word embeddings before applying machine learning models.

2
2.2 Existing Industry Solutions
In response to the growing problem of email overload, numerous commercial solutions have emerged.
These solutions can be broadly categorized into two groups: (1) AI-powered email management tools and (2)
email cleaning applications.

2.2.1 AI-Powered Email Management Tools:


Solutions like Clean Email, Bardeen.ai, and Harvey.ai, employ machine learning algorithms to categorize
and clean up incoming emails automatically [2]. These systems organize emails into smart folders or bundles
based on content similarity, allowing users to handle groups of related emails simultaneously. For example,
Clean Email creates smart views that bundle similar emails together, such as social network notifications, or
subscription newsletters [4].

2.2.2 Email Cleaning Applications:


Tools like Mailstorm, SaneBox, and Mail Sweeper focus on decluttering inboxes by identifying and removing
unwanted emails [5]. Mailstrom, for instance, allows users to visualize their inbox using interactive charts
and provides one-click actions to delete, archive, or move emails in bulk. Similarly, Mail Sweeper functions as
an ”automatic janitor” that collects unimportant emails into a ”DUstpan” folder before periodically moving
them to trash [7].
The majority of these commercial solutions offer features such as:
• Automated categorization of emails
• Bulk actions on groups of emails
• One-click unsubscribe functionality
• Automated rules for email processing
• Email backup and archiving

2.3 Current Limitations


Despite significant advances in email classification and management, several limitations still persist:
1. Static Classification Rules: Many solutions use fixed rules or require manual configuration of filters,
making them less adaptable to evolving email patterns and user preferences.
2. Limited Feature Integration: While existing email management solutions offer various individual
features like classification, cleaning, or organization, most lack comprehensive integration of these
capabilities into a unified system. Users often need to employ multiple tools to achieve complete email
management, which creates a fragmented experience that increases complexity.

2.4 Differentiation of Proposed Approach


Our, EmaiLLM: Bulk Cleanser, differentiates itself from existing solutions in several key ways:
1. Leveraging LLM Technology: Unlike traditional machine learning approaches, our solution employs
large language models (such as LlaMa-8b-instruct via Groq or GPT-4o-mini from OpenAI) that
demonstrate “better” contextual understanding of email content with automation. This enables a more
accurate and nuanced classification that captures the semantic meaning that the user features.
2. User-Defined Categories: While most existing systems use predefined categories, our approach
allows users to specify custom keywords and categories that matter to their own “Point-Of-Interest”,
creating a personalized classification system that better aligns with individual needs.

3
3. Tagging and Cleaning System: EmaiLLM combines classification with automatic archiving into
local folders, providing a comprehensive solution that both organizes relevant emails and reduces inbox
clutter while adding conspicuous tags to related keywords.

3 Proposed Approach
3.1 Methodology
The email classifier system will preprocess email data to detect important keywords specified by the user.
The system will categorize emails based on user input and classify relevant emails for retention in the inbox
while removing bulk, unimportant emails to a local folder for storage optimization.

3.1.1 Detailed Technical Approach


Data Preprocessing:
• Tokenization: Splitting email content into tokens for easier keyword identification.
• Lemmatization/Stemming: Reducing words to their root form for efficient keyword matching.
• Stopword Removal: Removing commonly used words that do not aid in classification.
Keyword Detection: Users specify important keywords (e.g., ”internship,” ”networking”). Emails
containing these keywords will be flagged as important and retained in the inbox.
Email Classification: The system will classify emails based on user-defined keywords and tag emails
accordingly (e.g., Email 1: #Internship, Email 2: #Networking).
LLM Integration: The LLaMa-8b-instruct model will flag important emails based on their content and
enhance classification accuracy.
Post-Processing: To free up space, emails without specified keywords will be downloaded to a local
folder and removed from the inbox.

3.1.2 Development Framework


The project will be developed using the following technologies:
• Python: For the core email processing logic
• Flask/Django: For the web framework (final decision pending based on project needs)
• HTML/CSS/JavaScript: For building a user-friendly front-end interface

3.1.3 Implementation Strategy


User Input: Users will specify keywords to filter important emails (e.g., ”internship,” ”meeting”).
Email Processing: Emails will be analyzed, with relevant emails tagged and retained in the inbox while
irrelevant emails are downloaded and deleted.
Tagging and Folder Management: Tagged emails will be stored in appropriate inbox folders based on
the keywords detected.

3.1.4 Quality Assurance Methods


Manual Testing: Emails from both real and synthetic datasets will be used for testing to validate
classification.
User Feedback: Users will evaluate the classification’s effectiveness and ease of use.
Model Output Verification: LLM classification output will be verified to ensure the accurate detection
of important emails.

4
3.1.5 Dataset Description and Preparation
The dataset consists of real emails manually downloaded from user inboxes and synthetic emails generated
using LLMs. The combination will provide a variety of scenarios to test the system’s classification capabilities.
The training dataset and the test dataset are humanly labeled. Every document in each of the datasets
consist of human labels of keywords and categories and the original email text.

3.1.6 Experimental Design


Experiment Goals: Test different sets of keywords, categories of emails, and LLM configurations.
Measure the accuracy of detecting important emails and the impact on storage optimization.
Test Cases: Various email scenarios will be run to evaluate the model’s classification accuracy.

3.1.7 Evaluation Metrics


The overall dataset: Skewed dataset
Precision: the proportion of correctly tagged emails out of all emails tagged by the tool
Recall: the proportion of emails classified into the category out of all emails in the category
F1-score: 2 x Precision x Recall/(Precision+Recall)
Accuracy: Test the proportion of correctly tagged emails within the emails classified as the user-defined
categories.
Storage Space Optimization: Evaluate the amount of inbox space saved by moving irrelevant emails
to local storage.
User Satisfaction: User feedback on the system’s overall effectiveness in organizing emails.

3.1.8 Validation Approach


Real-world Validation: The system will be validated using real emails from Emory inboxes and
synthetic data to simulate bulk email scenarios.
Comparison: Results from email classification will be compared to initial criteria for performance
improvement.

4 Timeline
4.1 Weekly Schedule
Week 1: Project Planning and Feedback Gathering (March 18-22) Milestone: Finalize project scope and
requirements, and gather feedback. Key Deliverables: Revised proposal pitch, project structure, and team
roles. Risk Mitigation: Incomplete project scope may delay execution. To mitigate this, team members will
focus on defining requirements early, and team meetings will address any unclear areas.
Week 2: Data Collection and Email Corpus Generation (March 23-29) Milestone: Manual email download
and LLM-based email corpus generation. Key Deliverables: Dataset of synthetic and real emails, cleaned and
preprocessed for use in development. Risk Mitigation: Potential difficulties in acquiring sufficient email data.
If this occurs, the team will use the LLM or additional data sources to expedite synthetic data generation.
Week 3: Algorithm Development and Workflow Design (March 30-April 5) Milestone: Develop the
core email classification, keyword detection, and workflow processes. Key Deliverables: Initial working
algorithm for email filtering and keyword-based tagging. Risk Mitigation: Unexpected complexities in NLP
implementation may arise. Team will use pre-trained models and modular development to reduce technical
debt.
Week 4: Testing and Performance Evaluation (April 6-12) Milestone: Conduct thorough testing, optimize
storage usage, and evaluate performance. Key Deliverables: Detailed performance report (classification
accuracy, inbox storage improvements). Risk Mitigation: Testing delays or low accuracy can lead to prolonged

5
development cycles. Regular test checkpoints will be created, and continuous integration will help track
progress.
Week 5: UI Implementation and Final Integration (April 13-19) Milestone: Develop the user interface,
integrate the front and back end, and refine for usability. Key Deliverables: Fully integrated system with a
user-friendly interface, ready for live demo. Risk Mitigation: Integration issues or delayed UI development may
occur. Collaborative development with Caleb (UI/UX) and Michael (Backend) will ensure synchronization.
Last Week: Live Demonstration and Final Adjustments Milestone: Prepare for a live demo showing the
system in action. Key Deliverables: Demo-ready system highlighting classification, keyword tagging, and
inbox storage optimization.

4.2 Team Responsibilities


4.2.1 Role distribution
Michael (Project Lead, Backend Development)
- Oversees the entire project
- Manages the backend system for email processing
- Coordinates between all components for seamless integration

Nate (Data Processing, LLM Integration)


- Focuses on handling data preprocessing
- Ensures integration of the LLM for keyword detection and classification

Kairos (Classification Models, Evaluation Metrics)


- Develops classification algorithms for email categorization
- Measures and evaluates accuracy and performance metric

Caleb (UI/UX Design, Frontend Development)


- Designs the UI for end-user interaction
- Develops and integrates the frontend with backend systems

4.2.2 Collaboration methods


• Google Docs and Overleaf for collaborative document editing
• Zoom for team communication

• GitHub for version control and code collaboration


• Google Slides for presentation and design

4.2.3 Meetings
Weekly meetings on Friday 6:30 PM will be held to track progress, assign new tasks, and address any
blockers.

4.2.4 Progress Tracking Plan


GitHub: Regular commits and pull requests will be reviewed for technical progress and peer feedback.
Weekly Updates: Each team member will report weekly progress during meetings.

6
References
[1] Eman M. Bahgat, Sherine Rady, Walaa Gad, and Ibrahim F. Moawad. Efficient email classification
approach based on semantic methods. Ain Shams Engineering Journal, 9(4):3259–3269, 2018. ISSN
2090-4479. doi: https://doi.org/10.1016/j.asej.2018.06.001. URL https://www.sciencedirect.com/
science/article/pii/S2090447918300455.
[2] Bardeen. We tried 5 ai email management tools for inbox cleanups, 2024. URL https://www.bardeen.
ai/posts/email-inbox-management-ai.
[3] Piyanuch Chaipornkaew, Takorn Prexawanprasut, Chia-Lin Chang, and Michael McAleer. A generalized
email classification system for workflow analysis. Technical report, Facultad de CC Económicas y
Empresariales. Instituto Complutense de Análisis . . . , 2017.
[4] Clean Email. Clean email inbox – organize and remove emails you don’t need, 2024. URL https:
//clean.email/.
[5] EmailAnalytics. 8 best email cleaner apps to take back your inbox (2025), 2024. URL https://
emailanalytics.com/email-cleaners/.
[6] Khalid Iqbal and Muhammad Shehrayar Khan. Email classification analysis using machine learning
techniques. Applied Computing and Informatics, 2022.

[7] Right Inbox. The 6 best email cleaner tools for 2024, 2024. URL https://www.rightinbox.com/blog/
email-cleaner-tools.

A Contributions
Name: Contribution of Work (Contribution Percentage)

Michael: Proposed Methodology + Timeline + Review entire proposal (40%)


Nate: Background + Review entire proposal (30%)
Kairos: Dataset and Evaluation + Review entire proposal (15%)
Caleb: Introduction + Review entire proposal (15%)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy