Phase 2 File
Phase 2 File
1. Problem Statement
In the digital age, customers expect instant, accurate, and 24/7 support across
various platforms. Traditional customer service models rely heavily on human
agents, resulting in increased operational costs, inconsistent responses, and delays
during high-demand periods. To overcome these limitations, businesses are
increasingly turning to intelligent automation solutions like chatbots.
4. Data Description
- Dataset: Custom and open-source chatbot datasets (e.g., Cornell Movie Dialogues,
Kaggle FAQs).
- Type: Text (Unstructured)
- Number of Records: ~10,000 conversation pairs (questions and responses)
- Number of Features: 2 main columns – user_input and intent
- Dataset Type: Static
- Target Variable: intent (used for classification)
5. Data Preprocessing
- Removed missing and duplicate entries.
- Normalized text (lowercasing, punctuation removal).
- Tokenized sentences and applied lemmatization.
- Applied label encoding on target variable.
- Vectorized inputs using TF-IDF and BERT embeddings.
Insights Summary:
- Common intents dominate dataset.
- Keyword-based patterns support model separability.
7. Feature Engineering
- Created features: message length, keyword flags.
- TF-IDF vectorization and BERT embeddings used.
- PCA used on TF-IDF (optional dimensionality reduction).
- Features helped improve classification accuracy.
8. Model Building
Models Used:
1. Logistic Regression – baseline with TF-IDF.
2. Random Forest – handles sparse data, interpretable.
3. BERT – transformer model with high accuracy.