Disaster
Disaster
DESCRIPTION :-
VARIANCE
Definition: Variance measures the spread of data points from the mean. It gives an
indication of how much the values in the data set deviate from the average value.
Formula :
STANDARD DEVIATION:
• Description:
Standard deviation is the square root of variance, measuring how much the values deviate
from the mean, in the same unit as the data.
• Formula:
• A lower standard deviation means the data points tend to be closer to the mean, whereas a
higher standard deviation indicates more spread.
SKEWNESS
Definition: Skewness measures the asymmetry of the distribution of data around its mean.
It indicates whether the data is skewed to the left (negatively skewed) or to the right
(positively skewed).
Formula:
where:
xˉ is the mean.
n is the number of data points.
• A Negative Skewness value indicates left-skewed data (more values on the right), while a
Positive Skewness indicates right-skewed data (more values on the left). A skewness value
close to 0 indicates a symmetric distribution.
Kurtosis
Definition: Kurtosis measures the "tailedness" of the distribution, indicating how much of
the data is in the tails compared to a normal distribution. It helps describe the shape of the
distribution.
Formula:
• A Positive kurtosis (leptokurtic) means that the distribution has heavier tails than a normal
distribution, while a Negative kurtosis (platykurtic) means it has lighter tails.
PROGRAM:-
import numpy as np
from scipy.stats import skew, kurtosis
# Example dataset
data = [12, 15, 14, 10, 8, 13, 14, 16, 18, 21]
# Calculating statistics
variance, std_dev, skewness, kurt = calculate_statistics(data)
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_dev}")
print(f"Skewness: {skewness}")
print(f"Kurtosis: {kurt}")
OUTPUT:
MACRO PROJECT
AIM :- Develop a machine learning model to detect fraudulent credit card transactions. Explore
anomaly detection techniques and evaluate model performance..
DESCRIPTION :-
Fraud detection in credit card transactions is a critical application of machine learning, aiming to
minimize financial losses and ensure secure transactions. The primary challenge lies in the
imbalanced nature of the data—fraudulent transactions are rare compared to legitimate ones.
This necessitates techniques that can effectively detect anomalies or patterns indicative of fraud
Data Preprocessing
1. Imbalanced Dataset: Fraudulent transactions typically make up less than 1% of the total
data. Handling this imbalance is crucial.
o Techniques like oversampling (SMOTE), undersampling, or generating synthetic
samples can be employed.
2. Feature Scaling and Encoding: Most datasets require normalization/scaling (e.g.,
MinMaxScaler or StandardScaler) and encoding of categorical features for effective model
training.
3. Feature Selection/Engineering: Identifying important features such as transaction amount,
time, or customer behavior patterns.
Given the class imbalance, conventional accuracy is not a reliable metric. Instead, the following are
used:
In anomaly detection:
Normalization scales all features to the same range, crucial for Isolation Forest.
StandardScaler standardizes features to have a mean of 0 and standard deviation of 1.
fit_transform computes and applies scaling, excluding the target variable (Class). . This step
improves model performance and stability.
Step 4: Predictions
The predict method of IsolationForest assigns a label of -1 for outliers (fraudulent
transactions) and 1 for normal points (non-fraudulent transactions).
Since the target variable in the dataset has 1 for fraud and 0 for non-fraud, the predictions
need to be mapped:
o -1 (outliers) -> 1 (fraud)
o 1 (normal) -> 0 (non-fraud)
predictions = model.predict(data_scaled)
predictions = [1 if pred == -1 else 0 for pred in predictions]
# Accuracy
accuracy = accuracy_score(labels, predictions)
print("\nAccuracy:", accuracy)
# Precision
precision = precision_score(labels, predictions)
print("Precision:", precision)
# Recall
recall = recall_score(labels, predictions)
print("Recall:", recall)
# F1-Score
f1 = f1_score(labels, predictions)
print("F1-Score:", f1)
Classification Report
The classification report provides additional metrics like precision, recall, and F1-score for each
class (fraud and non-fraud). This gives more insight into how well the model performs on each
class.
print("\nClassification Report:")
print(classification_report(labels, predictions))
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Non-Fraud', 'Fraud'],
yticklabels=['Non-Fraud', 'Fraud'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.show()
True Positives (TP): 305 transactions were correctly identified as fraudulent.
True Negatives (TN): 281771 transactions were correctly identified as non-fraudulent.
False Positives (FP): 2544 transactions were incorrectly identified as fraudulent (Type I error).
False Negatives (FN): 187 transactions were incorrectly identified as non-fraudulent (Type II
error).
MINI PROJECT
AIM :- Fake News Detection- fake news is sometimes transmitted through the
internet by some unauthorised sources, which creates issues for the targeted person
and it makes them panic and leads to even violence. Dataset: fake-news kaggle.
DESCRIPTION :-
Project Overview:
The project aims to develop a machine learning model for fake news detection using textual data.
The increasing prevalence of misinformation on the internet has made it critical to identify and flag
fake news articles. This model leverages Natural Language Processing (NLP) techniques, machine
learning algorithms, and feature extraction methods to classify news articles as either real or fake.
Description:
With the rise of social media and online news platforms, the spread of fake news has become a
major issue. Fake news can cause harm by spreading misinformation, creating panic, or inciting
violence. The goal of this project is to build an automatic system to classify news articles as real or
fake based on their content. The workflow involves cleaning and preprocessing the data,
visualizing the distribution and common words in fake news, extracting relevant features from the
text, training a machine learning model, evaluating its performance, and saving the model for
future predictions.
Steps Involved:
1. Data Collection and Understanding:
o A dataset containing news articles, typically labeled as either "real" or "fake", is
loaded into the system.
o The dataset is then analyzed to check the distribution of labels, missing values, and
any anomalies. Basic statistics are explored, and the data is cleaned for the next
steps.
2. Data Preprocessing:
o Text preprocessing is a critical part of any NLP task. Raw text data can contain
noise, such as URLs, special characters, and unnecessary whitespaces. Cleaning the
text helps the model focus on important features, like the core vocabulary.
o Steps include:
Removing URLs using regular expressions.
Removing non-alphabetic characters.
Converting text to lowercase for uniformity.
Stripping unnecessary spaces from the text.
3. Exploratory Data Analysis (EDA):
o Visualization: Before training the model, it is helpful to understand the dataset
better.
o Class distributions (real vs. fake news) are visualized.
o A word cloud of the most frequent words in fake news helps to identify common
patterns and potentially misleading vocabulary that could be indicators of fake
news.
4. Feature Extraction using TF-IDF:
o TF-IDF (Term Frequency-Inverse Document Frequency) is used to convert raw
text data into numerical vectors that can be processed by machine learning
algorithms.
o This method captures the importance of words in a document while penalizing
words that appear frequently across many documents, making it suitable for text
classification tasks.
5. Model Training:
o A Logistic Regression classifier is trained on the preprocessed and vectorized text
data. Logistic Regression is a simple and efficient algorithm for binary classification
tasks.
o The dataset is split into training and test sets to evaluate the model’s performance on
unseen data.
6. Model Evaluation:
o The trained model is evaluated using common metrics like:
Accuracy: The proportion of correct predictions.
Precision, Recall, and F1-score: These metrics are especially useful in
imbalanced datasets (such as news articles where fake news may be less
common than real news).
Confusion Matrix: This matrix helps visualize how well the model
distinguishes between real and fake news by showing the counts of true
positives, true negatives, false positives, and false negatives.
7. Saving the Model:
o Once the model is trained and evaluated, it is saved using the joblib library so it can
be reloaded and used to make predictions on new, unseen data without retraining.
Technology Stack and Libraries Used
Tech Stack:
1. Python: The programming language used for the entire project. Python is well-suited for
machine learning tasks due to its simplicity and the wide availability of libraries and
frameworks.
2. Jupyter Notebook or Google Colab: Used for writing, testing, and visualizing the code
and results interactively.
Libraries Used:
1. pandas:
o Purpose: Data manipulation and analysis.
o Usage: It is used to load and preprocess the dataset, clean missing data, and analyze
the structure of the dataset.
2. numpy:
o Purpose: Numerical computing.
o Usage: It is used for working with arrays and matrices, particularly in conjunction
with machine learning algorithms.
3. re (Regular Expression):
o Purpose: Text processing.
o Usage: Regular expressions are used to clean the text data, such as removing URLs
or special characters.
4. spaCy (Optional but useful):
o Purpose: Natural Language Processing (NLP).
o Usage: For advanced NLP tasks like tokenization, lemmatization, and part-of-
speech tagging. While not explicitly used in the provided code, it can be helpful for
more sophisticated preprocessing.
5. matplotlib and seaborn:
o Purpose: Data visualization.
o Usage: Used for visualizing data distributions and model performance.
6. wordcloud:
o Purpose: Text visualization.
o Usage: It generates a visual representation (word cloud) of the most frequent words
in a text corpus.
7. scikit-learn (sklearn):
o Purpose: Machine learning.
o Usage: It provides tools for feature extraction, model training, evaluation, and
tuning.
PROGRAM:
#Step 1: Load Libraries and Dataset
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
# Apply preprocessing
data['text'] = data['text'].fillna('').apply(preprocess_text)
plt.figure(figsize=(10, 5))
plt.imshow(wc_fake, interpolation='bilinear')
plt.axis("off")
plt.title("Most Common Words in Fake News")
plt.show()
# Step 4: Feature Extraction (TF-IDF Vectorization)
# Extract features and labels
X = data['text']
y = data['label']
# TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
print("TF-IDF Matrix Shape:", X_train_tfidf.shape)
# Make predictions
y_pred = model.predict(X_test_tfidf)
# Step 6: Model Evaluation
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Real', 'Fake'],
yticklabels=['Real', 'Fake'])
plt.title("Confusion Matrix")
plt.ylabel("True Label")
plt.xlabel("Predicted Label")
plt.show()
# Step 7: Save the Model
import joblib