0% found this document useful (0 votes)
13 views20 pages

Disaster

Uploaded by

akhilmetha756
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views20 pages

Disaster

Uploaded by

akhilmetha756
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Micro Project

S. Program Page no Signature


No.
1 Write a python program to calculate the
Variance, Standard Deviation, Skewness
and Kurtosis.
Macro Project

S. Program Page no Signature


No.
1 Develop a machine learning model to detect
fraudulent credit card transactions. Explore
anomaly detection techniques and evaluate
model performance.
Mini Skill Project
S. Program Page no Signature
No.
1 Fake News Detection- fake news is sometimes
transmitted through the internet by some
unauthorised sources, which creates issues for the
targeted person and it makes them panic and leads
to even violence. Dataset: fake-news kaggle.
MICRO PROJECT

AIM :- Write a python program to calculate the Variance, Standard Deviation,


Skewness and Kurtosis.

DESCRIPTION :-

VARIANCE

 Definition: Variance measures the spread of data points from the mean. It gives an
indication of how much the values in the data set deviate from the average value.
 Formula :

• μ is the population mean, xˉ is the sample mean.


• n is the number of data points.

STANDARD DEVIATION:

• Description:
Standard deviation is the square root of variance, measuring how much the values deviate
from the mean, in the same unit as the data.
• Formula:

• A lower standard deviation means the data points tend to be closer to the mean, whereas a
higher standard deviation indicates more spread.

SKEWNESS

 Definition: Skewness measures the asymmetry of the distribution of data around its mean.
It indicates whether the data is skewed to the left (negatively skewed) or to the right
(positively skewed).
 Formula:
where:

 xˉ is the mean.
 n is the number of data points.

• A Negative Skewness value indicates left-skewed data (more values on the right), while a
Positive Skewness indicates right-skewed data (more values on the left). A skewness value
close to 0 indicates a symmetric distribution.

Kurtosis

 Definition: Kurtosis measures the "tailedness" of the distribution, indicating how much of
the data is in the tails compared to a normal distribution. It helps describe the shape of the
distribution.
 Formula:

• A Positive kurtosis (leptokurtic) means that the distribution has heavier tails than a normal
distribution, while a Negative kurtosis (platykurtic) means it has lighter tails.
PROGRAM:-

import numpy as np
from scipy.stats import skew, kurtosis

# Function to calculate Variance, Standard Deviation, Skewness, and Kurtosis


def calculate_statistics(data):

variance = np.var(data, ddof=1)


std_dev = np.std(data, ddof=1) # ddof=1 for sample standard deviation
skewness = skew(data)
kurt = kurtosis(data)

return variance, std_dev, skewness, kurt

# Example dataset
data = [12, 15, 14, 10, 8, 13, 14, 16, 18, 21]

# Calculating statistics
variance, std_dev, skewness, kurt = calculate_statistics(data)

print(f"Variance: {variance}")
print(f"Standard Deviation: {std_dev}")
print(f"Skewness: {skewness}")
print(f"Kurtosis: {kurt}")

OUTPUT:
MACRO PROJECT

AIM :- Develop a machine learning model to detect fraudulent credit card transactions. Explore
anomaly detection techniques and evaluate model performance..

DESCRIPTION :-
Fraud detection in credit card transactions is a critical application of machine learning, aiming to
minimize financial losses and ensure secure transactions. The primary challenge lies in the
imbalanced nature of the data—fraudulent transactions are rare compared to legitimate ones.
This necessitates techniques that can effectively detect anomalies or patterns indicative of fraud

Data Preprocessing

1. Imbalanced Dataset: Fraudulent transactions typically make up less than 1% of the total
data. Handling this imbalance is crucial.
o Techniques like oversampling (SMOTE), undersampling, or generating synthetic
samples can be employed.
2. Feature Scaling and Encoding: Most datasets require normalization/scaling (e.g.,
MinMaxScaler or StandardScaler) and encoding of categorical features for effective model
training.
3. Feature Selection/Engineering: Identifying important features such as transaction amount,
time, or customer behavior patterns.

Machine Learning Techniques

1. Supervised Learning Models:


o Logistic Regression: Interpretable model for binary classification.
o Random Forest and Gradient Boosting (e.g., XGBoost): Capable of handling
imbalanced data with class weights.
o Neural Networks: For capturing complex patterns but require significant data and
resources.
o Support Vector Machines (SVM): Effective for smaller, high-dimensional
datasets.

2. Anomaly Detection Techniques:


o Unsupervised Models (useful when labels are scarce or unreliable):

• Autoencoders: Neural networks that learn to reconstruct data; higher reconstruction


error indicates potential fraud.
• Isolation Forest: Detects anomalies by isolating observations.
• Clustering Algorithms (e.g., DBSCAN, K-Means): Identifies clusters of
legitimate transactions; outliers are flagged as anomalies.
Evaluation Metrics

Given the class imbalance, conventional accuracy is not a reliable metric. Instead, the following are
used:

 Precision: High precision ensures fewer false positives.


 Recall (Sensitivity): Indicates the model's ability to identify actual frauds.
 F1 Score: Balances precision and recall.
 AUC-ROC Curve: Evaluates the trade-off between true positive and false positive rates.
 Confusion Matrix: Provides insights into false positives, false negatives, and true
classification rates.

Steps to Develop the Model

1. Exploratory Data Analysis (EDA):


o Understand the data distribution, identify missing values, and visualize class
imbalance.
2. Data Preprocessing:
o Handle missing data, normalize/scale features, and balance classes.
3. Model Training:
o Train multiple models using supervised and unsupervised techniques.
4. Hyperparameter Tuning:
o Optimize model parameters using grid search, random search, or Bayesian
optimization.
5. Evaluation:
o Evaluate models using the metrics listed above, focusing on minimizing false
negatives (missed frauds).

Anomaly Detection Approach

In anomaly detection:

 Models are trained on legitimate transaction data (normal data).


 Fraudulent transactions are flagged as anomalies based on the deviation from learned
patterns.
 Suitable algorithms:
o One-Class SVM: Learns a boundary for normal data.
o Autoencoders: Reconstruct transactions; high reconstruction errors signify
anomalies.
o Isolation Forest: Efficiently isolates anomalies by random partitioning.
PROGRAM:-

# Importing all the necessary libraries


import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score,
precision_score, recall_score, f1_score
import seaborn as sns
import matplotlib.pyplot as plt

Step 1: Load the Dataset


data = pd.read_csv('creditcard.csv')
data

Step 2: Preprocess the Data


In many fraud detection tasks, the Time column (which represents the time of day the transaction
occurred) may not provide useful information for detecting fraud. Removing it helps reduce noise
and prevents the model from overfitting.

data = data.drop(['Time'], axis=1)

 Normalization scales all features to the same range, crucial for Isolation Forest.
 StandardScaler standardizes features to have a mean of 0 and standard deviation of 1.
 fit_transform computes and applies scaling, excluding the target variable (Class). . This step
improves model performance and stability.

Normalize the Features


scaler = StandardScaler()
data_scaled = scaler.fit_transform(data.drop('Class', axis=1)) # Assume 'Class' is the target label
Step 3: Anomaly Detection Using Isolation Forest
Isolation Forest is used for detecting anomalies in high-dimensional data, like credit card
transactions. It identifies outliers (fraudulent cases). The contamination=0.01 parameter assumes
1% of transactions are fraud. The model is trained using model.fit(data_scaled).

model = IsolationForest(contamination=0.01) # 1% expected fraudulence


model.fit(data_scaled)

Step 4: Predictions
 The predict method of IsolationForest assigns a label of -1 for outliers (fraudulent
transactions) and 1 for normal points (non-fraudulent transactions).
 Since the target variable in the dataset has 1 for fraud and 0 for non-fraud, the predictions
need to be mapped:
o -1 (outliers) -> 1 (fraud)
o 1 (normal) -> 0 (non-fraud)

predictions = model.predict(data_scaled)
predictions = [1 if pred == -1 else 0 for pred in predictions]

Step 5: Evaluate Model Performance


Accuracy
 Accuracy is the proportion of correct predictions (both fraud and non-fraud) out of all
predictions. It's useful for general performance but may be misleading in imbalanced
datasets.
Precision
 Precision measures how many of the predicted fraud cases are actually fraud. It’s important
when false positives (non-fraud labeled as fraud) are costly.
Recall
 Recall indicates how many of the actual fraud cases were correctly identified. It’s crucial
when missing fraud cases (false negatives) is more costly.
F1-Score
 F1-Score is the balance between precision and recall. It’s useful when you need to balance
both false positives and false negatives, especially in imbalanced datasets.

# Accuracy
accuracy = accuracy_score(labels, predictions)
print("\nAccuracy:", accuracy)

# Precision
precision = precision_score(labels, predictions)
print("Precision:", precision)
# Recall
recall = recall_score(labels, predictions)
print("Recall:", recall)

# F1-Score
f1 = f1_score(labels, predictions)
print("F1-Score:", f1)

Classification Report
The classification report provides additional metrics like precision, recall, and F1-score for each
class (fraud and non-fraud). This gives more insight into how well the model performs on each
class.

print("\nClassification Report:")
print(classification_report(labels, predictions))

Step 6: Plotting the Confusion Matrix


 Confusion matrix visualization is helpful to understand how the model is performing.
 A heatmap is used to visually display the confusion matrix, with the actual labels on the y-
axis and predicted labels on the x-axis. The color intensity indicates the number of instances
in each category.

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Non-Fraud', 'Fraud'],
yticklabels=['Non-Fraud', 'Fraud'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.show()
True Positives (TP): 305 transactions were correctly identified as fraudulent.
True Negatives (TN): 281771 transactions were correctly identified as non-fraudulent.
False Positives (FP): 2544 transactions were incorrectly identified as fraudulent (Type I error).
False Negatives (FN): 187 transactions were incorrectly identified as non-fraudulent (Type II
error).
MINI PROJECT

AIM :- Fake News Detection- fake news is sometimes transmitted through the
internet by some unauthorised sources, which creates issues for the targeted person
and it makes them panic and leads to even violence. Dataset: fake-news kaggle.

DESCRIPTION :-

Project Overview:
The project aims to develop a machine learning model for fake news detection using textual data.
The increasing prevalence of misinformation on the internet has made it critical to identify and flag
fake news articles. This model leverages Natural Language Processing (NLP) techniques, machine
learning algorithms, and feature extraction methods to classify news articles as either real or fake.

Description:
With the rise of social media and online news platforms, the spread of fake news has become a
major issue. Fake news can cause harm by spreading misinformation, creating panic, or inciting
violence. The goal of this project is to build an automatic system to classify news articles as real or
fake based on their content. The workflow involves cleaning and preprocessing the data,
visualizing the distribution and common words in fake news, extracting relevant features from the
text, training a machine learning model, evaluating its performance, and saving the model for
future predictions.

Steps Involved:
1. Data Collection and Understanding:
o A dataset containing news articles, typically labeled as either "real" or "fake", is
loaded into the system.
o The dataset is then analyzed to check the distribution of labels, missing values, and
any anomalies. Basic statistics are explored, and the data is cleaned for the next
steps.
2. Data Preprocessing:
o Text preprocessing is a critical part of any NLP task. Raw text data can contain
noise, such as URLs, special characters, and unnecessary whitespaces. Cleaning the
text helps the model focus on important features, like the core vocabulary.
o Steps include:
 Removing URLs using regular expressions.
 Removing non-alphabetic characters.
 Converting text to lowercase for uniformity.
 Stripping unnecessary spaces from the text.
3. Exploratory Data Analysis (EDA):
o Visualization: Before training the model, it is helpful to understand the dataset
better.
o Class distributions (real vs. fake news) are visualized.
o A word cloud of the most frequent words in fake news helps to identify common
patterns and potentially misleading vocabulary that could be indicators of fake
news.
4. Feature Extraction using TF-IDF:
o TF-IDF (Term Frequency-Inverse Document Frequency) is used to convert raw
text data into numerical vectors that can be processed by machine learning
algorithms.
o This method captures the importance of words in a document while penalizing
words that appear frequently across many documents, making it suitable for text
classification tasks.
5. Model Training:
o A Logistic Regression classifier is trained on the preprocessed and vectorized text
data. Logistic Regression is a simple and efficient algorithm for binary classification
tasks.
o The dataset is split into training and test sets to evaluate the model’s performance on
unseen data.
6. Model Evaluation:
o The trained model is evaluated using common metrics like:
 Accuracy: The proportion of correct predictions.
 Precision, Recall, and F1-score: These metrics are especially useful in
imbalanced datasets (such as news articles where fake news may be less
common than real news).
 Confusion Matrix: This matrix helps visualize how well the model
distinguishes between real and fake news by showing the counts of true
positives, true negatives, false positives, and false negatives.
7. Saving the Model:
o Once the model is trained and evaluated, it is saved using the joblib library so it can
be reloaded and used to make predictions on new, unseen data without retraining.
Technology Stack and Libraries Used
Tech Stack:
1. Python: The programming language used for the entire project. Python is well-suited for
machine learning tasks due to its simplicity and the wide availability of libraries and
frameworks.
2. Jupyter Notebook or Google Colab: Used for writing, testing, and visualizing the code
and results interactively.
Libraries Used:
1. pandas:
o Purpose: Data manipulation and analysis.
o Usage: It is used to load and preprocess the dataset, clean missing data, and analyze
the structure of the dataset.
2. numpy:
o Purpose: Numerical computing.
o Usage: It is used for working with arrays and matrices, particularly in conjunction
with machine learning algorithms.
3. re (Regular Expression):
o Purpose: Text processing.
o Usage: Regular expressions are used to clean the text data, such as removing URLs
or special characters.
4. spaCy (Optional but useful):
o Purpose: Natural Language Processing (NLP).
o Usage: For advanced NLP tasks like tokenization, lemmatization, and part-of-
speech tagging. While not explicitly used in the provided code, it can be helpful for
more sophisticated preprocessing.
5. matplotlib and seaborn:
o Purpose: Data visualization.
o Usage: Used for visualizing data distributions and model performance.
6. wordcloud:
o Purpose: Text visualization.
o Usage: It generates a visual representation (word cloud) of the most frequent words
in a text corpus.
7. scikit-learn (sklearn):
o Purpose: Machine learning.
o Usage: It provides tools for feature extraction, model training, evaluation, and
tuning.
PROGRAM:
#Step 1: Load Libraries and Dataset
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline

# Load the dataset


data = pd.read_csv("train.csv", encoding="utf-8")

# Display basic information


print(data.head())
print(data.info())
print(data['label'].value_counts()) # Check class distribution
# Step 2: Data Preprocessing
# Text cleaning function
def preprocess_text(text):
text = re.sub(r"http\S+", "", text) # Remove URLs
text = re.sub(r"[^a-zA-Z\s]", "", text) # Remove non-alphabetic characters
text = text.lower() # Convert to lowercase
text = text.strip() # Remove leading/trailing spaces
return text

# Apply preprocessing
data['text'] = data['text'].fillna('').apply(preprocess_text)

# Check the cleaned text


print(data['text'].head())

# Step 3: Exploratory Data Analysis (EDA)


# Visualize class distribution
sns.countplot(data['label'])
plt.title("Class Distribution: Real vs Fake")
plt.show()

# Generate WordCloud for Fake News


fake_news = " ".join(data[data['label'] == 1]['text'])
wc_fake = WordCloud(width=800, height=400, background_color='black').generate(fake_news)

plt.figure(figsize=(10, 5))
plt.imshow(wc_fake, interpolation='bilinear')
plt.axis("off")
plt.title("Most Common Words in Fake News")
plt.show()
# Step 4: Feature Extraction (TF-IDF Vectorization)
# Extract features and labels
X = data['text']
y = data['label']

# Split data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
print("TF-IDF Matrix Shape:", X_train_tfidf.shape)

# Step 5: Model Training


# Train Logistic Regression Model
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

# Make predictions
y_pred = model.predict(X_test_tfidf)
# Step 6: Model Evaluation
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

print("Classification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Real', 'Fake'],
yticklabels=['Real', 'Fake'])
plt.title("Confusion Matrix")
plt.ylabel("True Label")
plt.xlabel("Predicted Label")
plt.show()
# Step 7: Save the Model
import joblib

# Save the trained model and vectorizer


joblib.dump(model, 'fake_news_model.pkl')
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')

print("Model and vectorizer saved!")

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy