0% found this document useful (0 votes)
7 views11 pages

23951a04e3 Acsd08

Uploaded by

rohanvemula1156
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views11 pages

23951a04e3 Acsd08

Uploaded by

rohanvemula1156
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

DATA STRUCTURES

ACSD08

A REPORT ON COMPLEX ENGINEERING


PROBLEM
SOLVING (AAT - 2)

Vemula Rohan
23951A04E3
ECE-C
DSCSP87
ComplexProblem SolvingSelf-AssessmentForm

1 Name of the Student Vemula Rohan

2 RollNumber 23951A04E3

3 BranchandSection ECE-C

4 Program B.Tech

5 CourseName DataStructures

6 CourseCode ACSD08

7 Pleasetick(✓)relevantEngineeringCompetency(ECs)Profiles

EC Profiles (✓)
EC1 Ensuresthatallaspectsofanengineeringactivityaresoundlybasedon ✓
fundamental principles - by diagnosing, and taking appropriate action
with data, calculations, results, proposals, processes, practices, and
documented information that may be ill-founded, illogical, erroneous,
unreliableorunrealisticrequirementsapplicabletotheengineering discipline

EC2 Havenoobvioussolution and requires abstract thinking and originality in ✓


analysis to formulate suitable models.
EC3 Support sustainable development solutions by ensuring functional ✓
requirements, minimize environmental impact and optimize resource
utilizationthroughoutthelifecycle,whilebalancingperformanceand cost
effectiveness.
EC4 Competentlyaddressescomplexengineeringproblemswhichinvolve ✓
uncertainty, ambiguity, imprecise information and wide-ranging or
conflictingtechnical,engineeringandotherissues.
EC5 Conceptualises alternative engineering approaches and evaluates ✓
potentialoutcomesagainstappropriatecriteriatojustifyanoptimal
solutionchoice.
EC6 Identifies, quantifies, mitigates and manages technical, health, ✓
environmental,safety,economicandothercontextualrisksassociatedto
seekachievablesustainableoutcomeswithengineeringapplicationinthe
designated engineering discipline.
1
EC7 Involvethecoordinationofdiverseresources(andforthis purpose, ✓
resourcesincludepeople,money,equipment,materials,informationand
technologies) in the timely delivery of outcomes
EC8 Designanddevelopsolutiontocomplexengineeringproblem ✓
consideringaveryperspectiveandtakingaccountofstakeholderviews with
widely varying needs.
EC9 Meetalllevel,legal,regulatory,relevantstandardsandcodesofpractice, ✓
protectpublichealthandsafetyinthecourseofallengineering activities.

EC 10 Highlevelproblemsincludingmanycomponentpartsorsub-problems, ✓
partitionsproblems,processesorsystemsintomanageableelementsfor the
purposes of analysis, modelling or design and then re-combines to form
awhole,withtheintegrityandperformanceoftheoverall system as the top
consideration.

EC Profiles (✓)
EC 11 UndertakeCPDactivitiestomaintainandextendcompetencesand ✓
enhancetheabilitytoadapttoemergingtechnologiesandtheever- changing
nature of work.
EC 12 Recognizecomplexityandassessalternativesinlightofcompeting ✓
requirementsandincompleteknowledge.Requirejudgementindecision
making in the course of all complex engineering activities.
8 Pleasetick(✓)relevantCourseOutcomes(COs)Covered

CO CourseOutcomes (✓)
CO1 Interpretthecomplexityofthealgorithmusingtheasymptoticnotations ✓

CO2 Selecttheappropriatesearchingandsortingtechniqueforagivenproblem ✓

CO3 Constructprogramsonperformingoperationsonlinearandnonlineardata ✓
structures for organization of a data
CO4 Makeuseoflineardatastructuresandnonlineardatastructuressolving ✓
realtime applications.
CO5 Describehashingtechniquesandcollisionresolutionmethodsforaccessing data ✓
with respect to performance
CO6 Comparevarioustypesofdatastructures;intermsofimplementation, ✓
operations and performance.

9 CourseELRVVideoLecturesViewed Numberof Viewingtime


Videos in Hours
68 35

Foundationsforanalyzingand
10 JustifyyourunderstandingofWK1
optimizing operations.

2
Coretoadvanced concepts,
11 JustifyyourunderstandingofWK2–WK9
tools,design,andethics.
HowmanyWKsfromWK2toWK9wereimplemented? All8WKsfrom WK2toWK9
12 areimplementedinthisdesign and
analysis.

Mentionthem WK2 toWK9

Date:22-12-2024

Signature of the Student

PROBLEM STATEMENT
Preprocessing Text Data for Natural Language Processing
(NLP) Models DS CSP87

I. Project Overview
Preprocessing text data is a fundamental step in Natural Language Processing (NLP).
Raw text data from various sources, such as social media posts, articles, or speech
transcriptions, is often noisy and unstructured. Effective preprocessing transforms
this raw data into a clean, structured format that machine learning models can
process. This ensures better performance, generalization, and interpretability of NLP
models.

II. Objectives
1. Text Normalization: Convert text into a consistent format, including cleaning,
tokenization, and case normalization.
2. Noise Removal: Eliminate unwanted elements such as punctuation, special
characters, and irrelevant data.
3. Stopword Removal: Remove common words (e.g., "the," "is") that do not
3
contribute significant meaning.
4. Stemming and Lemmatization: Reduce words to their root forms to minimize
redundant variations.
5. Feature Representation: Prepare text for numerical representation, such as
Bag-of-Words, TF-IDF, or embeddings.

III. Key Steps in Preprocessing


1. Cleaning the Text:
o Remove punctuation, numbers, and special characters.

o Convert text to lowercase for consistency.

2. Tokenization:
o Split text into individual words or tokens.

3. Stopword Removal:
o Eliminate common, less meaningful words (e.g., "and," "to," "at").

4. Stemming and Lemmatization:


o Stemming: Strip words down to their roots (e.g., "running" → "run").

o Lemmatization: Convert words to their base dictionary form (e.g., "better"

→ "good").
5. Handling Noise:
o Remove duplicate spaces, URLs, or irrelevant content such as HTML tags.

6. Final Output:
o Produce a clean, structured list of words or sentences ready for feature

extraction.

IV. Challenges
1. Language Dependency: Preprocessing rules vary for different languages (e.g.,
compound words in German).
2. Ambiguity: Words with multiple meanings (e.g., "bat") require context for
accurate preprocessing.
3. Data Size: Handling large text corpora efficiently.
4
4. Maintaining Semantic Meaning: Excessive cleaning might remove
meaningful words, affecting downstream tasks.

V. Technologies Used
1. Programming Language: Java
2. Libraries/Frameworks: None (built using native Java capabilities)
3. Concepts: Object-Oriented Programming, Regular Expressions, Stream API

VI. Java Implementation


Code: Text Preprocessing for NLP
java
Copy code
import java.util.*;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

public class TextPreprocessing {


// Stopwords list
private static final Set<String> STOPWORDS = new HashSet<>(Arrays.asList(
"the", "is", "in", "and", "to", "a", "of", "it", "on", "for", "this", "that", "with",
"as", "at", "by", "an"
));

// Main preprocessing pipeline


public static void main(String[] args) {
// Sample text data
List<String> textData = Arrays.asList(
"The quick brown fox jumps over the lazy dog!",
"NLP is fun, and it's exciting to preprocess text!",
"Cleaning & normalizing text improves model accuracy."
5
);

System.out.println("Original Text Data:");


textData.forEach(System.out::println);

// Step 1: Clean Text


List<String> cleanedTexts = textData.stream()
.map(TextPreprocessing::cleanText)
.collect(Collectors.toList());

// Step 2: Tokenize Text


List<List<String>> tokenizedTexts = cleanedTexts.stream()
.map(TextPreprocessing::tokenizeText)
.collect(Collectors.toList());

// Step 3: Remove Stopwords


List<List<String>> filteredTokens = tokenizedTexts.stream()
.map(TextPreprocessing::removeStopwords)
.collect(Collectors.toList());

// Step 4: Stemming (Simplistic Implementation)


List<List<String>> stemmedTokens = filteredTokens.stream()
.map(TextPreprocessing::stemTokens)
.collect(Collectors.toList());

// Step 5: Combine Tokens Back to Text


List<String> processedTexts = stemmedTokens.stream()
.map(tokens -> String.join(" ", tokens))
.collect(Collectors.toList());

// Display Processed Texts


System.out.println("\nProcessed Text Data:");
6
processedTexts.forEach(System.out::println);
}

// Step 1: Clean text (remove punctuation, convert to lowercase)


private static String cleanText(String text) {
return text.toLowerCase().replaceAll("[^a-zA-Z\\s]", "");
}

// Step 2: Tokenize text


private static List<String> tokenizeText(String text) {
return Arrays.asList(text.split("\\s+"));
}

// Step 3: Remove stopwords


private static List<String> removeStopwords(List<String> tokens) {
return tokens.stream()
.filter(token -> !STOPWORDS.contains(token))
.collect(Collectors.toList());
}

// Step 4: Stemming (simplistic implementation)


private static List<String> stemTokens(List<String> tokens) {
// Mimic stemming by truncating to the first 4 characters (for demonstration
purposes)
return tokens.stream()
.map(token -> token.length() > 4 ? token.substring(0, 4) : token)
.collect(Collectors.toList());
}
}

7
VII. Explanation of Code
1. Clean Text: Removes non-alphabetic characters and converts the text to
lowercase.
2. Tokenization: Splits sentences into individual words using whitespace as a
delimiter.
3. Stopword Removal: Filters out predefined stopwords stored in a HashSet.
4. Stemming: Simplistic implementation truncates words to their first 4 characters.
Replace this logic with a proper stemming library if needed.
5. Pipeline Design: Each preprocessing step is modular, enabling flexibility and
easier debugging.

VIII. Sample Execution


Input:
kotlin
Original Text Data:
The quick brown fox jumps over the lazy dog!
NLP is fun, and it's exciting to preprocess text!
Cleaning & normalizing text improves model accuracy.
Output:
kotlin
Processed Text Data:
quick brown fox jump lazy
nlp fun excit prep
clea norm text impr mode accurate

8
IX. Applications
1. Sentiment Analysis: Preprocessed text serves as input for models that predict
sentiment (positive, negative, or neutral).
2. Text Classification: Useful for categorizing news articles or emails (e.g., spam
9
detection).
3. Topic Modeling: Extract latent topics from large corpora using algorithms like
LDA.
4. Machine Translation: Cleaned text is essential for training translation models.

X. Conclusion
This Java implementation demonstrates a complete, modular pipeline for text
preprocessing in NLP. The approach can be expanded to include advanced
techniques like lemmatization, named entity recognition, or feature extraction. By
understanding and implementing preprocessing effectively, developers can ensure
their models work optimally with real-world data.

10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy