0% found this document useful (0 votes)

7 views11 pages

23951a04e3 Acsd08

Uploaded by

rohanvemula1156

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views11 pages

23951a04e3 Acsd08

Uploaded by

rohanvemula1156

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

DATA STRUCTURES

ACSD08

A REPORT ON COMPLEX ENGINEERING

PROBLEM
SOLVING (AAT - 2)

Vemula Rohan
23951A04E3
ECE-C
DSCSP87
ComplexProblem SolvingSelf-AssessmentForm

1 Name of the Student Vemula Rohan

2 RollNumber 23951A04E3

3 BranchandSection ECE-C

4 Program B.Tech

5 CourseName DataStructures

6 CourseCode ACSD08

7 Pleasetick(✓)relevantEngineeringCompetency(ECs)Profiles

EC Profiles (✓)
EC1 Ensuresthatallaspectsofanengineeringactivityaresoundlybasedon ✓
fundamental principles - by diagnosing, and taking appropriate action
with data, calculations, results, proposals, processes, practices, and
documented information that may be ill-founded, illogical, erroneous,
unreliableorunrealisticrequirementsapplicabletotheengineering discipline

EC2 Havenoobvioussolution and requires abstract thinking and originality in ✓

analysis to formulate suitable models.
EC3 Support sustainable development solutions by ensuring functional ✓
requirements, minimize environmental impact and optimize resource
utilizationthroughoutthelifecycle,whilebalancingperformanceand cost
effectiveness.
EC4 Competentlyaddressescomplexengineeringproblemswhichinvolve ✓
uncertainty, ambiguity, imprecise information and wide-ranging or
conflictingtechnical,engineeringandotherissues.
EC5 Conceptualises alternative engineering approaches and evaluates ✓
potentialoutcomesagainstappropriatecriteriatojustifyanoptimal
solutionchoice.
EC6 Identifies, quantifies, mitigates and manages technical, health, ✓
environmental,safety,economicandothercontextualrisksassociatedto
seekachievablesustainableoutcomeswithengineeringapplicationinthe
designated engineering discipline.
1
EC7 Involvethecoordinationofdiverseresources(andforthis purpose, ✓
resourcesincludepeople,money,equipment,materials,informationand
technologies) in the timely delivery of outcomes
EC8 Designanddevelopsolutiontocomplexengineeringproblem ✓
consideringaveryperspectiveandtakingaccountofstakeholderviews with
widely varying needs.
EC9 Meetalllevel,legal,regulatory,relevantstandardsandcodesofpractice, ✓
protectpublichealthandsafetyinthecourseofallengineering activities.

EC 10 Highlevelproblemsincludingmanycomponentpartsorsub-problems, ✓
partitionsproblems,processesorsystemsintomanageableelementsfor the
purposes of analysis, modelling or design and then re-combines to form
awhole,withtheintegrityandperformanceoftheoverall system as the top
consideration.

EC Profiles (✓)
EC 11 UndertakeCPDactivitiestomaintainandextendcompetencesand ✓
enhancetheabilitytoadapttoemergingtechnologiesandtheever- changing
nature of work.
EC 12 Recognizecomplexityandassessalternativesinlightofcompeting ✓
requirementsandincompleteknowledge.Requirejudgementindecision
making in the course of all complex engineering activities.
8 Pleasetick(✓)relevantCourseOutcomes(COs)Covered

CO CourseOutcomes (✓)
CO1 Interpretthecomplexityofthealgorithmusingtheasymptoticnotations ✓

CO2 Selecttheappropriatesearchingandsortingtechniqueforagivenproblem ✓

CO3 Constructprogramsonperformingoperationsonlinearandnonlineardata ✓
structures for organization of a data
CO4 Makeuseoflineardatastructuresandnonlineardatastructuressolving ✓
realtime applications.
CO5 Describehashingtechniquesandcollisionresolutionmethodsforaccessing data ✓
with respect to performance
CO6 Comparevarioustypesofdatastructures;intermsofimplementation, ✓
operations and performance.

9 CourseELRVVideoLecturesViewed Numberof Viewingtime

Videos in Hours
68 35

Foundationsforanalyzingand
10 JustifyyourunderstandingofWK1
optimizing operations.

2
Coretoadvanced concepts,
11 JustifyyourunderstandingofWK2–WK9
tools,design,andethics.
HowmanyWKsfromWK2toWK9wereimplemented? All8WKsfrom WK2toWK9
12 areimplementedinthisdesign and
analysis.

Mentionthem WK2 toWK9

Date:22-12-2024

Signature of the Student

PROBLEM STATEMENT
Preprocessing Text Data for Natural Language Processing
(NLP) Models DS CSP87

I. Project Overview
Preprocessing text data is a fundamental step in Natural Language Processing (NLP).
Raw text data from various sources, such as social media posts, articles, or speech
transcriptions, is often noisy and unstructured. Effective preprocessing transforms
this raw data into a clean, structured format that machine learning models can
process. This ensures better performance, generalization, and interpretability of NLP
models.

II. Objectives
1. Text Normalization: Convert text into a consistent format, including cleaning,
tokenization, and case normalization.
2. Noise Removal: Eliminate unwanted elements such as punctuation, special
characters, and irrelevant data.
3. Stopword Removal: Remove common words (e.g., "the," "is") that do not
3
contribute significant meaning.
4. Stemming and Lemmatization: Reduce words to their root forms to minimize
redundant variations.
5. Feature Representation: Prepare text for numerical representation, such as
Bag-of-Words, TF-IDF, or embeddings.

III. Key Steps in Preprocessing

1. Cleaning the Text:
o Remove punctuation, numbers, and special characters.

o Convert text to lowercase for consistency.

2. Tokenization:
o Split text into individual words or tokens.

3. Stopword Removal:
o Eliminate common, less meaningful words (e.g., "and," "to," "at").

4. Stemming and Lemmatization:

o Stemming: Strip words down to their roots (e.g., "running" → "run").

o Lemmatization: Convert words to their base dictionary form (e.g., "better"

→ "good").
5. Handling Noise:
o Remove duplicate spaces, URLs, or irrelevant content such as HTML tags.

6. Final Output:
o Produce a clean, structured list of words or sentences ready for feature

extraction.

IV. Challenges
1. Language Dependency: Preprocessing rules vary for different languages (e.g.,
compound words in German).
2. Ambiguity: Words with multiple meanings (e.g., "bat") require context for
accurate preprocessing.
3. Data Size: Handling large text corpora efficiently.
4
4. Maintaining Semantic Meaning: Excessive cleaning might remove
meaningful words, affecting downstream tasks.

V. Technologies Used
1. Programming Language: Java
2. Libraries/Frameworks: None (built using native Java capabilities)
3. Concepts: Object-Oriented Programming, Regular Expressions, Stream API

VI. Java Implementation

Code: Text Preprocessing for NLP
java
Copy code
import java.util.*;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

public class TextPreprocessing {

// Stopwords list
private static final Set<String> STOPWORDS = new HashSet<>(Arrays.asList(
"the", "is", "in", "and", "to", "a", "of", "it", "on", "for", "this", "that", "with",
"as", "at", "by", "an"
));

// Main preprocessing pipeline

public static void main(String[] args) {
// Sample text data
List<String> textData = Arrays.asList(
"The quick brown fox jumps over the lazy dog!",
"NLP is fun, and it's exciting to preprocess text!",
"Cleaning & normalizing text improves model accuracy."
5
);

System.out.println("Original Text Data:");

textData.forEach(System.out::println);

// Step 1: Clean Text

List<String> cleanedTexts = textData.stream()
.map(TextPreprocessing::cleanText)
.collect(Collectors.toList());

// Step 2: Tokenize Text

List<List<String>> tokenizedTexts = cleanedTexts.stream()
.map(TextPreprocessing::tokenizeText)
.collect(Collectors.toList());

// Step 3: Remove Stopwords

List<List<String>> filteredTokens = tokenizedTexts.stream()
.map(TextPreprocessing::removeStopwords)
.collect(Collectors.toList());

// Step 4: Stemming (Simplistic Implementation)

List<List<String>> stemmedTokens = filteredTokens.stream()
.map(TextPreprocessing::stemTokens)
.collect(Collectors.toList());

// Step 5: Combine Tokens Back to Text

List<String> processedTexts = stemmedTokens.stream()
.map(tokens -> String.join(" ", tokens))
.collect(Collectors.toList());

// Display Processed Texts

System.out.println("\nProcessed Text Data:");
6
processedTexts.forEach(System.out::println);
}

// Step 1: Clean text (remove punctuation, convert to lowercase)

private static String cleanText(String text) {
return text.toLowerCase().replaceAll("[^a-zA-Z\\s]", "");
}

// Step 2: Tokenize text

private static List<String> tokenizeText(String text) {
return Arrays.asList(text.split("\\s+"));
}

// Step 3: Remove stopwords

private static List<String> removeStopwords(List<String> tokens) {
return tokens.stream()
.filter(token -> !STOPWORDS.contains(token))
.collect(Collectors.toList());
}

// Step 4: Stemming (simplistic implementation)

private static List<String> stemTokens(List<String> tokens) {
// Mimic stemming by truncating to the first 4 characters (for demonstration
purposes)
return tokens.stream()
.map(token -> token.length() > 4 ? token.substring(0, 4) : token)
.collect(Collectors.toList());
}
}

7
VII. Explanation of Code
1. Clean Text: Removes non-alphabetic characters and converts the text to
lowercase.
2. Tokenization: Splits sentences into individual words using whitespace as a
delimiter.
3. Stopword Removal: Filters out predefined stopwords stored in a HashSet.
4. Stemming: Simplistic implementation truncates words to their first 4 characters.
Replace this logic with a proper stemming library if needed.
5. Pipeline Design: Each preprocessing step is modular, enabling flexibility and
easier debugging.

VIII. Sample Execution

Input:
kotlin
Original Text Data:
The quick brown fox jumps over the lazy dog!
NLP is fun, and it's exciting to preprocess text!
Cleaning & normalizing text improves model accuracy.
Output:
kotlin
Processed Text Data:
quick brown fox jump lazy
nlp fun excit prep
clea norm text impr mode accurate

8
IX. Applications
1. Sentiment Analysis: Preprocessed text serves as input for models that predict
sentiment (positive, negative, or neutral).
2. Text Classification: Useful for categorizing news articles or emails (e.g., spam
9
detection).
3. Topic Modeling: Extract latent topics from large corpora using algorithms like
LDA.
4. Machine Translation: Cleaned text is essential for training translation models.

X. Conclusion
This Java implementation demonstrates a complete, modular pipeline for text
preprocessing in NLP. The approach can be expanded to include advanced
techniques like lemmatization, named entity recognition, or feature extraction. By
understanding and implementing preprocessing effectively, developers can ensure
their models work optimally with real-world data.

Software Engineering - 2023 - Assignment 1
No ratings yet
Software Engineering - 2023 - Assignment 1
6 pages
Approaching Almost Any NLP
No ratings yet
Approaching Almost Any NLP
118 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
SpecFlow Test Automation Essentials: Definitive Reference for Developers and Engineers
From Everand
SpecFlow Test Automation Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Deno KV for Scalable, Distributed Applications: The Complete Guide for Developers and Engineers
From Everand
Deno KV for Scalable, Distributed Applications: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Lab Syllabus NLP Lab
No ratings yet
Lab Syllabus NLP Lab
2 pages
Tcl Language Essentials: Definitive Reference for Developers and Engineers
From Everand
Tcl Language Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
02 - NLP Pipeline - Binh
No ratings yet
02 - NLP Pipeline - Binh
37 pages
DevTest Engineering Foundations: Definitive Reference for Developers and Engineers
From Everand
DevTest Engineering Foundations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
Programming Cloudflare Workers KV: The Complete Guide for Developers and Engineers
From Everand
Programming Cloudflare Workers KV: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Kestra Pipeline Orchestration Essentials: The Complete Guide for Developers and Engineers
From Everand
Kestra Pipeline Orchestration Essentials: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Objective-C Language Reference and Techniques: Definitive Reference for Developers and Engineers
From Everand
Objective-C Language Reference and Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kubeflow Operations and Workflow Engineering: Definitive Reference for Developers and Engineers
From Everand
Kubeflow Operations and Workflow Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
C++ OOP Made Simple: A Practical Guide with Examples
From Everand
C++ OOP Made Simple: A Practical Guide with Examples
William E. Clark
No ratings yet
NLP Slides
No ratings yet
NLP Slides
19 pages
NLP Lect 2
No ratings yet
NLP Lect 2
5 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
C# Data Structures Explained: A Practical Guide with Examples
From Everand
C# Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Chapter 2 Solutions
No ratings yet
Chapter 2 Solutions
6 pages
Tekton Pipeline Engineering: Definitive Reference for Developers and Engineers
From Everand
Tekton Pipeline Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering the Art of C# Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of C# Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Natural Language Processing - NOTES
No ratings yet
Natural Language Processing - NOTES
4 pages
VO - MCA - SEM 4 - Text Mining - U2
No ratings yet
VO - MCA - SEM 4 - Text Mining - U2
15 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
5 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
Module 5
No ratings yet
Module 5
69 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
Effective Cucumber Automation: Definitive Reference for Developers and Engineers
From Everand
Effective Cucumber Automation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
C# OOP Step by Step: A Practical Guide with Examples
From Everand
C# OOP Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Carvel Ytt in Action: The Complete Guide for Developers and Engineers
From Everand
Carvel Ytt in Action: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
C# Algorithms for New Programmers: A Practical Guide with Examples
From Everand
C# Algorithms for New Programmers: A Practical Guide with Examples
William E. Clark
No ratings yet
Unit 5
No ratings yet
Unit 5
8 pages
Mastering C++: Advanced Techniques and Tricks
From Everand
Mastering C++: Advanced Techniques and Tricks
Ted Norice
No ratings yet
T-SQL Techniques and Best Practices: Definitive Reference for Developers and Engineers
From Everand
T-SQL Techniques and Best Practices: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
C++ Algorithms for Beginners: A Practical Guide with Examples
From Everand
C++ Algorithms for Beginners: A Practical Guide with Examples
William E. Clark
No ratings yet
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
Verilog for Digital Design and Simulation: Definitive Reference for Developers and Engineers
From Everand
Verilog for Digital Design and Simulation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Model-Driven Online Capacity Management for Component-Based Software Systems
From Everand
Model-Driven Online Capacity Management for Component-Based Software Systems
André van Hoorn
No ratings yet
OpenEdge Application Development Essentials: Definitive Reference for Developers and Engineers
From Everand
OpenEdge Application Development Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kops for Enterprise Kubernetes Cluster Management: Definitive Reference for Developers and Engineers
From Everand
Kops for Enterprise Kubernetes Cluster Management: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Build Systems with CMake: Definitive Reference for Developers and Engineers
From Everand
Efficient Build Systems with CMake: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
NLP Preprocessing Steps 1740444240
No ratings yet
NLP Preprocessing Steps 1740444240
20 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
No ratings yet
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
81 pages
BAI601 All Modules VTU 10 Mark Complete
No ratings yet
BAI601 All Modules VTU 10 Mark Complete
18 pages
Lecture 3: Text Processing & Minimum Edit Distance Algorithm
No ratings yet
Lecture 3: Text Processing & Minimum Edit Distance Algorithm
57 pages
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
From Everand
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
From Everand
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Text Processing
No ratings yet
Text Processing
5 pages
Efficient Editing with Kate: Definitive Reference for Developers and Engineers
From Everand
Efficient Editing with Kate: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Jest Techniques and Best Practices: Definitive Reference for Developers and Engineers
From Everand
Jest Techniques and Best Practices: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Google Kubernetes Engine Essentials: Definitive Reference for Developers and Engineers
From Everand
Google Kubernetes Engine Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Ir Manual
No ratings yet
Ir Manual
53 pages
Evolve 2025 Agenda
No ratings yet
Evolve 2025 Agenda
2 pages
DSD Module3part1
No ratings yet
DSD Module3part1
121 pages
Module 3 Part 2
No ratings yet
Module 3 Part 2
38 pages
DSDMODULE3part PDF
No ratings yet
DSDMODULE3part PDF
10 pages
User Flow of Hackathon
No ratings yet
User Flow of Hackathon
1 page
Companies&Expenditures 1
No ratings yet
Companies&Expenditures 1
4 pages
@bsjsjdjsxjs Disha AFCAT Topic-Wise Solved Papers
No ratings yet
@bsjsjdjsxjs Disha AFCAT Topic-Wise Solved Papers
220 pages
GD 2025 (E) Music
No ratings yet
GD 2025 (E) Music
2 pages
General Revision
No ratings yet
General Revision
18 pages
Chartered Accountant CA Articleship Resume Sample
0% (1)
Chartered Accountant CA Articleship Resume Sample
2 pages
Staff Duties 1
No ratings yet
Staff Duties 1
4 pages
Devops&Agile
No ratings yet
Devops&Agile
33 pages
Bank Job Preparation Bank Job Preparation: Lecture Material - III
No ratings yet
Bank Job Preparation Bank Job Preparation: Lecture Material - III
10 pages
Lesson Memo 101
No ratings yet
Lesson Memo 101
25 pages
Hi
No ratings yet
Hi
6 pages
BizTalk Custom Adapters
No ratings yet
BizTalk Custom Adapters
92 pages
DLL - Math-8 Quarter-1 Week-4
No ratings yet
DLL - Math-8 Quarter-1 Week-4
7 pages
Coleridge Dispute Wordsworth
100% (4)
Coleridge Dispute Wordsworth
8 pages
Bornologies and Lipschitz Analysis Gerald Beer Instant Download
No ratings yet
Bornologies and Lipschitz Analysis Gerald Beer Instant Download
75 pages
Great Books Reviewer 1
No ratings yet
Great Books Reviewer 1
19 pages
Everyday Public Speaking
No ratings yet
Everyday Public Speaking
7 pages
5G - Architecture - Overview
No ratings yet
5G - Architecture - Overview
16 pages
DRYPRO 793 Part List
No ratings yet
DRYPRO 793 Part List
53 pages
MB0039 Assignment Answer
No ratings yet
MB0039 Assignment Answer
8 pages
Ronald Ibanez - CV Telus - Template
No ratings yet
Ronald Ibanez - CV Telus - Template
3 pages
Software Engineering Syllabus
No ratings yet
Software Engineering Syllabus
2 pages
School Timetable Tests 62215
No ratings yet
School Timetable Tests 62215
3 pages
Themida - Winlicense Ultra Unpacker 1.4
100% (1)
Themida - Winlicense Ultra Unpacker 1.4
274 pages
Accomplishment Holy Mass
No ratings yet
Accomplishment Holy Mass
3 pages
Unscramble The Paragraph. Place Them in The Correct Part
No ratings yet
Unscramble The Paragraph. Place Them in The Correct Part
3 pages
Whole Bible Niv1984
No ratings yet
Whole Bible Niv1984
1,871 pages
COP2800 S13 Exam2 ANSWERKEY PDF
No ratings yet
COP2800 S13 Exam2 ANSWERKEY PDF
5 pages
CE-304 Course Week 2
No ratings yet
CE-304 Course Week 2
21 pages
(Data (1EtbTvVB8QTGN4mduSdy7n4cZQm4i
No ratings yet
(Data (1EtbTvVB8QTGN4mduSdy7n4cZQm4i
4 pages
Structure of A Report
No ratings yet
Structure of A Report
32 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

23951a04e3 Acsd08

Uploaded by

23951a04e3 Acsd08

Uploaded by

DATA STRUCTURES

A REPORT ON COMPLEX ENGINEERING

1 Name of the Student Vemula Rohan

EC2 Havenoobvioussolution and requires abstract thinking and originality in ✓

9 CourseELRVVideoLecturesViewed Numberof Viewingtime

Mentionthem WK2 toWK9

Signature of the Student

III. Key Steps in Preprocessing

o Convert text to lowercase for consistency.

4. Stemming and Lemmatization:

o Lemmatization: Convert words to their base dictionary form (e.g., "better"

VI. Java Implementation

public class TextPreprocessing {

// Main preprocessing pipeline

System.out.println("Original Text Data:");

// Step 1: Clean Text

// Step 2: Tokenize Text

// Step 3: Remove Stopwords

// Step 4: Stemming (Simplistic Implementation)

// Step 5: Combine Tokens Back to Text

// Display Processed Texts

// Step 1: Clean text (remove punctuation, convert to lowercase)

// Step 2: Tokenize text

// Step 3: Remove stopwords

// Step 4: Stemming (simplistic implementation)

VIII. Sample Execution

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.