0% found this document useful (0 votes)

68 views

What Is Structured Data?: Information Retrieval

Information retrieval involves obtaining information resources relevant to an information need from a collection. Searches can be based on full text or other content. Automated IR systems are used to reduce information overload. Web search engines are a major IR application. Structured data has clearly defined types that are easily searchable, while unstructured data like audio/video is less easily searchable. There is tension between analyzing structured vs unstructured data, as structured analytics is mature but unstructured is an emerging field. Customers select tools based on application needs rather than data type. Inverted indices store mappings from terms to locations, allowing fast full text search. They are commonly used in document retrieval systems like search engines. Tokenization breaks text

Uploaded by

Sundar Shahi Thakuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views

What Is Structured Data?: Information Retrieval

Uploaded by

Sundar Shahi Thakuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Information Retrieval: Information retrieval (IR) is the activity of obtaining information

system resources relevant to an information need from a collection of information resources.

Searches can be based on full-text or other content-based indexing. Information retrieval is the
science of searching for information in a document, searching for documents themselves, and
also searching for metadata that describe data, and for databases of texts, images or sounds.

Automated information retrieval systems are used to reduce what has been called information
overload. An IR system is a software that provide access to books, journals and other documents,
stores them and manages the document. Web search engines are the most visible IR applications.

Structured data vs. unstructured data: structured data is comprised of clearly defined data
types whose pattern makes them easily searchable; while unstructured data – “everything else”
– is comprised of data that is usually not as easily searchable, including formats like audio,
video, and social media postings.

Unstructured data vs. structured data does not denote any real conflict between the two.
Customers select one or the other not based on their data structure, but on the applications that
use them: relational databases for structured, and most any other type of application for
unstructured data.

However, there is a growing tension between the ease of analysis on structured data versus more
challenging analysis on unstructured data. Structured data analytics is a mature process and
technology. Unstructured data analytics is a nascent industry with a lot of new investment into
R&D, but is not a mature technology. The structured data vs. unstructured data issue within
corporations is deciding if they should invest in analytics for unstructured data, and if it is
possible to aggregate the two into better business intelligence.

What is Structured Data?

Structured data usually resides in relational databases (RDBMS). Fields store length-delineated
data phone numbers, Social Security numbers, or ZIP codes. Even text strings of variable length
like names are contained in records, making it a simple matter to search. Data may be human- or
machine-generated as long as the data is created within an RDBMS structure. This format is
eminently searchable both with human generated queries and via algorithms using type of data
and field names, such as alphabetical or numeric, currency or date.

Common relational database applications with structured data include airline reservation
systems, inventory control, sales transactions, and ATM activity. Structured Query Language
(SQL) enables queries on this type of structured data within relational databases.

Some relational databases do store or point to unstructured data such as customer relationship
management (CRM) applications. The integration can be awkward at best since memo fields do
not loan themselves to traditional database queries. Still, most of the CRM data is structured.
What is Unstructured Data?
Unstructured data is essentially everything else. Unstructured data has internal structure but is
not structured via pre-defined data models or schema. It may be textual or non-textual, and
human- or machine-generated. It may also be stored within a non-relational database like
NoSQL.

Typical human-generated unstructured data includes:

 Text files: Word processing, spreadsheets, presentations, email, logs.

 Email: Email has some internal structure thanks to its metadata, and we sometimes refer
to it as semi-structured. However, its message field is unstructured and traditional
analytics tools cannot parse it.
 Social Media: Data from Facebook, Twitter, LinkedIn.
 Website: YouTube, Instagram, photo sharing sites.
 Mobile data: Text messages, locations.
 Communications: Chat, IM, phone recordings, collaboration software.
 Media: MP3, digital photos, audio and video files.
 Business applications: MS Office documents, productivity applications.

Typical machine-generated unstructured data includes:

 Satellite imagery: Weather data, land forms, military movements.

 Scientific data: Oil and gas exploration, space exploration, seismic imagery, atmospheric
data.
 Digital surveillance: Surveillance photos and video.
 Sensor data: Traffic, weather, oceanographic sensors.
Inverted Indices

Inverted index in computer science (also referred to as postings file or inverted file) is an index
data structure storing a mapping from content, such as words or numbers, to its locations in a
database file, or in a document or a set of documents (named in contrast to a forward index,
which maps from documents to content). The purpose of an inverted index is to allow fast full
text searches, at a cost of increased processing when a document is added to the database. The
inverted file may be the database file itself, rather than its index. It is the most popular data
structure used in document retrieval systems,[1] used on a large scale for example in search
engines. Additionally, several significant general-purpose mainframe-based database
management systems have used inverted list architectures, including ADABAS,
DATACOM/DB, and Model 204.

There are two main variants of inverted indexes: A record-level inverted index (or inverted file
index or just inverted file) contains a list of references to documents for each word. A word-
level inverted index (or full inverted index or inverted list) additionally contains the positions
of each word within a document.[2] The latter form offers more functionality (like phrase
searches), but needs more processing power and space to be created.
Tokenization :

Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords,
phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or
even whole sentences. In the process of tokenization, some characters like punctuation marks are
discarded. The tokens become the input for another process like parsing and text mining.

Tokenization is used in computer science, where it plays a large part in the process of lexical
analysis

Tokenization relies mostly on simple heuristics in order to separate tokens by following a few
steps:

 Tokens or words are separated by whitespace, punctuation marks or line breaks

 White space or punctuation marks may or may not be included depending on the need
 All characters within contiguous strings are part of the token. Tokens can be made up of
all alpha characters, alphanumeric characters or numeric characters only.

Tokens themselves can also be separators. For example, in most programming languages,
identifiers can be placed together with arithmetic operators without white spaces. Although it
seems that this would appear as a single word or token, the grammar of the language actually
considers the mathematical operator (a token) as a separator, so even when multiple tokens are
bunched up together, they can still be separated via the mathematical operator.

A programming token is the basic component of source code . Characters are categorized as one
of five classes of tokens that describe their functions (constants, identifiers, operators, key words,
and separators) in accordance with the rules of the programming language.

For Example :

1. int main()
2. {
3. int a, b, total;
4. a = 10, b = 20;
5. total = a + b;
6. Printf (“Total = %d \n”, total);
7. }

where,

 main – identifier
 {,}, (,) – delimiter
 int – keyword
 a,b, total – identifier
 main, {, }, (, ), int, a, b, total

are the tokens

Introduction to Stemming

Stemming is the process of producing morphological variants of a root/base word. Stemming

programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm
reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and
“retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.

Some more example of stemming for root word "like" include:

->"likes"
->"liked"
->"likely"
->"liking"

Errors in Stemming:
There are mainly two errors in stemming – overstemming and under stemming. Over-stemming
occurs when two words are stemmed to same root that are of different stems. Under-stemming
occurs when two words are stemmed to same root that are not of different stems.

Applications of stemming are:

1. Stemming is used in information retrieval systems like search engines.

2. It is used to determine domain vocabularies in domain analysis.

Lemmatisation
Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a
single word without knowledge of the context, and therefore cannot discriminate between words
which have different meanings depending on part of speech. However, stemmers are typically
easier to implement and run faster, and the reduced accuracy may not matter for some
applications.

For instance:

1. The word "better" has "good" as its lemma. This link is missed by stemming, as it
requires a dictionary look-up.
2. The word "walk" is the base form for word "walking", and hence this is matched in both
stemming and lemmatisation.
3. The word "meeting" can be either the base form of a noun or a form of a verb ("to meet")
depending on the context, e.g., "in our last meeting" or "We are meeting again
tomorrow". Unlike stemming, lemmatisation can in principle select the appropriate
lemma depending on the context.
stop word

In computer search engines, a stop word is a commonly used word (such as "the") that a search
engine has been programmed to ignore, both when indexing entries for searching and when
retrieving them as the result of a search query. When building the index, most engines are
programmed to remove certain words from any index entry. The list of words that are not to be
added is called a stop list. Stop words are deemed irrelevant for searching purposes because they
occur frequently in the language for which the indexing engine has been tuned. In order to save
both space and time, these words are dropped at indexing time and then ignored at search time.
Some search engines allow you to include a stop word in your search by putting an inclusion
(plus sign) before each stop word in your query.

Proximity and Phrase

In text processing, a proximity search looks for documents where two or more separately
matching term occurrences are within a specified distance, where distance is the number of
intermediate words or characters. In addition to proximity, some implementations may also
impose a constraint on the word order, in that the order in the searched text must be identical to
the order of the search query. Proximity searching goes beyond the simple matching of words by
adding the constraint of proximity and is generally regarded as a form of advanced search.

For example, a search could be used to find "red brick house", and match phrases such as "red
house of brick" or "house made of red brick". By limiting the proximity, these phrases can be
matched while avoiding documents where the words are scattered or spread across a page or in
unrelated articles in an anthology.

5e Lesson Plan
No ratings yet
5e Lesson Plan
7 pages
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
No ratings yet
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
72 pages
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
From Everand
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
1/5 (1)
Nursing Leadership in Clinical Practice
100% (2)
Nursing Leadership in Clinical Practice
14 pages
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Unit 2
No ratings yet
Unit 2
40 pages
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
No ratings yet
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
3 pages
IRS_Unit_2
No ratings yet
IRS_Unit_2
15 pages
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Duplicate Record Detection - A Survey
No ratings yet
Duplicate Record Detection - A Survey
16 pages
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
From Everand
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
Christopher Right
2.5/5 (2)
irs unit-ii-notes
No ratings yet
irs unit-ii-notes
18 pages
Ir Assignment
No ratings yet
Ir Assignment
12 pages
Unit-Ii Notes
No ratings yet
Unit-Ii Notes
17 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Microsoft Access: Database Creation and Management through Microsoft Access
From Everand
Microsoft Access: Database Creation and Management through Microsoft Access
Steven Bright
No ratings yet
UNIT-1
No ratings yet
UNIT-1
15 pages
Chapter
No ratings yet
Chapter
24 pages
Unstructured Data
No ratings yet
Unstructured Data
6 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Unit I Types of Digital Data: CO1: Explain About Big Data Paradigm
No ratings yet
Unit I Types of Digital Data: CO1: Explain About Big Data Paradigm
37 pages
Config File Types
From Everand
Config File Types
Frank Wellington
No ratings yet
Ds 01
No ratings yet
Ds 01
71 pages
Indexing Database Systems
No ratings yet
Indexing Database Systems
5 pages
Information Extraction: Fundamentals and Applications
From Everand
Information Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
Unstructured Data - Wikipedia
No ratings yet
Unstructured Data - Wikipedia
7 pages
Drug Information Retrieval & Storage
No ratings yet
Drug Information Retrieval & Storage
54 pages
Jump Start MySQL: Master the Database That Powers the Web
From Everand
Jump Start MySQL: Master the Database That Powers the Web
Timothy Boronczyk
No ratings yet
IJCER (WWW - Ijceronline.com) International Journal of Computational Engineering Research
No ratings yet
IJCER (WWW - Ijceronline.com) International Journal of Computational Engineering Research
4 pages
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
1 - Chap 3 - Types of Digital Data
68% (19)
1 - Chap 3 - Types of Digital Data
40 pages
Information Integration: Existing Methods and Solutions
No ratings yet
Information Integration: Existing Methods and Solutions
25 pages
Data Types and Sources
No ratings yet
Data Types and Sources
36 pages
Data Format Compare
From Everand
Data Format Compare
Frank Wellington
No ratings yet
Artificial Intelligence Frame: Fundamentals and Applications
From Everand
Artificial Intelligence Frame: Fundamentals and Applications
Fouad Sabry
No ratings yet
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Irs Ii
No ratings yet
Irs Ii
39 pages
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
irs unit-2 modified
No ratings yet
irs unit-2 modified
7 pages
Wikipidea - Concept Search
No ratings yet
Wikipidea - Concept Search
7 pages
Information Retrieval
No ratings yet
Information Retrieval
17 pages
mod 3
No ratings yet
mod 3
56 pages
Fbda Unit-1
No ratings yet
Fbda Unit-1
17 pages
Big Data Analytics QB
No ratings yet
Big Data Analytics QB
44 pages
Basics of Big Data Notes
No ratings yet
Basics of Big Data Notes
17 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Database
No ratings yet
Database
2 pages
Domain-Specific Languages in R: Advanced Statistical Programming
From Everand
Domain-Specific Languages in R: Advanced Statistical Programming
Thomas Mailund
No ratings yet
DS Lec1
No ratings yet
DS Lec1
22 pages
Introduction To Information Retrieval Systems
No ratings yet
Introduction To Information Retrieval Systems
2 pages
IRS Cataloging and Indexing 2.1
No ratings yet
IRS Cataloging and Indexing 2.1
12 pages
Irs Unit III
No ratings yet
Irs Unit III
74 pages
Named Entity Recognition: Fundamentals and Applications
From Everand
Named Entity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Unit Iii Data Structure
No ratings yet
Unit Iii Data Structure
43 pages
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
No ratings yet
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
8 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
Voip Engineer Syllabus
No ratings yet
Voip Engineer Syllabus
3 pages
Digital Marketing The Caseof Digital Marketing Strategieson Luxurious Hotels
No ratings yet
Digital Marketing The Caseof Digital Marketing Strategieson Luxurious Hotels
10 pages
Gitanjoli Borah-CV
No ratings yet
Gitanjoli Borah-CV
1 page
Modalities: Theory Behind The Model
No ratings yet
Modalities: Theory Behind The Model
3 pages
Social Media Manager Cover Letter Sample
100% (2)
Social Media Manager Cover Letter Sample
9 pages
Document 9
No ratings yet
Document 9
3 pages
Teaching of Speaking, Listening and Reading Reviewer Majorship
No ratings yet
Teaching of Speaking, Listening and Reading Reviewer Majorship
21 pages
Good Design and Bad Design
No ratings yet
Good Design and Bad Design
6 pages
Template Provided by Genigraphics - 800.790.4001 - Replace This Text With Your Title
No ratings yet
Template Provided by Genigraphics - 800.790.4001 - Replace This Text With Your Title
1 page
Resume Tips
No ratings yet
Resume Tips
3 pages
HealthCareer - Lesson Plan
No ratings yet
HealthCareer - Lesson Plan
5 pages
Kindergarten Writing Lesson Plan Autosaved
No ratings yet
Kindergarten Writing Lesson Plan Autosaved
2 pages
Fitness Centre Segmentation
No ratings yet
Fitness Centre Segmentation
2 pages
E-Learning Methodologies and Tools PDF
No ratings yet
E-Learning Methodologies and Tools PDF
5 pages
Direct&Indirect Module
No ratings yet
Direct&Indirect Module
2 pages
Form 1KA Lesson Plan - Speaking
No ratings yet
Form 1KA Lesson Plan - Speaking
2 pages
A Detailed Lesson Plan in Media and Information Technology
89% (9)
A Detailed Lesson Plan in Media and Information Technology
4 pages
Language Perception and Language Production in Ear
No ratings yet
Language Perception and Language Production in Ear
9 pages
Lenskart Product Designer-1
No ratings yet
Lenskart Product Designer-1
1 page
Assignment No ENG522
No ratings yet
Assignment No ENG522
3 pages
Icon11 Marathi Kridant
No ratings yet
Icon11 Marathi Kridant
7 pages
Structure and Content of Ctet
No ratings yet
Structure and Content of Ctet
3 pages
Matkul Bahasa Inggris
No ratings yet
Matkul Bahasa Inggris
1 page
HRC - Forms - Employee Application - Rev2019 New
No ratings yet
HRC - Forms - Employee Application - Rev2019 New
3 pages
Other Verbs-Present Simple
No ratings yet
Other Verbs-Present Simple
1 page
Group work
No ratings yet
Group work
1 page
English For Academic and Professional Purposes (EAPP) Q1/Q3-Module 2 Academic Text Structure
67% (3)
English For Academic and Professional Purposes (EAPP) Q1/Q3-Module 2 Academic Text Structure
11 pages
Short Texts - Explanation
No ratings yet
Short Texts - Explanation
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

What Is Structured Data?: Information Retrieval

Uploaded by

What Is Structured Data?: Information Retrieval

Uploaded by

Information Retrieval: Information retrieval (IR) is the activity of obtaining information

system resources relevant to an information need from a collection of information resources.

What is Structured Data?

Typical human-generated unstructured data includes:

 Text files: Word processing, spreadsheets, presentations, email, logs.

Typical machine-generated unstructured data includes:

 Satellite imagery: Weather data, land forms, military movements.

 Tokens or words are separated by whitespace, punctuation marks or line breaks

are the tokens

Stemming is the process of producing morphological variants of a root/base word. Stemming

Some more example of stemming for root word "like" include:

Applications of stemming are:

1. Stemming is used in information retrieval systems like search engines.

Proximity and Phrase

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.