0% found this document useful (0 votes)
68 views

What Is Structured Data?: Information Retrieval

Information retrieval involves obtaining information resources relevant to an information need from a collection. Searches can be based on full text or other content. Automated IR systems are used to reduce information overload. Web search engines are a major IR application. Structured data has clearly defined types that are easily searchable, while unstructured data like audio/video is less easily searchable. There is tension between analyzing structured vs unstructured data, as structured analytics is mature but unstructured is an emerging field. Customers select tools based on application needs rather than data type. Inverted indices store mappings from terms to locations, allowing fast full text search. They are commonly used in document retrieval systems like search engines. Tokenization breaks text
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

What Is Structured Data?: Information Retrieval

Information retrieval involves obtaining information resources relevant to an information need from a collection. Searches can be based on full text or other content. Automated IR systems are used to reduce information overload. Web search engines are a major IR application. Structured data has clearly defined types that are easily searchable, while unstructured data like audio/video is less easily searchable. There is tension between analyzing structured vs unstructured data, as structured analytics is mature but unstructured is an emerging field. Customers select tools based on application needs rather than data type. Inverted indices store mappings from terms to locations, allowing fast full text search. They are commonly used in document retrieval systems like search engines. Tokenization breaks text
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Information Retrieval: Information retrieval (IR) is the activity of obtaining information

system resources relevant to an information need from a collection of information resources.


Searches can be based on full-text or other content-based indexing. Information retrieval is the
science of searching for information in a document, searching for documents themselves, and
also searching for metadata that describe data, and for databases of texts, images or sounds.

Automated information retrieval systems are used to reduce what has been called information
overload. An IR system is a software that provide access to books, journals and other documents,
stores them and manages the document. Web search engines are the most visible IR applications.

Structured data vs. unstructured data: structured data is comprised of clearly defined data
types whose pattern makes them easily searchable; while unstructured data – “everything else”
– is comprised of data that is usually not as easily searchable, including formats like audio,
video, and social media postings.

Unstructured data vs. structured data does not denote any real conflict between the two.
Customers select one or the other not based on their data structure, but on the applications that
use them: relational databases for structured, and most any other type of application for
unstructured data.

However, there is a growing tension between the ease of analysis on structured data versus more
challenging analysis on unstructured data. Structured data analytics is a mature process and
technology. Unstructured data analytics is a nascent industry with a lot of new investment into
R&D, but is not a mature technology. The structured data vs. unstructured data issue within
corporations is deciding if they should invest in analytics for unstructured data, and if it is
possible to aggregate the two into better business intelligence.

What is Structured Data?


Structured data usually resides in relational databases (RDBMS). Fields store length-delineated
data phone numbers, Social Security numbers, or ZIP codes. Even text strings of variable length
like names are contained in records, making it a simple matter to search. Data may be human- or
machine-generated as long as the data is created within an RDBMS structure. This format is
eminently searchable both with human generated queries and via algorithms using type of data
and field names, such as alphabetical or numeric, currency or date.

Common relational database applications with structured data include airline reservation
systems, inventory control, sales transactions, and ATM activity. Structured Query Language
(SQL) enables queries on this type of structured data within relational databases.

Some relational databases do store or point to unstructured data such as customer relationship
management (CRM) applications. The integration can be awkward at best since memo fields do
not loan themselves to traditional database queries. Still, most of the CRM data is structured.
What is Unstructured Data?
Unstructured data is essentially everything else. Unstructured data has internal structure but is
not structured via pre-defined data models or schema. It may be textual or non-textual, and
human- or machine-generated. It may also be stored within a non-relational database like
NoSQL.

Typical human-generated unstructured data includes:

 Text files: Word processing, spreadsheets, presentations, email, logs.


 Email: Email has some internal structure thanks to its metadata, and we sometimes refer
to it as semi-structured. However, its message field is unstructured and traditional
analytics tools cannot parse it.
 Social Media: Data from Facebook, Twitter, LinkedIn.
 Website: YouTube, Instagram, photo sharing sites.
 Mobile data: Text messages, locations.
 Communications: Chat, IM, phone recordings, collaboration software.
 Media: MP3, digital photos, audio and video files.
 Business applications: MS Office documents, productivity applications.

Typical machine-generated unstructured data includes:

 Satellite imagery: Weather data, land forms, military movements.


 Scientific data: Oil and gas exploration, space exploration, seismic imagery, atmospheric
data.
 Digital surveillance: Surveillance photos and video.
 Sensor data: Traffic, weather, oceanographic sensors.
Inverted Indices

Inverted index in computer science (also referred to as postings file or inverted file) is an index
data structure storing a mapping from content, such as words or numbers, to its locations in a
database file, or in a document or a set of documents (named in contrast to a forward index,
which maps from documents to content). The purpose of an inverted index is to allow fast full
text searches, at a cost of increased processing when a document is added to the database. The
inverted file may be the database file itself, rather than its index. It is the most popular data
structure used in document retrieval systems,[1] used on a large scale for example in search
engines. Additionally, several significant general-purpose mainframe-based database
management systems have used inverted list architectures, including ADABAS,
DATACOM/DB, and Model 204.

There are two main variants of inverted indexes: A record-level inverted index (or inverted file
index or just inverted file) contains a list of references to documents for each word. A word-
level inverted index (or full inverted index or inverted list) additionally contains the positions
of each word within a document.[2] The latter form offers more functionality (like phrase
searches), but needs more processing power and space to be created.
Tokenization :

Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords,
phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or
even whole sentences. In the process of tokenization, some characters like punctuation marks are
discarded. The tokens become the input for another process like parsing and text mining.

Tokenization is used in computer science, where it plays a large part in the process of lexical
analysis

Tokenization relies mostly on simple heuristics in order to separate tokens by following a few
steps:

 Tokens or words are separated by whitespace, punctuation marks or line breaks


 White space or punctuation marks may or may not be included depending on the need
 All characters within contiguous strings are part of the token. Tokens can be made up of
all alpha characters, alphanumeric characters or numeric characters only.

Tokens themselves can also be separators. For example, in most programming languages,
identifiers can be placed together with arithmetic operators without white spaces. Although it
seems that this would appear as a single word or token, the grammar of the language actually
considers the mathematical operator (a token) as a separator, so even when multiple tokens are
bunched up together, they can still be separated via the mathematical operator.

A programming token is the basic component of source code . Characters are categorized as one
of five classes of tokens that describe their functions (constants, identifiers, operators, key words,
and separators) in accordance with the rules of the programming language.

For Example :

1. int main()
2. {
3. int a, b, total;
4. a = 10, b = 20;
5. total = a + b;
6. Printf (“Total = %d \n”, total);
7. }

where,

 main – identifier
 {,}, (,) – delimiter
 int – keyword
 a,b, total – identifier
 main, {, }, (, ), int, a, b, total

are the tokens


Introduction to Stemming

Stemming is the process of producing morphological variants of a root/base word. Stemming


programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm
reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and
“retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.

Some more example of stemming for root word "like" include:


->"likes"
->"liked"
->"likely"
->"liking"

Errors in Stemming:
There are mainly two errors in stemming – overstemming and under stemming. Over-stemming
occurs when two words are stemmed to same root that are of different stems. Under-stemming
occurs when two words are stemmed to same root that are not of different stems.

Applications of stemming are:

1. Stemming is used in information retrieval systems like search engines.


2. It is used to determine domain vocabularies in domain analysis.

Lemmatisation
Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a
single word without knowledge of the context, and therefore cannot discriminate between words
which have different meanings depending on part of speech. However, stemmers are typically
easier to implement and run faster, and the reduced accuracy may not matter for some
applications.

For instance:

1. The word "better" has "good" as its lemma. This link is missed by stemming, as it
requires a dictionary look-up.
2. The word "walk" is the base form for word "walking", and hence this is matched in both
stemming and lemmatisation.
3. The word "meeting" can be either the base form of a noun or a form of a verb ("to meet")
depending on the context, e.g., "in our last meeting" or "We are meeting again
tomorrow". Unlike stemming, lemmatisation can in principle select the appropriate
lemma depending on the context.
stop word

In computer search engines, a stop word is a commonly used word (such as "the") that a search
engine has been programmed to ignore, both when indexing entries for searching and when
retrieving them as the result of a search query. When building the index, most engines are
programmed to remove certain words from any index entry. The list of words that are not to be
added is called a stop list. Stop words are deemed irrelevant for searching purposes because they
occur frequently in the language for which the indexing engine has been tuned. In order to save
both space and time, these words are dropped at indexing time and then ignored at search time.
Some search engines allow you to include a stop word in your search by putting an inclusion
(plus sign) before each stop word in your query.

Proximity and Phrase


In text processing, a proximity search looks for documents where two or more separately
matching term occurrences are within a specified distance, where distance is the number of
intermediate words or characters. In addition to proximity, some implementations may also
impose a constraint on the word order, in that the order in the searched text must be identical to
the order of the search query. Proximity searching goes beyond the simple matching of words by
adding the constraint of proximity and is generally regarded as a form of advanced search.

For example, a search could be used to find "red brick house", and match phrases such as "red
house of brick" or "house made of red brick". By limiting the proximity, these phrases can be
matched while avoiding documents where the words are scattered or spread across a page or in
unrelated articles in an anthology.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy