What Is Structured Data?: Information Retrieval
What Is Structured Data?: Information Retrieval
Automated information retrieval systems are used to reduce what has been called information
overload. An IR system is a software that provide access to books, journals and other documents,
stores them and manages the document. Web search engines are the most visible IR applications.
Structured data vs. unstructured data: structured data is comprised of clearly defined data
types whose pattern makes them easily searchable; while unstructured data – “everything else”
– is comprised of data that is usually not as easily searchable, including formats like audio,
video, and social media postings.
Unstructured data vs. structured data does not denote any real conflict between the two.
Customers select one or the other not based on their data structure, but on the applications that
use them: relational databases for structured, and most any other type of application for
unstructured data.
However, there is a growing tension between the ease of analysis on structured data versus more
challenging analysis on unstructured data. Structured data analytics is a mature process and
technology. Unstructured data analytics is a nascent industry with a lot of new investment into
R&D, but is not a mature technology. The structured data vs. unstructured data issue within
corporations is deciding if they should invest in analytics for unstructured data, and if it is
possible to aggregate the two into better business intelligence.
Common relational database applications with structured data include airline reservation
systems, inventory control, sales transactions, and ATM activity. Structured Query Language
(SQL) enables queries on this type of structured data within relational databases.
Some relational databases do store or point to unstructured data such as customer relationship
management (CRM) applications. The integration can be awkward at best since memo fields do
not loan themselves to traditional database queries. Still, most of the CRM data is structured.
What is Unstructured Data?
Unstructured data is essentially everything else. Unstructured data has internal structure but is
not structured via pre-defined data models or schema. It may be textual or non-textual, and
human- or machine-generated. It may also be stored within a non-relational database like
NoSQL.
Inverted index in computer science (also referred to as postings file or inverted file) is an index
data structure storing a mapping from content, such as words or numbers, to its locations in a
database file, or in a document or a set of documents (named in contrast to a forward index,
which maps from documents to content). The purpose of an inverted index is to allow fast full
text searches, at a cost of increased processing when a document is added to the database. The
inverted file may be the database file itself, rather than its index. It is the most popular data
structure used in document retrieval systems,[1] used on a large scale for example in search
engines. Additionally, several significant general-purpose mainframe-based database
management systems have used inverted list architectures, including ADABAS,
DATACOM/DB, and Model 204.
There are two main variants of inverted indexes: A record-level inverted index (or inverted file
index or just inverted file) contains a list of references to documents for each word. A word-
level inverted index (or full inverted index or inverted list) additionally contains the positions
of each word within a document.[2] The latter form offers more functionality (like phrase
searches), but needs more processing power and space to be created.
Tokenization :
Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords,
phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or
even whole sentences. In the process of tokenization, some characters like punctuation marks are
discarded. The tokens become the input for another process like parsing and text mining.
Tokenization is used in computer science, where it plays a large part in the process of lexical
analysis
Tokenization relies mostly on simple heuristics in order to separate tokens by following a few
steps:
Tokens themselves can also be separators. For example, in most programming languages,
identifiers can be placed together with arithmetic operators without white spaces. Although it
seems that this would appear as a single word or token, the grammar of the language actually
considers the mathematical operator (a token) as a separator, so even when multiple tokens are
bunched up together, they can still be separated via the mathematical operator.
A programming token is the basic component of source code . Characters are categorized as one
of five classes of tokens that describe their functions (constants, identifiers, operators, key words,
and separators) in accordance with the rules of the programming language.
For Example :
1. int main()
2. {
3. int a, b, total;
4. a = 10, b = 20;
5. total = a + b;
6. Printf (“Total = %d \n”, total);
7. }
where,
main – identifier
{,}, (,) – delimiter
int – keyword
a,b, total – identifier
main, {, }, (, ), int, a, b, total
Errors in Stemming:
There are mainly two errors in stemming – overstemming and under stemming. Over-stemming
occurs when two words are stemmed to same root that are of different stems. Under-stemming
occurs when two words are stemmed to same root that are not of different stems.
Lemmatisation
Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a
single word without knowledge of the context, and therefore cannot discriminate between words
which have different meanings depending on part of speech. However, stemmers are typically
easier to implement and run faster, and the reduced accuracy may not matter for some
applications.
For instance:
1. The word "better" has "good" as its lemma. This link is missed by stemming, as it
requires a dictionary look-up.
2. The word "walk" is the base form for word "walking", and hence this is matched in both
stemming and lemmatisation.
3. The word "meeting" can be either the base form of a noun or a form of a verb ("to meet")
depending on the context, e.g., "in our last meeting" or "We are meeting again
tomorrow". Unlike stemming, lemmatisation can in principle select the appropriate
lemma depending on the context.
stop word
In computer search engines, a stop word is a commonly used word (such as "the") that a search
engine has been programmed to ignore, both when indexing entries for searching and when
retrieving them as the result of a search query. When building the index, most engines are
programmed to remove certain words from any index entry. The list of words that are not to be
added is called a stop list. Stop words are deemed irrelevant for searching purposes because they
occur frequently in the language for which the indexing engine has been tuned. In order to save
both space and time, these words are dropped at indexing time and then ignored at search time.
Some search engines allow you to include a stop word in your search by putting an inclusion
(plus sign) before each stop word in your query.
For example, a search could be used to find "red brick house", and match phrases such as "red
house of brick" or "house made of red brick". By limiting the proximity, these phrases can be
matched while avoiding documents where the words are scattered or spread across a page or in
unrelated articles in an anthology.