National Institute of Electronics & Information Technology: Gorakhpur Center

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

National Institute of Electronics & Information Technology

Gorakhpur Center
Ministry of Electronics & Information Technology (MeitY), Government of India

Shubhra Dubey
Project Engineer

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia


Contents to be covered
1. Document Structures
2. Document Understanding Framework Steps
3. Data extraction from Documents

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia


Document Structures

• We can identify 3 main categories:


1. structured,
2. semi-structured and
3. unstructured.
• Each type of document comes with its own set of challenges.
• Documents come in many shapes and forms and are usually
combinations of the three classes above.

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia


Structured Document
Structured documents generally focus on collecting information in a precise format, guiding the person who is filling
them with precise areas where each piece of data needs to be entered. These come in a fixed form and are generally
called forms.
Examples of structured documents: Surveys, questionnaires, Tax forms. These contain exclusively key-value pairs
and tables.

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia


Semi-structured Document
• Semi-structured documents are documents that
do not follow a strict format the way structured
forms do and are not bound to specified data
fields. These don't have a fixed form but follow a
common enough format. They may contain
paragraphs as well, but data is mainly to be
found in key-value pairs.
• Examples of semi-structured
documents: Invoices, Receipts, Purchase Orders,
Healthcare lab reports.

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia


Unstructured Document

• Unstructured documents are


documents in which the
information isn't organized
according to a clear, structured
model. These files are all easily
comprehensible by human beings,
yet much more difficult for a robot.
• Examples of unstructured
documents: Contracts, Annual
Reports. Some may contain key-
value pairs and tables, but much of
the data is in unstructured form
inside the text.

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia


Document Understanding Framework Steps

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia


Taxonomy

In this pre-processing step, you can add multiple document types and the
fields you are interested in extracting. For example, you can work with
Invoices, wanting to extract the vendor and the total amount, and with
medical forms, wanting to extract insured ID number and patient name.

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia


Digitization

• As the documents are processed one by one, they go through the


digitization process. The difference for non-digital (scanned)
documents is that you need to apply the OCR engine of your
choice. The outputs of this step are the Document Object Model
and a string variable containing all the document text and are
passed down to the next steps.

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia


Classification

• After digitization, the document is classified. If you are working


with multiple documents types in the same project, to extract data
properly you need to know what type of document you're working
with. The important thing is that you can use multiple classifiers in
the same scope, you can configure the classifiers and, later in the
framework, train them. The classification results help in applying
the right strategy in extraction.

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia


Extraction

• Extraction is getting just the data you are interested in. For
example, extracting specific data from a 5-page document is quite
troublesome if you want to do it with string manipulation. In this
framework, you can use different extractors, for the different
document structures, in the same scope application. The
extraction results are passed further for validation.

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia


Validation

• The extracted data can be validated by a human user through the


Validation Station. A best practice is to build logic around the
decision of adding or not a human validation step, with rules
depending on the specific use case to be implemented. Validation
results can then be exported and used in further automation
activities.

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia


Export

• Once you have your validated information, you can use it as it is, or
save it in a DataTable format that can be converted very easy into
an Excel file.

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia


Training Classifiers and Extractors

• Classification and Extraction are as efficient as the classifiers and


extractors used are. If a document wasn’t classified properly, it
means it was unknown to the active classifiers. The same way goes
for incorrect data extraction. The Framework provides the
opportunity to train the classifiers and the extractors, to improve
recognition of the documents and fields.

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia


Thank You
Any Query

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy