0% found this document useful (0 votes)
68 views

White Paper - Integration of ECM With LLMs Using LlamaIndex

This white paper proposes integrating enterprise content management (ECM) systems with large language models (LLMs) using LlamaIndex to enable capabilities like document classification, summarization, translation, and semantic search via natural language queries. The integration would involve using LlamaIndex to query LLM systems like GPT on ECM document content to extract metadata like document type and summaries, which would be stored back in the ECM. Considerations for enterprise adoption include data privacy, accuracy, and oversight of the LLM systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

White Paper - Integration of ECM With LLMs Using LlamaIndex

This white paper proposes integrating enterprise content management (ECM) systems with large language models (LLMs) using LlamaIndex to enable capabilities like document classification, summarization, translation, and semantic search via natural language queries. The integration would involve using LlamaIndex to query LLM systems like GPT on ECM document content to extract metadata like document type and summaries, which would be stored back in the ECM. Considerations for enterprise adoption include data privacy, accuracy, and oversight of the LLM systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

White Paper

June 12, 2023

Integration of Enterprise Content


Management with Large Language Models
using LlamaIndex

Submitted by: Anand Kushwaha (anankus1@in.ibm.com)


1. Large language model and OpenAI GPT Overview
Large language models (LLMs) are a type of AI system that works with language. It consists of a
neural network with many parameters (typically billions of weights or more), trained on large
quantities of unlabelled text using self-supervised learning or semi-supervised learning.
LLMs are general purpose models which excel at a wide range of tasks, as opposed to being trained
for one specific task (such as sentiment analysis, named entity recognition, or mathematical
reasoning).
OpenAI’s GPT models are one of the popular LLM that use the Transformer architecture and are pre-
trained on large amounts of text data. They can generate human-like text and perform various NLP
tasks without much supervised training.

2. LlamaIndex Overview
LlamaIndex (earlier known as GPT Index) is a data framework for LLM application. At a high-level,
LlamaIndex gives us the ability to query our own data for any downstream LLM use case, whether
it’s question-answering, summarization, or a component in a chatbot.
LlamaIndex uses LangChain’s (another popular framework for building Generative AI applications)
LLM modules (default being OpenAI’s text-davinci-003 model). The chosen LLM is always used by
LlamaIndex to construct the final answer and is sometimes used during index creation as well.
Below are the high-level steps involved while using LlamaIndex -

1. Load the Documents. A Document represents a lightweight container around the data
source.
2. Parse the Documents objects into Node objects. Nodes represent “chunks” of source
Documents (ex. a text chunk). These node objects can be persisted to a MongoDB collection
or kept in memory.
3. Construct Index from Nodes. There are various kinds of indexes in LlamaIndex like “List
Index”, “Vector Store Index”.
4. Finally query the index. This is where the query is parsed, relevant Nodes retrieved through
the use of indexes, and provided as an input to a “Large Language Model” (LLM).
5. Final response from the LLM is returned to the calling application.

Below are few use cases to demonstrate the LlamaIndex-OpenAI GPT models integration capabilities
using the mentioned sample Invoice PDF –
1. Document Classification
2. Document Summarization
3. Document Translation
4. Semantic Search
1. Document Classification- Below is a sample python code which uses LlamaIndex for
Document Classification for our sample Invoice pdf file (stored at C:/temp/Doc1.pdf). We
see that we are asking “Classify this document in one word” and getting the response
as “Invoice” -

2. Document Summarization- For Document Summarization, we are asking “Summarise this


document” and getting the document summary –
3. Document Translation – Below is a sample showing the translation from English to
French –

4. Semantic Search- For Semantic Search also, we can get the Invoice Number and Dates
when asked for -
3. ECM-LLM integration use cases

Above demonstrated capabilities of LlamaIndex can be utilized in Enterprise Content


Management for below use cases -

1. Document classification - In ECM, Document classification is often used to help to


find the document we need more quickly. It involves assigning a document to one or
more categories depending on its content. This process can be manual or automatic
and can be done using a variety of techniques. The goal is to improve the efficiency
and accuracy in document management. Manual classification is done by people, and
it’s usually done by experts who have knowledge about the subject and know how to
classify documents correctly. Automated classification, on the other hand, is
performed by machines and it can be done in many ways – through optical character
recognition or natural language processing for example.

The benefits of document classification for organizations are as follows –

 Protecting sensitive or confidential data


 Managing large volumes of data in a structured way under a single document
repository.
 Ensuring that documents are properly classified according to the
organization’s policies and procedures
 Improving efficiency by reducing the time spent on searching for documents,
sorting them, and filing them away

2. Document/Text summarization/preview - Summarization is the task of producing a


shorter version of a document while preserving its important information.
It can be performed manually or automated using Natural Language Processing (NLP)
capability included in LLMs. Text summarization is an important task for both human
readers and NLP systems, as it allows for faster consumption of large amounts of
information.

The benefits of Document/Text summarization for organizations are as follows –

 Saves time: A summary can save us a lot of time, especially if we have to


analyse or review a lot of text. It would help if we didn’t spend hours trying to
read through an entire document; we can just read the summary and get the
gist of it in a fraction of the time.
 Allows to focus on the main ideas: A lot of unnecessary details can distract
us from understanding or reviewing an article. But a summary can help us to
focus on the main ideas so that we can better understand the document as a
whole.
 Helps to understand the content better: When we see a document’s
summary, it will be easier for us to understand what it’s all about. We will be
able to get a clear picture of the main points that the author is trying to
make. And we can quickly see if there are any concepts we need to review or
research further.

3. Document Translation – Document Translation is a also a key requirement in ECM


where LLMs can be used. Below are few key benefits -

 Facilitates International Communication


 Increases Access to Information
 Promotes Cultural Understanding
 Improves Educational Opportunities

4. Cognitive search using Natural language query processing - Cognitive search


represents a new generation of enterprise search that uses artificial intelligence (AI)
technologies to improve users' search queries and extract relevant information from
multiple diverse data sets.

Keyword-based search and traditional enterprise search have become inadequate


due to the increasing variety and amount of data used within organizations. The two
methodologies impair search processes and employee productivity by returning
irrelevant or incomplete results that users must sort through to find the information
they need.

Some overall benefits of cognitive search include –

 Maximized productivity. A single search functionality removes the necessity


of switching between apps and eliminates time wasted on tasks like re-
entering credentials multiple times. Furthermore, the unification of data tools
allows organizations to streamline their business processes.
 Improved employee experience and engagement. Employee loyalty is
promoted through the elimination of wasted time and the increase in
productivity. Machine learning (ML) algorithms that provide personalized
suggestions help users find relevant data more quickly and the flexibility of
cognitive search creates an improved user experience through
personalization. Since an employee's search experience is improved, they're
more likely to use the tools consistently.
 Lower operational costs. Maximized productivity decreases an organization's
operational costs since less time and resources are needed for gathering
information and knowledge discovery. This is especially beneficial to
industries such as healthcare and legal services that work with massive
amounts of data.
4. ECM-LLM integration option

Below is a sample integration approach which we tested as part of a PoC. In this approcah, a
batch job will send the “document content” and “query text” (like “Classify this document in
one word” , “Summarise this document” etc.) to LlamaIndex and use the response received to
set the Document Classification/Summary as a metadata on the document.

For the demo, we uploaded three sample PDFs in a IBM FileNet Object store which we can
see from IBM Content Navigator. These documents have two custom properties “Document
Type” and “Document Summary” which are blank as of now –
These sample pdf documents are Invoice, Newsletter and Annual Report -

Sample Invoice PDF

Sample Newsletter PDF


Sample Annual Report PDF

After the execution of ECM-LLM integration batch job, we see the values of “Document
Type” and “Document Summary” populated -

Alternatively, the process can be triggered based on an Event also like “Document Check-in”.
In this case the event handler will further send “document content” and “query text” to
LlamaIndex and use the response received based on the use case.
5. Considerations before using this integration at Enterprise Level –

The use of LLMs like OpenAI GPT for the enterprise data is still evolving so we should
consider below points before using it at Enterprise Level -

 We need to be mindful of complying with our enterprise confidentiality as the


data is sent to OpenAI.

 The accuracy of responses for queries still dependent on LLM.

 There is no consistent, scalable way to determine if the answers provided by


LLMs are correct or not.

 Every call to OpenAI GPT models for indexing/querying is charged based on the
number of tokens involved.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy