White Paper - Integration of ECM With LLMs Using LlamaIndex
White Paper - Integration of ECM With LLMs Using LlamaIndex
2. LlamaIndex Overview
LlamaIndex (earlier known as GPT Index) is a data framework for LLM application. At a high-level,
LlamaIndex gives us the ability to query our own data for any downstream LLM use case, whether
it’s question-answering, summarization, or a component in a chatbot.
LlamaIndex uses LangChain’s (another popular framework for building Generative AI applications)
LLM modules (default being OpenAI’s text-davinci-003 model). The chosen LLM is always used by
LlamaIndex to construct the final answer and is sometimes used during index creation as well.
Below are the high-level steps involved while using LlamaIndex -
1. Load the Documents. A Document represents a lightweight container around the data
source.
2. Parse the Documents objects into Node objects. Nodes represent “chunks” of source
Documents (ex. a text chunk). These node objects can be persisted to a MongoDB collection
or kept in memory.
3. Construct Index from Nodes. There are various kinds of indexes in LlamaIndex like “List
Index”, “Vector Store Index”.
4. Finally query the index. This is where the query is parsed, relevant Nodes retrieved through
the use of indexes, and provided as an input to a “Large Language Model” (LLM).
5. Final response from the LLM is returned to the calling application.
Below are few use cases to demonstrate the LlamaIndex-OpenAI GPT models integration capabilities
using the mentioned sample Invoice PDF –
1. Document Classification
2. Document Summarization
3. Document Translation
4. Semantic Search
1. Document Classification- Below is a sample python code which uses LlamaIndex for
Document Classification for our sample Invoice pdf file (stored at C:/temp/Doc1.pdf). We
see that we are asking “Classify this document in one word” and getting the response
as “Invoice” -
4. Semantic Search- For Semantic Search also, we can get the Invoice Number and Dates
when asked for -
3. ECM-LLM integration use cases
Below is a sample integration approach which we tested as part of a PoC. In this approcah, a
batch job will send the “document content” and “query text” (like “Classify this document in
one word” , “Summarise this document” etc.) to LlamaIndex and use the response received to
set the Document Classification/Summary as a metadata on the document.
For the demo, we uploaded three sample PDFs in a IBM FileNet Object store which we can
see from IBM Content Navigator. These documents have two custom properties “Document
Type” and “Document Summary” which are blank as of now –
These sample pdf documents are Invoice, Newsletter and Annual Report -
After the execution of ECM-LLM integration batch job, we see the values of “Document
Type” and “Document Summary” populated -
Alternatively, the process can be triggered based on an Event also like “Document Check-in”.
In this case the event handler will further send “document content” and “query text” to
LlamaIndex and use the response received based on the use case.
5. Considerations before using this integration at Enterprise Level –
The use of LLMs like OpenAI GPT for the enterprise data is still evolving so we should
consider below points before using it at Enterprise Level -
Every call to OpenAI GPT models for indexing/querying is charged based on the
number of tokens involved.