Guide: Retrieval Augmented Generation with Montag

Montag is an AI Governance Platform that makes it easy and secure for Enterprises to adopt Generative AI.

Now Generative AI is highly capable, and it’s a great resource for general knowledge questions and help, but it doesn’t necessarily perform so well on deep domain topics, or information it has never seen before.

This is where Retrieval Augmented Generation comes in. We’re not going to cover the whole topic here, as there are so many other resources on the internetthat have done definitive work on how and why RAG works, but it’s worth at least covering the basics.

Retrieval Augmented Generation is essentially giving your LLM everything it might need to know to complete your tasks as part of the input prompt. This is accomplished with a very clever technology known as embeddings and vector databases. Though you could do it with any kind of search-based retrieval.

To get started in RAG, you need:

  1. Some content you wnat to ask questions about
  2. A vector database to store your embeddings (and potentially the actual metadata, such as the content itself)
  3. An embedding model (a model that converts the text chunk into vectors)

Once we have the basics, we also need to put a process together for ingesting and processing the content into the vector database, not an easy task…

The actual process is:

  1. Take each piece of content you have
  2. Break it into chunks
  3. Feed those chunks into an embedding model to get it’s vector representation
  4. Feed each of those vector representations into your vector database.

Later, you would then run your LLM input query through the same embedding model, and use that output as a query to your vector DD. Then we would include the most relevant chunks of content in the prompt to the LLM as front-matter along with the original user question. We’ll cover that later on in this guide.

Part 1: Getting content into your database

Video Guide (Part 1)

First off, we need to put our content into a zip file, it can be nested in folders, it doesn’t matter as the Montag file processor will iterates through the whole tree, it works best with pure text or markdown.

Once you’ve got your zip file ready, we’re going to create a Text Collection. Text Collections are instructions for Montag about how to search, encode, enhance, store and retrieve embeddings.

For our example we have used the collected essays of Bertrand Russell, which are in the public domain, and are available via Project Gutenberg.

Step 1: Provide a name and a description:

When you create a text collection it is not just a way for you to injest and reference content that you wish to query, but is also a way for you to share that content store with other developers in your organisation through the AI Portal. So it’s important to give it a good name and description:

Name: Bertrand Russell Corpus Description: Collected esays of Bertrand Russell

Step 2: Storage Settings

Now we choose how to store this in the database itself, these settings can be daunting initially, so let’s go through them one by one:

  1. Storage Client - if you are using our quickstart, will be ChromaDB, thoguh we also support Pinecone
  2. Embedding Client - first the client: We’ll pick OpenAI’s Embedding endpoint here as it’s fast and cheap, though you could use the local embedding model runner we ship with if you want to go that route and not pay for the embeddings
  3. Embedding Settings - these define the model we are going to use and the token limits for the model. We can also set the chunking size for the content.
  4. Namespace - think of this as a tag for all the content in the DB for this collection, this is how we separate out text collections from one another in vector DBs
  5. Privacy Tier - in this case as these are public domain documents, we’ll make these the lowest privacy setting.

Why Privacy Tiers?

A Privacy tier is a way for you to make sure that LLMs that you will be sending your text collection data to are contracted or ensured to be capable of handling the privacy level of this data.

So for example an internal LLM run on your own hardware could probably process confidential information because the data never leaves your network, but foer this data, I think we can safely send it to OpenAI.

For the sake of this guide, we’ll skip over Content Enhancement for now. Briefly explained: this section enables you to create a pre-processing stage where you use an LLM (or multiple LLMs), to enhance your content chunks with metadata such as keywords or a summary of the page.

Step 3: Save it and upload the content

We now save that and we’ll come back to the list view with all your Text Collections. Here you will see we have an “Upload” button beside the collection we just created.

When you select this button, you can click the upload area, and upload that zip of text documents. Montag will process the documents in the background and let you know when it’s done.

Part 2: Querying the Text Collection

Video Guide (Part 2)

The fastest way to test your RAG data is in the Prompt Tester, so navigate to there in your Montag UI.

In the prompt tester you can leave the defaults for most settings except the ones we discuss below, first, we need to update the RAG settings

Step 1: Update the RAG Settings

When we created this Text Collection, we crated a namepace clled bertrand-001, and used OpenAI embeddings, we also put that data into the local database:

  1. Select OpenAI for Embed Client
  2. Select OpenAI for Embed Settings
  3. Select ChromaDB for Vector DB
  4. Set the Namespace to bertrand-001

For this example, even though GPT3.5 supports up 16k context now, we’ll just includ three references under Embeddings to Include in Prompt, and make sure that the context score is at least 0.5 in terms of relevance. This score is between 0 and 1, with values closer to 1 being higher relevance.

Step 2: Update the Prompt Settings

Next we need to make a few changes to the prompt, Montag gives you fuill control over how and what is sent to the LLM, we’re going to use the prompt bellow, don’t worry if it looks complicated, we’ll explain it in a second:

Prompt Template:

{{.Instructions}}
{{.ModelRoles.User}} {{ if .ContextToRender }}Use the following information to answer my question:{{ range $ctx := .ContextToRender }}
{{$ctx}}
{{ end }}{{ end }}

{{.ModelRoles.User}} Answer this question based on the information provided above: {{.Body}}
{{.ModelRoles.AI}}

Tempalte Variables

The prompt template is a Go template, and the variables are as follows:

  • .Instructions - this is the instructions for the LLM, in this case it’s the default instructions for the AI
  • .ModelRoles.User - this is the role of the user in the prompt, in this case it’s the default role for the User, because different LLMs use different prompting systems and styles, using the Model Roles instead of hard-coding the role makes it easier to swap out LLMs in the future for the same functionality
  • .ContextToRender - this is the context that we are going to send to the LLM, in this case it’s the top 3 most relevant chunks of text from the Text Collection
  • .Body - this is the question we want to ask the LLM
  • .ModelRoles.AI - this is the role of the AI in the prompt, in this case it’s the default role for the AI, because different LLMs use different prompting systems and styles, using the Model Roles instead of hard-coding the role makes it easier to swap out LLMs in the future for the same functionality

In order to check that it actually works, we want to know what references were used in the answer the LLM gave us in it’s response, so we can ammend the response template to look like this:

Response Template:

<@{{.User}}> {{.Response}}{{ if .Titles }}

*References:*{{ range .Titles }}
> {{.}}{{end}}
{{end}}

The Response Template has a view new variables:

  • .User - this is the username or user-id of the user, the prompt tester will assign a random ID.
  • .Response - this is the response from the LLM
  • .Titles - this is the list of titles of the chunks of text that were used in the response, the references.

Step 3: Run the Prompt

With that all done, let’s ask the LLM a question related to the body of text:

Question: What was Bertrand Russel's opinion on how to end the duel as a practice?

The LLM should now output an answer and several references, like this:

<@d7a0a506-63aa-424b-9e6d-876fc21cd9e3> Based on the information provided, Bertrand Russell's opinion on how to end the duel as a practice is to establish political institutions that make men averse to war. He believes that through the influence of institutions and habits, men can learn to look back upon war as a barbaric practice, similar to the burning of heretics or human sacrifice to heathen deities. He suggests that if political contest within a World-State were substituted for war, imagination would soon accustom itself to the new situation, and war would be seen as a thing of the past.

*References:*
> Why Men Fight.txt?part=0
> Why Men Fight.txt?part=18
> Why Men Fight.txt?part=43

And there you have it, an answer that is based on references from the text collection we submitted. The references show which text was used - in this case a very specific one about fighting, and which chunk was referenced.

The prompt tester uses all of the base objects of Montag to make the testing possible, each of these sections can be individually configured as a component, for example:

  • The Completion Settings section is encapsulated in LLM Configurations
  • The Embedding dependencies of API Client, Embed Settings and Vector DB are captured in Embed Configurations and namespaces (and their associated privacy tiers) are individual objects that are part of a Text Collection, or if you already have a corpus that you want to query, but not reference directly in Montag as a Text Collection, can still be tagged with a Privacy tier and queried in advanced use cases like scripts and AI functions
  • The Prompt Configurations are stored collectively in the “Prompts” section of the app

For things like Bots and AI Functions you can quickly select each of these compoennts and put them together like building blocks, and therefore also just as quickly reconfigure them with different back-end without having to redeclare them.