Artificial Intelligence

What on Earth Is RAG When Talking About LLMs and AI?

By Jumber Mdivnishvili

AI is a hot topic nowadays. There are many new AI-related terms that you may hear and encounter – understanding these terms and their uses can be valuable.

One of the most important solution patterns for LLM-based projects is Retrieval-Augmented Generation (RAG). In this article, we will discuss what RAG is, why it is so useful, and what it actually means. Let’s take a look!

Search Engines vs. LLMs

Search engines operate by taking user queries and, using a variety of retrieval methods and algorithms, matching them to relevant web pages. The search engine then returns a list of these pages, which the user can manually review to find the best match for their query.

Large language models (LLMs) are trained on massive datasets. When given a prompt, an LLM generates original content by searching its training data for relevant information and using that to create a response that matches the prompt.

LLMs can comfortably do manual jobs for us and give us the information directly, replacing the need to browse from one webpage to another.

What Is RAG?

Many businesses want to leverage LLMs in their business processes by using them in chatbots, or to analyze information and generate efficient responses – but this can cause issues.

As mentioned before, LLMs are trained on huge amounts of documents, some of which are from generally available online content. So what about the documents, the knowledge base, specific to the company that wants to leverage LLM? 

The knowledge base specific to the company is not available and exposed to the internet, nor in the documents on which LLMs are trained. So how will LLM respond to customer queries without the company and product-specific knowledge? This is where RAG comes in.

RAG boosts the capability of LLM by giving it the context that the LLM will use to give the response – in this case, the LLM will have access to private company knowledge to generate responses.

Four main steps need to be followed to implement the RAG Solution.

  1. Preparation: Prepare knowledge base documents e.g. documents, PDF, Excel, and so on.
  2. Preprocessing: Developers preprocess documents by removing unnecessary information specific to each use case. For example, if we want an LLM to answer Salesforce questions using the knowledge from SF Ben articles, author information would be removed during preprocessing, as it is irrelevant to the LLM’s ability to answer those questions.
  3. Chunking: After preprocessing, data should be chunked. For the LLM to answer a question, they might not need the whole document – the answer might be in a certain section, helping the overall performance.
  4. Embedding Chunks: Embedding is the process of translating texts to vectors. Word embeddings are stored in high-dimensional space, so similar words are close to each other in that space. In order to utilize RAG effectively, chunks of text should be represented as embeddings and then stored in the VDB (vector database).

Let’s use the idea of a chatbot that has access to the VDB of SF Ben articles. The VDB would look something like the table below. 

Note: Keep in mind that this table is a simplified version.

ChunkEmbedded form of Chunk (Vector)
“In my opinion, Queueable Apex is an improvement of the Future method – I would only use the Future method over Queueable in test classes to overcome mixed DML Error.”(73; 89; 65)
“We cannot monitor the process executed in the Future method as we don’t have a direct way of seeing whether it is still running or is completed.”

(92, 94, 8)
“However, we can monitor the Queueable job and control the flow of our application based on that.”(39; 75; 93)

So when exactly will the RAG be used in the flow of communication between the agent and the chatbot? To answer this question, let’s take a look at the flow of direct communication between a user and an LLM:

Now, what is the communication process between users and an SF Ben chatbot that uses an LLM?

Keep in mind that a user query to the chat is not prompt in this case – specific businesses would process a user query in the backend server, prepare a prompt for the LLM, and attach a user query to it. 

In the case of an SF Ben chatbot, in a backend server would be written code that prepares a prompt, which would look something like this: “You are a knowledgeable Salesforce expert who responds to users’ questions about various Salesforce topics. Please answer the following question in a helpful and polite manner. Here is the question: [USER’S QUESTION GOES HERE].”

Note: This prompt is not very detailed, it could be much better, but I made it shorter for simplicity.

However, RAG is missing from the above picture. In this case, an LLM would answer the question of the user based on the knowledge it has from the internet or training documents. We would like to inject a chunk from VDB that answers the user’s question in the prompt above.

So with the RAG module, the flow would look like this:

Now the prompt would look like this: “You are a knowledgeable Salesforce expert who responds to users’ questions about various Salesforce topics. Please answer the following question in a helpful and polite manner. Use the following information: [CHUNK(S) RETRIEVED FROM THE VDB GO GERE], and answer this question: [USER’S QUESTION GOES HERE].”

So now the prompt is more dynamic and we are attaching the concrete information. The chunk(s) are the concrete information – we are giving the required portion of the knowledge to the LLM to answer a question, instead of passing to it the entire huge knowledge base; that’s why chunking is important.

Final Thoughts

RAG is a way to augment LLM with a company’s private knowledge base. It helps us to make LLM more specific to our business needs, therefore providing customers with a much better experience.

The Author

Jumber Mdivnishvili

Jumber is a 7x Certified Salesforce Developer with years of experience of working on Salesforce Sales Cloud, Service Cloud and Experience Cloud.

Leave a Reply