Overview
While Large Language Models (LLMs) are increasingly capable of performing general reasoning and solving everyday tasks, they typically lack the context to address specialized use cases, like answering questions about your proprietary data. This is because they lack the necessary context to understand your data — so instead, they often default to saying they don’t know, or worse, hallucinating wrong information. Instead, a better solution is to ground your LLM on the data it needs to generate highly effective responses, rather than forcing it to guess on topics it doesn’t have context on.
While this process sounds simple, when it comes to chat applications, there are many pitfalls in ensuring that you retrieve the right information from your knowledge base. At Arcus, we’ve built and run information retrieval systems at planet scale to discover and incorporate the most relevant context from your data into your LLMs, grounding them with real data to prevent hallucinations. Our information retrieval capabilities are customizable to your data and domain, enabling users to power personalized LLM applications, such as domain-specific, chat-based copilots.
The Challenge of Building “Chat over Your Data” Applications
A simple solution to building an LLM chat application that’s grounded on your data works as follows:
- Index and structure your data in a way that can be used for semantic retrieval. This requires large-scale data processing and intelligent methods for extracting metadata while preserving the overall structure and content of your data.
- Build or configure an information retrieval system that is capable of retrieving relevant context for a given set of chat messages. This context will represent snippets of data from your index that, when synthesized, can provide answers to users’ prompts.
- Configure a “post-processing” step that combines retrieved context with a user’s chat history to send to an LLM. This will ensure that your LLM can synthesize and understand your data in order to generate accurate responses to users’ chat queries.
While steps 1 and 3 above are required for building any LLM application grounded on additional data and challenges in their own right, step 2 can be especially tricky when dealing with chat applications.
Typically, information retrieval systems built for LLM applications take in a snippet of text as a query and retrieve indexed data that’s highly similar to the text provided. These information retrieval algorithms usually rely on vector semantic search, keyword-based searches, or more intelligent approaches. However, in the context of chat applications, deciding what text to use for your query is surprisingly difficult.
Challenge: What data from the chat history should you use for retrieval?
Consider the scenario where we build a chatbot to answer questions about various tech companies, dynamically retrieving context from a database of company summaries to enhance the knowledge of our LLM. Below is an example interaction that a user might have with our chatbot:
When the user asks their final prompt, our goal is to retrieve the company summary relating to Arcus and use that to answer the user’s question. However, the user’s final prompt doesn’t mention Arcus directly and to the model the “it” in the prompt is ambiguous — as the prompt refers to history in the chat. This means simply attempting to retrieve data based on the user’s final prompt won’t give us the data we need to answer their question. Here are some simple strategies for how we can generally use chat history for retrieval and why they might not work as expected:
- Use only the final user prompt as the query. This approach works poorly in chat applications when there is a lot of back-and-forth discussion, like the one above. In these scenarios individual prompts might not have all the context necessary to inform accurate retrieval. In the example above, the final user prompt “when was it founded?” doesn’t make sense on its own as it isn’t clear that the “it” refers to “Arcus” in this case. This ambiguity in the chat prevents an information retrieval system from retrieving the right context.
- Use the entire chat history as the query. This can be problematic in cases where the topic of conversation in the chat application changes over time, because irrelevant context will likely be retrieved. For example, in the example above, the chat history includes both information about Apple and Arcus. While the final user prompt requires information to be retrieved about Arcus’ founding date, the unrelated discussion about Apple in the chat history means that we will likely retrieve information that’s also related to Apple which isn’t topical to the subject at hand. This will potentially confuse your LLM with irrelevant context and lead to irrelevant or incorrect responses.
Since simple solutions to forming the data retrieval query from chat applications often fall short for a wide variety of user prompts, we need to find a more robust solution to this problem to build performant, production-worthy chat-based copilots over our data.
Arcus’ Solution to Building Chat-Based Copilots over Your Data
At Arcus, we’ve architected a solution that decouples the chat history of the application from the query used for the retrieval system to solve the challenges above. Our solution relies on intelligent and automatic query transformations, which transform the user’s chat history into queries that get to the heart of the user’s prompt and can be used as single units of retrieval against our data. Using these transformed queries results in better retrieval performance and more accurate LLM responses.
The Basics of Query Transformations For Chat Applications
By using query transformations to decouple the user’s chat history from the specific queries used to retrieve information, we can ensure that the retrieval process focuses on the most pertinent data, minimizing the risk of irrelevant or conflicting information being retrieved. This approach is similar in spirit to our approach for indexing data, which decouples the raw data we intend to retrieve from the information we use to index the data.
In the context of chat applications, determining the right query for information retrieval is a nuanced task. A simple and straightforward approach to using query transformations for chat applications is to ask an LLM to re-write the chat history into a simple query. This query can then be used to retrieve relevant information over our index. For example, we can ask ChatGPT the following question and use ChatGPT’s response as the query we use for our retrieval system:
“Incorporate the above context to rephrase the user’s final prompt into a single self-contained question.”
For the example we gave previously, the generated query is a re-stated question that incorporates all the necessary information to retrieve the right context from our index:
“When was Arcus Inc., the seed-stage startup based in New York City that focuses on building a data platform for LLMs, founded?”
This is now a self-contained prompt that can be used for the retrieval system and ensures that we retrieve the right context for our LLM to provide the correct response to the user’s prompt.
Query Transformations for Chat: Key Considerations and Tradeoffs
Simply using an out-of-the-box LLM for query transformations is often not a production-ready solution due to poor reliability and performance. Here are some factors to consider when deciding how to use query transformations for chat applications:
- Cost and latency. Adding an LLM call to the critical path of the retrieval workflow can potentially double both the latency and the cost of generating responses for your chat application. While the naive approach to using query transformations requires an additional LLM call, this strategy can become prohibitive for an application. In applications that are latency or cost sensitive, it’s important to consider alternatives that are cheaper and faster than this approach.
- Preserving important information from the chat history. While LLMs are highly capable of performing open-ended, unstructured tasks, they can sometimes be unreliable in preserving information, especially along long context windows. Ensuring that your query transformation algorithms incorporate the entire chat history and distill it into the right query is vital to ensuring reliable retrieval performance.
- More complex queries. Some prompts may require information from multiple disparate sources at once, which necessitates more complex query patterns against your data. Query transformations for these more advanced use cases can’t consist of just re-stating user prompts into single queries, but rather into multiple queries that can be stated in succession.
Arcus’ Solution
At Arcus, we’ve built a Query Transformation Engine (QTE) specifically designed to solve the key requirements and tradeoffs of using query transformations for chat applications. Our engine is built on the following core steps to solve the main challenges above:
- Use faster and cheaper models for distilling chat history wherever possible. The most capable models available today like OpenAI’s GPT-4 are great for exhibiting intelligence across a broad variety of tasks, but are relatively expensive and slow. Instead, QTE can achieve better performance at a fraction of the cost with models that are small and specialized only to distilling chat histories. QTE uses a combination of multiple small, specialized LLMs specifically meant for resolving ambiguities in text and transforming chat histories into single, succinctly stated questions.
- Recursively break down long chat histories. Because longer chat histories can be hard for LLMs to process, QTE only processes small portions of the chat history at a time as part of step 1 above. QTE then aggregates the intermediate results of these smaller tasks recursively into a single query. The aggregation processes enforce that key entities or details are not omitted or “washed out” in the recursive steps required to aggregate intermediate results.
- Compute and use sub-queries to answer more complex questions. After distilling all of a given chat history into a single question, the resulting question may not be something that can be answered by a single query against the data. This may need to be broken down into smaller queries that need to be aggregated back together. In this case, QTE can use a set of smaller, more specialized LLMs to break down the complex query into sub-queries that can individually be used to query the data. After these sub-queries are run, QTE runs a recursive aggregation to summarize and process what data is most relevant to the user’s prompt.
Summary
Building a chat application that provides valuable insights using your data presents many unique challenges. At Arcus, we’ve designed an approach that gets to the heart of users’ questions and retrieves the most relevant data to answer them. As we continue to improve and iterate on the core challenges of chat-based copilots, we’re pushing the frontier of what’s possible for using your data intelligently to build LLM applications. Request early access to see how Arcus can help you ground LLMs on your data to build domain-specific copilots and AI applications!
Arcus is also hiring! We’re actively working on building LLM applications grounded on complex data, using advanced indexing and retrieval algorithms to answer complex user queries, large scale systems for processing heterogeneous data at scale, and understanding the performance of LLMs in the context of your data. Check out our careers page or reach out to us at recruiting@arcus.co!