Discover more from Retrieve and Generate
Beware Tunnel Vision in AI Retrieval
Vector search is not always the right tool. So why do VCs pretend that it is?
Edited 6/21/23 to add Andreesen Horowitz and analysis of their consulted sources
Generative large language models (LLMs) need to retrieve information to provide most of their value - their memorized knowledge is unreliable and infeasible to update, but they are effective at reasoning over context provided at inference time. This fact is common knowledge in the LLM community. However, over the first half of 2023 a strong bias has sprung up: That information retrieval for LLMs must be achieved through vector search. This is untrue, and substituting a single solution type (vector databases) for a capability (information retrieval) in authoritative content like tech stacks from VCs is hindering the community’s ability to choose the right tool for the job.
To be sure: vector search is a great solution to many LLM-related retrieval problems and I’ve been using it in production for nearly two years! But it struggles in certain conditions and excels in others, and suggesting that it is “the” solution to retrieval for LLMs is simply misinformation. In your application, the optimal solution may be full-text keyword search, vector search, a relational database, a graph database, or a mixture. We’ll explore the real information retrieval landscape later on.
Thanks for reading Colin’s Substack! Subscribe for free to receive new posts and support my work.
Audience notes: This is written for those who understand what vector search is.
Vector search, semantic search, and similarity search can all be considered synonyms but we will use vector search here
Vector database and vector search will be used interchangeably
Information retrieval, retrieval, and search will be used interchangeably
Embeddings and vectors will be used interchangeably
Who says you have to use vector search?
Imagine you’re architecting a new product or feature using LLMs. Time to pick a retrieval system! If you’re unfamiliar with the field, some safe bets to guide your decision may be:
Infrastructure stacks from super-smart VCs
Tutorials from popular LLM application frameworks
If you’re reading this, there’s a good chance you already have several of these resources open in other tabs. Let’s review some notable examples.
The latest and flashiest of the examples shared here, a16z comes out swinging with a highly detailed and prescriptive piece of content, one of the few that actually contains an actual architecture diagram:
Try to find the information retrieval section and you’ll come up empty. In its place are the embedding model and vector database blocks. This is helpful because they actually tease apart the two components of vector search where others don’t, but grossly unhelpful because they commit the cardinal sin of substituting vector search for information retrieval as a whole. Do they know that half of their listed vector database providers (Pinecone, Weaviate) also offer keyword/BM25 based search, and therefore are not just vector databases? Perhaps they chose to offer that functionality for a reason.
Respected VC, authoritative tone, full speed ahead! They surveyed 33 companies in their network on their current or expected usage of AI:
88% believe a retrieval mechanism, such as a vector database, would remain a key part of their stack. Retrieving relevant context for a model to reason about helps increase the quality of results, reduce “hallucinations” (inaccuracies), and solve data freshness issues. Some use purpose-built vector databases (Pinecone, Weaviate, Chroma, Qdrant, Milvus, and many more), while others use pgvector or AWS offerings.
It’s a very interesting survey and I recommend the read! But they begin to narrate by inserting vector databases as the only named option for retrieval capabilities. It gets worse in their “stack” diagram.
Now the category of “retrieval” is totally gone, replaced by Vector DBs. You’d be forgiven for picking a provider from each box and calling it a day, but you’d be missing out. And, sure, it’s “not exhaustive,” but that’s no excuse for showing a single database category.
Again, this stack contains a category describing a technology (vector database) rather than a capability (information retrieval).
Vector Databases: When building LLM applications — especially semantic search systems and conversational interfaces — developers often want to vectorize a variety of unstructured data using LLM embeddings and store those vectors so they can be queried effectively. Enter vector databases. Vector stores vary in terms of closed vs. open source, supported modalities, performance, and developer experience. There are several stand-alone vector database offerings and others built by existing database systems.
Cowboy at least named the category after the capability (retrieval), but failed to provide any examples that aren’t vector-first search engines (Pinecone, Baseplate, Zilliz, Qdrant, Weaviate, LanceDB, Activeloop, Chroma).
This map is focused on open-source and ha an entire section for vector search, but no mention of using non-vector search for retrieval.
Nearly every widely adopted modern database started as open source, and we expect that trend to continue with vector databases. Vector databases will serve a critical role in the AI-native stack — the pattern of ensuring fast retrieval of embeddings generated from LLMs to serve different application use cases is one that will continue to grow. The last year has seen a proliferation of open-source vector databases to address this need.
This message - In retrieval, vector search is not only default, but mandatory - seems to be the rule rather than the exception. Sadly it’s repeated frequently, with shockingly little nuance:
But what if you use application building frameworks to help you make your decision, for a more practical perspective? Many have played with LangChain as an interface toolbox, in which case you may have found yourself on the docs’ Data Connection page:
Many LLM applications require user-specific data that is not part of the model's training set. LangChain gives you the building blocks to load, transform, and query your data via:
Document loaders: Load documents from many different sources
Document transformers: Split documents, drop redundant documents, and more
Text embedding models: Take unstructured text and turn it into a list of floating point numbers
Vector stores: Store and search over embedded data
Retrievers: Query your data
According to this, if you want to connect to data you have exactly one option: embeddings (vectors). However, the LangChain project is community-led and quite chaotic, with big chunks of the project changing every week. Even while writing this article I had to change this section as the page I had linked vanished. So what if you prefer something a little more stable? For this I’ve recommended Microsoft’s Semantic Kernel as an enterprise-tested alternative. But surprisingly, we see something similar from Semantic Kernel: What are Memories?
We access memories to be fed into Semantic Kernel in one of three ways — with the third way being the most interesting:
Conventional key-value pairs: Just like you would set an environment variable in your shell, the same can be done when using Semantic Kernel. The lookup is "conventional" because it's a one-to-one match between a key and your query.
Conventional local-storage: When you save information to a file, it can be retrieved with its filename. When you have a lot of information to store in a key-value pair, you're best off keeping it on disk.
Semantic memory search: You can also represent text information as a long vector of numbers, known as "embeddings." This lets you execute a "semantic" search that compares meaning-to-meaning with your query.
At this point, with the official partner of OpenAI selling vector search as the only solution for text-based retrieval, you may be thinking, “Well, maybe it’s YOU who is wrong.” But please keep reading and decide for yourself.
Vector isn’t always best
The unfortunate truth is that, for enterprise-level applications, plain vector search is frequently the wrong solution. Many practitioners understand this and often use vector search as a part of their information retrieval system. Below are a couple resources showing where vector search relevance is outperformed by traditional, keyword-based alternatives.
These examples detail user-facing search applications. At a glance, you might be inclined to distinguish between these applications and retrieve-and-generate applications, but upon closer inspection, the distinction blurs. The reality is that a huge proportion of enterprise LLM application are essentially search systems with a generative user interface and a target retrieval unit of a “chunk” or short record rather than a long record. Unless there's a substantial deviation in your usage pattern, the relevance criteria for search and LLM retrieval are virtually interchangeable.
My intention here is not to denounce vector search but to provide a more balanced view of its place within the information retrieval landscape. These counterexamples should dispel the notion that vector search is the definitive solution for all scenarios, but not undermine its importance and potential utility.
An excellent case study presented at Haystack US 2023 (where I also presented on generative search) exploring the entire implementation and evaluation process of a search engine in the medical domain. When replacing the incumbent, keyword-based system with vector search, “Not Satisfied” survey responses nearly doubled, even when using a domain-specific vectorizer model.
The team is still working toward improving the vector search system and I trust it will perform great when it’s finished, but a huge thank you to Erica and Max for having the courage to share a timely real-world example from their journey that contradicts some of the techno-optimism around vector search.
This is a simple yet informative piece backed up by benchmark performance numbers. In this example both approaches perform comparably, and taken with the Clinical Decisions case study above, it’s a good illustration of how the target data and domain can shift the balance.
How did this happen?
Many subtle currents in the LLM community contributed to a feedback loop of “vector search is what you use with AI”, which led to an echo chamber, which led to the one-sided content from earlier. Let’s review all the factors and consider their contribution.
Assuming that retrieval for AI is different than text search
What exactly are the differences between retrieval for generative use cases and retrieval as it existed previously, for search?
Limited retrieved object size
Different query forms
Do these necessitate a new kind of retrieval? Let’s explore each and see.
Limited retrieved object size
In order for an LLM to generate over a retrieved object, it must fit easily within the LLM’s context window, while leaving some room for generation. For example, at time of writing the basic gpt-3.5-turbo (standard ChatGPT model) has a context window of 4k tokens, with a separate version offering 16k. A rough conversion heuristic is 3 words per 4 tokens, meaning those models support combined input/output of 3k and 12k words respectively. For some data, entire records will comfortably fit within this limit. Other data with longer records must be broken into “chunks” so that only the relevant portion of a long record can be inserted into the LLM’s context. This requirement happens to be shared with vector search (which typically use chunks as its retrieval target the vectorizer models have similar constraints to the LLMs’ described above), but chunked text is still just text and can be indexed and retrieved as such. Conclusion: Any retrieval method can work.
Different query forms
Another possible objection could be that vector search is needed to handle the different types of queries that LLM systems may encounter, whereas traditional search methods require keywords only. Some of these possible query types:
Entire records (e.g. search-by-file)
In every case, keywords can be extracted using stopword removal or explicit keyword extraction. And in some cases, an entire dialogue or record may be too large for vector search anyway, making a version of this step necessary in either case. Certainly, vector search can provide some advantages for question-answering and paraphrasing systems, but it is hardly the only option. What’s more, some systems will run a reranking or LLM-based filtering step before final generation, minimizing some of those advantages. Conclusion: Any retrieval method can work.
Combining the above points, we see that information retrieval for LLM generation can look slightly different depending on whether you use vectors or not, but all methods are on the table.
Clearly, this assumption (Retrieval for LLM is a separate category, and must be addressed by vector search) is wrong, but it has been so infrequently challenged that it has persisted.
Community familiarity between LLMs and vector search
Both generative LLMs and vector search are based on the same technology: transformer-based language models. Between 2020 to 2022, LLMs and vector search were simply two sides of the same coin. Then, upon the explosion of frameworks and content following the release of ChatGPT, developers, framework builders, and content producers naturally prioritized vector search in their demos and frameworks because they were already familiar with it. I count myself within this group. Even back in 2021, my answer to text search was vector search simply because that was a technique I was familiar with and keyword search was not. This familiarity bias likely nudged a lot of early practitioners toward choosing vector search in suboptimal situations.
Confusing FAIR’s Retrieval-Augmented Generation with whatever we’re doing now
In 2020, researchers at FAIR released a blog post and paper called Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (RAG). This paper describes a system where they fine-tune a query vectorizer and answer generator simultaneously on a closed domain question answering-like task. This was an innovative approach to improving performance on similar tasks at the time, but it’s since been made obsolete as the generative models became much larger and more powerful, rendering simultaneous fine-tuning alongside the vectorizer unimportant and infeasible.
Very few practice FAIR RAG now, but the “Retrieval-Augmented Generation” term was unfortunately resurrected in late 2022 and applied to the simple retrieve-and-generate grounding pattern that is now ubiquitous in LLM usage. I strongly suspect that some people assume that vectors must be used for this basic retrieve-and-generate pattern because they were required for the FAIR RAG system, but the constraint requiring vectors to be used (ability to backprop through both the generator and the vectorizer) no longer exists. If it were up to me, we’d stop saying Retrieval-Augmented Generation and say retrieve-and-generate, generative search, or retrieval for LLM grounding. But that probably won’t happen so just be sure that YOU don’t make the mistake of assuming vector search is required to retrieve context for LLM generation.
Startup community engagement and incentives
Naturally, the LLM community of early adopters and tech enthusiasts is more likely to be interested in and follow vector search content than traditional search content. After all, vector database providers are some of the most active content contributors in the community, driven by VC funding and in pursuit of adoption through capture of the rapidly multiplying LLM builders. Partnerships with frameworks, blog posts, demos, video tutorials - all this content has doubtless benefitted the community. But along with all the help comes a perspective, sometimes loud and sometimes quiet, that vector search is the silver bullet.
Some has been egregious - the branding from Pinecone is the worst offender:
But this grandstanding is likely the same reason they were so ridiculously successful with fundraising from VCs. And funding truly is likely the motivator here, because if vector search is all you need, why would they integrate keyword search into their product?
Others have done somewhat better - Weaviate and Qdrant have avoided some of the grossest branding excesses and still shared loads of helpful content. But with any advice, be aware of where it is coming from. Content from a technology provider is likely to promote said technology, implicitly or explicitly.
Shameful neglect by incumbents
On the other side of the search provider chasm, the incumbents made their own mistakes. Their primary mistake was ignoring vector search as an important feature of a holistic text search engine until the startups had already solidified their hold on the community of LLM users. The LLM world could look very different if Elasticsearch or even MongoDB had hired 10 engineers to integrate vector search into their offerings in 2020 or 2021 and had ridden the wave of LLMs with active community engagement. Instead, they dragged their feet, treated vector search as a second-class feature, and largely ignored the growing community.
Elastic released their full ANN (approximate nearest neighbors) vector search functionality in February 2022. This was laughably late and the community had already been using open-source plugins they had built to accomplish the same feat for over a year. As if that wasn’t bad enough, they put the interface with Hugging Face models inside the Eland client, which is a “machine learning” feature restricted to Platinum+ clients. You can implement your own interface if you have the time, but they are basically charging you for access to open source Hugging Face models. Their docs are as unintelligible as any I’ve found, and it took me hours to realize I couldn’t deploy a vectorizer model because I hadn’t bought a license. I can’t blame anyone for ignoring their offering.
OpenSearch was much more serious about it, apparently offering ANN vector search since 2019. But they don’t seem to have made much headway into the community outside of recommendations from longtime search practitioners, which I’ve found in some LLM frameworks’ Discord channels or similar. And MongoDB has just joined the full-text search party with their Atlas search, but doesn’t seem to support ANN vectors yet.
Lack of input from real-world users
To date, most content about retrieval for LLMs has been produced by:
Vector DB providers
None of these groups are solving problems in production, they are all upstream at varying distance. Technology providers’ work usually manifests in demos on public domain data like Wikipedia or video/podcast transcripts. Amusingly, those two use cases happen to be two of the best possible applications for vector search, since they’re open domain and, in the case of transcripts, conversational! The also don’t face the sort of scrutiny that generative search outputs in an enterprise do. So be aware that you’re getting a biased view of both the technology’s capabilities and also the applications where it’s being used.
This is also a problem in the larger LLM community - project demos like AutoGPT consume a disproportionate amount of community attention for the amount of value they deliver in enterprise usage (likely close to zero, due to reliability, observability, privacy, and other obstacles). In some ways this is good, it keeps the technology progressing, but it also gives an inaccurate impression of what’s actually being used in production and generating business value.
Vector search is new and cool 😎
Do you remember when the buzz around cryptocurrencies, distributed ledger technologies, and “decentralized everything” led to an influx of systems built using blockchain, even when their specific use cases didn't benefit from blockchain’s advantages? These initiatives were motivated more by the appeal of the blockchain label and the allure of riding the hype and funding wave, rather than a calculated choice based on technology needs. Similar to how blockchain-based systems can run into massive operational challenges at scale, if you naively select vector search and then find yourself unable to afford retraining and reindexing a year down the road to meet relevance requirements, you are going to have a bad time.
VC bias and echo chamber
Similar to how database providers have an incentive to promote their technology, VCs have an incentive to promote new technology (since they are invested in its success). In addition to this, their attention dwells on new companies and buzzy technologies, leading them to lack familiarity with traditional solutions to the same problems that new solutions also solve. Perhaps most importantly, the “real-world practitioners” on whose experience they draw from tend to fall into the groups described above, that is: language model enthusiasts who have not rigorously weighed the advantages of vector search against conventional keyword search for the same use cases. Let’s take a look at the parties consulted by a16z:
The 15 printed names roughly fall into these buckets:
3 are deep learning model providers or experts (Ben Firshman, Andrej Karpathy, Moin Nadeem)
8 are framework/tool/platform providers (Ted Benson, Harrison Chase, Jerry Liu, Ali Ghodsil, Shreya Rajpal, Ion Stoica, Matei Zaharia, Jared Zoneraich)
1 is a vector database provider (Greg Kogan)
1 is an investor (Diego Oppenheimer)
Just 1 is a user-facing application provider (Dennis Xu)
VCs are talking to the wrong people. If you ask a bunch of AI tool providers and experts how retrieval should work, they will tell you: AI (vectors). They probably haven’t evaluated any alternatives. The risk here is creating a whole generation of suboptimal applications simply because builders only ever learned about one type of search. VCs can fix this by consulting longtime search practitioners and those who have evaluated multiple techniques for user-facing applications.
Most likely, technical familiarity and echo chamber dynamics comprise up the bulk of why VCs heavily favor vector databases as the solution to retrieval. As you consume their content, be aware of these biases and the incentives they hold.
So what’s the real retrieval stack?
Unfortunately, data-driven advice around how to navigate the options is still rare. For now I will provide some simplified categories, intuition, and generic recommendations.
Interrogate your data
To navigate the techniques available, acknowledge that how you should retrieve depends on what you want to retrieve. Try to answer the following questions:
For structured data
Is text involved?
Is it free text or limited to entities/categories?
How much text per record? Across one or multiple fields?
For unstructured data
What is the nature of the text? Conversational (call/podcast/tv) transcripts? Formal (reports/policies)?
How much text per record?
In either case
What is the unit you are trying to retrieve? An entire record or a granular answer/entity contained therein?
What is your search query? A question, keywords, a chat dialogue, or a record?
Is the domain (topics, terminology, acronyms) primarily open/public or closed/private?
Do you have reliable thesauri or glossaries in the domain?
How frequently does the domain substantially change (address new concepts and terminology)?
Are you enriching records upon indexing, or trying to perform tasks beyond retrieval?
These considerations will guide your choice of retrieval technique. Note that this isn't an exhaustive guide, but it should provide a solid foundation and help build intuition. I hope to write or share more on the topic at a later date.
Note: I’m focusing on text data (or data that contains text) and excluding modalities like image/video/audio. For those modalities vector search has some significant advantages but the use cases are currently more limited. Another minor caveat is that I’m considering only retrieval and not reranking techniques here, which sometimes include using vector similarity to rerank keyword search results (see Azure Cognitive Search, Cohere Rerank).
Consider your options
With your answers in mind, consider the various retrieval methods available. Here's a simplified breakdown of the pros, cons, and examples, keeping in mind that this overview is a simplified representation of a much more complex landscape:
Good for “about-ness”, conversational data, paraphrasing, question-as-a-query
Multilingual options available (depending on the vectorizer model)
Intuitively understands related terms/concepts in vectorizer model’s training context
Struggles with exact matches, dealing with many possible similarity dimensions
Cost/latency/complexity from vectorizer running on every query (at query time) and indexed record (at index time)
Blind to terms/concepts outside vectorizer model’s training context unless fine-tuned
Fragile to shifting terminology and relationships
Difficult to update through fine-tuning and expensive to reindex at scale
Examples: Weaviate, Pinecone, Qdrant
Good for exact matches, retrieving long records
Can be easily updated with synonyms, taxonomies, etc
Struggles with paraphrasing, conversational data
Does not natively understand relationships between terms/concepts
Often requires query preprocessing to reduce noise
Massive quantity of customization levers can cause tuning paralysis
Examples: Solr, Elasticsearch, OpenSearch, MongoDB
If it’s what you’re already using, it keeps things simple
May have vector/keyword search capabilities/plugins you’re not aware of
Easy and effective for retrieving by IDs
Keyword and vector search features may be limited
Hybrid search (Combination of 1 & 2)
Gives access to both vector and keyword search
Fusion methods/usage patterns are not very mature yet
Providers are weak in their secondary index (vector-first providers lack keyword features, keyword-first providers lack vector features)
Examples: Weaviate, Pinecone, Elasticsearch, OpenSearch
Note 1: Graph databases and text2sql are becoming more popular for retrieval as well, and will be included in future related content.
Note 2: Hybrid Search often refers to running vector and keyword search simultaneously and fusing the results. I am using it to describe search engines with both vector and keyword indices so that those searches can be run together or independently.
Advice to new practitioners
There’s a huge upside for you to understand both vector- and keyword-based retrieval: You may be able to choose a far simpler solution than competitors, get to market faster, and offer higher quality results in the right situations.
To get to the best answer, having an evaluation process can prove invaluable, and for this I recommend checking out Quepid or similar. However, I know most of you are going to implement a retrieval solution without a rigorous evaluation process, so here are some guidelines based on my own experience and conversations with other practitioners.
First, if your data is relational and already in a database with full-text search, explore the potential of your current setup first - run keyword or vector searches (see pgvector, Redis vector similarity), gauge their performance, and only consider switching to an alternative if it clearly outperforms your existing solution. If your retrieval is based primarily on IDs (like user accounts), you might not need to venture into text search at all.
If your data is not in a relational database:
Choose a database that will support both vector and keyword search
In most cases, implement keyword search first
Use stopword removal or keyword extraction to preprocess your query
Implement synonym sets if available/necessary
In other cases, implement vector search first. Some rules of thumb:
Conversational data (e.g. call/interview transcripts)
Synonyms are frequently used but domain is too broad for them to be collected
If many results are being returned, consider a reranking or filtering step to increase precision
When your first search works well, implement the other and evaluate
Start with using a secondary search pass, usually keyword-then-vector, with the second pass triggered by no results or poor answer quality from the first
For an easier but hackier solution, consider using parallel hybrid search with a technique like reciprocal ranked fusion (often included in hybrid offerings)
For mature search systems only: Tune your keyword search system and fine-tune your vectorizer model
The reason to start with keyword search for most applications is because it can offer high precision and it’s easier to understand the mapping from input to output. This is very useful in production use cases where users may be confused by results that don’t seem to match query terms.
The reason to use keyword-then-vector search in 5.a. is because this sequence allows you to seek high-precision results (with keyword search), and if that fails, to increase your recall using vector search.
Using vector search as a fuzzy fallback when keyword results are bad plays to the strengths of both search types.
Interestingly, this is also what Qdrant recommends for hybrid search.
Advice to VCs and framework creators
The substitution of vector search information retrieval is misleading and reductive.
The anthropomorphizing of information retrieval to “memories” is unhelpful and distracting.
Your stack or documentation should contain a category of information retrieval. Within this, specify the options and help builders decide.
For everybody: Focus on capability, not technology. Think of vector search as a feature of text search, rather than its own category. This is how it will be remembered.
Thanks for reading Colin’s Substack! Subscribe for free to receive new posts and support my work.