Welcome to the age of Enterprise LLM Pilot Projects! Right now, enterprises are cautiously but enthusiastically embarking on their first Large Language Model projects, with the goal of demonstrating value and lighting the way for mass LLM adoption, use case proliferation across the organization, and business impact.
In a well-planned and well-executed project, that’s exactly what will happen: the use case will provide new capabilities and efficiencies to the business, function reliably, delight users, and make a measurable impact on the business. Understanding of the technology and its value will spread outside the pilot project where new use cases will be surfaced and championed.
However, if the wrong use case is chosen the project could fall flat, even when executed to perfection. Imagine that a use case is chosen that is a poor fit for the capabilities that LLMs provide:
You implement a e-commerce analytics assistant, but the LLM-based conversational interface is a clumsier experience than the previous column-based lookup interface, so you remove it. Then you realize all the analytics requirements are best satisfied by traditional data analytics patterns like statistical analysis, topic modeling, and anomaly detection, so instead you shoehorn a conversational interface into some side-feature that nobody will use. You’ve reinvented the wheel and the LLM provides no benefit to users.
🗑
Or imagine that a use case is chosen that fails to provide an advantage over existing workflows, even if it is a fit for LLM capabilities:
You implement a chat-with-data system on a set of marketing publications. Users try it out, but prefer to use Google because they can’t ask your system about topics not contained in the dataset, while Google has everything and still gets answers and sources. Feedback is negative and your fancy AI system is perceived as less useful that Google search.
🗑
Or, imagine that the data for your selected use case is not organized or reliable enough to support it:
You implement a question-answering system over a massive dump of unstructured data. Enterprise Search initiatives over this dataset had previously failed spectacularly, due to nonexistent data management, chaotic content, and erratic formatting. Now, the question-answering system, which is also search based, faces the same problems and also fails spectacularly.
🗑
In any of these cases, your future LLM projects will struggle to gain support from leadership or adoption from users. However, under different circumstances, an LLM-based assistant, chat-with-data system, or question-answering system can be an impressive success!
In the rest of this post we’ll review the core capabilities of LLMs and map those to use cases categories with examples and what kind of impact they can generate. By the end of the article you should have an intuition of how to spot candidate use cases in your organization, and in the next post of this series we’ll look at what to specifically avoid.
But First: AI and Generative AI Do Totally Different Things
If you’re already very familiar with LLMs and their capabilities, skip to the “Major Use Case Categories” section below.
What did “Enterprise AI” mean before 2023?
Usually it meant deriving insights or making judgments based on data. It went hand in hand with Big Data, and was almost never generative. Predict, classify, detect, analyze, recommend. These functions typically manifested in optimizations grounded in the dataset.
You’re recommending products to buy? Recommend better ones, based on the data.
You’re predicting account lifetime value? Predict more accurately, based on the data.
You’re detecting suspicious transactions? Detect more accurately, based on the data.
The idea is, the secret to doing better is right there in the data! After all, enterprises have tons of data, so it makes sense and works really well. Grab a handful of business processes in your organization and if there’s relevant data, there’s likely an optimization to be had.
💡 Non-generative AI provides optimizations
Generative AI is totally different. By generating information, it provides new capabilities that are almost mutually exclusive from the optimization power of non-generative AI. And the capabilities of a generative language model don’t come from the data surrounding a business task - they come primarily from the text of the entire internet, and the instruction data it was fine-tuned over, which sometimes extends into your enterprise. This coin has two faces:
LLMs don't need any of your data in order to be useful. 😃
LLMs will never perfectly align with the distribution of your data. 🙁
But perfectly matching your internal data distribution is generally only needed for optimizations, so that’s okay! What’s more, the LLM was not trained on your specific task either. To reinforce the difference between them, let’s spell out how the different models are trained:
Non-generative AI: Usually 100% trained on the relevant task and data
Generative LLMs: Often 0% trained on the relevant task and data
Clearly they couldn’t be more different. As you approach LLMs, be careful not to think of them in the same category as non-generative AI.
💡 Generative LLMs can provide new capabilities.
They work more like a computer than an AI model.
You usually program them, instead of training them.
We’ll use LLM as shorthand for Generative Large Language Model throughout this post, and expand these capabilities below.
LLM Core Capabilities
To lay the groundwork for use case selection, let’s divide LLMs’ broad capabilities into three more specific ones.
Instruction following and reasoning
Natural language fluency
Memorized knowledge
1. Instruction following and reasoning
LLMs with some measure of fluency, memorized knowledge, and creativity have existed since as far back as 2018 (GPT-1). GPT-3, a version of which underpins ChatGPT, was released in 2020 and yet failed to permeate the technology landscape. What changed, and enabled an entire class of models to be used off-the-shelf for countless enterprise purposes, was training it to follow human instructions, which picked up speed with the release of InstructGPT in 2022 and reached exit velocity with ChatGPT later that year. Previously, generative models (BART, T5, etc) had to be extensively fine-tuned for every type of task you wanted them to perform, e.g. text summarization, query generation. Even GPT-3 had to be tricked into following instructions with clever yet unreliable prompts. But with instruction-tuned (we’ll use that to encompass both task-tuning and RLHF) models, you could simply describe the task, and the model would somewhat reliably interpret the instructions to generate a plausible answer. All of a sudden, projects could skip the costly data collection and training processes that were required before, with even better results!
This ability is the cornerstone of enterprise value that can be derived from LLMs: follow text instructions, perform basic reasoning, and provide a text output. Without it, LLMs regress to 2021 where they could be used for plenty of fun, but not enterprise-valuable applications. It’s so important that I recommend using following mindset while ideating and selecting LLM use cases.
💡 Think of LLMs as reasoning engines: Interpret an input, generate an output.
This framing correctly focuses thinking on the transformation of inputs into outputs by following instructions. It avoids treating LLMs as a knowledge source. It avoids focusing solely on their conversational power, because that is just the tip of the iceberg and the majority of enterprise value lies under the water, beneath the chit-chat. It focuses on processing an input at a time, rather than generating a single output over an entire dataset.
The disclaimer to this impressive skill is that instruction following and reasoning performance depend on the complexity and nature of the task. Rarely can it be relied on to make business-critical decisions, and some types of tasks (like numerical reasoning) can cause even the strongest LLM to err spectacularly.
2. Natural language fluency
Since the early generative models of 2018, LLMs have steadily improved their fluency in natural language, making fewer mistakes and generating text in more tones, formats, styles, and languages (even programming languages!). To explore how far we’ve come, explore the Write with Transformers demo from Hugging Face. (Note: since I wrote this, the webpage containing models other than GPT-2 seems to have broken, so the link only points to the GPT-2 version) The older models hosted there are still quite fluent, albeit without the level of intelligence behind them to which we’re now accustomed in 2024. They’re also poorly suited to task-oriented fluency due to the lack of instruction-tuning, which nowadays often includes chat-tuning to produce seamless conversational experiences. However, it’s reductive and dangerous to focus only on chat as the only way to interact with or benefit from LLMs; just know that these models can produce fluent language under nearly any condition.
As a sidenote to fluency let’s propose another capability: Creativity. This won’t get its own section as it can be considered a combination of fluency, reasoning, and some X-factor. Suffice to say that LLMs can produce surprisingly novel and interesting outputs, even as far back as the demo linked above. Debate continues to rage about whether the models are “truly creative” but when you ditch the philosophy and focus on the capability, the answer is resoundingly: Yes! However, creativity is opposed to many enterprise use cases but is easily discouraged through narrow, grounded, task-oriented prompting.
3. Memorized knowledge
LLMs accumulate a vast amount of “knowledge” during their pre-training task of attempting to autocomplete text across billions of words. This feature strongly contributed to the hype and utility of ChatGPT upon its release - it’s like a search engine that only gives me what I need! However, the community quickly discovered the limitations of this knowledge:
The memorized knowledge is unreliable and unverifiable
The memorized knowledge cannot be edited without building a new model
All are now familiar with “hallucinations,” or factually inaccurate generations, as a harmful byproduct of trying to extract memorized knowledge from a large language model. Information changes over time too; both issues could clearly harm an enterprise LLM project if unmitigated. The accepted (and only feasible) practice to address these issues is called Retrieval-Augmented Generation or RAG. Factual information is stored in an external knowledge base, information is retrieved relevant to a particular task, and then the LLM is fed the information and instructed to only consider the information provided to it when generating. Another victory for instruction-tuning! But that does mean that most enterprise use cases now require a search engine or database and the requisite data readiness to support it.
There are still some use cases where the memorized knowledge in a large language model can be useful, I’ve even found some small applications of it in my designs. So don’t totally count it out, but in general assume that your system’s knowledge will come from outside the model and treat the memorized knowledge with caution.
Major Use Case Categories
The capabilities of instruction following and reasoning, natural language fluency, and memorized knowledge manifest in use cases that can be grouped into the following categories, from least to most complex:
Data transformations
Natural language interfaces
Workflow automations
Copilots and assistants
Autonomous agents
Nearly every use case can either be classified into one of these categories or be composed from them. If you’re not sure how a certain project idea of yours might map into these, let me know!
1. Data transformations
The simplest thing you can do with an LLM is to submit a text input and receive a text output. That can be just a single call to a model, although it can be more complex, too, including advanced prompting systems (chain-of-thought, tree-of-thought, etc), guided generation (guidance, lmql, outlines), or context from a conversation. But what’s in common is, the input to the system is text, the output is text, and no external tools or data were used.
Within this category, there are two very distinct sub-groupings.
Data Transformations Type 1: Augmentation or Reformatting
These are typically data solutions, not user-facing, that transform data in some pipeline. The outputs may eventually be presented to a user, but usually with other application layers in between. Guided generation is extremely important for achieving output formatting, for integration with the rest of the pipeline. These systems are often very quick to spin up compared to their non-generative AI-based alternatives, but may be prone to bias and require data collection to evaluate (which is a normal part of the training and evaluation process for most non-generative alternatives). Here are some examples:
Summarization
E.g. An email thread summarization feature in a customer support tool.
Highlights reasoning and fluency capabilities.
Classification
E.g. Tag podcasts with topics by classifying their transcript text, “entertainment”, “culture”, “technology”, etc.
Highlights reasoning capabilities.
Feature extraction
E.g. Extract key words and phrases from customer complaint tickets.
Highlights reasoning capabilities.
Format conversion
E.g. Convert free-text feedback messages into json objects that can be submitted as bug tickets.
Highlights reasoning capabilities.
Information expansion
E.g. Get definitions or synonyms for open-domain terms.
Highlights memory capabilities
Data Transformations Type 2: Model-Only Chatbots
In contrast to the “data product” examples above, these examples are very much user facing and familiar. They are model-as-a-product, with a user interface. Their limitations limitations have to do with limited knowledge, relying on model memory or a short corpus. It may seem strange to thing of chatbots as data transformations, but that is all they are - text in, text out. No contact with external systems. Examples:
Model-only chatbot
E.g. ChatGPT without plugins.
Highlights reasoning, fluency, and memory capabilities.
Small-corpus chatbot (entire knowledge base included in prompt)
E.g. Chat-a-document, InstaBase AI Hub.
Highlights reasoning and fluency capabilities.
Impact
The value of these use cases can vary widely. On one hand, operations like classifying records with an LLM can typically be performed with non-generative methods equally well or better, and for lower operational costs. But the startup costs are much higher, and if your team doesn’t have the skills or time to develop that system, quick-starting using an LLM could unlock a capability you didn’t have time for before, or allow you to prototype much faster. And single-document chatbots and question answering systems can improve review efficiency, but that only matters if you have a document-review process that happens frequently, and if it can’t be solved by simply highlighting keywords or extracting consistently formatted fields.
So, as with any use case, the impact depends on your business. Here are some clues to help you spot good ones:
You want to perform AI data transformations without having machine learning/data science skillsets on the team.
You want to prototype data transformations to gauge impact before investing in a machine learning/data science project.
You have a frequent document review processes that deals with variable language or formatting.
2. Natural Language Interfaces (NLI)
This is the most common of the enterprise LLM pilot project types. Connect an LLM to a knowledge base or (less commonly) a tool, like an internal API, to automate question answering or basic tool usage.
Chat-your-data
E.g. Customer support chatbot that answers questions based on product documentation indexed in a search engine. Demos like Weaviate Verba.
Note: May seem similar to the small-corpus chatbot mentioned above. The difference is that the entire knowledge base cannot be included in a single model prompt. This requires a system to retrieve the relevant records, which introduces a lot of complexity.
Possible forms of knowledge bases: Search engines, graph databases, data warehouse, CMS, etc.
Highlights reasoning and fluency capabilities.
Tool-user chatbot
E.g. A chatbot that can book flights by interacting with an API. ChatGPT with Plugins (can also be a chat-your-data example, depending on the plugin type).
Possible forms of tools: run a query against a database, submit a ticket via API.
Highlights reasoning and fluency capabilities.
Impact
It can be easy to overestimate the value of these applications. Essentially, they improve the usability of existing search engines and business tools - they rarely introduce new capabilities of their own. If you don’t have a search engine, you’ll need to build one as part of the project, and THAT will provide a step-change in capabilities. The natural language interface provides efficiencies on top of that. However, if your data is chaotic you may have trouble getting a search engine working, and then your NLI on top of it is useless. Do not underestimate the difficulty of enterprise search.
With tools it’s similar: most tools’ programmatic interface gets translated into a UI with buttons and fields… so make sure that using natural language would be more valuable than that type of interface.
On the other hand, the visibility and familiarity of these applications can help give a taste of LLM to members of your organization, which can be valuable on its own. This is why so many leaders are choosing NLIs as the subject of their Pilot Projects.
Here are my suggestions for the best opportunities to look for:
Enterprise search systems already in use and generating value, but without an NLI.
Well-organized and heavily trafficked data sources, with bounded use cases, that would benefit from search.
A business tool with a busy interface, that frustrates non-technical users,
3. Workflow Automations
Pssssst: Everybody is focused on the other categories but this one has the potential to generate the most value at your business, IF you can find the right place for it.
Put yourself in the shoes of a random business function at your organization. Over the course of a week, that function perform dozens of distinct workflows (a collection of tasks towards an outcome).
Some repeat often, some rarely. (Frequency)
Some are quick and easy, others are slow and tedious. (Burden)
Some vary wildly, others are nearly the same each time. (Variability)
Some involve using tools and outside knowledge sources, some require interactions with colleagues or clients, some are based solely on the input artifact. (Resource complexity)
Some require complex reasoning processes, other simple operations. (Process complexity)
If you choose workflows in the best slice of these axes, you can completely automate them and generate massive business value. With the right execution can save employee hours and deliver outcomes near-instantly. Of course, if you choose the wrong workflow you’ll find it an infeasible automation target or it won’t generate any value and nobody will realize it exists. Keep reading to see how to choose the right one.
An implementation generally looks like this:
Define the inputs, outputs, process steps, and external resources necessary to complete the workflow.
Create logic to represent the process steps, and interfaces with any of the external resources necessary.
Examples
It sounds simple! But of course the complexity depends on the application. Here are some examples of what effective workflow automations might look like in your business:
Compare company policies against updated regulations, surfacing risks. Inputs/resources are a collection of policy documents and a collection of regulation documents. Outputs are suspected conflicts, with the respective sections from both policy and regulation documents.
Frequency: Medium (whenever policies or regulations change)
Burden: High (reading through policies and regulations and comparing them is time intensive, error-prone, and tedious)
Variability: Low (the review process is nearly identical every time)
Resource Complexity: Low (two narrowly scoped data sources - policies and regulations - is very simple, as far as these projects go. And keeping the human review for afterward means the workflow can run end-to-end)
Process Complexity: Low (though it is time-consuming to evaluate every statement in a policy against a corpus of regulations, it is a relatively simple tasks with few steps. Some domain-specific reasoning may be involved, though)
Filling out e-commerce forms using product documentation, for submitting products to new marketplaces. Inputs are forms to complete. Resources are product documentation. Outputs are completed forms.
Frequency: Medium (any time you have a new or updated product, or need to submit your products to a new marketplace
Burden: Medium (filling out forms is fairly easy, but can take a while)
Variability: Low (the forms and product documentation may be fairly consistent)
Resource Complexity: Low (the forms are simple, but it all depends on the product documentation. If it’s well organized, this is very simple)
Process Complexity: Low (usually, little reasoning needs to happen when filling out a form from documentation - it’s a mechanical process)
Reviewing engineering change requests for compliance, quality, traceability. Inputs are change requests. Resources are policies and quality guidelines. Outputs are audits of requests, containing suspected risks or gaps in documentation.
Frequency: High (any time you have a new or updated product, or need to submit your products to a new marketplace
Burden: Medium (filling out forms is fairly easy, but can take a while)
Variability: Low (reviewing a request should involve roughly the same steps, each time)
Resource Complexity: Low (the change requests should have a consistent style and format, and the policies and guidelines should be easily circumscribed)
Process Complexity: Medium (using the resources to audit the change request could be straightforward, but there may be more advanced reasoning involved in assessing quality, for example)
What makes this impact possible is to circumscribe and define the workflows. We’re not expecting LLMs to figure out how to solve new problems - that’s the job of autonomous agent systems (see the section below). Instead, we’re turning a business process into a pipeline of transformations, interactions, and decisions.
To keep your mind open to the best opportunities, do not fixate on solutions with a conversational/chat interface. If you limit yourself to chatbots, you lose the document comparison, form filler, and automated review use cases above, among others. Many of these automations are best implemented as a behind-the-scenes pipeline than a user-facing app.
💡 Focus on the workflow, not the interface
Here’s a basic guide of how to select candidates for automation:
Select for high impact, as measured by
frequency * burden
. This translates to hours saved, minus some time for manual review of the automation output. For a work function, create a Pareto chart of the workflows and their frequency, then multiply the frequency values by the time burden. Take the 20% of workflows on the left and move to the next stage.
From the impactful candidate workflows, select the most feasible to automate. This is determined by the other three attributes, which we want to all be low:
Variability
Can you create a flowchart of the workflow that’s followed each time?
If the process to complete the task changes every time, it may be a poor fit for automation. Instead, what we’re looking for is a repeatable process for which the inputs or data change every time. This likely has made it resistant to automation in the past, but LLMs’ reasoning abilities may now render automation feasible.
External system complexity
Can you draw a clean box around the resources (information, tools, and interactions) involved in a workflow?
If a workflow normally involves looking things up in a narrowly-scoped set of documents, great - that’ll be easy to automate. If it involves performing Google searches and using many information sources that change as you go, it’ll be tougher.
If collaboration with colleagues or clients is part of the workflow, it may still be feasible to automate, but is clearly more complex than otherwise. We’re moving toward an enterprise where automated processes send messages to employees to collect the necessary information to perform their automatic workflow. Since this is still a new idea, treat these workflows carefully as those employees the system depends on create another risk for your project.
Process complexity
How complex are the operations in the workflow?
LLMs are impressive reasoning engines that newly enable automation of countless operations. However, they’re not a perfect replacement for expert decision-making and have some critical weak spots, such as numeric reasoning (although that can be outsourced to symbolic systems). A good rule of thumb here is: for each operation in the workflow, could you teach it to a new employee with a one-pager or less?
Control or avoid risk. Avoid any use cases where bias is a consideration. Include some form of human review as part of the process, usually as a final step before delivery. Build and test iteratively, with stakeholders involved throughout the process.
Impact
The potential value of these solutions is straightforward: employee hours saved and outcomes accelerated. Reaching this potential depends on choosing appropriate automation candidates and executing the project well. The challenges are significant, but if your organization is looking for actual business impact rather than flashy demos, this is the answer.
The very same is the drawback though: these implementations will not result in a fun chatbot everyone can play with to spread the LLM gospel. And integrating this automation tightly into your business creates upside, but also risk. Workflow automations are cold, hard business tools, not toys.
4. Copilots
Take a Natural Language Interface and connect it to additional data sources and systems. Maybe integrate it with an existing text editor. If you like, even connect it to some workflow automations. Now you’ve got a Copilot! The idea is: with each knowledge source and tool integrated, and the deeper that integration goes, the greater the utility. The goal of copilots is to dramatically increase the efficiency of the user across a broad set of activities. Here are some examples of what that might look like:
Advanced customer support chatbot
Answer questions from documentation
Answer questions from user account data
Perform actions on behalf of the user
Coding assistant E.g. Github Copilot X
Autocomplete code
Create and rewrite code based on knowledge of the codebase
Answer questions about the code and codebase
These systems are very complex and require an immense amount of engineering to perfect. Unsurprisingly, some of the best current examples come from Microsoft (Github Copilot, Microsoft 365 Copilot), which has been pouring resources into LLM-related solutions for years.
Impact
As these systems seep with our everyday life, it will become evident that they can make us multiple times more productive. Even the old, autocomplete-only Github Autopilot automated 25-60% of users’ code. The same is happening with word processing and I’m sure you’ve already used it in emails. If these simple augmentations can approach 2x efficiency (in the Github example) then imagine how effective they will be with deep integrations with other systems. The value is very real, but very difficult to reach. For now, expect these solutions to come from extremely sophisticated tech organizations with deep pockets. However, building your own NLIs and Workflow Automations is a good way to progress to that goal, generating value along the way.
5. Autonomous Agents
Now imagine a Copilot that pilots itself. Remember those workflows we talked about automating, and the ones we ignored?
The undefined, long tail of workflows is what autonomous agents seek to address, by integrating as many knowledge sources and tools as possible, and implementing an executive reasoning system that can plan and execute, doing the job of the user in a Copilot-style partnership. Imagine a system that can completely replace human employees, using the same tools and knowledge sources that they do.
Currently there is a massive amount of capital being invested into developing these systems ($1.3B to Inflection AI, $415M to Adept AI, etc), but they aren’t changing the world just yet. As it turns out, the complexity of navigating the entire digital world is rather high. In domains of limited scope, we know that automation works. But once you add all the tools, all the knowledge sources, all the possible environments… the long tail of complications explodes and reliability tanks.
Sound familiar? Autonomous cars present similar benefits but encounter the same problem: in limited environments they are reliable, but once you throw them into the infinite complexity of the real world, the issues never seem to end. And, sure - in 2024 they’re pretty advanced. But it took decades of work and $75B to get here. Don’t expect LLM agents to be too much easier.
Impact
Autonomous agents represent escape velocity for the productivity of AI systems. But we are not there yet and making progress towards them remains the domain of tech giants and startups with hundreds of millions of dollars to spend toward uncertain returns.
Coming in Part 2
Depending on your circumstances, even the most appealing use case could be a terrible choice. In the next post, we’ll focus on what NOT to do: examples of poor use-case choices and risks. If this post has helped your understanding so far, be sure to subscribe so you don’t miss the next part. And let me know in the comments what else you’d like to hear about!
That is a great resource! Thanks for cookin it up!
interesting, well written, with very nice examples. Thanks Colin!