FrugalGPT is a framework proposed by Lingjiao Chen, Matei Zaharia, and James Zou from Stanford University in their 2023 paper “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance”. The paper outlines strategies for more cost-effective and performant usage of large language model (LLM) APIs.
A year after its initial publication, FrugalGPT remains highly relevant and widely discussed in the AI community. Its enduring popularity stems from the pressing need to make LLM API usage more affordable and efficient as these models grow larger and more expensive.
The core of FrugalGPT revolves around three key techniques for reducing LLM inference costs:
- Prompt Adaptation – Using concise, optimized prompts to minimize prompt processing costs
- LLM Approximation – Utilizing caches and model fine-tuning to avoid repeated queries to expensive models
- LLM Cascade – Dynamically selecting the optimal set of LLMs to query based on the input
- The authors demonstrate the potential of these techniques, showing that
FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost.
In this post, we’ll delve into the practical implementation of these FrugalGPT strategies. We’ll provide concrete code examples of how you can employ prompt adaptation, LLM approximation, and LLM cascade in your own AI applications to get the most out of LLMs while managing costs effectively. By adopting FrugalGPT techniques, you can significantly reduce your LLM operating expenses without sacrificing performance.
Let’s put the theory of FrugalGPT into practice:
1. Prompt Adaptation
FrugalGPT wants us to either reduce the size of the prompt OR combine similar prompts together. The core idea is to minimize tokens and thus reduce LLM costs.
1.1 Decrease the prompt size
FrugalGPT proposes that instead of sending a lot of examples in a few-shot prompt, we could pick and choose the best ones, and thus reduce the prompt size.
While some larger models today don’t necessarily need few-shot prompts, the technique does significantly increase accuracy across multiple tasks.
Let’s take a classification example where we want to classify our incoming email based on the body into folders – Personal, Work, Events, Updates, Todos.
A few-shot prompt would look something like:
[{"role": "user", "content": "Identify the intent of incoming email by it's summary and classify it as Personal, Work, Events, Updates or Todos"},
{{examples}},
{"role": "user", "content": "Email: {{email_body}}"}]
Where the examples would be formatted as human, assistant chats duets like this
[{"role": "user", "content": "Email: Hey, this is a reminder from your future self to take flowers home today"},
{"role": "assistant", "content": "Personal, Todos"}]
The prompt with 20 examples is approximately 623 tokens and would cost 0.1 cents per request on gpt-3.5-turbo
. Also, we might want to keep adding examples when emails are mislabeled to improve accuracy. This would further increase the cost of the prompt tokens.
FrugalGPT suggests identifying the best examples to be used instead of all of them.
In this case, we could do a semantic similarity test between the incoming email and the example email bodies. Then, only pick the top k similar examples.
import numpy as np
from sklearn.metrics.pairise import cosine_similarity
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
def get_embeddings(texts):
encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
embeddings = model_output.last_hidden_state[:, 0, :]
return embeddings.numpy()
# Assuming examples is a list of dictionaries with 'email_body' and 'labels' keys
example_embeddings = get_embeddings([ex['email_body'] for ex in examples])
def select_best_examples(query, examples, k=5):
query_embedding = get_embeddings([query])[0]
similarities = cosine_similarity([query_embedding], example_embeddings)[0]
top_k_indices = np.argsort(similarities)[-k:]
return [examples[i] for i in top_k_indices]
# Example usage
query_email = "Don't forget our lunch meeting at the new Italian place today at noon!"
best_examples = select_best_examples(query_email, examples)
few_shot_prompt = generate_prompt(best_examples, query_email) # Omitted for brevity
# Few-shot prompt now contains only the most relevant examples, reducing token count and cost
If we pick only the top 5, we reduce our prompt token cost to 0.03 cents which is already a 70% reduction.
This is a great technique when using few-shot prompts for high accuracy. In production scenarios, this works really well.
1.2 Combine similar requests together
LLMs have been found to retain context for multiple tasks together and FrugalGPT proposes to use this to group multiple requests together thus decreasing the redundant prompt examples in each request.
Using the same example as above, we could now try to classify multiple emails in a single request by tweaking the prompt like this:
[
{"role": "user", "content": "Identify the intent of the incoming emails by it's summary and classify it as 'Personal', 'Work', 'Events', 'Updates' or 'Todos'"},
{{examples}},
{"role": "user", "content": "Emails:n - {{email_body_1}}n - {{email_body_2}}n - {{email_body_3}}"}
]
And similarly, modify the examples also to be in batches of 2 & 3.
This reduces the cost for 3 requests from 0.06 cents to 0.03 cents in a 50% decrease.
This approach is particularly useful when processing data in batches using a few-shot prompt.
Bonus 1.3: Better utilize a smaller model with a more optimized prompt
There may be certain tasks that can only be accomplished with bigger models. This is because prompting a bigger model is easier or also because you can write a more general-purpose prompt, do zero-shot prompting without giving any examples, and still get reliable results.
If we can convert some zero-shot prompts for bigger models into few-shot prompts for smaller models, we can get the same level of accuracy at a faster, cheaper rate.
Matt Shumer proposed an interesting way to use Claude Opus to convert a zero-shot prompt into a few-shot prompt which could then be run on a much smaller model without a significant decrease in accuracy.
This can lead to significant savings in both latency & cost. For the example Matt used, the original zero-shot prompt contained 281 tokens and was optimized for Claude-3-Opus, which cost 3.2 cents.
When converted to a few-shot prompt with enough instructions, the prompt size increased to 1600 tokens. But, since we could now run this on Claude Haiku, our overall cost for the request was reduced to 0.05 cents, representing a 98.5% percent cost reduction with a 78% percent speed up!
The approach works well across XXL and S-sized models. Check out a more general-purpose notebook here.
This can lead to significant savings in both latency & cost. For the example Matt used, the original zero-shot prompt contained 281 tokens and was optimized for Claude-3-Opus, which cost 3.2 cents.
When converted to a few-shot prompt with enough instructions, the prompt size increased to 1600 tokens. But, since we could now run this on Claude Haiku, our overall cost for the request was reduced to 0.05 cents, representing a 98.5% percent cost reduction with a 78% percent speed up!
The approach works well across XXL and S-sized models. Check out a more general-purpose notebook here.
Bonus 1.4: Compress the prompt
The LLMLingua paper published in April 2024 talks about an interesting concept to reduce LLM costs called prompt compression. Prompt compression aims to shorten the input prompts fed into LLMs while preserving the key information, to make LLM inference more efficient and less costly.
LLMLingua is a task-agnostic prompt compression method proposed in the paper. It works by estimating the information entropy of tokens in the prompt using a smaller language model like LLaMa-7B and then removing low-entropy tokens that contribute less to the overall meaning. The authors demonstrated that LLMLingua can achieve compression ratios of 2.5x-5x on datasets like MeetingBank, LongBench, and GSM8K. Importantly, the compressed prompts still allow the target LLM (e.g. GPT-3.5-Turbo) to produce outputs comparable in quality to using the original full-length prompts.
By reducing prompt length through compression techniques like LLMLingua, we can substantially cut down on the computational cost and latency of LLM inference, without sacrificing too much on the quality of the model’s outputs. This is a promising approach to make LLMs more practical and accessible for various AI applications. As research on prompt compression advances, we can expect LLMs to become more cost-efficient to deploy and use at scale.