Reducing the Cost of Generative AI in Practice
Generative AI is a powerhouse of modern innovation, enabling the creation of realistic content, scenario simulations, and task automation. However, running and maintaining these solutions can be costly. Large, resource-heavy models or commercial APIs require significant computational power or incur ongoing token costs, leading to increased expenses.
Imagine each LLM (Large Language Models) as a specialist on your team, paid by the second for their expertise. Each application using the LLM corresponds to a department with similar knowledge requirements. While all the specialists have general knowledge, the top experts charge the highest rates and generally perform better. Whenever a department sends you a task, you always hire the best specialist. But as requests flood in, your expenses skyrocket. It’s clear you need a smarter strategy to manage your team and control costs without sacrificing quality.
You notice some patterns that might help. Many tasks, especially from the same department, are similar, with some almost identical. The first idea is to record every task you have handled before (caching). Then, when a new request comes in, you can simply check the record and send out the answer directly, without calling any experts. This way, you save time and money by reusing past solutions.
However, this approach has flaws. Some requests include a lot of contexts from earlier interactions, making it hard to match them accurately. This makes it difficult to decide whether to use a recorded answer or consult a specialist. Additionally, it’s challenging to determine if changes in the outside situation affect the answers in your records, risking outdated or incorrect responses. You need to carefully design a solution to address these problems.
Another thing you noticed is that some requests come with long instructions (prompts), which take time for specialists to read and increase costs. You consider using junior specialists to shorten these instructions, but you are unsure if the savings will cover the cost. If many requests share the same instructions, it might be worth it. However, you are concerned that rewriting might potentially compromise your top experts’ performance. To address this, you will need to carefully evaluate the cost-benefit and ensure quality in the rewritten instructions.
You suddenly got another idea: mix the experts instead of sending every request to the best one. The logic is simple — not all requests need the top expert, and the top expert is not always the best for everything.
Here is the new idea: For certain type of tasks, start by sending them to a junior expert who shows potential for good responses with lowest cost. If their answer makes sense, send it back to the customer. If not, escalate to a more skilled expert, and continue this process until you get a satisfactory answer. If you run out of experts, send the request to the top expert as a last resort. This way, you can optimize costs while maintaining quality.
This excites you, and you decide to establish a fixed search path for each task type, using historical data to find this path. Since there are too many requests for you to judge the quality of each answer, you will need to train a lower-cost specialist for this task. However, to save money, this specialist must also be cost-effective; otherwise, it will just be an additional cost.
Thinking deeper, this strategy has its own problems. Within each task type, the difficulty level of the request is more crucial in deciding which expert fits best than the task type. Additionally, for many junior experts, you cannot pay by the second; instead, you need to hire them full-time (dedicated GPU for open-source models). So, if the request volume is insufficient, it could cost more than you save. Moreover, if you choose the wrong path or have too few requests, you could end up paying more because you keep adding experts’ costs to answer the same question. Of course, eventually, the right strategy will help you lower costs, and improve performance, because for some types of tasks, junior experts might perform better.
You really love the last idea, but it is obviously not the right time for it due to the early stage of development with requests still not large enough and the need for more fine-classified data according to task difficult level; difficult to create. To find a practical solution for the current situation, you want to focus on routing tasks to experts based on the difficulty of the task rather than their type. Knowing you cannot hire many people, you opt for a few small models, just one. You still need to train a junior specialist, but instead of judging the answers, their task will be to learn how to determine which expert is best suited for each task based on historical experience.
At this stage, you have two junior experts: one will perform suitable tasks, and another will specialize in routing the tasks, who should be cost-effective. The bottom line is that the total cost of these two should be significantly less than for the best expert. The data to train your routing specialist comes from all the different historical tasks performed by the best experts. Of course, if you have data with correct answers, it would be even better as training material. The key will be the accuracy of your routing specialist. However, this approach allows for continuous improvement or adding more experts in the future.
In a nutshell, the solution consists of three stages: first, collecting request and response data from various requests sending to large language models; second, using a smaller model to generate responses to the same requests and comparing the responses from both models to train a routing model; and third, replacing the large model with the trained routing model, which predicts how closely the smaller model’s response matches the large model’s response. If the match exceeds a certain threshold, the task is routed to the smaller model. This process can continuously improve the routing model or incorporate additional models in training the routing model. While some errors are expected, it has been shown that smaller models can often perform better on simpler tasks. Implementing guardrails, a common practice in LLM applications, can help mitigate some risks.
By adopting this method along with other strategies, you can streamline the training processes, reduce costs, and ensure quality responses, paving the way for more efficient use of generative AI. Additionally, this method can be used to enhance performance instead of reducing the cost if you have ground truth data available. You can also choose a loss in training to strike the right balance between cost and performance.