Anthropic has announced a new feature for its Claude family of generative AI models, known as prompt caching, which promises to significantly reduce costs and improve performance for developers. This feature allows developers to store frequently used prompts between API calls, thus avoiding the need to send the same long prompt repeatedly. By saving prompts on the inference server, Claude can refer to the cached prompts in subsequent requests, cutting down on both costs and latency.
With prompt caching, customers can now provide Claude with more detailed background knowledge and example outputs, which are especially useful for tasks like document-based question answering or recommendation systems. According to Anthropic, prompt caching can reduce costs by up to 90% and latency by as much as 85%, making it particularly beneficial for long prompts. The feature is currently in public beta for Claude 3.5 Sonnet and Claude 3 Haiku, with plans to extend support to Claude 3 Opus in the near future.
A recent study by researchers from Yale University and Google highlighted the advantages of prompt caching in reducing inference latency, particularly for longer prompts. By caching the prompts on the inference server, latency can be reduced from 8x on GPU-based systems to as much as 60x on CPU-based systems. The study also emphasized that this reduction in latency occurs without compromising the accuracy of the model’s outputs or requiring any changes to the model’s parameters.
Prompt caching is expected to be highly useful in several practical scenarios. For instance, it can be applied in conversational agents, coding assistants, or tasks that involve processing large documents. Additionally, users could query cached content like books, papers, or transcripts, speeding up access to relevant information. Developers can also use the feature to share instructions or fine-tune the responses of Claude through iterative changes, enhancing the overall performance of the AI system. With up to four cache breakpoints available for developers to define and a cache life of five minutes, this update is poised to make significant improvements in the efficiency of AI-powered applications.