Tokenization is a crucial aspect of natural language processing, and different model families utilize different tokenizers. However, there is limited research on how tokenization processes vary across these models. Do all tokenizers produce the same number of tokens for a given input text? If not, how do the generated tokens differ, and what are the implications of these differences?
In this article, we delve into these questions and explore the practical implications of tokenization variability. We focus on comparing two cutting-edge model families: OpenAI’s ChatGPT and Anthropic’s Claude. While both models offer competitive pricing in terms of “cost-per-token,” experiments reveal that Anthropic models can be 20–30% more costly than GPT models.
API Pricing Comparison — Claude 3.5 Sonnet vs GPT-4o
As of June 2024, the pricing structure for Anthropic’s Claude 3.5 Sonnet and OpenAI’s GPT-4o is highly competitive. While both models have identical costs for output tokens, Claude 3.5 Sonnet boasts a 40% lower cost for input tokens.
The Hidden “Tokenizer Inefficiency”
Despite the lower input token rates of Anthropic models, experiments show that the overall costs of running experiments with GPT-4o are significantly lower than using Claude Sonnet-3.5. This is primarily due to the fact that Anthropic’s tokenizer tends to produce more tokens for the same input compared to OpenAI’s tokenizer. While the per-token cost for Claude 3.5 Sonnet may be lower, the increased tokenization results in higher overall costs in practical scenarios.
Domain-Dependent Tokenization Inefficiency
Anthropic’s tokenizer tokenizes different types of domain content differently, leading to varying levels of increased token counts compared to OpenAI’s models. Our experiments across English articles, Python code, and math domains revealed that Claude’s tokenizer generates 16% more tokens for English articles, 30% more for Python code, and 21% more for mathematical equations compared to GPT-4o.
Other Practical Implications of Tokenizer Inefficiency
Apart from cost implications, tokenizer inefficiency also affects context window utilization. While Anthropic models claim a larger context window of 200K tokens, the effective usable token space may be smaller due to verbosity, potentially causing a discrepancy between advertised and actual context window sizes.
Implementation of Tokenizers
GPT models utilize Byte Pair Encoding (BPE) to form tokens, while Anthropic’s tokenizer is known to have a unique approach. While detailed information about Anthropic’s tokenizer is not as readily available, tools and resources are emerging to analyze tokenization differences between GPT and Claude models.
Key Takeaways
– Anthropic’s competitive pricing may come with hidden costs due to tokenizer inefficiencies.
– Understanding the verbosity of Anthropic models is essential for businesses evaluating deployment costs.
– Consider the nature of your input text when choosing between OpenAI and Anthropic models to assess potential cost differences.
– The effective context window size may differ from the advertised size, impacting the usability of the models.
It’s important to note that despite requests for comment, Anthropic did not respond by press time. This article will be updated if they provide a response.