I have recently released the initial version of the
chat-splitter
crate.
The purpose of this blog post is to offer an overview of what
chat-splitter
is,
the issue it addresses,
and how it operates.
The context length problem
Large language models face challenges when processing lengthy sequences of text. State-of-the-art models, like OpenAI’s GPT-4, have a maximum token limit that restricts their ability to handle contexts beyond a certain threshold. The maximum context window typically ranges from 2k tokens for legacy GPT-3 models to as high as 32k tokens for the most recent GPT-4 model, and potentially more1. This limitation presents a significant obstacle when generating coherent and meaningful responses. For example, in the context of building a chat application, what should be done when the conversation becomes too long to fit within the model’s context window?
What is chat-splitter
?
chat-splitter
is a Rust library designed specifically to split
chat messages for OpenAI’s chat
models.
Its
main purpose is to ensure that these messages consistently adhere
to the context length limitations imposed by large language models,
as discussed.
By
handling the splitting process,
chat-splitter
aims to simplify
the development of applications that utilize these models.
The code for chat-splitter
is open-source and can be accessed on
GitHub.
Additionally,
it is available on
crates.io for easy integration into
projects.
For comprehensive documentation,
please visit
docs.rs.
Integrating chat-splitter
into your project
To incorporate chat-splitter
into your Rust project,
you can
easily add it as a dependency using
Cargo.
$ cd your/rust/project
$ cargo add chat-splitter
Next,
create an instance of the
ChatSplitter
object:
use chat_splitter::ChatSplitter;
let cs = ChatSplitter::default();
There are two methods available to control the splitting process: by setting a maximum message count and by specifying a maximum token count2:
let cs = ChatSplitter::new("gpt-3.5-turbo") // same as ::default()
.max_messages(20)
.max_tokens(1000);
ℹ️ The
max_tokens
parameter in this context refers to the number of tokens that will remain in the context window for the large language model to complete. It is the samemax_tokens
value that you would use when making requests to the OpenAI API.
The main function is
split
,
which categorizes chat messages
into two groups: those that fit within the context and the
remaining messages3:
let chat_messages: Vec<_> = /* way too many messages */;
let (rest, fit) = cs.split(&chat_messages);
For more detailed information and practical examples,
please refer
to the chat-splitter
documentation and the GitHub
repository.
In particular,
you may find the
examples/chat.rs
file to be helpful.
How does it work?
chat-splitter
leverages
tiktoken-rs
for token
counting and is compatible with any message type that implements
the
IntoChatCompletionRequestMessage
trait.
The crate provides built-in implementations for chat message
types defined by
async-openai
and
tiktoken-rs
.
However,
it is also effortless to implement the trait for your own custom
message types4.
The algorithm used by chat-splitter
first splits the chat
messages based on the message count,
which is a simple process.
It
then performs token-based splitting,
which involves token counting
and determining a split position that maximizes the number of
messages within a given token budget.
Token counting relies on
tokenizers from tiktoken-rs
,
and a naive implementation with a
time complexity of $\mathcal{O}(n)$5 would not be suitable for certain
real-world scenarios6.
To ensure optimal performance,
chat-splitter
utilizes a binary
search algorithm implemented in
indxvec,
which has a worst-case time
complexity of $\mathcal{O}(\log_2{n})$.
This approach significantly improves
efficiency compared to a naive implementation that would require up
to 2048 counts,
which is the maximum message count limit in
OpenAI’s API.
With chat-splitter
,
the
token counting process will never exceed eleven iterations in this
scenario.
I hope you find chat-splitter
beneficial for your projects.
I
welcome any suggestions or contributions through pull requests on
the GitHub
repository
.
If you have any questions or need further assistance,
please feel
free to reach out to me on
Twitter.
Researchers are actively working on addressing this limitation. For example, Anthropic has recently announced a Claude model with a 100K context window. ↩︎
Additional methods of controlling the splitting process may be introduced in the future. ↩︎
While the
rest
messages may be disregarded in many applications, there are interesting use cases for them, particularly in the context of long-term chat memory. ↩︎Perhaps there is a demand for a dedicated crate for this purpose? 📦 ↩︎
In this context, $n$ represents the number of messages considered in the chat after the initial splitting by messages. Therefore, it is important to note that $n$ can be significantly smaller than the actual input slice. ↩︎
For additional information related to this topic, please consult this issue in the
tiktoken-rs
repository. ↩︎