At the heart of LLM training lies the concept of tokenization. Text data is broken down into smaller units called tokens, which the model can then process and learn from. The efficiency of this tokenization process directly impacts the training cost.
We'll explore how the TikToken library can be leveraged to estimate the cost of training an OpenAI's LLM model. You can go through OpenAI's pricing page. As of now, they are charging 8 USD for training 1 million tokens. So, all we will be doing is calculating the number of tokens in our training and validation dataset. Since the size of our validation dataset is negligible, So, let us calculate the the total number of tokens in the training set. Let us jump into the implementation.
%%capture
%pip install --upgrade tiktoken
%pip install --upgrade openai
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
encoding.encode("Please leave a rating!")
def num_tokens_from_string(string: str, model_name: str) -> int:
encoding = tiktoken.encoding_for_model(model_name)
num_tokens = len(encoding.encode(string))
return num_tokens
num_tokens_from_string("I want to see the Northern light", "gpt-3.5-turbo")
num_tokens = 0
for index, row in df.iterrows():
tokens_count = num_tokens_from_string(row['comment'], "gpt-3.5-turbo")
num_tokens += tokens_count
print(num_tokens)
#similarly count the tokens in validaiton dataset, however it is too small, so ignoring
price_per_million = 8
cost_of_fine_tuning_per_epoch = num_tokens/1000000*price_per_million
print(cost_of_fine_tuning_per_epoch)
#Output:
0.004632