Data Formatting for Fine Tuning

Hide Video?

Google Colab: https://colab.research.google.com/drive/1-0yZZmDe6RDmeyrWNXOg-ruT4ZPjK6li#scrollTo=LpYZ5LgJ22NX&line=13&uniqifier=1

OpenAI expects data to be formatted in the below format

[
{"messages": [{"role": "system", "content": "Classify the  given ."}, {"role": "user", "content": "Congrats for winning"}, {"role": "assistant", "content": "spam"}]},
  
{"messages": [{"role": "system", "content": "Classify the given"}, {"role": "user", "content": "Your YC is pending"}, {"role": "assistant", "content": "ham"}]}
]

One more thing is that the content for the role of assistant has to be a string. When an LLM communicates, It can send us a JSON or Array or many other data structures but they are encoded as strings. So, we will need to stringify the assistant's expected response.

Let us stringify the probabilities of toxic, indecent, ... spam. in this format: '{"toxic": 0.017215505, "indecent": 0.31928617,...'

import json

combined_scores = df[['toxic', 'indecent', 'threat', 'offensive', 'erotic', 'spam']]
scores_training = []

for index, row in combined_scores.iterrows():
    scores_dict = row.to_dict()
    # Stringify the dictionary
    scores_str = json.dumps(scores_dict)
    scores_training.append(scores_str)

df['scores'] = scores_training
df.head()

#Output:
id 	comment 	toxic 	indecent 	threat 	offensive 	erotic 	spam 	scores
0 	1 	go back 	0.017216 	0.319286 	0.007431 	0.010667 	0.014272 	0.154840 	{"toxic": 0.017215505, "indecent": 0.31928617,...

Now, we have the a good format for the training data, we can try to bring it in the format required by OpenAI. I am going to start by creating an example prompt and then fit all the rows as per example prompt.

system_prompt = "Classify the given input text, return a JSON object containing the probability scores for: 'toxic', 'indecent', 'threat', 'offensive', 'erotic', and 'spam'. Please respond with only the JSON object, without any additional text or explanation"

example_prompt = {"messages":[{"role": "system", "content": system_prompt},
                              {"role": "user", "content": df["comment"].iloc[1]},
                              {"role": "assistant", "content": df["scores"].iloc[1]}]
                 }
example_prompt

formatted_training_data = []
for index, row in df.iterrows():
    item = {"messages":[{"role": "system", "content": system_prompt},
                              {"role": "user", "content": row["comment"]},
                              {"role": "assistant", "content": row["scores"]}]
                 }
    formatted_training_data.append(item)

Similarly, we should get the validation data in the proper format as required by OpenAI.

import json

combined_scores_validation = df_validation[['toxic', 'indecent', 'threat', 'offensive', 'erotic', 'spam']]
scores_validation = []

for index, row in combined_scores_validation.iterrows():
    scores_dict = row.to_dict()
    # Stringify the dictionary
    scores_str = json.dumps(scores_dict)
    scores_validation.append(scores_str)

df_validation['scores'] = scores_validation

df_validation.head()

formatted_validation_data = []

for index, row in df_validation.iterrows():
    item = {"messages":[{"role": "system", "content": system_prompt},
                              {"role": "user", "content": row["comment"]},
                              {"role": "assistant", "content": row["scores"]}]
                 }
    formatted_validation_data.append(item)

Almost there, one last thing is that OpenAI expects JSON Lines files to be expected for these data structures which we have created. JSON Lines is a convenient format for storing structured data that may be processed one record at a time.

with open("./training_dataset.jsonl","w") as f:
    for item in formatted_training_data:
        f.write(json.dumps(item))
        f.write("\n")

with open("./validation_dataset.jsonl","w") as f:
    for item in formatted_validation_data:
        f.write(json.dumps(item))
        f.write("\n")

Prev: Estimating the cost … Next: Hands on Fine …