Google Colab: https://colab.research.google.com/drive/1-0yZZmDe6RDmeyrWNXOg-ruT4ZPjK6li#scrollTo=LpYZ5LgJ22NX&line=13&uniqifier=1
OpenAI expects data to be formatted in the below format
[
{"messages": [{"role": "system", "content": "Classify the given ."}, {"role": "user", "content": "Congrats for winning"}, {"role": "assistant", "content": "spam"}]},
{"messages": [{"role": "system", "content": "Classify the given"}, {"role": "user", "content": "Your YC is pending"}, {"role": "assistant", "content": "ham"}]}
]
One more thing is that the content for the role of assistant has to be a string. When an LLM communicates, It can send us a JSON or Array or many other data structures but they are encoded as strings. So, we will need to stringify the assistant's expected response.
Let us stringify the probabilities of toxic, indecent, ... spam. in this format: '{"toxic": 0.017215505, "indecent": 0.31928617,...'
import json
combined_scores = df[['toxic', 'indecent', 'threat', 'offensive', 'erotic', 'spam']]
scores_training = []
for index, row in combined_scores.iterrows():
scores_dict = row.to_dict()
# Stringify the dictionary
scores_str = json.dumps(scores_dict)
scores_training.append(scores_str)
df['scores'] = scores_training
df.head()
#Output:
id comment toxic indecent threat offensive erotic spam scores
0 1 go back 0.017216 0.319286 0.007431 0.010667 0.014272 0.154840 {"toxic": 0.017215505, "indecent": 0.31928617,...
Now, we have the a good format for the training data, we can try to bring it in the format required by OpenAI. I am going to start by creating an example prompt and then fit all the rows as per example prompt.
system_prompt = "Classify the given input text, return a JSON object containing the probability scores for: 'toxic', 'indecent', 'threat', 'offensive', 'erotic', and 'spam'. Please respond with only the JSON object, without any additional text or explanation"
example_prompt = {"messages":[{"role": "system", "content": system_prompt},
{"role": "user", "content": df["comment"].iloc[1]},
{"role": "assistant", "content": df["scores"].iloc[1]}]
}
example_prompt
formatted_training_data = []
for index, row in df.iterrows():
item = {"messages":[{"role": "system", "content": system_prompt},
{"role": "user", "content": row["comment"]},
{"role": "assistant", "content": row["scores"]}]
}
formatted_training_data.append(item)
Similarly, we should get the validation data in the proper format as required by OpenAI.
import json
combined_scores_validation = df_validation[['toxic', 'indecent', 'threat', 'offensive', 'erotic', 'spam']]
scores_validation = []
for index, row in combined_scores_validation.iterrows():
scores_dict = row.to_dict()
# Stringify the dictionary
scores_str = json.dumps(scores_dict)
scores_validation.append(scores_str)
df_validation['scores'] = scores_validation
df_validation.head()
formatted_validation_data = []
for index, row in df_validation.iterrows():
item = {"messages":[{"role": "system", "content": system_prompt},
{"role": "user", "content": row["comment"]},
{"role": "assistant", "content": row["scores"]}]
}
formatted_validation_data.append(item)
Almost there, one last thing is that OpenAI expects JSON Lines files to be expected for these data structures which we have created. JSON Lines is a convenient format for storing structured data that may be processed one record at a time.
with open("./training_dataset.jsonl","w") as f:
for item in formatted_training_data:
f.write(json.dumps(item))
f.write("\n")
with open("./validation_dataset.jsonl","w") as f:
for item in formatted_validation_data:
f.write(json.dumps(item))
f.write("\n")