Preparing the dataset for fine tuning.

Hide Video?

Objective: Consider we have a csv file named comments.csv and validation_data.csv They both contain comments. We want to create a dataset out of it. Manually labeling into the 6 categories is very irritating, I have already tried it :D

id	comment
1	get lost
2	you are amazing

So, in this chapter, we are going to use an external api named DeSpam to generate these labels and prepare a proper training and validation dataset. BTW, this tutorial is completely optional.

Results:

id	comment	toxic	indecent	threat	offensive	spam
1	get lost	0.40590	0.6594	...	....	0.036
2	you are amazing	0.083	0.33948		...	0.0094

%%capture
!pip install pandas

We will need the pandas library read the csv files and prepare the dataset.

import pandas as pd
df = pd.read_csv("./comments.csv")
df.head()


#Output
id 	comment
..  ...
2 	3 	Congratulations you have won, 1 billion dollar...
3 	4 	I will kick your ***

Similarly, we should read the validation dataset. LLM will use the validation dataset to validate its knowledge.

df_validation = pd.read_csv("./validation_data.csv")
df_validation.head()

As discussed at the beginning of this tutorial, we are going to use an external API to get the classification probabilities.

import time
import pandas as pd
import requests

# Replace with your actual API key
api_key = "genai"
api_url = "https://despam.io/api/v1/moderate"

def predict_toxicity(comment):
    headers = {
      "x-api-key": api_key,
      "Content-Type": "application/json"
    }
    data = {"input": comment}
    response = requests.post(api_url, headers=headers, json=data)
    print(data.get("input")[:30], response.text)
    return response.json()


for index, row in df.iterrows():
    comment = row["comment"]
    time.sleep(1)
    print(f"Progress: {index}/{len(df)}")
    scores = predict_toxicity(comment)
    df.loc[index, "toxic"] = scores["toxic"]
    df.loc[index, "indecent"] = scores["indecent"]
    df.loc[index, "threat"] = scores["threat"]
    df.loc[index, "offensive"] = scores["offensive"]
    df.loc[index, "erotic"] = scores["erotic"]
    df.loc[index, "spam"] = scores["spam"]

The code includes a loop that iterates over a pandas dataframe. Inside the loop, it retrieves the comment text from each row, calls the predict_toxicity function to get the scores, and stores the scores in different columns of the dataframe. The loop includes a time delay of 1 second in between each API call, to avoid rate limiting by the API. Hey, one more thing, You can reuse the api_key="genai", it should work fine.

df.head()

#Output
id 	comment 	toxic 	indecent 	threat 	offensive 	erotic 	spam
0 	1 	I will kick you 0.017216 	0.319286 	0.007431 	0.010667 	0.014272 	0.154840
1 	2 	go back 0.052800 	0.415785 	0.009541 	0.030375 	0.026540 	0.052821

Similarly, let's also prepare the validation dataset.


for index, row in df_validation.iterrows():
    comment = row["comment"]
    scores = predict_toxicity(comment)
    print(f"Progress: {index+1}/{len(df_validation)}")
    df_validation.loc[index, "toxic"] = scores["toxic"]
    df_validation.loc[index, "indecent"] = scores["indecent"]
    df_validation.loc[index, "threat"] = scores["threat"]
    df_validation.loc[index, "offensive"] = scores["offensive"]
    df_validation.loc[index, "erotic"] = scores["erotic"]
    df_validation.loc[index, "spam"] = scores["spam"]

If you want you can save the results in a file, so as to avoid going through the above steps again and simply start from the next tutorial.

df.to_csv("./training_preprocessed_data.csv")
df_validation.to_csv("validation_preprocessed_data.csv")

Prev: Project 1: Fine … Next: Estimating the cost …