Preparing the dataset for fine tuning.

Objective: Consider we have a csv file named comments.csv and validation_data.csv They both contain comments. We want to create a dataset out of it. Manually labeling into the 6 categories is very irritating, I have already tried it :D

id comment
1 get lost
2 you are amazing

So, in this chapter, we are going to use an external api named DeSpam to generate these labels and prepare a proper training and validation dataset. BTW, this tutorial is completely optional.

Results:

id comment toxic indecent threat offensive spam
1 get lost 0.40590 0.6594 ... .... 0.036
2 you are amazing 0.083 0.33948   ... 0.0094
%%capture
!pip install pandas

We will need the pandas library read the csv files and prepare the dataset.

import pandas as pd
df = pd.read_csv("./comments.csv")
df.head()


#Output
id 	comment
..  ...
2 	3 	Congratulations you have won, 1 billion dollar...
3 	4 	I will kick your ***

Similarly, we should read the validation dataset. LLM will use the validation dataset to validate its knowledge.

df_validation = pd.read_csv("./validation_data.csv")
df_validation.head()

As discussed at the beginning of this tutorial, we are going to use an external API to get the classification probabilities.

import time
import pandas as pd
import requests

# Replace with your actual API key
api_key = "genai"
api_url = "https://despam.io/api/v1/moderate"

def predict_toxicity(comment):
    headers = {
      "x-api-key": api_key,
      "Content-Type": "application/json"
    }
    data = {"input": comment}
    response = requests.post(api_url, headers=headers, json=data)
    print(data.get("input")[:30], response.text)
    return response.json()


for index, row in df.iterrows():
    comment = row["comment"]
    time.sleep(1)
    print(f"Progress: {index}/{len(df)}")
    scores = predict_toxicity(comment)
    df.loc[index, "toxic"] = scores["toxic"]
    df.loc[index, "indecent"] = scores["indecent"]
    df.loc[index, "threat"] = scores["threat"]
    df.loc[index, "offensive"] = scores["offensive"]
    df.loc[index, "erotic"] = scores["erotic"]
    df.loc[index, "spam"] = scores["spam"]

The code includes a loop that iterates over a pandas dataframe. Inside the loop, it retrieves the comment text from each row, calls the  predict_toxicity function to get the scores, and stores the scores in different columns of the dataframe. The loop includes a time delay of 1 second in between each API call, to avoid rate limiting by the API. Hey, one more thing, You can reuse the api_key="genai", it should work fine.

df.head()

#Output
id 	comment 	toxic 	indecent 	threat 	offensive 	erotic 	spam
0 	1 	I will kick you 0.017216 	0.319286 	0.007431 	0.010667 	0.014272 	0.154840
1 	2 	go back 0.052800 	0.415785 	0.009541 	0.030375 	0.026540 	0.052821

Similarly, let's also prepare the validation dataset.


for index, row in df_validation.iterrows():
    comment = row["comment"]
    scores = predict_toxicity(comment)
    print(f"Progress: {index+1}/{len(df_validation)}")
    df_validation.loc[index, "toxic"] = scores["toxic"]
    df_validation.loc[index, "indecent"] = scores["indecent"]
    df_validation.loc[index, "threat"] = scores["threat"]
    df_validation.loc[index, "offensive"] = scores["offensive"]
    df_validation.loc[index, "erotic"] = scores["erotic"]
    df_validation.loc[index, "spam"] = scores["spam"]

If you want you can save the results in a file, so as to avoid going through the above steps again and simply start from the next tutorial.

df.to_csv("./training_preprocessed_data.csv")
df_validation.to_csv("validation_preprocessed_data.csv")

 

generative-ai.io

Bridging the gap between basic tutorials and industry-grade applications, generative-ai.io is dedicated to bringing you the best in generative AI education.

Our content is designed to challenge and elevate your skills, ensuring you are well-prepared for real-world AI development.

Contacts

Refunds:

Refund Policy
Social

Follow us on our social media channels to stay updated.

© Copyright Algoholic (OPC) Private Limited