Objective: Consider we have a csv file named comments.csv and validation_data.csv They both contain comments. We want to create a dataset out of it. Manually labeling into the 6 categories is very irritating, I have already tried it :D
id | comment |
1 | get lost |
2 | you are amazing |
So, in this chapter, we are going to use an external api named DeSpam to generate these labels and prepare a proper training and validation dataset. BTW, this tutorial is completely optional.
Results:
id | comment | toxic | indecent | threat | offensive | spam |
1 | get lost | 0.40590 | 0.6594 | ... | .... | 0.036 |
2 | you are amazing | 0.083 | 0.33948 | ... | 0.0094 |
%%capture
!pip install pandas
We will need the pandas library read the csv files and prepare the dataset.
import pandas as pd
df = pd.read_csv("./comments.csv")
df.head()
#Output
id comment
.. ...
2 3 Congratulations you have won, 1 billion dollar...
3 4 I will kick your ***
Similarly, we should read the validation dataset. LLM will use the validation dataset to validate its knowledge.
df_validation = pd.read_csv("./validation_data.csv")
df_validation.head()
As discussed at the beginning of this tutorial, we are going to use an external API to get the classification probabilities.
import time
import pandas as pd
import requests
# Replace with your actual API key
api_key = "genai"
api_url = "https://despam.io/api/v1/moderate"
def predict_toxicity(comment):
headers = {
"x-api-key": api_key,
"Content-Type": "application/json"
}
data = {"input": comment}
response = requests.post(api_url, headers=headers, json=data)
print(data.get("input")[:30], response.text)
return response.json()
for index, row in df.iterrows():
comment = row["comment"]
time.sleep(1)
print(f"Progress: {index}/{len(df)}")
scores = predict_toxicity(comment)
df.loc[index, "toxic"] = scores["toxic"]
df.loc[index, "indecent"] = scores["indecent"]
df.loc[index, "threat"] = scores["threat"]
df.loc[index, "offensive"] = scores["offensive"]
df.loc[index, "erotic"] = scores["erotic"]
df.loc[index, "spam"] = scores["spam"]
The code includes a loop that iterates over a pandas dataframe. Inside the loop, it retrieves the comment text from each row, calls the predict_toxicity function to get the scores, and stores the scores in different columns of the dataframe. The loop includes a time delay of 1 second in between each API call, to avoid rate limiting by the API. Hey, one more thing, You can reuse the api_key="genai", it should work fine.
df.head()
#Output
id comment toxic indecent threat offensive erotic spam
0 1 I will kick you 0.017216 0.319286 0.007431 0.010667 0.014272 0.154840
1 2 go back 0.052800 0.415785 0.009541 0.030375 0.026540 0.052821
Similarly, let's also prepare the validation dataset.
for index, row in df_validation.iterrows():
comment = row["comment"]
scores = predict_toxicity(comment)
print(f"Progress: {index+1}/{len(df_validation)}")
df_validation.loc[index, "toxic"] = scores["toxic"]
df_validation.loc[index, "indecent"] = scores["indecent"]
df_validation.loc[index, "threat"] = scores["threat"]
df_validation.loc[index, "offensive"] = scores["offensive"]
df_validation.loc[index, "erotic"] = scores["erotic"]
df_validation.loc[index, "spam"] = scores["spam"]
If you want you can save the results in a file, so as to avoid going through the above steps again and simply start from the next tutorial.
df.to_csv("./training_preprocessed_data.csv")
df_validation.to_csv("validation_preprocessed_data.csv")