So I decided to go for another route, to fine tune my hugging face model, so I went to this site:
and it all works well, until I met this error:
Traceback (most recent call last):
File "c:\Users\Philip Chen\Documents\AICrowd\llm-practive\qna-bot5.py", line 53, in <module>
model.fit(
File "C:\Users\Philip Chen\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\modeling_tf_utils.py", line 1229, in fit
return super().fit(*args, **kwargs)
File "C:\Users\Philip Chen\AppData\Local\Programs\Python\Python310\lib\site-packages\tf_keras\src\utils\traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\Philip Chen\AppData\Local\Programs\Python\Python310\lib\site-packages\tf_keras\src\engine\data_adapter.py", line 1314, in __init__
raise ValueError("Expected input data to be non-empty.")
ValueError: Expected input data to be non-empty.
And it comes from the offending line marked in the code below":
# Load a data set
from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizerFast
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import tensorflow as tf
import keras as tk
from transformers import TFDistilBertForSequenceClassification
# Preparation
model_name="google-bert/bert-base-cased"
fileName="data.json"
# dataset = load_dataset("json", data_files=fileName, split="train")
# model_save_path = './model'
# questions=dataset["input"]
df = pd.read_json(fileName, lines=True)
df=df[['input']]
#Work
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
is_train = np.random.uniform(size = len(df))<0.8
train_raw = (tf.data.Dataset.
from_tensor_slices((dict(tokenizer(list(df['input'][is_train]), padding = True, truncation = True)),np.array(df['input'])[is_train ])).
shuffle(len(df)).
batch(64,drop_remainder = True)
)
train_raw.prefetch(1)
test_raw = (tf.data.Dataset.
from_tensor_slices((dict(tokenizer(list(df['input'][~is_train]), padding = True, truncation = True)),np.array(df['input'])[~is_train ])).
shuffle(len(df)).
batch(64, drop_remainder = True)
)
print(f"Data: {len(test_raw)}") # Got 0
test_raw.prefetch(1)
num_epochs = 3
adam = tf.keras.optimizers.Adam()
model.compile(
optimizer=tf.keras.optimizers.Adam(5e-5),
metrics=["accuracy"],
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True)
)
model.fit(
train_raw,
validation_data=test_raw, #<<-- OFFENDING LINE
epochs = num_epochs
)
model.evaluate(test_raw)
Json File:
{"input":"What is the tallest mountain in the world?"}
{"input":"What is the oldest civilization in history?"}
{"input":"What was Marie Antoinette's most famouse quote?"}
{"input":"Extract the keyphrase from this review.\n Review: The food in Lau Pa Sat is delicious, albeit expensive."}
{"input":"Where is Paul McCartney now?"}
{"input":"What are the benefits of orange?"}
{"input":"I have cockroach problems at home. How do I solve this?"}
{"input":"Who assassinated John F. Kennedy?"}
{"input":"Summarize Matthew 4:3-12."}
{"input":"Extract the keyphrase from this review.\n Review: The Singapore Chilli Crab in this seafood restaurant is amazing! Definetely recommend!"}
{"input":"List some good places to jog in Singapore."}
{"input":"Compare the A350 to the 777."}
{"input":"Tell me the stocks of Intel and AMD."}
{"input":"Any good places to hike in Japan?"}
{"input":"When was the Tower of London built?"}
{"input":"Extract the keyphrase from this review.\n Review: This Z790 motherboard is a total beauty, but extremely overpriced!"}
{"input":"Where is the largest archipelago located"}
{"input":"When was the Statue of Liberty built?"}
{"input":"Who wrote Les Miserables?"}
{"input":"Who sang 'American Pie'?"}
{"input":"What is the current population in Singapore?"}
{"input":"When is Sakura season in Japan?"}
{"input":"How old is the Westminster Abbey?"}
{"input":"Where is the best place to celebrate Oktoberfest in Germany?"}
{"input":"How to go from Changi Airport to the city?"}
{"input":"How long does it take to go from Tokyo to Fukuoka by train?"}
I printed the length of test_data and it’s 0 for some odd reason.
Did I do something wrong here?