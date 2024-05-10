So I decided to go for another route, to fine tune my hugging face model, so I went to this site:

Medium – 22 Jul 23 Fine-tuning Hugging Face Transformer using Keras API (TF) Fine-tuning is a way to train your models faster, we take a pre-trained model and like tuning guitar we make minor tweaks in the weights… Reading time: 5 min read

and it all works well, until I met this error:

Traceback (most recent call last): File "c:\Users\Philip Chen\Documents\AICrowd\llm-practive\qna-bot5.py", line 53, in <module> model.fit( File "C:\Users\Philip Chen\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\modeling_tf_utils.py", line 1229, in fit return super().fit(*args, **kwargs) File "C:\Users\Philip Chen\AppData\Local\Programs\Python\Python310\lib\site-packages\tf_keras\src\utils\traceback_utils.py", line 70, in error_handler raise e.with_traceback(filtered_tb) from None File "C:\Users\Philip Chen\AppData\Local\Programs\Python\Python310\lib\site-packages\tf_keras\src\engine\data_adapter.py", line 1314, in __init__ raise ValueError("Expected input data to be non-empty.") ValueError: Expected input data to be non-empty.

And it comes from the offending line marked in the code below":

# Load a data set from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizerFast from sklearn.model_selection import train_test_split import pandas as pd import numpy as np import tensorflow as tf import keras as tk from transformers import TFDistilBertForSequenceClassification # Preparation model_name="google-bert/bert-base-cased" fileName="data.json" # dataset = load_dataset("json", data_files=fileName, split="train") # model_save_path = './model' # questions=dataset["input"] df = pd.read_json(fileName, lines=True) df=df[['input']] #Work tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased') model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased') is_train = np.random.uniform(size = len(df))<0.8 train_raw = (tf.data.Dataset. from_tensor_slices((dict(tokenizer(list(df['input'][is_train]), padding = True, truncation = True)),np.array(df['input'])[is_train ])). shuffle(len(df)). batch(64,drop_remainder = True) ) train_raw.prefetch(1) test_raw = (tf.data.Dataset. from_tensor_slices((dict(tokenizer(list(df['input'][~is_train]), padding = True, truncation = True)),np.array(df['input'])[~is_train ])). shuffle(len(df)). batch(64, drop_remainder = True) ) print(f"Data: {len(test_raw)}") # Got 0 test_raw.prefetch(1) num_epochs = 3 adam = tf.keras.optimizers.Adam() model.compile( optimizer=tf.keras.optimizers.Adam(5e-5), metrics=["accuracy"], loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True) ) model.fit( train_raw, validation_data=test_raw, #<<-- OFFENDING LINE epochs = num_epochs ) model.evaluate(test_raw)

Json File:

{"input":"What is the tallest mountain in the world?"} {"input":"What is the oldest civilization in history?"} {"input":"What was Marie Antoinette's most famouse quote?"} {"input":"Extract the keyphrase from this review.

Review: The food in Lau Pa Sat is delicious, albeit expensive."} {"input":"Where is Paul McCartney now?"} {"input":"What are the benefits of orange?"} {"input":"I have cockroach problems at home. How do I solve this?"} {"input":"Who assassinated John F. Kennedy?"} {"input":"Summarize Matthew 4:3-12."} {"input":"Extract the keyphrase from this review.

Review: The Singapore Chilli Crab in this seafood restaurant is amazing! Definetely recommend!"} {"input":"List some good places to jog in Singapore."} {"input":"Compare the A350 to the 777."} {"input":"Tell me the stocks of Intel and AMD."} {"input":"Any good places to hike in Japan?"} {"input":"When was the Tower of London built?"} {"input":"Extract the keyphrase from this review.

Review: This Z790 motherboard is a total beauty, but extremely overpriced!"} {"input":"Where is the largest archipelago located"} {"input":"When was the Statue of Liberty built?"} {"input":"Who wrote Les Miserables?"} {"input":"Who sang 'American Pie'?"} {"input":"What is the current population in Singapore?"} {"input":"When is Sakura season in Japan?"} {"input":"How old is the Westminster Abbey?"} {"input":"Where is the best place to celebrate Oktoberfest in Germany?"} {"input":"How to go from Changi Airport to the city?"} {"input":"How long does it take to go from Tokyo to Fukuoka by train?"}

I printed the length of test_data and it’s 0 for some odd reason.

Did I do something wrong here?