Getting ValueError: Expected input data to be non-empty

chenphilip14 · May 10, 2024, 3:25am

So I decided to go for another route, to fine tune my hugging face model, so I went to this site:

and it all works well, until I met this error:

Traceback (most recent call last):
  File "c:\Users\Philip Chen\Documents\AICrowd\llm-practive\qna-bot5.py", line 53, in <module>
    model.fit(
  File "C:\Users\Philip Chen\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\modeling_tf_utils.py", line 1229, in fit
    return super().fit(*args, **kwargs)
  File "C:\Users\Philip Chen\AppData\Local\Programs\Python\Python310\lib\site-packages\tf_keras\src\utils\traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\Philip Chen\AppData\Local\Programs\Python\Python310\lib\site-packages\tf_keras\src\engine\data_adapter.py", line 1314, in __init__
    raise ValueError("Expected input data to be non-empty.")
ValueError: Expected input data to be non-empty.

And it comes from the offending line marked in the code below":

# Load a data set
from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizerFast
from sklearn.model_selection  import train_test_split
import pandas as pd
import numpy as np
import tensorflow as tf
import keras as tk
from transformers import TFDistilBertForSequenceClassification
# Preparation
model_name="google-bert/bert-base-cased"
fileName="data.json"
# dataset = load_dataset("json", data_files=fileName, split="train")
# model_save_path = './model'
# questions=dataset["input"]
df = pd.read_json(fileName, lines=True)
df=df[['input']]

#Work
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
is_train = np.random.uniform(size = len(df))<0.8

train_raw = (tf.data.Dataset.
             from_tensor_slices((dict(tokenizer(list(df['input'][is_train]), padding = True, truncation = True)),np.array(df['input'])[is_train ])).
             shuffle(len(df)).
             batch(64,drop_remainder = True)
            )

train_raw.prefetch(1)

test_raw = (tf.data.Dataset.
            from_tensor_slices((dict(tokenizer(list(df['input'][~is_train]), padding = True, truncation = True)),np.array(df['input'])[~is_train ])).
            shuffle(len(df)).
            batch(64, drop_remainder = True)
            )

print(f"Data: {len(test_raw)}")    # Got 0

test_raw.prefetch(1)

num_epochs = 3

adam = tf.keras.optimizers.Adam()

model.compile(
    optimizer=tf.keras.optimizers.Adam(5e-5),
    metrics=["accuracy"],
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True)
)
model.fit(
    train_raw,
    validation_data=test_raw,   #<<-- OFFENDING LINE
    epochs = num_epochs
)

model.evaluate(test_raw)

Json File:

{"input":"What is the tallest mountain in the world?"}
{"input":"What is the oldest civilization in history?"}
{"input":"What was Marie Antoinette's most famouse quote?"}
{"input":"Extract the keyphrase from this review.\n Review: The food in Lau Pa Sat is delicious, albeit expensive."}
{"input":"Where is Paul McCartney now?"}
{"input":"What are the benefits of orange?"}
{"input":"I have cockroach problems at home. How do I solve this?"}
{"input":"Who assassinated John F. Kennedy?"}
{"input":"Summarize Matthew 4:3-12."}
{"input":"Extract the keyphrase from this review.\n Review: The Singapore Chilli Crab in this seafood restaurant is amazing! Definetely recommend!"}
{"input":"List some good places to jog in Singapore."}
{"input":"Compare the A350 to the 777."}
{"input":"Tell me the stocks of Intel and AMD."}
{"input":"Any good places to hike in Japan?"}
{"input":"When was the Tower of London built?"}
{"input":"Extract the keyphrase from this review.\n Review: This Z790 motherboard is a total beauty, but extremely overpriced!"}
{"input":"Where is the largest archipelago located"}
{"input":"When was the Statue of Liberty built?"}
{"input":"Who wrote Les Miserables?"}
{"input":"Who sang 'American Pie'?"}
{"input":"What is the current population in Singapore?"}
{"input":"When is Sakura season in Japan?"}
{"input":"How old is the Westminster Abbey?"}
{"input":"Where is the best place to celebrate Oktoberfest in Germany?"}
{"input":"How to go from Changi Airport to the city?"}
{"input":"How long does it take to go from Tokyo to Fukuoka by train?"}

I printed the length of test_data and it’s 0 for some odd reason.
Did I do something wrong here?

Zensei · May 16, 2024, 5:08pm

You are not doing anything wrong. The author of tutorial is wrong.
He claimed that that the above will return 80/20 of the dataset which is incorrect. The return is, wait for it …random.

Do a small test and you’ll see that is the case. Every time you run your test it will return random results

is_train = np.random.uniform(size = len(df))<0.8
filtered_true = is_train[is_train == True]
print(f"Dataset total: {len(is_train)}")
print(f"Dataset true: {len(filtered_true)}")
print(f"Dataset false: {len(is_train) - len(filtered_true)}")

One more thing. The tutorial is doing text classification. Your data is missing a label column.