Getting ValueError: Expected input data to be non-empty

chenphilip14 · May 10, 2024, 3:25am

So I decided to go for another route, to fine tune my hugging face model, so I went to this site:

and it all works well, until I met this error:

Traceback (most recent call last):
  File "c:\Users\Philip Chen\Documents\AICrowd\llm-practive\qna-bot5.py", line 53, in <module>
    model.fit(
  File "C:\Users\Philip Chen\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\modeling_tf_utils.py", line 1229, in fit
    return super().fit(*args, **kwargs)
  File "C:\Users\Philip Chen\AppData\Local\Programs\Python\Python310\lib\site-packages\tf_keras\src\utils\traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\Philip Chen\AppData\Local\Programs\Python\Python310\lib\site-packages\tf_keras\src\engine\data_adapter.py", line 1314, in __init__
    raise ValueError("Expected input data to be non-empty.")
ValueError: Expected input data to be non-empty.

And it comes from the offending line marked in the code below":

# Load a data set
from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizerFast
from sklearn.model_selection  import train_test_split
import pandas as pd
import numpy as np
import tensorflow as tf
import keras as tk
from transformers import TFDistilBertForSequenceClassification
# Preparation
model_name="google-bert/bert-base-cased"
fileName="data.json"
# dataset = load_dataset("json", data_files=fileName, split="train")
# model_save_path = './model'
# questions=dataset["input"]
df = pd.read_json(fileName, lines=True)
df=df[['input']]

#Work
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
is_train = np.random.uniform(size = len(df))<0.8

train_raw = (tf.data.Dataset.
             from_tensor_slices((dict(tokenizer(list(df['input'][is_train]), padding = True, truncation = True)),np.array(df['input'])[is_train ])).
             shuffle(len(df)).
             batch(64,drop_remainder = True)
            )

train_raw.prefetch(1)

test_raw = (tf.data.Dataset.
            from_tensor_slices((dict(tokenizer(list(df['input'][~is_train]), padding = True, truncation = True)),np.array(df['input'])[~is_train ])).
            shuffle(len(df)).
            batch(64, drop_remainder = True)
            )

print(f"Data: {len(test_raw)}")    # Got 0

test_raw.prefetch(1)

num_epochs = 3

adam = tf.keras.optimizers.Adam()

model.compile(
    optimizer=tf.keras.optimizers.Adam(5e-5),
    metrics=["accuracy"],
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True)
)
model.fit(
    train_raw,
    validation_data=test_raw,   #<<-- OFFENDING LINE
    epochs = num_epochs
)

model.evaluate(test_raw)

Json File:

{"input":"What is the tallest mountain in the world?"}
{"input":"What is the oldest civilization in history?"}
{"input":"What was Marie Antoinette's most famouse quote?"}
{"input":"Extract the keyphrase from this review.\n Review: The food in Lau Pa Sat is delicious, albeit expensive."}
{"input":"Where is Paul McCartney now?"}
{"input":"What are the benefits of orange?"}
{"input":"I have cockroach problems at home. How do I solve this?"}
{"input":"Who assassinated John F. Kennedy?"}
{"input":"Summarize Matthew 4:3-12."}
{"input":"Extract the keyphrase from this review.\n Review: The Singapore Chilli Crab in this seafood restaurant is amazing! Definetely recommend!"}
{"input":"List some good places to jog in Singapore."}
{"input":"Compare the A350 to the 777."}
{"input":"Tell me the stocks of Intel and AMD."}
{"input":"Any good places to hike in Japan?"}
{"input":"When was the Tower of London built?"}
{"input":"Extract the keyphrase from this review.\n Review: This Z790 motherboard is a total beauty, but extremely overpriced!"}
{"input":"Where is the largest archipelago located"}
{"input":"When was the Statue of Liberty built?"}
{"input":"Who wrote Les Miserables?"}
{"input":"Who sang 'American Pie'?"}
{"input":"What is the current population in Singapore?"}
{"input":"When is Sakura season in Japan?"}
{"input":"How old is the Westminster Abbey?"}
{"input":"Where is the best place to celebrate Oktoberfest in Germany?"}
{"input":"How to go from Changi Airport to the city?"}
{"input":"How long does it take to go from Tokyo to Fukuoka by train?"}

I printed the length of test_data and it’s 0 for some odd reason.
Did I do something wrong here?

Zensei · May 16, 2024, 5:08pm

You are not doing anything wrong. The author of tutorial is wrong.
He claimed that that the above will return 80/20 of the dataset which is incorrect. The return is, wait for it …random.

Do a small test and you’ll see that is the case. Every time you run your test it will return random results

is_train = np.random.uniform(size = len(df))<0.8
filtered_true = is_train[is_train == True]
print(f"Dataset total: {len(is_train)}")
print(f"Dataset true: {len(filtered_true)}")
print(f"Dataset false: {len(is_train) - len(filtered_true)}")

One more thing. The tutorial is doing text classification. Your data is missing a label column.

system · August 16, 2024, 12:08am

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TypeError when trying to fine tune hugging face code Python	6	1065	August 9, 2024
Getting KeyError when trying to train my model Python	4	1511	August 20, 2024
"ValueError: Shapes (5, 6) and (5, 35) are incompatible" when trying to train my model Python	2	676	June 26, 2024
ValueError when running chatBot program in new computer Python	3	847	July 26, 2024
ML: Inference score is very low (want to make it .7 or higher.) Python	3	410	November 7, 2024

Getting ValueError: Expected input data to be non-empty

Related topics