0

This issue has been posted a handful of times in SO, but I still can't figure out what is the problem with my code, especially because it comes from a tutorial in medium and the author makes the code available on google colab

I have seen other users having problem with wrong variable types #56304986 (which is not my case, as my model input is the output of tokenizer) and even seen the function I am trying to use (tf.data.Dataset.from_tensor_slices) being suggested as a solution #56304986.

The line yielding error is:

# train dataset
ds_train_encoded = encode_examples(ds_train).shuffle(10000).batch(batch_size)

where the method encode_examples is defined as (I have inserted an assert line into the encode_examples method to be sure my problem was not unmatching lenghts):

def encode_examples(ds, limit=-1):
    # prepare list, so that we can build up final TensorFlow dataset from slices.
    input_ids_list = []
    token_type_ids_list = []
    attention_mask_list = []
    label_list = []
    if (limit > 0):
        ds = ds.take(limit)

    for review, label in tfds.as_numpy(ds):

            bert_input = convert_example_to_feature(review.decode())

            ii = bert_input['input_ids']
            tti = bert_input['token_type_ids']
            am = bert_input['attention_mask']

            assert len(ii) == len(tti) == len(am), "unmatching lengths!"

            input_ids_list.append(ii)
            token_type_ids_list.append(tti)
            attention_mask_list.append(am)
            label_list.append([label])

    return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)

The data is loaded like this (here i changed the dataset to get only 10% of the training data so I could speed up the debugging)

(ds_train, ds_test), ds_info = tfds.load('imdb_reviews', split = ['train[:10%]','test[10%:15%]'], as_supervised=True, with_info=True)

And the other two calls(convert_example_to_feature and map_example_to_dict) and the tokenizer are as follow:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
def convert_example_to_feature(text):
    # combine step for tokenization, WordPiece vector mapping, adding special tokens as well as truncating reviews longer than the max length
    return tokenizer.encode_plus(text,
                                 add_special_tokens = True, # add [CLS], [SEP]
                                 #max_length = max_length, # max length of the text that can go to BERT
                                 pad_to_max_length = True, # add [PAD] tokens
                                 return_attention_mask = True,)# add attention mask to not focus on pad tokens

def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
    return ({"input_ids": input_ids,
            "token_type_ids": token_type_ids,
            "attention_mask": attention_masks,
            }, label)

I suspect the error might have something to do with different versions of TensorFlow (I am using 2.3), but unfortunately I couldn't run the snippets in the google.colab notebook for memory reasons.

Does anyone know where what is the problem with my code? Thanks for your time and attention.

2 Answers 2

1

Turns out that I had caused the trouble by having commented the line

#max_length = max_length, # max length of the text that can go to BERT

I assumed it would truncate on the model max size or that it would take the longest input as the max size. It does none of it and then even if I have the same amount of entries, those entries vary in size, generating a non-rectangular tensor.

I've removed the # and am using 512 as max_lenght. Which is the max that BERT takes anyways. (see transformer's tokenizer class for reference)

Sign up to request clarification or add additional context in comments.

Comments

0

One other possible cause is that truncation should be explicitly enabled in the tokenizer. The parameter is truncation = True

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.