Numpy Array of tensorflow.keras.preprocessing.text.Tokenizer.texts_to_sequences is giving weird output, list([2]) instead of [[2]]

Question

Numpy Array of tensorflow.keras.preprocessing.text.Tokenizer.texts_to_sequences is giving weird output for Training Labels as shown below:

(training_label_list[0:10]) = [list([1]) list([1]) list([1]) list([1]) list([1]) list([1]) list([1]) list([1]) list([1]) list([1])]

but is printing Normal Array for the Validation Labels,

(validation_label_list[0:10]) = [[16]
 [16]
 [16]
 [16]
 [16]
 [16]
 [16]
 [16]
 [16]
 [16]]

In other words, type(training_label_list[0]) = <class 'list'> but

type(validation_label_list[0]) =  <class 'numpy.ndarray'>

Consequently, while Training the Model using Keras Model.fit, it is resulting in the below Error,

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).

This is the Link of the Google Colab, to reproduce the error easily.

Complete Code to reproduce the Error is given below:

!pip install tensorflow==2.1

# For Preprocessing the Text => To Tokenize the Text
from tensorflow.keras.preprocessing.text import Tokenizer
# If the Two Articles are of different length, pad_sequences will make the length equal
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Package for performing Numerical Operations
import numpy as np

Unique_Labels_List = ['India', 'USA', 'Australia', 'Germany', 'Bhutan', 'Nepal', 'New Zealand', 'Israel', 'Canada', 'France', 'Ireland', 'Poland', 'Egypt', 'Greece', 'China', 'Spain', 'Mexico']


Train_Labels = Unique_Labels_List[0:14]
#print('Train Labels = {}'.format(Train_Labels))

Val_Labels =  Unique_Labels_List[14:]
#print('Val_Labels = {}'.format(Val_Labels))

No_Of_Train_Items = [248, 200, 200, 218, 248, 248, 249, 247, 220, 200, 200, 211, 224, 209]
No_Val_Items = [212, 200, 219]

T_L = []
for Each_Label, Item in zip(Train_Labels, No_Of_Train_Items):
    T_L.append([Each_Label] * Item)

T_L = [item for sublist in T_L for item in sublist]

V_L = []
for Each_Label, Item in zip(Val_Labels, No_Val_Items):
    V_L.append([Each_Label] * Item)

V_L = [item for sublist in V_L for item in sublist]


len(T_L)

len(V_L)

label_tokenizer = Tokenizer()

label_tokenizer.fit_on_texts(Unique_Labels_List)

# Since it should be a Numpy Array, we should Convert the Sequences to Numpy Array, for both Training and 
# Test Labels

training_label_list = np.array(label_tokenizer.texts_to_sequences(T_L))

validation_label_list = np.array(label_tokenizer.texts_to_sequences(V_L))

print('(training_label_list[0:10]) = {}'.format((training_label_list[0:10])))
print('(validation_label_list[0:10]) = {}'.format((validation_label_list[0:10])))

print('type(training_label_list[0]) = ', type(training_label_seq[0]))
print('type(validation_label_seq[0]) = ', type(validation_label_seq[0]))

I will be Grateful if someone can suggest me how can I get both Training Labels and Validation Labels in same Format, as I have spent so much time on it.

RakTheGeek · Accepted Answer · 2020-02-16 07:02:21Z

Replacing np.array with np.hstack as mentioned in this Stack Overflow Answer has fixed that problem for me.

Now, the Correct Output is

(training_label_seq[0:10]) = [1 1 1 1 1 1 1 1 1 1]
(validation_label_seq[0:10]) = [16 16 16 16 16 16 16 16 16 16]
type(training_label_list[0]) =  <class 'numpy.int64'>
type(validation_label_seq[0]) =  <class 'numpy.int64'>

Link of the working code is in this Google Colab.

Mentioned below is the working code (just in case if the above link doesn't work):

!pip install tensorflow==2.1

# For Preprocessing the Text => To Tokenize the Text
from tensorflow.keras.preprocessing.text import Tokenizer
# If the Two Articles are of different length, pad_sequences will make the length equal
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Package for performing Numerical Operations
import numpy as np

Unique_Labels_List = ['India', 'USA', 'Australia', 'Germany', 'Bhutan', 'Nepal', 'New Zealand', 'Israel', 'Canada', 'France', 'Ireland', 'Poland', 'Egypt', 'Greece', 'China', 'Spain', 'Mexico']


Train_Labels = Unique_Labels_List[0:14]
#print('Train Labels = {}'.format(Train_Labels))

Val_Labels =  Unique_Labels_List[14:]
#print('Val_Labels = {}'.format(Val_Labels))

No_Of_Train_Items = [248, 200, 200, 218, 248, 248, 249, 247, 220, 200, 200, 211, 224, 209]
No_Val_Items = [212, 200, 219]

T_L = []
for Each_Label, Item in zip(Train_Labels, No_Of_Train_Items):
    T_L.append([Each_Label] * Item)

T_L = [item for sublist in T_L for item in sublist]

V_L = []
for Each_Label, Item in zip(Val_Labels, No_Val_Items):
    V_L.append([Each_Label] * Item)

V_L = [item for sublist in V_L for item in sublist]


len(T_L)

len(V_L)

label_tokenizer = Tokenizer()

label_tokenizer.fit_on_texts(Unique_Labels_List)

# Since it should be a Numpy Array, we should Convert the Sequences to Numpy Array, for both Training and 
# Test Labels

training_label_list = np.hstack(label_tokenizer.texts_to_sequences(T_L))

validation_label_list = np.hstack(label_tokenizer.texts_to_sequences(V_L))

print('(training_label_list[0:10]) = {}'.format((training_label_list[0:10])))
print('(validation_label_list[0:10]) = {}'.format((validation_label_list[0:10])))

print('type(training_label_list[0]) = ', type(training_label_seq[0]))
print('type(validation_label_seq[0]) = ', type(validation_label_seq[0]))

Timbus Calin · Accepted Answer · 2020-02-15 10:40:52Z

0

Your problem is that, while your are converting your training data to a numpy array, that specific numpy array consists of list elements, hence the error

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).

The error is subtler than it appears; some have reported that they had to switch back from 2.1.0 to 2.0.0. What is the difference between Numpy's array() and asarray() functions?

I would personally try this:

Use training_label_list = np.asarray(label_tokenizer.texts_to_sequences(T_L)), instead of np.array. Tensorflow - ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float)
According to this:

List of lists into numpy array

you will have to force the casting(although weird yet this should work):

x=[[1,2],[1,2,3],[1]]
y=numpy.array([numpy.array(xi) for xi in x])
type(y)
>>><type 'numpy.ndarray'>
type(y[0])
>>><type 'numpy.ndarray'>

While trying to help you on this issue, I discovered an interesting fact about numpy casting:

CASE 1:

   my_list = [[1,2],[2],[3]]
   my_numpy_array = np.array(my_list)
   print(type(my_numpy_array))
   print(type(my_numpy_array[0]))
   <class 'numpy.ndarray'>
   <class 'list'>

CASE 2:

    my_list = [[1],[2],[3]]
    my_numpy_array = np.array(my_list)
    print(type(my_numpy_array))
    print(type(my_numpy_array[0]))
    <class 'numpy.ndarray'>
    <class 'numpy.ndarray'>

Short conclusion: If the sublists lengths differ, apparently they are left as lists and not converted to numpy arrays.

I tested on your code, now it works:

training_label_seq = np.asarray(label_tokenizer.texts_to_sequences(T_L))

training_label_seq = np.array([np.array(training_element) for training_element in training_label_seq])

validation_label_seq = np.asarray(label_tokenizer.texts_to_sequences(V_L))



print('(training_label_seq[0:10]) = {}'.format((training_label_seq[0:10])))
print('(validation_label_seq[0:10]) = {}'.format((validation_label_seq[0:10])))

print('type(training_label_list[0]) = ', type(training_label_seq[0]))
print('type(validation_label_seq[0]) = ', type(validation_label_seq[0]))



(training_label_seq[0:10]) = [array([1]) array([1]) array([1]) array([1]) array([1]) array([1])
 array([1]) array([1]) array([1]) array([1])]
(validation_label_seq[0:10]) = [[16]
 [16]
 [16]
 [16]
 [16]
 [16]
 [16]
 [16]
 [16]
 [16]]
type(training_label_list[0]) =  <class 'numpy.ndarray'>
type(validation_label_seq[0]) =  <class 'numpy.ndarray'>

edited Feb 15, 2020 at 10:40

answered Feb 15, 2020 at 10:12

Timbus Calin

15.2k6 gold badges49 silver badges69 bronze badges

3 Comments

RakTheGeek Over a year ago

Thank you for the quick response. I've tried both, np.asarray and Downgrading it to TF 2.0. No luck. Surprisingly, both Training and Testing Data are the Array of Lists but only Training Data is behaving weirdly.

Timbus Calin Over a year ago

Yes, I have also tried these two on the colab you provided. I am updating my answer with another possible response(check the number 3)

RakTheGeek Over a year ago

Still it doesn't work because, instead of list', we are getting array now for Training Labels but normal data for Testing Labels. Now the error is ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).

Collectives™ on Stack Overflow

Numpy Array of tensorflow.keras.preprocessing.text.Tokenizer.texts_to_sequences is giving weird output, list([2]) instead of [[2]]

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related