1

I am presently trying to feed a dataset with a few multivalent feature columns through a TensorFlow Extended (TFX) pipeline. Here is a row from my sample data:

user_id                     29601
product_id                     28
touched_product_id     [2435, 28]
liked_product_id       [2435, 28]
disliked_product_id            []
target                          1

As you can see, the columns (features) touched_product_id, liked_product_id, disliked_product_id are multivalent.

Now, in order to feed this data through TFX's validation layer, I'm following the guide below:

https://www.tensorflow.org/tfx/tutorials/tfx/components_keras

In accordance with the guide, I produce some TFRecord files using an instance of CSVExampleGen, and proceed to generate statistics and schema as evinced below:

# create train and eval records
c = CsvExampleGen(input_base='sample_train')
context.run(c)

# generate statistics
statistics_gen = StatisticsGen(
    examples=c.outputs['examples']
)
context.run(statistics_gen)

# generate schema
schema_gen = SchemaGen(
    statistics=statistics_gen.outputs['statistics'],
    infer_feature_shape=False)
context.run(schema_gen)
context.show(schema_gen.outputs['schema'])

The final schema displayed by the above code is:

                        Type  Presence Valency Domain
Feature name                                         
'disliked_product_id'  BYTES  required  single      -
'liked_product_id'     BYTES  required  single      -
'product_id'             INT  required  single      -
'target'                 INT  required  single      -
'touched_product_id'   BYTES  required  single      -
'user_id'                INT  required  single      -

Clearly, the multivalent features are incorrectly inferred to be univalent. In an attempt to fix this, I loaded up the Schema proto manually and tried to adjust a valence property.

schema_path = os.path.join(schema_gen.outputs['schema'].get()[0].uri, 'schema.pbtxt')
schema = schema_pb2.Schema()
contents = file_io.read_file_to_string(schema_path)
schema = text_format.Parse(contents, schema)

# THIS LINE DOES NOT WORK
tfdv.get_feature(schema, 'user_id').valence = 'multiple'

Clearly, that final line does not work because to my surprise, there is no valence property. I tried looking into the spec for the Schema proto but did not find a valence property. Anyone know how I can solve this?

1 Answer 1

0

try to set feature.value_count.min or feature.value_count.max to a value greater than 1

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.