TensorFlow Extended: Specifying Valency Of Features In Schema

Question

I am presently trying to feed a dataset with a few multivalent feature columns through a TensorFlow Extended (TFX) pipeline. Here is a row from my sample data:

user_id                     29601
product_id                     28
touched_product_id     [2435, 28]
liked_product_id       [2435, 28]
disliked_product_id            []
target                          1

As you can see, the columns (features) touched_product_id, liked_product_id, disliked_product_id are multivalent.

Now, in order to feed this data through TFX's validation layer, I'm following the guide below:

https://www.tensorflow.org/tfx/tutorials/tfx/components_keras

In accordance with the guide, I produce some TFRecord files using an instance of CSVExampleGen, and proceed to generate statistics and schema as evinced below:

# create train and eval records
c = CsvExampleGen(input_base='sample_train')
context.run(c)

# generate statistics
statistics_gen = StatisticsGen(
    examples=c.outputs['examples']
)
context.run(statistics_gen)

# generate schema
schema_gen = SchemaGen(
    statistics=statistics_gen.outputs['statistics'],
    infer_feature_shape=False)
context.run(schema_gen)
context.show(schema_gen.outputs['schema'])

The final schema displayed by the above code is:

                        Type  Presence Valency Domain
Feature name                                         
'disliked_product_id'  BYTES  required  single      -
'liked_product_id'     BYTES  required  single      -
'product_id'             INT  required  single      -
'target'                 INT  required  single      -
'touched_product_id'   BYTES  required  single      -
'user_id'                INT  required  single      -

Clearly, the multivalent features are incorrectly inferred to be univalent. In an attempt to fix this, I loaded up the Schema proto manually and tried to adjust a valence property.

schema_path = os.path.join(schema_gen.outputs['schema'].get()[0].uri, 'schema.pbtxt')
schema = schema_pb2.Schema()
contents = file_io.read_file_to_string(schema_path)
schema = text_format.Parse(contents, schema)

# THIS LINE DOES NOT WORK
tfdv.get_feature(schema, 'user_id').valence = 'multiple'

Clearly, that final line does not work because to my surprise, there is no valence property. I tried looking into the spec for the Schema proto but did not find a valence property. Anyone know how I can solve this?

Amine_h · Accepted Answer · 2020-11-02 18:20:05Z

0

try to set feature.value_count.min or feature.value_count.max to a value greater than 1

answered Nov 2, 2020 at 18:20

Amine_h

1297 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

TensorFlow Extended: Specifying Valency Of Features In Schema

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related