Unable to view output for beam.combiners.Count.PerElement() in Dataflow

Question

I have a Pub/Sub script publishing male first names as follow:

from google.cloud import pubsub_v1
import names

project_id = "Your-Project-Name"
topic_name = "Your-Topic-Name"

publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_name)

while True:
    data = names.get_first_name(gender='male') #u"Message number {}".format(n)
    data = data.encode("utf-8")
    publisher.publish(topic_path, data=data)

Then I have a Dataflow that reads from the subscription attached to the topic then count each element of the pipeline as follow:

import logging,re,os
import apache_beam as beam
from apache_beam.options.pipeline_options import  PipelineOptions

root = logging.getLogger()
root.setLevel(logging.INFO)

p = beam.Pipeline(options=PipelineOptions())
x = (
 p
 | beam.io.ReadFromPubSub(topic=None, subscription="projects/YOUR-PROJECT-NAME/subscriptions/YOUR-SUBSCRIPTION-NAME").with_output_types(bytes)
 | 'Decode_UTF-8' >> beam.Map(lambda x: x.decode('utf-8'))
 | 'ExtractWords' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
 | 'CountingElem' >> beam.combiners.Count.PerElement()
 | 'FormatOutput' >> beam.MapTuple(lambda word, count: '%s: %s' % (word, count))
 | 'Printing2Log' >> beam.Map(lambda k: logging.info(k)))

result = p.run()
result.wait_until_finish()

The issue is: I don't get any output from the last 3 steps of pipeline while I could see data flowing from the first 3 steps of the pipeline - which means nothing is logged.

I expected the output like this:

Peter: 2
Glen: 1
Alex: 1
Ryan: 2

I thank you already for helping me

Which runner are you using to run the dataflow job ?

Jayadeep Jayaraman
– Jayadeep Jayaraman

2020-03-27 15:00:45 +00:00
Commented Mar 27, 2020 at 15:00 — Jayadeep Jayaraman
– Jayadeep Jayaraman, Commented Mar 27, 2020 at 15:00
I'm using the DataflowRunner

Steeve
– Steeve

2020-03-27 17:25:23 +00:00
Commented Mar 27, 2020 at 17:25 — Steeve
– Steeve, Commented Mar 27, 2020 at 17:25

chamikara · Accepted Answer · 2020-03-28 00:16:18Z

1

Given that this is a streaming pipeline, you need to setup windowing/triggering appropriately for the pipeline to work. See following. https://beam.apache.org/documentation/programming-guide/#windowing

More specifically:

Caution: Beam’s default windowing behavior is to assign all elements of a PCollection to a single, global window and discard late data, even for unbounded PCollections. Before you use a grouping transform such as GroupByKey on an unbounded PCollection, you must do at least one of the following:

beam.combiners.Count.PerElement() contains a GroupByKey in it.

answered Mar 28, 2020 at 0:16

chamikara

2,0841 gold badge11 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Unable to view output for beam.combiners.Count.PerElement() in Dataflow

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related