You are on the right track with using the BigQuery client outside your sink. It should look something like this:
[..]
from google.cloud import bigquery
client = bigquery.Client(project='PROJECT_ID')
dataset = client.dataset(DATASET_NAME)
dataset.create()
[..]
You have to remember, that although this may work when you run your pipeline locally, the VMs that are spun up in the worker pool when you run it remotely on GCP will not have the same dependancies as your local machine.
So, you need to install the dependancies remotely by following the steps outlined here:
- Find out which packages you have installed on your machine. Run the following command:
pip freeze > requirements.txt. This will create a requirements.txt file that lists all packages that have been installed on your machine, regardless of where they came from (i.e. were installed from).
- In the requirements.txt file, leave only the packages that were installed from PyPI and are used in the workflow source. Delete the rest of the packages that are irrelevant to your code.
- Run your pipeline with the following command-line option:
--requirements_file requirements.txt. This will stage the requirements.txt file to the staging location you defined.