2

I have compressed csv gzip files in Google Cloud Storage and using Python, I am auto detecting the schema and creating a new table in Google BigQuery depending on the naming convention. How do I partition the table being created? I already have a Date column in the data that I would like to use.

# importing libraries
from google.cloud import bigquery

# defining first load list
first_load_list = []

#defining tracker file
tracker_file = open("tracker_file", "a")

#reading values from config file
config_file = open("ingestion.config", "r")
for line in config_file:
    if "project_id" in line:
        project_id = line.split("=")[1].strip()
    elif "dataset" in line:
        dataset = line.split("=")[1].strip()
    elif "gcs_location" in line:
        gcs_location = line.split("=")[1].strip()
    elif "bq1_target_table" in line:
        bq1_target_table = line.split("=")[1].strip()
    elif "bq2_target_table" in line:
        bq2_target_table = line.split("=")[1].strip()
    elif "bq1_first_load_filename" in line:
        bq1_first_load_filename = line.split("=")[1].strip()
        first_load_list.append(bq1_first_load_filename)
    elif "bq2_first_load_filename" in line:
        bq2_first_load_filename = line.split("=")[1].strip()
        first_load_list.append(bq2_first_load_filename)
    elif "gcs_bucket" in line:
        gcs_bucket = line.split("=")[1].strip()

# reading bucket list temp file
bucket_list_file = open("bucket_list.temp", "r")
bucket_list = []
for entry in bucket_list_file:
    bucket_list.append(entry)


# defining client and specifying project
client = bigquery.Client(project_id)
dataset_id = dataset
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.autodetect = True
job_config.skip_leading_rows = 1
job_config.source_format = bigquery.SourceFormat.CSV


# leading files into tables based on naming convention
for filename in first_load_list:
    if "BQ2_2" in filename:
        uri = gcs_location + filename
        print "Processing file = " + uri
        load_job = client.load_table_from_uri(
            uri.strip(),
            dataset_ref.table(bq2_target_table),
            job_config=job_config)  # API request
        assert load_job.job_type == 'load'
        load_job.result()  # Waits for table load to complete.
        assert load_job.state == 'DONE'
        assert client.get_table(dataset_ref.table(bq2_target_table))
        tracker_file.write(filename + "\n")
        print filename.strip() + " processing complete\n"
    elif "BQ1_2" in filename:
        uri = gcs_location + filename
        print "Processing file = " + uri
        load_job = client.load_table_from_uri(
            uri.strip(),
            dataset_ref.table(bq1_target_table),
            job_config=job_config)  # API request
        assert load_job.job_type == 'load'
        load_job.result()  # Waits for table load to complete.
        assert load_job.state == 'DONE'
        assert client.get_table(dataset_ref.table(bq1_target_table))
        tracker_file.write(filename + "\n")
        print filename.strip() + " processing complete\n"

tracker_file.close()

This is the code that I run for first load. Once the first load tables are created, I then only want to append data to these tables going forward. I looked at https://cloud.google.com/bigquery/docs/creating-partitioned-tables but I can't figure out how to implement in Python.

Can anyone help to point me in the right direction please?

1
  • Hi Utsav, have you found a solution? I need to append only increment to BigQuery, and I can't find a way of doing it. 'load_table_from_uri' method overrides the table every time... Commented Jul 21, 2019 at 10:03

1 Answer 1

1

You can use job_config._properties['load']['timePartitioning'] = {"type":"DAY", 'field':'your_field'} to create a partition table on load. I just tested it on my end with test data and it worked as expected.

Please note that partition with the API only supports 'DAY' for now.

See GitHub issue

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.