How do I create and append data from csv file to big query and partition the table using python?

Question

I have compressed csv gzip files in Google Cloud Storage and using Python, I am auto detecting the schema and creating a new table in Google BigQuery depending on the naming convention. How do I partition the table being created? I already have a Date column in the data that I would like to use.

# importing libraries
from google.cloud import bigquery

# defining first load list
first_load_list = []

#defining tracker file
tracker_file = open("tracker_file", "a")

#reading values from config file
config_file = open("ingestion.config", "r")
for line in config_file:
    if "project_id" in line:
        project_id = line.split("=")[1].strip()
    elif "dataset" in line:
        dataset = line.split("=")[1].strip()
    elif "gcs_location" in line:
        gcs_location = line.split("=")[1].strip()
    elif "bq1_target_table" in line:
        bq1_target_table = line.split("=")[1].strip()
    elif "bq2_target_table" in line:
        bq2_target_table = line.split("=")[1].strip()
    elif "bq1_first_load_filename" in line:
        bq1_first_load_filename = line.split("=")[1].strip()
        first_load_list.append(bq1_first_load_filename)
    elif "bq2_first_load_filename" in line:
        bq2_first_load_filename = line.split("=")[1].strip()
        first_load_list.append(bq2_first_load_filename)
    elif "gcs_bucket" in line:
        gcs_bucket = line.split("=")[1].strip()

# reading bucket list temp file
bucket_list_file = open("bucket_list.temp", "r")
bucket_list = []
for entry in bucket_list_file:
    bucket_list.append(entry)


# defining client and specifying project
client = bigquery.Client(project_id)
dataset_id = dataset
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.autodetect = True
job_config.skip_leading_rows = 1
job_config.source_format = bigquery.SourceFormat.CSV


# leading files into tables based on naming convention
for filename in first_load_list:
    if "BQ2_2" in filename:
        uri = gcs_location + filename
        print "Processing file = " + uri
        load_job = client.load_table_from_uri(
            uri.strip(),
            dataset_ref.table(bq2_target_table),
            job_config=job_config)  # API request
        assert load_job.job_type == 'load'
        load_job.result()  # Waits for table load to complete.
        assert load_job.state == 'DONE'
        assert client.get_table(dataset_ref.table(bq2_target_table))
        tracker_file.write(filename + "\n")
        print filename.strip() + " processing complete\n"
    elif "BQ1_2" in filename:
        uri = gcs_location + filename
        print "Processing file = " + uri
        load_job = client.load_table_from_uri(
            uri.strip(),
            dataset_ref.table(bq1_target_table),
            job_config=job_config)  # API request
        assert load_job.job_type == 'load'
        load_job.result()  # Waits for table load to complete.
        assert load_job.state == 'DONE'
        assert client.get_table(dataset_ref.table(bq1_target_table))
        tracker_file.write(filename + "\n")
        print filename.strip() + " processing complete\n"

tracker_file.close()

This is the code that I run for first load. Once the first load tables are created, I then only want to append data to these tables going forward. I looked at https://cloud.google.com/bigquery/docs/creating-partitioned-tables but I can't figure out how to implement in Python.

Can anyone help to point me in the right direction please?

Hi Utsav, have you found a solution? I need to append only increment to BigQuery, and I can't find a way of doing it. 'load_table_from_uri' method overrides the table every time... — sergodeeva
– sergodeeva, Commented Jul 21, 2019 at 10:03

Teddy · Accepted Answer · 2018-08-25 03:13:14Z

1

You can use job_config._properties['load']['timePartitioning'] = {"type":"DAY", 'field':'your_field'} to create a partition table on load. I just tested it on my end with test data and it worked as expected.

Please note that partition with the API only supports 'DAY' for now.

See GitHub issue

answered Aug 25, 2018 at 3:13

Teddy

6651 gold badge8 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How do I create and append data from csv file to big query and partition the table using python?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related