0

summary

How can I use Python to specify partitions in the fetch time partitioning table to fetch?

What we tried

I have found that the following is possible when inserting in SQL. https://cloud.google.com/bigquery/docs/using-dml-with-partitioned-tables

but I don't know how to describe it in Python. I am thinking of using "client.load_table_from_dataframe" from google-cloud-bigquery module. https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.load_table_from_dataframe

I found the following document, but when I use the name _PARTITIONTIME I get the following error. https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-partitioned#bigquery_load_table_partitioned-python

google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/aaa/jobs?uploadType=multipart: Invalid field name "_PARTITIONTIME". Field names are not allowed to start with the (case-insensitive) prefixes _PARTITION, _TABLE_, _FILE_, _ROW_TIMESTAMP, __ROOT__ and _COLIDENTIFIER

execution environment

  • python: 3.8.10
  • google-cloud-bigquery: 3.2.0
  • pandas: 1.4.3
  • About Certification
    • If PARTITION is not specified, we consider that there is no problem because data can be inserted.

table

CREATE TABLE IF NOT EXISTS `aaa.bbb.ccc`(
  c1 INTEGER,
  c2 STRING
)
PARTITION BY _PARTITIONDATE;

What I want to do

SQL

INSERT INTO `aaa.bbb.ccc` (c1, c2, _PARTITIONTIME) VALUES (99, "zz", TIMESTAMP("2000-01-02"));

Python ( Tried and tested code )

import pandas as pd
from google.cloud import bigquery
from google.cloud.bigquery.enums import SqlTypeNames
from google.cloud.bigquery.job import WriteDisposition
from datetime import datetime

client = bigquery.Client(project="aaa")
job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("c1", SqlTypeNames.INTEGER),
        bigquery.SchemaField("c2", SqlTypeNames.STRING),
        bigquery.SchemaField("_PARTITIONTIME", SqlTypeNames.TIMESTAMP),
    ],
    write_disposition=WriteDisposition.WRITE_APPEND,
    time_partitioning=bigquery.TimePartitioning(
        type_=bigquery.TimePartitioningType.DAY,
        field="_PARTITIONTIME",  # Name of the column to use for partitioning.
        expiration_ms=7776000000,  # 90 days.
    ),
)
df = pd.DataFrame(
    [
        [1, "a", datetime.strptime("2100-11-12", "%Y-%m-%d")],
        [2, "b", datetime.strptime("2101-12-13", "%Y-%m-%d")],
    ],
    columns=["c1", "c2", "_PARTITIONTIME"],
)
job = client.load_table_from_dataframe(df, "aaa.bbb.ccc", job_config=job_config) # error
result = job.result()

multi-post

We also ask the following questions. https://ja.stackoverflow.com/questions/90760

1 Answer 1

1

You can just change the naming convention _PARTITIONTIME to another name since it is part of case sensitive prefixes. The code below worked:

import pandas as pd
from google.cloud import bigquery
from google.cloud.bigquery.enums import SqlTypeNames
from google.cloud.bigquery.job import WriteDisposition
from datetime import datetime

client = bigquery.Client(project="<your-project>")
job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("c1", SqlTypeNames.INTEGER),
        bigquery.SchemaField("c2", SqlTypeNames.STRING),
        bigquery.SchemaField("_P1", SqlTypeNames.TIMESTAMP),
    ],
    write_disposition=WriteDisposition.WRITE_APPEND,
    time_partitioning=bigquery.TimePartitioning(
        type_=bigquery.TimePartitioningType.DAY,
        field="_P1",  # Name of the column to use for partitioning.
        expiration_ms=7776000000,  # 90 days.
    ),
)
df = pd.DataFrame(
    [
        [1, "a", datetime.strptime("2100-11-12", "%Y-%m-%d")],
        [2, "b", datetime.strptime("2101-12-13", "%Y-%m-%d")],
    ],
    columns=["c1", "c2", "_P1"],
)
job = client.load_table_from_dataframe(df, "<your-project>.<your-dataset>.ccc", job_config=job_config) # error
result = job.result()

Output:

enter image description here

As for the query you want to insert:

INSERT INTO `<your-project>.<your-dataset>.ccc` (c1, c2, _P1) VALUES (99, "zz", TIMESTAMP("2000-01-02"));

This is not possible as explained in this SO post answered by a Googler. Since in the expiration_ms field we stated that the expiration is 90 days, 90 days before the current day(the day python script is executed) are valid dates, anything before that are not valid. This query will work:

INSERT INTO `<your-project>.<your-dataset>.ccc` (c1, c2, _P1) VALUES (99, "zz", TIMESTAMP("2022-06-01"));

Output: enter image description here

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.