0

I want to load csv.gz file from storage to bigquery. Right now I using below code, but I am not sure if it is efficient way to load data to bigquery.

# -*- coding: utf-8 -*-
from io import BytesIO
import pandas as pd
from google.cloud import storage
import pandas_gbq as gbq
client = storage.Client.from_service_account_json(service_account)
bucket = client.get_bucket("bucketname")
blob = storage.blob.Blob("""somefile.csv.gz""", bucket)
content = blob.download_as_string()
df = pd.read_csv(BytesIO(content), delimiter=',', quotechar='"', low_memory=False)
df = df.astype(str)
df.columns = df.columns.str.replace("|", "")
df["dateinsert"] = pd.datetime.now()
gbq.to_gbq(df, 'desttable',
           'projectid',
           chunksize=None,
           if_exists='append'
           )

Please assist me to write this code in efficient way

4
  • Yes, there is easier, but for the best answer, I need more details. What do you want to achieve? What are your constraint? For example, why are you using the dateInsert? is the day granularity is enough or your need more precision? How big are your files? Why do you replace | ? (...) Provide as detail as you can. Commented Sep 18, 2019 at 15:12
  • My file contains more than 150 columns, and column names contains some characters like {,}, [,], | etc. So I want to replace the special characters in the column so bigquery will accept. Date Insert column just to know the loading time. It is very big file contains 1 million rows Commented Sep 19, 2019 at 6:24
  • Why the special character are problematic in BigQuery? Are they present in "numeric field" and this special character for the field to be a string ? Commented Sep 19, 2019 at 8:58
  • Hi Guilaume, special character are ther in csv headers, so i just replacing then with '_' , so bigquery can accept as column. Commented Sep 24, 2019 at 12:46

1 Answer 1

1

I propose you this process:

  • Perform a load job into bigquery
    • Add the schema, yes 150 column is boring...
    • Add skip leading row option for skipping the header job_config.skip_leading_rows = 1
    • Name your table like this <dataset>.<tableBaseName>_<Datetime> The date time must be a string format compliant with BigQuery table name. For example YYYYMMDDHHMM

When you query your data, you can query a subset of table, and inject the table name in the query result, like this:

SELECT *,(SELECT table_id
      FROM `<project>.<dataset>.__TABLES_SUMMARY__`
      WHERE table_id LIKE '<tableBaseName>%') FROM `<project>.<dataset>.<tableBaseName>*` 

Of course, you can raffine the * with the year, month, day,...

I think, I meet all your requirements. Comment if something goes wrong

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.