Load csv.gz file from google storage to bigquery using python

Question

I want to load csv.gz file from storage to bigquery. Right now I using below code, but I am not sure if it is efficient way to load data to bigquery.

# -*- coding: utf-8 -*-
from io import BytesIO
import pandas as pd
from google.cloud import storage
import pandas_gbq as gbq
client = storage.Client.from_service_account_json(service_account)
bucket = client.get_bucket("bucketname")
blob = storage.blob.Blob("""somefile.csv.gz""", bucket)
content = blob.download_as_string()
df = pd.read_csv(BytesIO(content), delimiter=',', quotechar='"', low_memory=False)
df = df.astype(str)
df.columns = df.columns.str.replace("|", "")
df["dateinsert"] = pd.datetime.now()
gbq.to_gbq(df, 'desttable',
           'projectid',
           chunksize=None,
           if_exists='append'
           )

Please assist me to write this code in efficient way

Yes, there is easier, but for the best answer, I need more details. What do you want to achieve? What are your constraint? For example, why are you using the dateInsert? is the day granularity is enough or your need more precision? How big are your files? Why do you replace | ? (...) Provide as detail as you can. — guillaume blaquiere
– guillaume blaquiere, Commented Sep 18, 2019 at 15:12
My file contains more than 150 columns, and column names contains some characters like {,}, [,], | etc. So I want to replace the special characters in the column so bigquery will accept. Date Insert column just to know the loading time. It is very big file contains 1 million rows — Niyas
– Niyas, Commented Sep 19, 2019 at 6:24
Why the special character are problematic in BigQuery? Are they present in "numeric field" and this special character for the field to be a string ? — guillaume blaquiere
– guillaume blaquiere, Commented Sep 19, 2019 at 8:58
Hi Guilaume, special character are ther in csv headers, so i just replacing then with '_' , so bigquery can accept as column. — Niyas
– Niyas, Commented Sep 24, 2019 at 12:46

guillaume blaquiere · Accepted Answer · 2019-09-24 13:22:20Z

1

I propose you this process:

Perform a load job into bigquery
- Add the schema, yes 150 column is boring...
- Add skip leading row option for skipping the header job_config.skip_leading_rows = 1
- Name your table like this <dataset>.<tableBaseName>_<Datetime> The date time must be a string format compliant with BigQuery table name. For example YYYYMMDDHHMM

When you query your data, you can query a subset of table, and inject the table name in the query result, like this:

SELECT *,(SELECT table_id
      FROM `<project>.<dataset>.__TABLES_SUMMARY__`
      WHERE table_id LIKE '<tableBaseName>%') FROM `<project>.<dataset>.<tableBaseName>*`

Of course, you can raffine the * with the year, month, day,...

I think, I meet all your requirements. Comment if something goes wrong

answered Sep 24, 2019 at 13:22

guillaume blaquiere

76.5k3 gold badges65 silver badges102 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Load csv.gz file from google storage to bigquery using python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related