0

I've been trying to compress my CSV files to .gz before uploading to GCS using Cloud Function-Python 3.7, but what my code does only adds the .gz extension but doesn't really compress the file, so in the end, the file was corrupted. Can you please show me how to fix this? Thanks

here is part of my code

import gzip


def to_gcs(request):    
    job_config = bigquery.QueryJobConfig()
    gcs_filename = 'filename_{}.csv'
    bucket_name = 'bucket_gcs_name'
    subfolder = 'subfolder_name'
    client = bigquery.Client()


    job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE

    QUERY = "SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` session, UNNEST(hits) AS hits"
    query_job = client.query(
        QUERY,
        location='US',
        job_config=job_config)

    while not query_job.done():
        time.sleep(1)

    rows_df = query_job.result().to_dataframe()
    storage_client = storage.Client()

    storage_client.get_bucket(bucket_name).blob(subfolder+'/'+gcs_filename+'.gz').upload_from_string(rows_df.to_csv(sep='|',index=False,encoding='utf-8',compression='gzip'), content_type='application/octet-stream')

6

2 Answers 2

3

As suggested in the thread referred by @Sam Mason in a comment, once you have obtained the Pandas datafame, you should use a TextIOWrapper() and BytesIO() as described in the following sample:

The following sample was inspired by @ramhiser's answer in this SO thread

df = query_job.result().to_dataframe()
blob = bucket.blob(f'{subfolder}/{gcs_filename}.gz')

with BytesIO() as gz_buffer:
    with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:
        df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)

    blob.upload_from_file(gz_buffer,
        content_type='application/octet-stream')

also note that if you expect this file to ever get larger than a couple of MB you are probably better off using something from the tempfile module in place of BytesIO. SpooledTemporaryFile is basically designed for this use case, where it will use a memory buffer up to some given size and only use the disk if the file gets really big

Sign up to request clarification or add additional context in comments.

1 Comment

For people facing ValueError: Stream must be at beginning use insert gz_buffer.seek(0) before the blob.upload_from_file(...) line.
0

Hi I tried to reproduce your use case:

  1. I created a cloud function using this quickstart link:

    def hello_world(request):
    
      from google.cloud import bigquery
      from google.cloud import storage 
      import pandas as pd 
    
    
      client = bigquery.Client() 
      storage_client = storage.Client() 
    
      path = '/tmp/file.gz'
    
    
      query_job = client.query("""
      SELECT
      CONCAT(
        'https://stackoverflow.com/questions/',
         CAST(id as STRING)) as url,
      view_count
      FROM `bigquery-public-data.stackoverflow.posts_questions`
      WHERE tags like '%google-bigquery%'
      ORDER BY view_count DESC
      LIMIT 10""")  
    
      results = query_job.result().to_dataframe()
      results.to_csv(path,sep='|',index=False,encoding='utf-8',compression='gzip')
    
      bucket = storage_client.get_bucket('mybucket')  
      blob = bucket.blob('file.gz')
      blob.upload_from_filename(path)
    
    1. This is the requirements.txt:

      # Function dependencies, for example:
      
      google-cloud-bigquery
      google-cloud-storage
      pandas
      
    2. I deployed the function.

    3. I checked the output.

      gsutil cp gs://mybucket/file.gz file.gz
      gzip -d file.gz
      cat file
      
      
      #url|view_count
      https://stackoverflow.com/questions/22879669|52306
      https://stackoverflow.com/questions/13530967|46073
      https://stackoverflow.com/questions/35159967|45991
      https://stackoverflow.com/questions/10604135|45238
      https://stackoverflow.com/questions/16609219|37758
      https://stackoverflow.com/questions/11647201|32963
      https://stackoverflow.com/questions/13221978|32507
      https://stackoverflow.com/questions/27060396|31630
      https://stackoverflow.com/questions/6607552|31487
      https://stackoverflow.com/questions/11057219|29069
      

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.