7

I feel like I'm overlooking something really simple, but I can't make it work. I'm using SQLite now, but a solution in SQLAlchemy would also be very helpful.

Let's create our original dataset:

### This is just the setup part
import pandas as pd
import sqlite3
conn = sqlite3.connect('test.sqlite')

orig = pd.DataFrame({'COLUPC': [100001, 100002, 100003, 100004],
'L5': ['ABC ALE', 'ABC MALT LIQUOR', 'ABITA AMBER', 'ABITA AMBER'],
'attr1': [0.25, 0.25, 0.041, 0.041]})

orig.to_sql("UPCs", conn, if_exists='replace', index=False)

#Create an index just in case it's needed
conn.execute("""CREATE INDEX upc_index
ON UPCs (COLUPC);""")

Now suppose I take that orig dataframe and add a column called 'L5_lower'. Then I create the column in the SQLite database:

# Create new variable
orig['L5_lower'] = orig.L5.str.lower()
conn.execute("alter table UPCs add column L5_lower TEXT;")

Now suppose I want to fill in this single column L5_lower to the SQLite table, without having to pass other columns (below I explain why I need this)

I tried passing the index and the new column as tuples:

query='''insert or replace into UPCs (COLUPC, L5_lower) values (?,?) '''
conn.executemany(query, orig[['COLUPC', 'L5_lower']].to_records(index=False))
conn.commit() 

# But then:
df = pd.read_sql("SELECT * FROM UPCs;", conn)
conn.close()

gives this messed up result.

    COLUPC                               L5                 attr1   L5_lower
0   100001                               ABC ALE            0.250   None
1   100002                               ABC MALT LIQUOR    0.250   None
2   100003                               ABITA AMBER        0.041   None
3   100004                               ABITA AMBER        0.041   None
4   b'\xa1\x86\x01\x00\x00\x00\x00\x00'     None            NaN     abc ale
5   b'\xa2\x86\x01\x00\x00\x00\x00\x00'     None            NaN     abc malt liquor
6   b'\xa3\x86\x01\x00\x00\x00\x00\x00'     None            NaN     abita amber
7   b'\xa4\x86\x01\x00\x00\x00\x00\x00'     None            NaN     abita amber

Instead, the expected output is:

    COLUPC                               L5                 attr1   L5_lower
0   100001                               ABC ALE            0.250   abc ale
1   100002                               ABC MALT LIQUOR    0.250   abc malt liquor
2   100003                               ABITA AMBER        0.041   abita amber
3   100004                               ABITA AMBER        0.041   abita amber

So, why am I trying to pass a single column? I have a very big dataset and I won't be able to have the whole dataframe in memory. My intended workflow is to construct one column at a time and then update or insert into the SQLite database.

3
  • AFAIK you can't add COLUMNS using Pandas to_sql - you can add ROWS. One solution would be to insert a new column into a temporary table (with the same index as the original table has) and then update the source table on the SQLite side... Commented Jan 5, 2017 at 21:36
  • @MaxU Could you please provide some example code as an answer? I guess it makes sense to delete the auxiliary table after that. Commented Jan 5, 2017 at 21:45
  • 1
    I've added a working example - please check Commented Jan 5, 2017 at 22:54

1 Answer 1

6

AFAIK you can't add COLUMNS using Pandas to_sql - you can add ROWS. One solution would be to insert a new column into a temporary table (with the same index as the original table has) and then update the source table on the SQLite side.

Here is a working example:

SETUP:

assuming we have the following original DF:

In [79]: orig
Out[79]:
   COLUPC               L5  attr1
0  100001          ABC ALE  0.250
1  100002  ABC MALT LIQUOR  0.250
2  100003      ABITA AMBER  0.041
3  100004      ABITA AMBER  0.041

In [80]: orig.set_index('COLUPC', inplace=True)

In [81]: conn = sqlite3.connect('d:/temp/test.sqlite')

In [82]: orig.to_sql('upcs', conn, if_exists='replace', index=True)

In [83]: conn.close()

SOLUTION:

In [84]: conn = sqlite3.connect('d:/temp/test.sqlite')

In [85]: df = pd.read_sql('select * from upcs', conn, index_col='COLUPC')

In [86]: df
Out[86]:
                     L5  attr1
COLUPC
100001          ABC ALE  0.250
100002  ABC MALT LIQUOR  0.250
100003      ABITA AMBER  0.041
100004      ABITA AMBER  0.041

create temporary table:

In [87]: tmp = orig.L5.str.lower().to_frame('L5_lower')

In [88]: tmp
Out[88]:
               L5_lower
COLUPC
100001          abc ale
100002  abc malt liquor
100003      abita amber
100004      abita amber

In [89]: tmp.to_sql('tmp', conn, if_exists='replace', index=True)

add new column to SQLite table:

In [90]: conn.execute('alter table UPCs add column L5_lower varchar(50)')
Out[90]: <sqlite3.Cursor at 0xa558c00>

In [91]: qry = 'update upcs set L5_lower = (select L5_lower from tmp where tmp.COLUPC = upcs.COLUPC) where L5_lower is NULL'

In [92]: conn.execute(qry)
Out[92]: <sqlite3.Cursor at 0xa593570>

In [93]: conn.commit()

In [94]: conn.execute('drop table tmp')
Out[94]: <sqlite3.Cursor at 0xa5930a0>

Check:

In [95]: pd.read_sql('select * from upcs', conn, index_col='COLUPC')
Out[95]:
                     L5  attr1         L5_lower
COLUPC
100001          ABC ALE  0.250          abc ale
100002  ABC MALT LIQUOR  0.250  abc malt liquor
100003      ABITA AMBER  0.041      abita amber
100004      ABITA AMBER  0.041      abita amber

In [96]: conn.close()
Sign up to request clarification or add additional context in comments.

4 Comments

Great work, thanks! I keep getting surprised how easy it is to add rows, but not columns. In my line of work, I seldom do the first, but continuously do the second
@cd98, in this case you may want to work with transposed data sets: your columns will become rows or at least store it this way
If updating several columns, is there a more efficient code than ' set col1 = (select col1 from temp where temp.id = table.id), col2 = (select col2 from temp where temp.id = table.id)' ?
What is the purpose of 'where L5_lower is NULL' in qry? If my new column also contains NULL value, will that affect the code?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.