Splitting a CSV file by range of datetimes

Question

I have a pretty large CSV file that contains data from 2009-2015. I am wondering if there is an easy way to split this file into smaller files on a per month basis. I could split the data into evenly sized chunks, but I would prefer to group the data by month.

DateTime             Price    Bid    Ask    Size                                  
2009-09-28 09:30:17  35.5250  35.49  35.56  100      
2009-09-28 09:30:18  35.5600  35.49  35.56  100      
2009-09-28 09:30:18  35.5600  35.50  35.57  100      
2009-09-28 09:30:20  35.5000  35.42  35.56  100      
2009-09-28 09:30:20  35.5000  35.42  35.56  100      
2009-09-28 09:30:30  35.4600  35.46  35.52  100      
2009-09-28 09:30:30  35.4600  35.46  35.52  100      
2009-09-28 09:30:30  35.5000  35.46  35.52  100      
2009-09-28 09:30:33  35.5100  35.47  35.51  100      
2009-09-28 09:30:40  35.5100  35.48  35.51  200      
2009-09-28 09:30:41  35.5100  35.48  35.51  100      
2009-09-28 09:30:42  35.4803  35.48  35.51  100      
2009-09-28 09:30:42  35.4800  35.48  35.51  1044      
...                      ...    ...    ...  ...      
2015-04-07 15:59:59  94.1200  94.10  94.12  100      
2015-04-07 16:00:00  94.2000  94.09  94.60  300      
2015-04-07 16:00:00  94.2100  94.09  94.60  100      
2015-04-07 16:00:00  94.1800  94.09  94.60  217      
2015-04-07 16:00:05  94.1100  94.09  94.59  600      
2015-04-07 16:00:09  94.1100  94.09  94.59  350      
2015-04-07 16:00:32  94.1100  94.09  94.59  2804      
2015-04-07 16:00:32  94.1100  94.09  94.59  1582      
2015-04-07 16:00:32  94.1100  94.09  94.59  100      
2015-04-07 16:00:33  94.1100  94.09  94.59  600      
2015-04-07 16:00:35  94.1100  94.09  94.59  16702      

[29195283 rows x 5 columns]

Search for how to extract the year-month from a DateTime field. There are like a million duplicates already. — smci
– smci, Commented Apr 8, 2015 at 16:15
Nearly exact duplicate: stackoverflow.com/questions/17937049/… — smci
– smci, Commented Apr 8, 2015 at 16:19
Thanks. Sorry for the duplicate question. Wasn't sure exactly how to phrase the question. — Tom Cadden
– Tom Cadden, Commented Apr 8, 2015 at 16:47

Community · Accepted Answer · 2020-06-20 09:12:55Z

In [1599]: y.head()
Out[1599]: 
                       Price    Bid    Ask  Size
DateTime                                        
2009-09-28 09:30:17  35.5250  35.49  35.56   100
2009-09-28 09:30:18  35.5600  35.49  35.56   100
2009-09-28 09:30:18  35.5600  35.50  35.57   100
2009-09-28 09:30:20  35.5000  35.42  35.56   100
2009-09-28 09:30:20  35.5000  35.42  35.56   100

If you want to group by month or year, you could do it with:

`pd.groupby(y, by=[y.index.year])`

By month:

In [1597]: pd.groupby(y, by=[y.index.month]).count()
Out[1597]: 
   Price  Bid  Ask  Size
4     11   11   11    11
5      1    1    1     0
9     13   13   13    13

By year:

In [1598]: pd.groupby(y, by=[y.index.year]).count()
Out[1598]: 
      Price  Bid  Ask  Size
2009     13   13   13    13
2015     12   12   12    11

pd.TimeGrouper

In [1604]: y.groupby(pd.TimeGrouper(freq='M')).count().head()
Out[1604]: 
            Price  Bid  Ask  Size
DateTime                         
2009-09-30     13   13   13    13
2009-10-31      0    0    0     0
2009-11-30      0    0    0     0
2009-12-31      0    0    0     0
2010-01-31      0    0    0     0

In [1605]: y.groupby(pd.TimeGrouper(freq='D')).count().head()
Out[1605]: 
            Price  Bid  Ask  Size
DateTime                         
2009-09-28     13   13   13    13
2009-09-29      0    0    0     0
2009-09-30      0    0    0     0
2009-10-01      0    0    0     0
2009-10-02      0    0    0     0

jwilner · Accepted Answer · 2015-04-08 16:15:28Z

0

Try df.groupby((df.datetime.year, df.datetime.month)). This assumes you want to group by year-month pairs, not just lumping every September together, e.g.

answered Apr 8, 2015 at 16:15

jwilner

6,6067 gold badges39 silver badges48 bronze badges

Comments

Julien Spronck · Accepted Answer · 2015-04-08 16:25:21Z

0

In case of very large files, you might not want to put the entire file into a database or a list. You can do this instead.

In this example, I used a very simple regular expression to parse the date. There are more suitable regular expressions for this purpose, but this should work for you.

import re
fileroot = 'blah'

with open(yourfile, 'r') as infile:
    for line in infile:
        datestr = re.match('\d{4}-\d\d-\d\d', line)
        if datestr:
            with open('{0}_{1}.txt'.format(fileroot, datestr.group(0)), 'a') as fil:
               fil.write(line)

answered Apr 8, 2015 at 16:25

Julien Spronck

15.5k5 gold badges50 silver badges57 bronze badges

Collectives™ on Stack Overflow

Splitting a CSV file by range of datetimes

3 Answers 3

`pd.groupby(y, by=[y.index.year])`

pd.TimeGrouper

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

pd.groupby(y, by=[y.index.year])

pd.TimeGrouper

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

`pd.groupby(y, by=[y.index.year])`