0

I have a pretty large CSV file that contains data from 2009-2015. I am wondering if there is an easy way to split this file into smaller files on a per month basis. I could split the data into evenly sized chunks, but I would prefer to group the data by month.

DateTime             Price    Bid    Ask    Size                                  
2009-09-28 09:30:17  35.5250  35.49  35.56  100      
2009-09-28 09:30:18  35.5600  35.49  35.56  100      
2009-09-28 09:30:18  35.5600  35.50  35.57  100      
2009-09-28 09:30:20  35.5000  35.42  35.56  100      
2009-09-28 09:30:20  35.5000  35.42  35.56  100      
2009-09-28 09:30:30  35.4600  35.46  35.52  100      
2009-09-28 09:30:30  35.4600  35.46  35.52  100      
2009-09-28 09:30:30  35.5000  35.46  35.52  100      
2009-09-28 09:30:33  35.5100  35.47  35.51  100      
2009-09-28 09:30:40  35.5100  35.48  35.51  200      
2009-09-28 09:30:41  35.5100  35.48  35.51  100      
2009-09-28 09:30:42  35.4803  35.48  35.51  100      
2009-09-28 09:30:42  35.4800  35.48  35.51  1044      
...                      ...    ...    ...  ...      
2015-04-07 15:59:59  94.1200  94.10  94.12  100      
2015-04-07 16:00:00  94.2000  94.09  94.60  300      
2015-04-07 16:00:00  94.2100  94.09  94.60  100      
2015-04-07 16:00:00  94.1800  94.09  94.60  217      
2015-04-07 16:00:05  94.1100  94.09  94.59  600      
2015-04-07 16:00:09  94.1100  94.09  94.59  350      
2015-04-07 16:00:32  94.1100  94.09  94.59  2804      
2015-04-07 16:00:32  94.1100  94.09  94.59  1582      
2015-04-07 16:00:32  94.1100  94.09  94.59  100      
2015-04-07 16:00:33  94.1100  94.09  94.59  600      
2015-04-07 16:00:35  94.1100  94.09  94.59  16702      

[29195283 rows x 5 columns]
3
  • Search for how to extract the year-month from a DateTime field. There are like a million duplicates already. Commented Apr 8, 2015 at 16:15
  • Nearly exact duplicate: stackoverflow.com/questions/17937049/… Commented Apr 8, 2015 at 16:19
  • Thanks. Sorry for the duplicate question. Wasn't sure exactly how to phrase the question. Commented Apr 8, 2015 at 16:47

3 Answers 3

1
In [1599]: y.head()
Out[1599]: 
                       Price    Bid    Ask  Size
DateTime                                        
2009-09-28 09:30:17  35.5250  35.49  35.56   100
2009-09-28 09:30:18  35.5600  35.49  35.56   100
2009-09-28 09:30:18  35.5600  35.50  35.57   100
2009-09-28 09:30:20  35.5000  35.42  35.56   100
2009-09-28 09:30:20  35.5000  35.42  35.56   100

If you want to group by month or year, you could do it with:

pd.groupby(y, by=[y.index.year])

By month:

In [1597]: pd.groupby(y, by=[y.index.month]).count()
Out[1597]: 
   Price  Bid  Ask  Size
4     11   11   11    11
5      1    1    1     0
9     13   13   13    13

By year:

In [1598]: pd.groupby(y, by=[y.index.year]).count()
Out[1598]: 
      Price  Bid  Ask  Size
2009     13   13   13    13
2015     12   12   12    11

pd.TimeGrouper

In [1604]: y.groupby(pd.TimeGrouper(freq='M')).count().head()
Out[1604]: 
            Price  Bid  Ask  Size
DateTime                         
2009-09-30     13   13   13    13
2009-10-31      0    0    0     0
2009-11-30      0    0    0     0
2009-12-31      0    0    0     0
2010-01-31      0    0    0     0

In [1605]: y.groupby(pd.TimeGrouper(freq='D')).count().head()
Out[1605]: 
            Price  Bid  Ask  Size
DateTime                         
2009-09-28     13   13   13    13
2009-09-29      0    0    0     0
2009-09-30      0    0    0     0
2009-10-01      0    0    0     0
2009-10-02      0    0    0     0
Sign up to request clarification or add additional context in comments.

Comments

0

Try df.groupby((df.datetime.year, df.datetime.month)). This assumes you want to group by year-month pairs, not just lumping every September together, e.g.

Comments

0

In case of very large files, you might not want to put the entire file into a database or a list. You can do this instead.

In this example, I used a very simple regular expression to parse the date. There are more suitable regular expressions for this purpose, but this should work for you.

import re
fileroot = 'blah'

with open(yourfile, 'r') as infile:
    for line in infile:
        datestr = re.match('\d{4}-\d\d-\d\d', line)
        if datestr:
            with open('{0}_{1}.txt'.format(fileroot, datestr.group(0)), 'a') as fil:
               fil.write(line)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.