4

I have a dataframe with multiple columns along with a date column. The date format is 12/31/15 and I have set it as a datetime object.

I set the datetime column as the index and want to perform a regression calculation for each month of the dataframe.

I believe the methodology to do this would be to split the dataframe into multiple dataframes based on month, store into a list of dataframes, then perform regression on each dataframe in the list.

I have used groupby which successfully split the dataframe by month, but am unsure how to correctly convert each group in the groupby object into a dataframe to be able to run my regression function on it.

Does anyone know how to split a dataframe into multiple dataframes based on date, or a better approach to my problem?

Here is my code I've written so far

import pandas as pd
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices

df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df = df.set_index('date')

# Group dataframe on index by month and year 
# Groupby works, but dmatrices does not 
for df_group in df.groupby(pd.TimeGrouper("M")):
    y,X = dmatrices('value1 ~ value2 + value3', data=df_group,      
    return_type='dataframe')
1
  • 2
    you can just use df.groupby(...).apply. No need to loop. I don't have time to type out a full answer. Here's a notebook I made that demonstrates something similar: gist.github.com/phobson/… Commented Mar 10, 2016 at 5:12

2 Answers 2

9

If you must loop, you need to unpack the key and the dataframe when you iterate over a groupby object:

import pandas as pd
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices

df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df = df.set_index('date')

Note the use of group_name here:

for group_name, df_group in df.groupby(pd.Grouper(freq='M')):
    y,X = dmatrices('value1 ~ value2 + value3', data=df_group,      
    return_type='dataframe')

If you want to avoid iteration, do have a look at the notebook in Paul H's gist (see his comment), but a simple example of using apply would be:

def do_regression(df_group, ret='outcome'):
    """Apply the function to each group in the data and return one result."""
    y,X = dmatrices('value1 ~ value2 + value3',
                    data=df_group,      
                    return_type='dataframe')
    if ret == 'outcome':
        return y
    else:
        return X

outcome = df.groupby(pd.Grouper(freq='M')).apply(do_regression, ret='outcome')
Sign up to request clarification or add additional context in comments.

2 Comments

This is exactly what I did yesterday by using the "group_name". Thanks for your comment.
pd.TimeGrouper() was formally deprecated in pandas v0.21.0 in favor of pd.Grouper() (see this question).
3

This is a split per year.

import pandas as pd
import dateutil.parser
dfile = 'rg_unificado.csv'
df = pd.read_csv(dfile, sep='|', quotechar='"', encoding='latin-1')
df['FECHA'] = df['FECHA'].apply(lambda x: dateutil.parser.parse(x)) 
#http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
#use to_period
per = df['FECHA'].dt.to_period("Y")
#group by that period
agg = df.groupby([per])
for year, group in agg:
    #this simple save the data
    datep =  str(year).replace('-', '')
    filename = '%s_%s.csv' % (dfile.replace('.csv', ''), datep)
    group.to_csv(filename, sep='|', quotechar='"', encoding='latin-1', index=False, header=True)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.