1

This is a question that builds on the question here: Split dataframe into grouped chunks

I have been trying to break a big dataset into different chunks and was using the solution proposed in the question above to do this. This is the code I'm referring to:

df = pd.DataFrame(data=['a', 'a', 'b', 'c', 'a', 'a', 'b', 'v', 'v', 'f'], columns=['A']) 

def iter_by_group(df, column, num_groups):
    groups = []
    for i, group in df.groupby(column):
    groups.append(group)
        if len(groups) == num_groups:
            yield pd.concat(groups)
            groups = []
    if groups:
        yield pd.concat(groups)

for group in iter_by_group(df, 'A', 2):
print(group)

The result of the print is:

    A
 0  a
 1  a
 4  a
 5  a
 2  b
 6  b
    A
 3  c
 9  f
    A
 7  v
 8  v

The issue is that I'm not managing to then go and call each of these chunks individually as if I just call group it returns me the last group only and if instead of print I use return in the last for loop it only gets me the first chunk. How could I alter the code so that I can call each of the chunks individually?

1 Answer 1

1

Use pd.factorize to form groups, then store the grouped object in a dict. Here's it's based on the order of occurrence. Add sort=True to pd.factorize to form groups based on the sorted key ordering

N = 2
col = 'A'

d = dict(tuple(df.groupby((pd.factorize(df[col])[0]+N)//N)))

Output:

d[1]
#   A
#0  a
#1  a
#2  b
#4  a
#5  a
#6  b

d[2]
#   A
#3  c
#9  f

d[3]
#   A
#7  v
#8  v
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.