1

I am facing an issue where I have to load a huge CSV file, split the file into multiple files based on the unique values in the columns and outputting the files to a multiple Csv's with a predefined name pattern.

The example of the original CSV is as below.

date     place  type    product value   zone
09/10/16 NY     Zo      shirt   19       1
09/10/16 NY     Mo      jeans   18       2
09/10/16 CA     Zo      trouser 13       3
09/10/16 CA     Co      tie     17       4
09/10/16 WA     Wo      bat     11       1
09/10/16 FL     Zo      ball    12       2
09/10/16 NC     Mo      belt    13       3
09/10/16 WA     Zo      buckle  15       4
09/10/16 WA     Co      glass   16       1
09/10/16 FL     Zo      cup     19       2

I have to filer this massive pandas dataframe into multiple pandas dataframes based on place, type and zone and the output dataframes should be converted into multiple csv file with the naming convention place_type_product_zone.csv.

The code I have got till now is as below.

def list_of_dataframes(df, col_list):
    df_list = [df]
    name_list = []
    for _, i in enumerate(col_list):
        df_list, names = _split_dataframes(df_list, i)

file_name = zip(name_list, df)
_ = dict(zip(names, df))
for k, v in _:
    v.to_csv("{0}.csv".format(k))

Print("CSV files created")
return df, file_name


def _split_dataframes(df_list, col):
    names = []
    dfs = []
    for df in df_list:
        for c in df[col].unique():
            dfs.append(df.loc[df[col] == c])
            names.append(c)
    return dfs, names

list_of_dataframes(df,['place','type','zone']

It output csv files with the title 1.csv, 2.csv etc. How do I create a loop in the function to get the naming convention as NY_zo_shirt_1.csv, CA_Zo_trouser_3.csv etc. should I be creating a dictionary where it stores all the keys?

Thanks in advance.

2
  • Do you have to create a csv for each unique combination of product, type and place? Commented Nov 6, 2018 at 5:46
  • yes. I will have to create a seperate csv for every combination using the above naming convention. Commented Nov 6, 2018 at 5:48

1 Answer 1

4

Here it is -

# Part 1
places = df['place'].unique()
types = df['type'].unique()
products = df['product'].unique()
zones = df['zone'].unique()

# Part 2
import itertools
combs = list(itertools.product(*[places, types, products, zones]))

#Part 3
for comb in combs:
    place, type_, prod, zone = comb
    df_subset = df[(df['place']==place) & (df['type']==type_) & (df['product']==prod) & (df['zone']==zone)]
    if df_subset.shape[0] > 0:
        df_subset.to_csv('temp1/{}_{}_{}_{}.csv'.format(place, type_, prod, zone), index=False)

Output

enter image description here

Sign up to request clarification or add additional context in comments.

3 Comments

If you run this, I see that there are a lot of additional empty files are being created. Imagine a tree being created. original df is divided into dataframes with 4 unique zones and those 4 dataframes are split on product and the resulting product dataframes are split into types. I hope I am being clear here.
@Matt resolved that with the if statement checking the shape
@Matt you can take care of it by taking up the filters in a nested way rather than in a combination. If this helped, you can upvote/accept as answer by click the greyed out tick mark. It helps others if they are searching for something similar

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.