Subsetting Dataset based on two condition, Save each dataframe into a .csv file, Iterate through each file and Plot figures

Question

I am new to data science and I need help doing the following:

(I) splitting a dataset based on unique groups in column and another group, in my case region and country

(II) I will like to save each dataframe as a .csv file- something like this regionname_country.csv, for example, west_GER.csv, east_POL.csv

(III) if possible, I will like to iterate for loop through each .csv file to plot a scatterplot of education vs age for each df.

(IV) Lastly save my plots/figures in a pdf file (4 figures per page)

'df'
   Region, country, Age, Education, Income, FICO, Target
1   west, GER, 43, 1, 47510, 710, 1
2   east, POL, 32, 2, 73640, 723, 1
3   east, POL, 22, 2, 88525, 610, 0
4   west, GER, 55, 0, 31008, 592, 0
5   north, USA, 19, 0, 18007, 599, 1
6   south, PER, 27, 2, 68850, 690, 0
7   south, BRZ, 56, 3, 71065, 592, 0
8   north, USA, 39, 1, 98004, 729, 1
9   east, JPN, 36, 2, 51361, 692, 0
10  west, ESP, 59, 1, 98643, 729, 1

Desired outcome:

 # df_to_csv : 'west_GER.csv'
west, GER, 43, 1, 47510, 710, 1 
west, GER, 55, 0, 31008, 592, 0

# west_ESP.csv
west, ESP, 59, 1, 98643, 729, 1

# east_POL.csv
east, POL, 32, 2, 73640, 723, 1

.
.
.

# north_USA.csv
north, USA, 39, 1, 98004, 729, 1  
north, USA, 19, 0, 18007, 599, 1

See below for my code

# using pandas 

# code for (I) and (II) not sure of my code but I think I need to nest through the for loop

for i, split_df in df.groupby('Region'):
     for j in df.groupby('country'): # not sure of the nested for loop
      split_df.to_csv(f'{i,j}.csv', index = False) # not sure of the {i,j} part

# code for (III) and (IV)

import glob
import numpy
import matplotlib.pyplot 
from matplotlib import pyplot as plot
from matplotlib.backends.backend_pdf import PdfPages


filenames = sorted(glob.glob('_*.csv')) # retrieving all files containing '_' since we have region_country.csv
filenames = filenames[0:len(filenames)]
for filename in filenames:
    print(filename)

    data = numpy.loadtxt(fname=filename, delimiter=',')
    # The PDF document
    pdf_pages = PdfPages('plots.pdf')
    fig, ax = plt.subplots()    # create a figure

    # Generate the pages
    nb_plots = data.shape[0]
    nb_plots_per_page = 4
    nb_pages = int(numpy.ceil(nb_plots / float(nb_plots_per_page)))
    grid_size = (nb_plots_per_page, 1)
    for i, samples in enumerate(data):
    # Create a figure instance (ie. a new page) if needed
      if i % nb_plots_per_page == 0:
      fig = plot.figure(figsize=(8.5, 12), dpi=125)

    # plot stuff 
      x = data[:,2]  # age column
      y = data[:,3] # education column

     ax.plot(x, y,color = colorlist[i])

     ax.set_xscale("log")
     ax.set_xlabel("x")
     ax.set_ylabel("y")

     plt.show()
    # Close the page if needed
    if (i + 1) % nb_plots_per_page == 0 or (i + 1) == nb_plots:
    plot.tight_layout()
    pdf_pages.savefig(fig)
 
    # Write the PDF document to the disk
    pdf_pages.close()

Any assistance will be much appreciated, I am open to both python and R. Thank you in advance.


#Attempt for PCA 


import glob
import matplotlib.pyplot as plt

fig, axs = plt.subplots(nrows=2, ncols=2)
for ax, file in zip(axs.flatten(), glob.glob("./*csv")):
    df_temp = pd.read_csv(file) # read each csv file
    df_temp.drop('Unnamed: 0', axis=1, inplace=True) # drop the index number columns
    df_temp = df_temp.dropna() # drop NaNs

    X = df_temp.iloc[:,4:len(df_temp.columns)]#.astype(float) # select the 5th columns to the end 
    y = df_temp.iloc[:,0] # the first column is the label column
    # PCA starts from here
    scaler = StandardScaler()
    scaler.fit(X)
    X = scaler.transform(X)
    pca = PCA(n_components=2)
    pca.fit(X)
    x_pca = pca.transform(X)
     # I want to convert the x_pca array in dataframe for easier plotting
    data = pd.DataFrame({'PC1': x_pca[:, 0], 'PC2': x_pca[:, 1]})
    PC1_temp = data['PC1'][0]
    PC2_temp = data['PC2'][0]
    categories = y # label column to be used for distinguish the two classes
    colormap = np.array(['r', 'g']) # desired color red and green for the two distinct classes in the label column
    ax.scatter(x_pca[:,0], x_pca[,:1],c=colormap[categories])
    ax.set_title(f"PC1:{PC1_temp}, P2:{PC2_temp}")
    ax.set_xlabel("PC1")
    ax.set_ylabel("PC2")
    plt.tight_layout()
    plt.legend()# Also, I want to include a legend to show the 'r', 'g' values of the two distinct classes of label column
fig.savefig("scatter.pdf")

```

Andre S. · Accepted Answer · 2020-10-29 06:16:43Z

1

For Python:
(I) & (II):

for i in df.groupby(["Region", "country"])[["Region", "country"]].apply(lambda x: list(np.unique(x))):
    df.groupby(["Region", "country"]).get_group((i[1], i[0])).to_csv(f"{i[1]}_{i[0]}.csv")

(III) & (IV):

import glob
import matplotlib.pyplot as plt

fig, axs = plt.subplots(nrows=2, ncols=2)
for ax, file in zip(axs.flatten(), glob.glob("./*csv")):
    df_temp = pd.read_csv(file)
    region_temp = df_temp['Region'][0]
    country_temp = df_temp['country'][0]    
    ax.scatter(df_temp["Age"], df_temp["Education"])
    ax.set_title(f"Region:{region_temp}, Country:{country_temp}")
    ax.set_xlabel("Age")
    ax.set_ylabel("Education")
    plt.tight_layout()
fig.savefig("scatter.pdf")

edited Oct 29, 2020 at 6:16

answered Oct 28, 2020 at 7:51

Andre S.

5184 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

nasa313 Over a year ago

@Andre.S. Thanks for this, code III and IV works perfectly well, however, your code (I and II), returns one unique region per country for each .csv file (I think I didn't explain this part correctly). What I desire is some thing like this: west_UK.csv, west_ESP.csv, north_USA.csv, east_POL.csv and east_JPN.csv. Note that both UK and ESP both belong to west region and . in simpler terms, each .csv should have unique country within a unique region. Thanks

Andre S. Over a year ago

I just updated the solution for (I) & (II). Upvotes are welcome if it helps you :)

nasa313 Over a year ago

@Andre.S. Unfortunately, it didn't work, it returned only 1 .csv file out of 40 files.

Andre S. Over a year ago

Mhh, the code for (I) and (II) applied to the above DataFrame generates 7 separate CSVs. Did you maybe forget the "f" in the string of the .to_csv argument? Then the CSV would be overwritten with every iteration.

nasa313 Over a year ago

@Andre.S. Thanks it works fine now! One last thing, assuming in code (III and IV) I want to do PCA with age income, Education and FICO (numeric features) columns, thereby plotting PC1 v PC2 for each csv file. At what point do I include the normalize and pca.fit step (if you could include it and comment in your answer code, I would really appreciate it!) Thanks you are a lifesaver!

|

Ronak Shah · Accepted Answer · 2020-10-28 08:08:22Z

0

In R, you can do this as :

library(tidyverse)

#get data in list of dataframes
df %>%
  select(Region, country, Education, Age) %>%
  group_split(Region, country) -> split_data

#From list of data create list of plots. 
list_plots <- map(split_data, ~ggplot(.) + aes(Education, Age) + 
                geom_point() + 
                 ggtitle(sprintf('Plot for region %s and country %s', 
                 first(.$Region), first(.$country))))

#Write the plots in pdf as well as write the csvs.
pdf("plots.pdf", onefile = TRUE)
for (i in seq_along(list_plots)) {
  write.csv(split_data, sprintf('%s_%s.csv', 
      split_data[[i]]$Region[1], split_data[[i]]$country[1]), row.names = FALSE)
  print(list_plots[[i]]) 
}
dev.off()

answered Oct 28, 2020 at 8:08

Ronak Shah

391k20 gold badges173 silver badges237 bronze badges

1 Comment

nasa313 Over a year ago

@Ronak, the plots worked (with the exception that I want 4 figures per page instead of 1 figure per page). Also each of the .csv file returns 40 rows of <tbl_df[,140]> . I think this is because you plotted the figures before exporting the subset data frames (I am not sure though). Please can you subset the dataset based on the unique region & unique country combination you used, also, can you make it 4 figures per page. Thanks in advance

Collectives™ on Stack Overflow

Subsetting Dataset based on two condition, Save each dataframe into a .csv file, Iterate through each file and Plot figures

2 Answers 2

9 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related