Study of most efficient method to convert tree structure to csv in Python

Question

I have a tree with the following structure :

my_hash_pop = {
    "Europe" : {
        "France" : {
            "Paris" : 2220445,
            "Lille" : 225789,
            "Lyon" : 506615 },
        "Germany" : {
            "Berlin" : 3520031,
            "Munchen" : 1544041,
            "Dresden" : 540000 },
        },
    "South America" : {
        "Brasil" : {
            "Sao Paulo" : 11895893,
            "Rio de Janeiro" : 6093472 },
        "Argentina" : {
            "Salta" : 535303,
            "Buenos Aires" : 3090900 },
        },
    }

I would like to convert this structure to CSV, using python :

Europe;Germany;Berlin;3520031
Europe;Germany;Munchen;1544041
Europe;Germany;Dresden;540000
Europe;France;Paris;2220445
Europe;France;Lyon;506615
Europe;France;Lille;225789
South America;Argentina;Buenos Aires;3090900
South America;Argentina;Salta;3090900
South America;Brasil;Sao Paulo;11895893
South America;Brasil;Rio de Janeiro;6093472

As my tree contains a large number of leaves in real life (not in this example obviously), the converting script I'm using takes ages. I try to find a more efficient way to do the converstion. Here is what I tried :

First Method : Concatenate string on every leaf :

### METHOD 1 ###

start_1 = time.time()

data_to_write = ""

for region in my_hash_pop:
    for country in my_hash_pop[region]:
        for city in my_hash_pop[region][country]:
            data_to_write += region+";"+country+";"+city+";"+str(my_hash_pop[region][country][city])+"\n"

filename = "my_test_1.csv"
with open("my_test_1.csv", 'w+') as outfile:
    outfile.write(data_to_write)
    outfile.close()

end_1 = time.time()
print("---> METHOD 1 : Write all took " + str(end_1 - start_1) + "s")

Second Method : Concatenate string with "checkpoints"

### METHOD 2 ###

start_2 = time.time()

data_to_write = ""

for region in my_hash_pop:
    region_to_write = ""

    for country in my_hash_pop[region]:
        country_to_write = ""

        for city in my_hash_pop[region][country]:
            city_to_write = region+";"+country+";"+city+";"+str(my_hash_pop[region][country][city])+"\n"
            country_to_write += city_to_write

        region_to_write += country_to_write

    data_to_write += region_to_write

filename = "my_test_2.csv"
with open("my_test_2.csv", 'w+') as outfile:
    outfile.write(data_to_write)
    outfile.close()

end_2 = time.time()
print("---> METHOD 2 : Write all took " + str(end_2 - start_2) + "s")

Third Method : With Writer object

### METHOD 3 ###

import csv

start_3 = time.time()

with open("my_test_3.csv", 'w+') as outfile:
    del_char = b";"
    w = csv.writer(outfile, delimiter=del_char)

    for region in my_hash_pop:
        for country in my_hash_pop[region]:
            for city in my_hash_pop[region][country]:
                w.writerow([region, country, city, str(my_hash_pop[region][country][city])])

end_3 = time.time()
print("---> METHOD 3 : Write all took " + str(end_3 - start_3) + "s")

Comparing the time each method takes when growing up my example tree, I notice that method 1 is rather inneficient. Between method 2 & 3 though, results vary and are not so distinct (usually method 3 seems to be more efficient though)

I have therefore two questions :

Do you see another method I may want to try ?
Is there a better way to check and compare the efficiency of the different methods ?

And a bonus one :

I noticed the output file of Method 1 and 2 are the exact same size. The output file of Method 3 is larger than the other two. Why ?

Thanks for any contribution !

Eric Duminil · Accepted Answer · 2017-09-21 13:59:14Z

1

The 3rd method is the most promising.

You could avoid many dict lookups by using items() at each level:

with open("my_test_3.csv", 'w+') as outfile:
    del_char = ";"
    w = csv.writer(outfile, delimiter=del_char)

    for region,countries in my_hash_pop.items():
        for country,cities in countries.items():
            for city,value in cities.items():
                w.writerow([region, country, city, value])

The difference in size between example 2 and 3 comes from newlines : "\n" for 'my_test_2.csv' and "\r\n" for 'my_test_3.csv'. So every line in 'my_test_3.csv' is 1 byte larger than in 'my_test_2.csv'.

edited Sep 21, 2017 at 13:59

answered Sep 21, 2017 at 13:53

Eric Duminil

54.6k10 gold badges80 silver badges134 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mago Over a year ago

Good catch for the difference of sizes between the files ! I'll try the ''.items'' and let you know how much it improves the time.

Mago Over a year ago

This method is indeed more efficient than the one I used, so thank you for that. It takes exactly the same time as @QuantumEnergy

Eric Duminil Over a year ago

@Mago sure. Quantum uses the exact same loops, but packed in a nested list comprehension.

QuantumEnergy · Accepted Answer · 2017-09-21 14:10:03Z

1

start_1 = time.time()
filename = "my_test_4.csv"
with open("my_test_4.csv", 'w+') as outfile:
    a = [outfile.write("%s;%s;%s;%s\n" % (k, kk, kkk, vvv))
         for (k, v) in my_hash_pop.items()
         for (kk, vv) in v.items()
         for (kkk, vvv) in vv.items()]
end_1 = time.time()
print("---> METHOD 1 : Write all took " + str(end_1 - start_1) + "s")

answered Sep 21, 2017 at 14:10

QuantumEnergy

4064 silver badges15 bronze badges

3 Comments

Mago Over a year ago

Thanks, it's more efficient, and also takes exactly the same time as @eric-duminil 's answer

Eric Duminil Over a year ago

You don't need a, do you?

QuantumEnergy Over a year ago

Yes, don't need a!

Simpom · Accepted Answer · 2017-09-21 14:48:31Z

0

A suggestion would be to use pandas, as follows:

import pandas as pd
df = pd.DataFrame([(i,j,k,my_hash_pop[i][j][k])
                           for i in my_hash_pop.keys() 
                           for j in my_hash_pop[i].keys()
                           for k in my_hash_pop[i][j].keys()])

with open("my_test_4.csv", 'w') as outfile:
    outfile.write(df.to_csv(sep=';', header=False, index=False)))

I have not compared execution times, and maybe using pandas is not an option for you, so this is just a suggestion.

edited Sep 21, 2017 at 14:48

answered Sep 21, 2017 at 14:36

Simpom

1,0071 gold badge7 silver badges25 bronze badges

1 Comment

Mago Over a year ago

I can't seem to be able to install pandas module in my environment, but I'll try on a more open one and let you know.

Anil_M · Accepted Answer · 2017-09-21 15:06:39Z

0

panads is very efficient when it comes to handling large data-set. Below is a method to import dict of dicts in pandas, flatten using json_normalize and then you can manipulate it. e.g. write to csv etc.

Let me know how it fares with your options.

Source-Code

from pandas.io.json import json_normalize

df = json_normalize(my_hash_pop)

outfile = "temp.csv"
del_char = ";"

with open(outfile, 'wb+') as outfile:
    w = csv.writer(outfile, delimiter =';',quoting=csv.QUOTE_MINIMAL)
    for i in df.keys():
        s = ("{};{}").format(i.replace('.',';'),df[i][0]).split(";")
        w.writerow(s)

edited Sep 21, 2017 at 15:06

answered Sep 21, 2017 at 14:48

Anil_M

11.5k6 gold badges56 silver badges76 bronze badges

1 Comment

Mago Over a year ago

I can't seem to be able to install pandas module in my environment, but I'll try on a more open one and let you know. Thanks

Collectives™ on Stack Overflow

Study of most efficient method to convert tree structure to csv in Python

First Method : Concatenate string on every leaf :

Second Method : Concatenate string with "checkpoints"

Third Method : With Writer object

4 Answers 4

3 Comments

3 Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

First Method : Concatenate string on every leaf :

Second Method : Concatenate string with "checkpoints"

Third Method : With Writer object

4 Answers 4

3 Comments

3 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related