4

I have a tree with the following structure :

my_hash_pop = {
    "Europe" : {
        "France" : {
            "Paris" : 2220445,
            "Lille" : 225789,
            "Lyon" : 506615 },
        "Germany" : {
            "Berlin" : 3520031,
            "Munchen" : 1544041,
            "Dresden" : 540000 },
        },
    "South America" : {
        "Brasil" : {
            "Sao Paulo" : 11895893,
            "Rio de Janeiro" : 6093472 },
        "Argentina" : {
            "Salta" : 535303,
            "Buenos Aires" : 3090900 },
        },
    }

I would like to convert this structure to CSV, using python :

Europe;Germany;Berlin;3520031
Europe;Germany;Munchen;1544041
Europe;Germany;Dresden;540000
Europe;France;Paris;2220445
Europe;France;Lyon;506615
Europe;France;Lille;225789
South America;Argentina;Buenos Aires;3090900
South America;Argentina;Salta;3090900
South America;Brasil;Sao Paulo;11895893
South America;Brasil;Rio de Janeiro;6093472

As my tree contains a large number of leaves in real life (not in this example obviously), the converting script I'm using takes ages. I try to find a more efficient way to do the converstion. Here is what I tried :

First Method : Concatenate string on every leaf :

### METHOD 1 ###

start_1 = time.time()

data_to_write = ""

for region in my_hash_pop:
    for country in my_hash_pop[region]:
        for city in my_hash_pop[region][country]:
            data_to_write += region+";"+country+";"+city+";"+str(my_hash_pop[region][country][city])+"\n"

filename = "my_test_1.csv"
with open("my_test_1.csv", 'w+') as outfile:
    outfile.write(data_to_write)
    outfile.close()

end_1 = time.time()
print("---> METHOD 1 : Write all took " + str(end_1 - start_1) + "s")

Second Method : Concatenate string with "checkpoints"

### METHOD 2 ###

start_2 = time.time()

data_to_write = ""

for region in my_hash_pop:
    region_to_write = ""

    for country in my_hash_pop[region]:
        country_to_write = ""

        for city in my_hash_pop[region][country]:
            city_to_write = region+";"+country+";"+city+";"+str(my_hash_pop[region][country][city])+"\n"
            country_to_write += city_to_write

        region_to_write += country_to_write

    data_to_write += region_to_write

filename = "my_test_2.csv"
with open("my_test_2.csv", 'w+') as outfile:
    outfile.write(data_to_write)
    outfile.close()

end_2 = time.time()
print("---> METHOD 2 : Write all took " + str(end_2 - start_2) + "s")

Third Method : With Writer object

### METHOD 3 ###

import csv

start_3 = time.time()

with open("my_test_3.csv", 'w+') as outfile:
    del_char = b";"
    w = csv.writer(outfile, delimiter=del_char)

    for region in my_hash_pop:
        for country in my_hash_pop[region]:
            for city in my_hash_pop[region][country]:
                w.writerow([region, country, city, str(my_hash_pop[region][country][city])])

end_3 = time.time()
print("---> METHOD 3 : Write all took " + str(end_3 - start_3) + "s")

Comparing the time each method takes when growing up my example tree, I notice that method 1 is rather inneficient. Between method 2 & 3 though, results vary and are not so distinct (usually method 3 seems to be more efficient though)

I have therefore two questions :

  • Do you see another method I may want to try ?
  • Is there a better way to check and compare the efficiency of the different methods ?

And a bonus one :

  • I noticed the output file of Method 1 and 2 are the exact same size. The output file of Method 3 is larger than the other two. Why ?

Thanks for any contribution !

4 Answers 4

1

The 3rd method is the most promising.

You could avoid many dict lookups by using items() at each level:

with open("my_test_3.csv", 'w+') as outfile:
    del_char = ";"
    w = csv.writer(outfile, delimiter=del_char)

    for region,countries in my_hash_pop.items():
        for country,cities in countries.items():
            for city,value in cities.items():
                w.writerow([region, country, city, value])

The difference in size between example 2 and 3 comes from newlines : "\n" for 'my_test_2.csv' and "\r\n" for 'my_test_3.csv'. So every line in 'my_test_3.csv' is 1 byte larger than in 'my_test_2.csv'.

Sign up to request clarification or add additional context in comments.

3 Comments

Good catch for the difference of sizes between the files ! I'll try the ''.items'' and let you know how much it improves the time.
This method is indeed more efficient than the one I used, so thank you for that. It takes exactly the same time as @QuantumEnergy
@Mago sure. Quantum uses the exact same loops, but packed in a nested list comprehension.
1
start_1 = time.time()
filename = "my_test_4.csv"
with open("my_test_4.csv", 'w+') as outfile:
    a = [outfile.write("%s;%s;%s;%s\n" % (k, kk, kkk, vvv))
         for (k, v) in my_hash_pop.items()
         for (kk, vv) in v.items()
         for (kkk, vvv) in vv.items()]
end_1 = time.time()
print("---> METHOD 1 : Write all took " + str(end_1 - start_1) + "s")

3 Comments

Thanks, it's more efficient, and also takes exactly the same time as @eric-duminil 's answer
You don't need a, do you?
Yes, don't need a!
0

A suggestion would be to use pandas, as follows:

import pandas as pd
df = pd.DataFrame([(i,j,k,my_hash_pop[i][j][k])
                           for i in my_hash_pop.keys() 
                           for j in my_hash_pop[i].keys()
                           for k in my_hash_pop[i][j].keys()])

with open("my_test_4.csv", 'w') as outfile:
    outfile.write(df.to_csv(sep=';', header=False, index=False)))

I have not compared execution times, and maybe using pandas is not an option for you, so this is just a suggestion.

1 Comment

I can't seem to be able to install pandas module in my environment, but I'll try on a more open one and let you know.
0

panads is very efficient when it comes to handling large data-set. Below is a method to import dict of dicts in pandas, flatten using json_normalize and then you can manipulate it. e.g. write to csv etc.

Let me know how it fares with your options.

Source-Code

from pandas.io.json import json_normalize

df = json_normalize(my_hash_pop)

outfile = "temp.csv"
del_char = ";"

with open(outfile, 'wb+') as outfile:
    w = csv.writer(outfile, delimiter =';',quoting=csv.QUOTE_MINIMAL)
    for i in df.keys():
        s = ("{};{}").format(i.replace('.',';'),df[i][0]).split(";")
        w.writerow(s)

1 Comment

I can't seem to be able to install pandas module in my environment, but I'll try on a more open one and let you know. Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.