0

In order to search correlations between products and categories and next visualizations (heatmaps) I need to reorder array using Python with/without Pandas or other libraries from this:

Book Name, Category 1, Category 2, Category 3, Django 101 Python Web-Dev Beginner ROR Guide Rails Web-Dev Intermediate Laravel PHP Web-Dev Intermediate

into that:

Book Name, Python, Web-Dev, Beginner, Rails, PHP, Intermediate Django 101 True True True False False, False ROR Guide False True False False False, True Laravel False True False False True, True

Is there any way to do that? Data stored into .csv file and read by pandas.read_csv ()

2
  • related: stackoverflow.com/questions/11587782/… Commented Jun 19, 2015 at 13:19
  • Maybe add some information on what kind of objects are in the array? Is this an array of arrays? Commented Jun 19, 2015 at 13:23

1 Answer 1

2

This can be done using the get_dummies function in Pandas.

df = pd.DataFrame({'Book Name': ['Django 101', 'ROR Guide', 'Laravel'], 'Category 1': ['Python', 'Rails', 'PHP'], 'Category 2': ['Web-Dev']*3, 'Category 3': ['Beginner', 'Intermediate', 'Intermediate']})

dummies = pd.concat([pd.get_dummies(df[c]) for c in df.columns[1:]], axis=1)
df_new = pd.concat([df['Book Name'], dummies], axis=1)

>>> df_new
    Book Name  PHP  Python  Rails  Web-Dev  Beginner  Intermediate
0  Django 101    0       1      0        1         1             0
1   ROR Guide    0       0      1        1         0             1
2     Laravel    1       0      0        1         0             1

Or you can reset the index of the DataFrame to the Book's name:

df.set_index('Book Name', inplace=True)
df_new = pd.concat([pd.get_dummies(df[c]) for c in df], axis=1)
>>> df_new
            PHP  Python  Rails  Web-Dev  Beginner  Intermediate
Book Name                                                      
Django 101    0       1      0        1         1             0
ROR Guide     0       0      1        1         0             1
Laravel       1       0      0        1         0             1
Sign up to request clarification or add additional context in comments.

3 Comments

Unfortunately I have data like that: ` Book Name, Category 1, Category 2, Category 3, Django 101 Python Web-Dev Beginner ROR Guide Rails Intermediate Web-Dev Laravel Beginner Web-Dev PHP ` so it produces column duplicates
Does not work exactly right since categories can be mixed like that so it will produce more duplications df = pd.DataFrame({'Book Name': ['Django 101', 'ROR Guide', 'Laravel'], 'Category 1': ['Python', 'Intermediate', 'PHP'], 'Category 2': ['Web-Dev', 'Web-Dev', 'Intermediate'], 'Category 3': ['Beginner', 'Rails', 'Web-Dev']}) Is there any way to avoid columns duplications?
@sergei It is up to you to define the categorization. To ensure uniqueness across categories, you can prepend each name in the column with an identifier, e.g. cat1_beginner will be different than cat2_beginner.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.