0

I'm working with numpy and trying to find which platform sold the most copies in NA region.

I have a CSV file holding a lot of data looking like this:

Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
2,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
3,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
4,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33
5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1,31.37
6,Tetris,GB,1989,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26
7,New Super Mario Bros.,DS,2006,Platform,Nintendo,11.38,9.23,6.5,2.9,30.01
8,Wii Play,Wii,2006,Misc,Nintendo,14.03,9.2,2.93,2.85,29.02
9,New Super Mario Bros. Wii,Wii,2009,Platform,Nintendo,14.59,7.06,4.7,2.26,28.62
10,Duck Hunt,NES,1984,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31
11,Nintendogs,DS,2005,Simulation,Nintendo,9.07,11,1.93,2.75,24.76

I would like to print the platform with the most sales and the amount sold in the NA region. How can I do this?

2
  • What did you try so far? Commented Mar 4, 2017 at 23:04
  • I hard coded all the different platform as masks like: maskNES = (data[:,2] == 'NES') and then i assigned it to a variable like: pfNES = data[maskNES][:,6].sum() lastly i compared all the platforms to find the one with the highest value. Just seems like a idiotic way of doing it. If i were to have thousands of different platforms Oh and i took the csv data into a matrix called 'data' Commented Mar 4, 2017 at 23:44

2 Answers 2

1

With pandas this is fairly straight forward.

Code:

# read csv data into a dataframe
df = pd.read_csv(data, skipinitialspace=True)

# roll up by NA Sales
platform_roll_up = df.groupby('Platform')['NA_Sales'].sum()

# find row with max sales
idx_max = platform_roll_up.idxmax()

# show platform and sales for max
print(idx_max, platform_roll_up[idx_max])

Results:

Wii 101.71

Test Data:

data = StringIO(u"""
    Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
    1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
    2,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
    3,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
    4,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33
    5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1,31.37
    6,Tetris,GB,1989,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26
    7,New Super Mario Bros.,DS,2006,Platform,Nintendo,11.38,9.23,6.5,2.9,30.01
    8,Wii Play,Wii,2006,Misc,Nintendo,14.03,9.2,2.93,2.85,29.02
    9,New Super Mario Bros. Wii,Wii,2009,Platform,Nintendo,14.59,7.06,4.7,2.26,28.62
    10,Duck Hunt,NES,1984,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31
    11,Nintendogs,DS,2005,Simulation,Nintendo,9.07,11,1.93,2.75,24.76
""")
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the quick answer! I'm trying to work with a solution that works for numpy.ndarray. Which doesn't have a iloc attribute. Should i stay away from ndarray in this case? Also I'm trying to find total NA_Sales value of all product of X platform. Instead of finding the highest single value. By the way, I am new to python
Thanks a ton! Greatly appreciate the answer, your edited version was exactly what i was looking for.
1

Loading this with genfromtxt is straight forward:

In [280]: data=np.genfromtxt('stack42602390.csv',delimiter=',',names=True, dtype=None)

In [281]: data
Out[281]: 
array([ ( 1, b'Wii Sports', b'Wii', 2006, b'Sports', b'Nintendo',  41.49,  29.02,   3.77,  8.46,  82.74),
       ( 2, b'Super Mario Bros.', b'NES', 1985, b'Platform', b'Nintendo',  29.08,   3.58,   6.81,  0.77,  40.24),
       ( 3, b'Mario Kart Wii', b'Wii', 2008, b'Racing', b'Nintendo',  15.85,  12.88,   3.79,  3.31,  35.82),
....
       (11, b'Nintendogs', b'DS', 2005, b'Simulation', b'Nintendo',   9.07,  11.  ,   1.93,  2.75,  24.76)], 
      dtype=[('Rank', '<i4'), ('Name', 'S25'), ('Platform', 'S3'), ('Year', '<i4'), ('Genre', 'S12'), ('Publisher', 'S8'), ('NA_Sales', '<f8'), ('EU_Sales', '<f8'), ('JP_Sales', '<f8'), ('Other_Sales', '<f8'), ('Global_Sales', '<f8')])

The b'string' is just the Python3 way of showing bytestrings, the default string format from genfromtxt. They won't show in Py2.

The result is a structured array, with different field names and types. It is not a 2d array with rows and columns.

The NA_Sales data:

In [282]: data['NA_Sales']
Out[282]: 
array([ 41.49,  29.08,  15.85,  15.75,  11.27,  23.2 ,  11.38,  14.03,
        14.59,  26.93,   9.07])

And the maximum of these:

In [283]: np.argmax(data['NA_Sales'])
Out[283]: 0

and the corresponding record:

In [284]: data[0]
Out[284]: (1, b'Wii Sports', b'Wii', 2006, b'Sports', b'Nintendo',  41.49,  29.02,  3.77,  8.46,  82.74)

To make the most use of this array you'll have to read up on structured arrays.

3 Comments

Tried this solution but ran into the problem that longer down my csv file there is commas inside the titles and i couldn't add quotechar=' " ' to np.getfromtext
The csv package handles quotes, but the numpy readers don't. genfromtxt accepts input from anything that feeds it lines, so you can preprocess the lines, cleaning them up so they can be parsed with simple delimiters. That's been discussed in many previous SO questions.
A recent example of genfromtxt with a filter input: stackoverflow.com/a/42593389/901925

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.