How do I load a .csv file with strings and floats in python?

Question

I am trying to load a .csv file that contains 2 columns. The first column has floats and the second column has strings that correspond to each number in the 1st column.

I tried to load them in with file = np.genfromtxt('tester.csv',delimiter=',', skip_header=1) but only the floats loaded. The texts all appeared as nan in the array. What is the best way to load a .csv file into a 2d array with a column of floats and a column of strings?

The first few lines of the .csv file will look something like this

m/z,     Lipid ID
885.5,   PI 18:0_20:4 
857.5,   PI 16:0_20:4
834.5,   PS 18:0_22:6
810.5,   PS 18:0_20:4
790.5,   PE 18:0_22:6

Thank you. Are the large gaps between columns several space (` ) characters in a row or tabs (\t`)? — user17242583
– user17242583, Commented Dec 22, 2021 at 22:57
Oh so I just did that to make it easy to look. Each number and lipid name will be in a cell — color_blue
– color_blue, Commented Dec 22, 2021 at 23:01

Sibtain Reza · Accepted Answer · 2021-12-22 22:59:43Z

2

Use pandas to load your csv file, and then convert it to numpy array using:

import numpy as np
import pandas as pd

df = pd.read_csv('tester.csv')
df_to_array = np.array(df)

Your csv will be stored in df_to_array as a numpy array.

answered Dec 22, 2021 at 22:59

Sibtain Reza

5332 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user17242583 Over a year ago

Or instead of np.array(df): df.to_numpy()

Sibtain Reza Over a year ago

Both are possible :)

score 2 · Accepted Answer · 2021-12-23 01:32:23Z

2

In order to avoid the nans, you need to tell genfromtxt the dtypes of the columns, because, by default, it tries to make everything a float.

dtypes = ['float', 'object']
csv = np.array(np.genfromtxt('tester.csv',delimiter=',', skip_header=1, dtype=dtypes).tolist())

Output:

>>> csv
array([[885.5, b'PI 18:0_20:4'],
       [857.5, b'PI 16:0_20:4'],
       [834.5, b'PS 18:0_22:6'],
       [810.5, b'PS 18:0_20:4'],
       [790.5, b'PE 18:0_22:6']], dtype=object)

edited Dec 23, 2021 at 1:32

answered Dec 22, 2021 at 23:20

user17242583

4 Comments

hpaulj Over a year ago

That "oddlly" way is called a structured array. The result is a 1d array with 2 fields, one for each column, each with its own dtype. That's rather like the pandas dataframe with different dtypes for each column (Series).

user17242583 Over a year ago

Okay, thank you @hpaulj. Do you know of a better way to deal with that than using unpack=True and transposing?

hpaulj Over a year ago

Depends on the desired result. Is a 2d object dtype array better? np.array(data.tolist(), dtype=object) is another option.

user17242583 Over a year ago

Oh yeah, that actually worked! Interesting. I'll update the answer.

Corralien · Accepted Answer · 2021-12-22 22:55:39Z

0

As you use numpy, you can install pandas to load your csv file:

# Python env: pip install pandas
# Anaconda env: conda install pandas
df = pd.read_csv('tester.csv', sep='\s\s+', engine='python')

answered Dec 22, 2021 at 22:55

Corralien

121k8 gold badges43 silver badges68 bronze badges

Comments

hpaulj · Accepted Answer · 2021-12-23 01:45:00Z

In [228]: txt="""m/z,     Lipid ID
     ...: 885.5,   PI 18:0_20:4 
     ...: 857.5,   PI 16:0_20:4
     ...: 834.5,   PS 18:0_22:6
     ...: 810.5,   PS 18:0_20:4
     ...: 790.5,   PE 18:0_22:6
     ...: """

genfromtxt has a lot of possible parameters. It's not as fast as the pandas equivalent, but still quite flixible.

In [229]: data = np.genfromtxt(txt.splitlines(),delimiter=',',dtype=None, encoding=None, 
     names=True, autostrip=True)
In [230]: data
Out[230]: 
array([(885.5, 'PI 18:0_20:4'), (857.5, 'PI 16:0_20:4'),
       (834.5, 'PS 18:0_22:6'), (810.5, 'PS 18:0_20:4'),
       (790.5, 'PE 18:0_22:6')],
      dtype=[('mz', '<f8'), ('Lipid_ID', '<U12')])

This is a structured array, with 2 fields. Because of the names parameter, field names are taken from the file header line. With dtype=None, it deduces a dtype for each column, in this case float and string. Fields are accessed by name:

In [231]: data['Lipid_ID']
Out[231]: 
array(['PI 18:0_20:4', 'PI 16:0_20:4', 'PS 18:0_22:6', 'PS 18:0_20:4',
       'PE 18:0_22:6'], dtype='<U12')
In [232]: data['mz']
Out[232]: array([885.5, 857.5, 834.5, 810.5, 790.5])

To make a 2d array we have to cast it to object dtype, allowing a mix of numbers and strings.

In [233]: np.array(data.tolist(), object)
Out[233]: 
array([[885.5, 'PI 18:0_20:4'],
       [857.5, 'PI 16:0_20:4'],
       [834.5, 'PS 18:0_22:6'],
       [810.5, 'PS 18:0_20:4'],
       [790.5, 'PE 18:0_22:6']], dtype=object)

The structured arrays can be loaded into a dataframe, with a result similar to what a pandas read would produce:

In [235]: pd.DataFrame(data)
Out[235]: 
      mz      Lipid_ID
0  885.5  PI 18:0_20:4
1  857.5  PI 16:0_20:4
2  834.5  PS 18:0_22:6
3  810.5  PS 18:0_20:4
4  790.5  PE 18:0_22:6

Dataframe to_records produces a structured array, much like what we started with.

In [238]: _235.to_records(index=False)
Out[238]: 
rec.array([(885.5, 'PI 18:0_20:4'), (857.5, 'PI 16:0_20:4'),
           (834.5, 'PS 18:0_22:6'), (810.5, 'PS 18:0_20:4'),
           (790.5, 'PE 18:0_22:6')],
          dtype=[('mz', '<f8'), ('Lipid_ID', 'O')])

Collectives™ on Stack Overflow

How do I load a .csv file with strings and floats in python?

4 Answers 4

2 Comments

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related