0

I am trying to load a .csv file that contains 2 columns. The first column has floats and the second column has strings that correspond to each number in the 1st column.

I tried to load them in with file = np.genfromtxt('tester.csv',delimiter=',', skip_header=1) but only the floats loaded. The texts all appeared as nan in the array. What is the best way to load a .csv file into a 2d array with a column of floats and a column of strings?

The first few lines of the .csv file will look something like this

m/z,     Lipid ID
885.5,   PI 18:0_20:4 
857.5,   PI 16:0_20:4
834.5,   PS 18:0_22:6
810.5,   PS 18:0_20:4
790.5,   PE 18:0_22:6
9
  • Will you please show a few lines of your CSV file? Commented Dec 22, 2021 at 22:53
  • Sorry for that. Just added them! Commented Dec 22, 2021 at 22:57
  • Thank you. Are the large gaps between columns several space (` ) characters in a row or tabs (\t`)? Commented Dec 22, 2021 at 22:57
  • Oh so I just did that to make it easy to look. Each number and lipid name will be in a cell Commented Dec 22, 2021 at 23:01
  • What will the separator be? ,? Commented Dec 22, 2021 at 23:02

4 Answers 4

2

Use pandas to load your csv file, and then convert it to numpy array using:

import numpy as np
import pandas as pd

df = pd.read_csv('tester.csv')
df_to_array = np.array(df)

Your csv will be stored in df_to_array as a numpy array.

Sign up to request clarification or add additional context in comments.

2 Comments

Or instead of np.array(df): df.to_numpy()
Both are possible :)
2

In order to avoid the nans, you need to tell genfromtxt the dtypes of the columns, because, by default, it tries to make everything a float.

dtypes = ['float', 'object']
csv = np.array(np.genfromtxt('tester.csv',delimiter=',', skip_header=1, dtype=dtypes).tolist())

Output:

>>> csv
array([[885.5, b'PI 18:0_20:4'],
       [857.5, b'PI 16:0_20:4'],
       [834.5, b'PS 18:0_22:6'],
       [810.5, b'PS 18:0_20:4'],
       [790.5, b'PE 18:0_22:6']], dtype=object)

4 Comments

That "oddlly" way is called a structured array. The result is a 1d array with 2 fields, one for each column, each with its own dtype. That's rather like the pandas dataframe with different dtypes for each column (Series).
Okay, thank you @hpaulj. Do you know of a better way to deal with that than using unpack=True and transposing?
Depends on the desired result. Is a 2d object dtype array better? np.array(data.tolist(), dtype=object) is another option.
Oh yeah, that actually worked! Interesting. I'll update the answer.
0

As you use numpy, you can install pandas to load your csv file:

# Python env: pip install pandas
# Anaconda env: conda install pandas
df = pd.read_csv('tester.csv', sep='\s\s+', engine='python')

Comments

0
In [228]: txt="""m/z,     Lipid ID
     ...: 885.5,   PI 18:0_20:4 
     ...: 857.5,   PI 16:0_20:4
     ...: 834.5,   PS 18:0_22:6
     ...: 810.5,   PS 18:0_20:4
     ...: 790.5,   PE 18:0_22:6
     ...: """

genfromtxt has a lot of possible parameters. It's not as fast as the pandas equivalent, but still quite flixible.

In [229]: data = np.genfromtxt(txt.splitlines(),delimiter=',',dtype=None, encoding=None, 
     names=True, autostrip=True)
In [230]: data
Out[230]: 
array([(885.5, 'PI 18:0_20:4'), (857.5, 'PI 16:0_20:4'),
       (834.5, 'PS 18:0_22:6'), (810.5, 'PS 18:0_20:4'),
       (790.5, 'PE 18:0_22:6')],
      dtype=[('mz', '<f8'), ('Lipid_ID', '<U12')])

This is a structured array, with 2 fields. Because of the names parameter, field names are taken from the file header line. With dtype=None, it deduces a dtype for each column, in this case float and string. Fields are accessed by name:

In [231]: data['Lipid_ID']
Out[231]: 
array(['PI 18:0_20:4', 'PI 16:0_20:4', 'PS 18:0_22:6', 'PS 18:0_20:4',
       'PE 18:0_22:6'], dtype='<U12')
In [232]: data['mz']
Out[232]: array([885.5, 857.5, 834.5, 810.5, 790.5])

To make a 2d array we have to cast it to object dtype, allowing a mix of numbers and strings.

In [233]: np.array(data.tolist(), object)
Out[233]: 
array([[885.5, 'PI 18:0_20:4'],
       [857.5, 'PI 16:0_20:4'],
       [834.5, 'PS 18:0_22:6'],
       [810.5, 'PS 18:0_20:4'],
       [790.5, 'PE 18:0_22:6']], dtype=object)

The structured arrays can be loaded into a dataframe, with a result similar to what a pandas read would produce:

In [235]: pd.DataFrame(data)
Out[235]: 
      mz      Lipid_ID
0  885.5  PI 18:0_20:4
1  857.5  PI 16:0_20:4
2  834.5  PS 18:0_22:6
3  810.5  PS 18:0_20:4
4  790.5  PE 18:0_22:6

Dataframe to_records produces a structured array, much like what we started with.

In [238]: _235.to_records(index=False)
Out[238]: 
rec.array([(885.5, 'PI 18:0_20:4'), (857.5, 'PI 16:0_20:4'),
           (834.5, 'PS 18:0_22:6'), (810.5, 'PS 18:0_20:4'),
           (790.5, 'PE 18:0_22:6')],
          dtype=[('mz', '<f8'), ('Lipid_ID', 'O')])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.