How can I get the first word from each string in my Dataframe using Python?

Question

I have a Pandas DataFrame called "data" with 2 columns and 50 rows filled with one or two lines of text each, imported from a .tsv file. Some of the questions may contain integers and floats, besides strings. I am trying to extract the first word of every sentence (in both columns), but consistently get this error: AttributeError: 'DataFrame' object has no attribute 'str'.

At first, I thought the error was due to my wrong use of "data.str.split", but all changes I could Google failed. Then I through the file might not be composed of all strings. So I tried "data.astype(str)" on the file, but the same error remained. Any suggestions? Thanks a lot!

Here is my code:

import pandas as pd
questions = "questions.tsv"
data = pd.read_csv(questions, usecols = [3], nrows = 50, header=1, sep="\t")
data = data.astype(str)
first_words = data.str.split(None, 1)[0]

Yes, both work! Thanks so much! Just to learn, any idea why my approach failed? — twhale
– twhale, Commented Sep 15, 2017 at 4:46
It doesn't work because you can't call .str accessor on a dataframe directly. — cs95
– cs95, Commented Sep 15, 2017 at 4:53

jezrael · Accepted Answer · 2017-09-15 04:46:02Z

5

Use:

first_words = data.apply(lambda x: x.str.split().str[0])

Or:

first_words = data.applymap(lambda x: x.split()[0])

Sample:

data = pd.DataFrame({'a':['aa ss ss','ee rre', 1, 'r'],
                   'b':[4,'rrt ee', 'ee www ee', 6]})
print (data)
          a          b
0  aa ss ss          4
1    ee rre     rrt ee
2         1  ee www ee
3         r          6

data = data.astype(str)
first_words = data.apply(lambda x: x.str.split().str[0])
print (first_words)
    a    b
0  aa    4
1  ee  rrt
2   1   ee
3   r    6

first_words = data.applymap(lambda x: x.split()[0])
print (first_words)
    a    b
0  aa    4
1  ee  rrt
2   1   ee
3   r    6

edited Sep 15, 2017 at 4:46

answered Sep 15, 2017 at 4:40

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

cs95 Over a year ago

I could not understand what you said well, but from what I understood, it seemed like you were upset

cs95 Over a year ago

Sorry, I didn't see x.str.split().str[0] in your answer.

cs95 Over a year ago

Fantastic. So happy to see you thinking positively.

piRSquared · Accepted Answer · 2017-09-15 05:14:18Z

1

The problem is that you attempted to use the pd.Series.str string accessor on a pd.DataFrame. Unfortunately, it is a pd.Series only attribute. That means you need to use it in a pd.Series context. You can accomplish in several ways.

Setup
Assume your dataframe looked like this

              Col1               Col2
0   this is a test        hello world
1  this is another          pandas123
2            test3       tommy trojan
3         etcetera  one more sentence

Option 1
Use stack to convert a 2-dimensional dataframe into a series... then use the string accessor

#  Make a
#  Series
#  /----\    
df.stack().str.split(n=1).str[0].unstack()
#                                 \_____/
#                                 Turn it
#                                   Back

       Col1       Col2
0      this      hello
1      this  pandas123
2     test3      tommy
3  etcetera        one

Option 2
Or you can use pd.DataFrame.apply to use the pd.Series.str accessor on each column separately.
This is covered in @jezrael's answer.

df.apply(lambda x: x.str.split(n=1).str[0])

       Col1       Col2
0      this      hello
1      this  pandas123
2     test3      tommy
3  etcetera        one

Option 3
Use a comprehension

pd.DataFrame({c: df[c].str.split(n=1).str[0] for c in df})

       Col1       Col2
0      this      hello
1      this  pandas123
2     test3      tommy
3  etcetera        one

You'll notice that in all options, we used the str on a pd.Series object and not a pd.DataFrame object.

edited Sep 15, 2017 at 5:14

answered Sep 15, 2017 at 5:00

piRSquared

296k68 gold badges509 silver badges654 bronze badges

3 Comments

cs95 Over a year ago

Awesome! I think split(n=1) might improve efficiency a bit, because splitting stops after the first word (everything after is unnecessary). This was covered in my (now deleted) answer.

piRSquared Over a year ago

Added. Thanks for tip.

twhale Over a year ago

This is great, thanks. I am a starter, so I am grateful for this steep learning curve!

Collectives™ on Stack Overflow

How can I get the first word from each string in my Dataframe using Python?

2 Answers 2

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related