pandas dataframe how to convert object to array and extract the array value

Question

I have a DataFrame as below which I received from hive DB.

How to extract value 'cat', 'animal', and 'dog' in column col2.

In[]:
sample = {'col1': ['cat', 'dog'], 'col2': ['WrappedArray([animal], [cat])', 'WrappedArray([animal], [dog])']}
df = pd.DataFrame(data=sample)
df

out[] :
    col1                            col2
-----------------------------------------
0   cat     WrappedArray([animal], [cat])
1   dog     WrappedArray([animal], [dog])

I tried to convert object to an array and extract the data like this code.

In[]: df['col2'][0][1]
Out[]: cat

If this is wrong, which way can I go for, I'm new to Pandas so the question might be unclear.

Thanks in advance.

Where is the "WrappedArray()" coming from? Is that how you're getting the data? I'm guessing you're not actually creating a dataset like that, you'd just be making more work for yourself. — elPastor
– elPastor, Commented Feb 23, 2020 at 18:49
@elPastor yes, I didn't create the dataframe, it's what I received from database. — Chachatonel Hashimotto
– Chachatonel Hashimotto, Commented Feb 23, 2020 at 18:57
Not sure that this helps. stackoverflow.com/questions/44468311/…. Looks like WrappedArray is spark type. — Poojan
– Poojan, Commented Feb 23, 2020 at 19:07
@TomRon its selected data. The column has consisted of an array. This is how the column type : col1 array<struct<tag:string,score:float>> — Chachatonel Hashimotto
– Chachatonel Hashimotto, Commented Feb 23, 2020 at 19:08

rkedge · Accepted Answer · 2020-02-23 20:59:58Z

The data in the second column col2 appear to be simply strings.

The output from df['col2'][0][1] would be "r" Which is the second character (index 1) in the first string. To get "cat" you would need to alter the strings and remove the 'WrappedArray([animal]...' stuff. leaving only the actual data. "cat", "dog', etc.

You could try df['col2'].iloc[0][24:27], but that's not a general solution. It would also be brittle and unmaintainable.

If you have any control over how the data is exported from the database, try to get the data out in a cleaner format, i.e. without the WrappedArray(... stuff.

Regular expressions might be helpful here.

You could try something like this:

import re

wrapped = re.compile(r'\[(.*?)\].+\[(.*?)\]')
element = wrapped.search(df['col2'].iloc[0]).group(2)

* Danger Danger Danger *

If you need that functionality. You could create a WrappedArray function that returns the contents as list of strings or the like. Then you can call it by using eval(df['col2'][0][1]).

Don't do this.

FYI:

Your dtypes likely defaulted to object, because you didn't specify them when you created your data frame. You can do that like this:

df = pd.DataFrame(data=sample, dtype='string')

Also, it's recommended to use iloc to index dataframes by index.

Chachatonel Hashimotto · Accepted Answer · 2020-02-24 17:08:08Z

0

I solved it as @rkedge advised me

the data is written in a foreign language.

As I said, DataFrame has object data written with 'WrappedArray([우주ごぎゅ],[ぎゃ],[한국어])'.

df_ = df['col2'].str.extractall(r'([REGEX expression]+)')
df_

0   0   우주ごぎゅ
0   1   ぎゃ
0   2   한국어
1   0   cat
2   0   animal

answered Feb 24, 2020 at 17:08

Chachatonel Hashimotto

5772 gold badges5 silver badges25 bronze badges

Collectives™ on Stack Overflow

pandas dataframe how to convert object to array and extract the array value

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related