0

I have a DataFrame as below which I received from hive DB.

How to extract value 'cat', 'animal', and 'dog' in column col2.

In[]:
sample = {'col1': ['cat', 'dog'], 'col2': ['WrappedArray([animal], [cat])', 'WrappedArray([animal], [dog])']}
df = pd.DataFrame(data=sample)
df

out[] :
    col1                            col2
-----------------------------------------
0   cat     WrappedArray([animal], [cat])
1   dog     WrappedArray([animal], [dog])

I tried to convert object to an array and extract the data like this code.

In[]: df['col2'][0][1]
Out[]: cat

If this is wrong, which way can I go for, I'm new to Pandas so the question might be unclear.

Thanks in advance.

9
  • Where is the "WrappedArray()" coming from? Is that how you're getting the data? I'm guessing you're not actually creating a dataset like that, you'd just be making more work for yourself. Commented Feb 23, 2020 at 18:49
  • @elPastor yes, I didn't create the dataframe, it's what I received from database. Commented Feb 23, 2020 at 18:57
  • How did you read the data from the database? Commented Feb 23, 2020 at 19:02
  • Not sure that this helps. stackoverflow.com/questions/44468311/…. Looks like WrappedArray is spark type. Commented Feb 23, 2020 at 19:07
  • @TomRon its selected data. The column has consisted of an array. This is how the column type : col1 array<struct<tag:string,score:float>> Commented Feb 23, 2020 at 19:08

2 Answers 2

1

The data in the second column col2 appear to be simply strings.

The output from df['col2'][0][1] would be "r" Which is the second character (index 1) in the first string. To get "cat" you would need to alter the strings and remove the 'WrappedArray([animal]...' stuff. leaving only the actual data. "cat", "dog', etc.

You could try df['col2'].iloc[0][24:27], but that's not a general solution. It would also be brittle and unmaintainable.

If you have any control over how the data is exported from the database, try to get the data out in a cleaner format, i.e. without the WrappedArray(... stuff.

Regular expressions might be helpful here.

You could try something like this:

import re

wrapped = re.compile(r'\[(.*?)\].+\[(.*?)\]')
element = wrapped.search(df['col2'].iloc[0]).group(2)

* Danger Danger Danger *

If you need that functionality. You could create a WrappedArray function that returns the contents as list of strings or the like. Then you can call it by using eval(df['col2'][0][1]).

Don't do this.

FYI:

Your dtypes likely defaulted to object, because you didn't specify them when you created your data frame. You can do that like this:

df = pd.DataFrame(data=sample, dtype='string')

Also, it's recommended to use iloc to index dataframes by index.

Sign up to request clarification or add additional context in comments.

Comments

0

I solved it as @rkedge advised me

the data is written in a foreign language.

As I said, DataFrame has object data written with 'WrappedArray([우주ごぎゅ],[ぎゃ],[한국어])'.

df_ = df['col2'].str.extractall(r'([REGEX expression]+)')
df_

0   0   우주ごぎゅ
0   1   ぎゃ
0   2   한국어
1   0   cat
2   0   animal

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.