0

My RDD has TAB delimited strings in it. I'm trying to filter it: if column 5 contains few strings:

filt_data = raw_data.filter(lambda x: '' if len(x.split('\t')) < 5 else "apple" in x.split('\t')[4] or "pear" in x.split('\t')[4] or "berry" in x.split('\t')[4] or "cherry" in x.split('\t')[4])

I dont think its very effective solution since i'm doing 4 splits of the same row there. Can some1 show more optimal way of doing it?

And what if i have an array of "fruits". How can i filter my RDD that contains elements from this array? Could do something like that x.split('\t')[4] in array but it will filter only if an array element is equal to column 5 item, but i need to check if column 5 contains any of the strings in array.

1 Answer 1

1

You can replace the lambda function, with a "real" function which will do whatever you like, in an efficient way. See below a prototype of the suggested solution

def efficient_func(line):
    if len(x.split('\t')) < 5:
        return ''
    word = line.split('\t')[4]
    ...

    return ...

filt_data = raw_data.filter(efficient_func)

Regarding the 2nd question - I think that using one "if" statement should be better than using several "if" statements. e.g.

fruits_array = ['apple','pear','berry','cherry']
if word in fruits_array:
  do_something (or return some_value)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! It did speed up the process. Any idea about second question how to filter by array contents?
If I answered your question please accept it. regarding your 2nd question, I think that it is a good idea, and should work. I've updated my answer to reflect that part as well. Please check if it works for you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.