Filtering in pyspark

Question

My RDD has TAB delimited strings in it. I'm trying to filter it: if column 5 contains few strings:

filt_data = raw_data.filter(lambda x: '' if len(x.split('\t')) < 5 else "apple" in x.split('\t')[4] or "pear" in x.split('\t')[4] or "berry" in x.split('\t')[4] or "cherry" in x.split('\t')[4])

I dont think its very effective solution since i'm doing 4 splits of the same row there. Can some1 show more optimal way of doing it?

And what if i have an array of "fruits". How can i filter my RDD that contains elements from this array? Could do something like that x.split('\t')[4] in array but it will filter only if an array element is equal to column 5 item, but i need to check if column 5 contains any of the strings in array.

Yaron · Accepted Answer · 2016-08-10 12:56:02Z

1

You can replace the lambda function, with a "real" function which will do whatever you like, in an efficient way. See below a prototype of the suggested solution

def efficient_func(line):
    if len(x.split('\t')) < 5:
        return ''
    word = line.split('\t')[4]
    ...

    return ...

filt_data = raw_data.filter(efficient_func)

Regarding the 2nd question - I think that using one "if" statement should be better than using several "if" statements. e.g.

fruits_array = ['apple','pear','berry','cherry']
if word in fruits_array:
  do_something (or return some_value)

edited Aug 10, 2016 at 12:56

answered Aug 10, 2016 at 10:59

Yaron

10.6k9 gold badges50 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

lacerated Over a year ago

Thanks! It did speed up the process. Any idea about second question how to filter by array contents?

Yaron Over a year ago

If I answered your question please accept it. regarding your 2nd question, I think that it is a good idea, and should work. I've updated my answer to reflect that part as well. Please check if it works for you.

Collectives™ on Stack Overflow

Filtering in pyspark

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related