My RDD has TAB delimited strings in it. I'm trying to filter it: if column 5 contains few strings:
filt_data = raw_data.filter(lambda x: '' if len(x.split('\t')) < 5 else "apple" in x.split('\t')[4] or "pear" in x.split('\t')[4] or "berry" in x.split('\t')[4] or "cherry" in x.split('\t')[4])
I dont think its very effective solution since i'm doing 4 splits of the same row there. Can some1 show more optimal way of doing it?
And what if i have an array of "fruits". How can i filter my RDD that contains elements from this array?
Could do something like that x.split('\t')[4] in array but it will filter only if an array element is equal to column 5 item, but i need to check if column 5 contains any of the strings in array.