I have a quite large dask dataframe mydataframe and a numpy array mycodes. I want to filter the rows of mydataframe and keep only those where the column CODE is not in mycodes. I reseted the index of the dataframe so that partitions are known, as I read it was important after I got an error. I tried the following code
is_new = ~mydataframe["CODE"].isin(mycodes).compute().values.flatten()
new_codes = aduanas.loc[is_new,"codigo_nc"].drop_duplicates().compute()
and variations. I get errors regarding the number of partitions or the length of the index I pass to filter... I have tried other approaches and got other errors, sometimes assertion errors. I can't seem to be able to do something as simple as filtering the rows of a dataframe.
Forgive the lack of concrete examples, but I don't find them really necessary, the question is really general: can anyone please give me some indications on how to filter the rows of a large dask dataframe? The things I need to take into account, or limitations.
You can find the data I am working with for mydataframe here. I am testing with the data in the first zip file. It's a fwf and you have the design in this gist. The only relevant variable is CODE, which I read it as string. For mycodes you can try any subset.