1390

Let’s say I have the following Pandas dataframe:

df = DataFrame({'A': [5,6,3,4], 'B': [1,2,3,5]})
df

     A   B
0    5   1
1    6   2
2    3   3
3    4   5

I can subset based on a specific value:

x = df[df['A'] == 3]
x

     A   B
2    3   3

But how can I subset based on a list of values? - something like this:

list_of_values = [3, 6]

y = df[df['A'] in list_of_values]

To get:

     A    B
1    6    2
2    3    3
1

9 Answers 9

2285

You can use the isin method:

In [1]: df = pd.DataFrame({'A': [5,6,3,4], 'B': [1,2,3,5]})

In [2]: df
Out[2]:
   A  B
0  5  1
1  6  2
2  3  3
3  4  5

In [3]: df[df['A'].isin([3, 6])]
Out[3]:
   A  B
1  6  2
2  3  3

And to get the opposite use ~:

In [4]: df[~df['A'].isin([3, 6])]
Out[4]:
   A  B
0  5  1
3  4  5
Sign up to request clarification or add additional context in comments.

Comments

104

You can use the method query:

df.query('A in [6, 3]')
# df.query('A == [6, 3]')

or

lst = [6, 3]
df.query('A in @lst')
# df.query('A == @lst')

5 Comments

i wonder if query() is computationally better than isin() function
@Hammad According to Pandas docs: "DataFrame.query() using numexpr is slightly faster than Python for large frames."
@Hammad I did a little test and query is faster than boolean indexing for dataframes with >10k rows (check this answer for more info).
how to control for case sensitivity with query? In case you are searching strings.
@sheth7 You can use df.query('A.str.lower() in @lst')
31

list_of_values doesn't have to be a list; it can be set, tuple, dictionary, numpy array, pandas Series, generator, range etc. and isin() and query() will still work.

A note on query():

  • You can also call isin() inside query():
    list_of_values = [3, 6]
    df.query("A.isin(@list_of_values)")
    
  • You can pass a values to search over as a local_dict argument, which is useful if you don't want to create the filtering list beforehand in a chain of function calls:
    df.query("A == @lst", local_dict={'lst': [3, 6]})
    

Some common problems with selecting rows

1. list_of_values is a range

If you need to filter within a range, you can use between() method or query().

list_of_values = [3, 4, 5, 6] # a range of values

df[df['A'].between(3, 6)]  # or
df.query('3<=A<=6')

2. Return df in the order of list_of_values

In the OP, the values in list_of_values don't appear in that order in df. If you want df to return in the order they appear in list_of_values, i.e. "sort" by list_of_values, use loc.

list_of_values = [3, 6]
df.set_index('A').loc[list_of_values].reset_index()

If you want to retain the old index, you can use the following.

list_of_values = [3, 6, 3]
df.reset_index().set_index('A').loc[list_of_values].reset_index().set_index('index').rename_axis(None)

3. Don't use apply

In general, isin() and query() are the best methods for this task; there's no need for apply(). For example, for function f(A) = 2*A - 5 on column A, both isin() and query() work much more efficiently:

df[(2*df['A']-5).isin(list_of_values)]         # or
df[df['A'].mul(2).sub(5).isin(list_of_values)] # or
df.query("A.mul(2).sub(5) in @list_of_values")

4. Select rows not in list_of_values

To select rows not in list_of_values, negate isin()/in:

df[~df['A'].isin(list_of_values)]
df.query("A not in @list_of_values")  # df.query("A != @list_of_values")

5. Select rows where multiple columns are in list_of_values

If you want to filter using both (or multiple) columns, there's any() and all() to reduce columns (axis=1) depending on the need.

  1. Select rows where at least one of A or B is in list_of_values:
    df[df[['A','B']].isin(list_of_values).any(1)]
    df.query("A in @list_of_values or B in @list_of_values")
    
  2. Select rows where both of A and B are in list_of_values:
    df[df[['A','B']].isin(list_of_values).all(1)] 
    df.query("A in @list_of_values and B in @list_of_values")
    

Comments

17

You can store your values in a list as:

lis = [3,6]

then

df1 = df[df['A'].isin(lis)]

3 Comments

What's the difference between top answer?
if you have more values to filter, it is better to store them as list and filter that list from the main dataframe.
what if I want to preserve the order based on the lis list?
10

Another method;

df.loc[df.apply(lambda x: x.A in [3,6], axis=1)]

Unlike the isin method, this is particularly useful in determining if the list contains a function of the column A. For example, f(A) = 2*A - 5 as the function;

df.loc[df.apply(lambda x: 2*x.A-5 in [3,6], axis=1)]

It should be noted that this approach is slower than the isin method.

Comments

4

Its trickier with f-Strings

list_of_values = [3,6]


df.query(f'A in {list_of_values}')

Comments

3

The above answers are correct, but if you still are not able to filter out rows as expected, make sure both DataFrames' columns have the same dtype.

source = source.astype({1: 'int64'})
to_rem = to_rem.astype({'some col': 'int64'})

works = source[~source[1].isin(to_rem['some col'])]

Took me long enough.

Comments

1

A non pandas solution that compares in terms of speed may be:

filtered_column = set(df.A) - set(list_list_of_values)

Comments

1
import pandas as pd

# Create the DataFrame
df = pd.DataFrame({'A': [5, 6, 3, 4], 'B': [1, 2, 3, 5]})

# Define the list of values to subset
list_of_values = [3, 6]

# Subset the DataFrame based on the list of values in column 'A'
subset_df = df[df['A'].isin(list_of_values)]

# Display the subset DataFrame
print(subset_df)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.