Python Logic in searching String

Question

filtered=[]
text="any.pdf"
if "doc" and "pdf" and "xls" and "jpg" not in text:
    filtered.append(text)
print(filtered)

This is my first Post in Stack Overflow, so excuse if there's something annoying in Question, The Code suppose to append text if text doesn't include any of these words:doc,pdf,xls,jpg. It works fine if Its like:

if "doc" in text:
elif "jpg" in text:
elif "pdf" in text:
elif "xls" in text:
else:
    filtered.append(text)

senderle · Accepted Answer · 2011-02-28 15:57:28Z

6

If you open up a python interpreter, you'll find that "doc" and "pdf" and "xls" and "jpg" is the same thing as 'jpg':

>>> "doc" and "pdf" and "xls" and "jpg"
'jpg'

So rather than testing against all the strings, your first attempt only tests against 'jpg'.

There are a number of ways to do what you want. The below isn't the most obvious, but it's useful:

if not any(test_string in text for test_string in ["doc", "pdf", "xls", "jpg"]):
    filtered.append(text)

Another approach would be to use a for loop in conjunction with an else statement:

for test_string in ["doc", "pdf", "xls", "jpg"]:
    if test_string in text:
        break
else: 
    filtered.append(text)

Finally, you could use a pure list comprehension:

tofilter = ["one.pdf", "two.txt", "three.jpg", "four.png"]
test_strings = ["doc", "pdf", "xls", "jpg"]
filtered = [s for s in tofilter if not any(t in s for t in test_strings)]

EDIT:

If you want to filter both words and extensions, I would recommend the following:

text_list = generate_text_list() # or whatever you do to get a text sequence
extensions = ['.doc', '.pdf', '.xls', '.jpg']
words = ['some', 'words', 'to', 'filter']
text_list = [text for text in text_list if not text.endswith(tuple(extensions))]
text_list = [text for text in text_list if not any(word in text for word in words)]

This could still lead to some mismatches; the above also filters "Do something", "He's a wordsmith", etc. If that's a problem then you may need a more complex solution.

edited Feb 28, 2011 at 15:57

answered Feb 27, 2011 at 7:31

senderle

152k36 gold badges218 silver badges244 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

senderle Over a year ago

Rather than editing I'll simply add that if you want to ignore case, you should use the str.lower() method -- i.e. "pdf" in text.lower(). Also, using .endswith() (S.Mark's answer) is nice because it doesn't reject strings like "mypdfprocessor.py".

YOU · Accepted Answer · 2011-02-27 08:17:01Z

4

If those extensions are always at the end, you can use .endswith and that can parse tuple.

if not text.endswith(("doc", "pdf", "xls", "jpg")):
    filtered.append(text)

edited Feb 27, 2011 at 8:17

answered Feb 27, 2011 at 7:28

YOU

124k34 gold badges191 silver badges222 bronze badges

2 Comments

Mahmoud A. Raouf Over a year ago

just edit if not as code exclude links which ends with these strings, sorry I can't edit it myself as it tells me it's less than 6 characters, Thanks

senderle Over a year ago

+1, endswith is definitely the way to go for filtering specifically based on extension.

user2665694 · Accepted Answer · 2011-02-27 07:33:32Z

3

basename, ext = os.path.splitext(some_filename)
if not ext in ('.pdf', '.png'):
   filtered.append(some_filename)
....

answered Feb 27, 2011 at 7:33

user2665694

Comments

Adeel Zafar Soomro · Accepted Answer · 2011-02-27 07:28:32Z

1

Try the following:

if all(substring not in text for substring in ['doc', 'pdf', 'xls', 'jpg']):
     filtered.append(text)

answered Feb 27, 2011 at 7:28

Adeel Zafar Soomro

1,52210 silver badges16 bronze badges

Comments

John Machin · Accepted Answer · 2011-02-27 10:17:52Z

1

The currently-selected answer is very good as far as explaining the syntactically correct ways to do what you want to do. However it's obvious that you are dealing with file extensions, which appear at the end [fail: doctor_no.py, whatsupdoc], and probable that you are using Windows, where case distinctions in file paths don't exist [fail: FUBAR.DOC].

To cover those bases:

# setup
import os.path
interesting_extensions = set("." + x for x in "doc pdf xls jpg".split())

# each time around
basename, ext = os.path.splitext(text)
if ext.lower() not in interesting_extensions:
    filtered.append(text)

answered Feb 27, 2011 at 10:17

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

4 Comments

Mahmoud A. Raouf Over a year ago

sorry I don't get what you are saying, but I'm using Ubuntu and the main target was Spidering Web Site and after extracting source code from source code, I was excluding links contain javascript or these words, thanks anyway

John Machin Over a year ago

You are excluding links containing those strings, not links containing those words. You will (for example) exclude a link that contains the word "doctor" or "dock" or "docket" or "doctored", and fail to exclude a link that contains a filename in upper case (example: FUBAR.DOC).

Mahmoud A. Raouf Over a year ago

I'm using a .lower() so FUBAR.DOC won't be included, but you are right all words will be excluded which I don't want to. the problem that not all of the words are extensions, like javascript in start, so what to do??

John Machin Over a year ago

@Mahmoud A. Raouf: "what to do??": (1) edit your question to say what you actually want to do (it points heavily towards file extensions and doesn't mention "javascript in start" (which you should explain)). (2) unselect the selected answer (3) await an answer that solves your problem

Collectives™ on Stack Overflow

Python Logic in searching String

5 Answers 5

1 Comment

2 Comments

Comments

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

2 Comments

Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related