3
filtered=[]
text="any.pdf"
if "doc" and "pdf" and "xls" and "jpg" not in text:
    filtered.append(text)
print(filtered)

This is my first Post in Stack Overflow, so excuse if there's something annoying in Question, The Code suppose to append text if text doesn't include any of these words:doc,pdf,xls,jpg. It works fine if Its like:

if "doc" in text:
elif "jpg" in text:
elif "pdf" in text:
elif "xls" in text:
else:
    filtered.append(text)

5 Answers 5

6

If you open up a python interpreter, you'll find that "doc" and "pdf" and "xls" and "jpg" is the same thing as 'jpg':

>>> "doc" and "pdf" and "xls" and "jpg"
'jpg'

So rather than testing against all the strings, your first attempt only tests against 'jpg'.

There are a number of ways to do what you want. The below isn't the most obvious, but it's useful:

if not any(test_string in text for test_string in ["doc", "pdf", "xls", "jpg"]):
    filtered.append(text)

Another approach would be to use a for loop in conjunction with an else statement:

for test_string in ["doc", "pdf", "xls", "jpg"]:
    if test_string in text:
        break
else: 
    filtered.append(text)

Finally, you could use a pure list comprehension:

tofilter = ["one.pdf", "two.txt", "three.jpg", "four.png"]
test_strings = ["doc", "pdf", "xls", "jpg"]
filtered = [s for s in tofilter if not any(t in s for t in test_strings)]

EDIT:

If you want to filter both words and extensions, I would recommend the following:

text_list = generate_text_list() # or whatever you do to get a text sequence
extensions = ['.doc', '.pdf', '.xls', '.jpg']
words = ['some', 'words', 'to', 'filter']
text_list = [text for text in text_list if not text.endswith(tuple(extensions))]
text_list = [text for text in text_list if not any(word in text for word in words)]

This could still lead to some mismatches; the above also filters "Do something", "He's a wordsmith", etc. If that's a problem then you may need a more complex solution.

Sign up to request clarification or add additional context in comments.

1 Comment

Rather than editing I'll simply add that if you want to ignore case, you should use the str.lower() method -- i.e. "pdf" in text.lower(). Also, using .endswith() (S.Mark's answer) is nice because it doesn't reject strings like "mypdfprocessor.py".
4

If those extensions are always at the end, you can use .endswith and that can parse tuple.

if not text.endswith(("doc", "pdf", "xls", "jpg")):
    filtered.append(text)

2 Comments

just edit if not as code exclude links which ends with these strings, sorry I can't edit it myself as it tells me it's less than 6 characters, Thanks
+1, endswith is definitely the way to go for filtering specifically based on extension.
3
basename, ext = os.path.splitext(some_filename)
if not ext in ('.pdf', '.png'):
   filtered.append(some_filename)
....

Comments

1

Try the following:

if all(substring not in text for substring in ['doc', 'pdf', 'xls', 'jpg']):
     filtered.append(text)

Comments

1

The currently-selected answer is very good as far as explaining the syntactically correct ways to do what you want to do. However it's obvious that you are dealing with file extensions, which appear at the end [fail: doctor_no.py, whatsupdoc], and probable that you are using Windows, where case distinctions in file paths don't exist [fail: FUBAR.DOC].

To cover those bases:

# setup
import os.path
interesting_extensions = set("." + x for x in "doc pdf xls jpg".split())

# each time around
basename, ext = os.path.splitext(text)
if ext.lower() not in interesting_extensions:
    filtered.append(text)

4 Comments

sorry I don't get what you are saying, but I'm using Ubuntu and the main target was Spidering Web Site and after extracting source code from source code, I was excluding links contain javascript or these words, thanks anyway
You are excluding links containing those strings, not links containing those words. You will (for example) exclude a link that contains the word "doctor" or "dock" or "docket" or "doctored", and fail to exclude a link that contains a filename in upper case (example: FUBAR.DOC).
I'm using a .lower() so FUBAR.DOC won't be included, but you are right all words will be excluded which I don't want to. the problem that not all of the words are extensions, like javascript in start, so what to do??
@Mahmoud A. Raouf: "what to do??": (1) edit your question to say what you actually want to do (it points heavily towards file extensions and doesn't mention "javascript in start" (which you should explain)). (2) unselect the selected answer (3) await an answer that solves your problem

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.