1

I'm fairly new to coding in Python. For a personal project, I'm looking for different ways to retrieve birthdays and days of death from a list of Wikipedia pages. I am using wikipedia package.

One way I try to achieve that is by iterating over the Wikipedia summary and returning the index from when I count four digits in a row.

import wikipedia as wp

names = ('Zaha Hadid', 'Rem Koolhaas')
wiki_summary = wp.summary(names)
b_counter = 0
i_b_year = []
d_counter = 0
i_d_year = []

for i,x in enumerate(wiki_summary):
    if x.isdigit() == True:
        b_counter += 1
        if b_counter == 4:
           i_b_year = i
           break
        else:
            continue        
    else:
        b_counter = 0

So far, that works for the first person in my list but I would like to iterate over all the names in my names list. Is there a way to use the for loop to find the index and use a for loop to iterate over the names?

I know there are other ways like parsing to find the bday tags, but I would like to try a couple of different solutions.

2
  • 1
    Are you using wikipedia ( pypi.org/project/wikipedia ) package? Commented May 15, 2020 at 15:22
  • @arsho Yes, I do. Commented May 15, 2020 at 15:24

3 Answers 3

1

You are trying to:

  1. Declare two empty lists to store birth year and death year of each person.
  2. Get Wikipedia summary of each person from a tuple.
  3. Parse first two numbers with 4 digits from the summary and append them to birth year and death year list.

The problem is that summary of the persons may not include birth year and death year as first two 4 digit numbers. For example Rem_Koolhaas's wikipedia summary includes his birth year as first 4 digit number but second 4 digit number is in this line: In 2005, he co-founded Volume Magazine together with Mark Wigley and Ole Bouman.

We can see that, the birth_year and death_year list may not include accurate information.

Here is the code that does what you are trying to achieve:

import wikipedia as wp

names = ('Zaha Hadid', 'Rem Koolhaas')
i_b_year = []
i_d_year = []

for person_name in names:
    wiki_summary = wp.summary(person_name)
    birth_year_found = False
    death_year_found = False
    digits = ""    

    for c in wiki_summary:
        if c.isdigit() == True:
            if birth_year_found == False:                
                digits += c
                if len(digits) == 4:
                    birth_year_found = True
                    i_b_year.append(int(digits))
                    digits = ""
            elif death_year_found == False:
                digits += c
                if len(digits) == 4:
                    death_year_found = True
                    i_d_year.append(int(digits))
                    break
        else:
            digits = ""
    if birth_year_found == False:
        i_b_year.append(0)
    if death_year_found == False:
        i_d_year.append(0)

for i in range(len(names)):
    print(names[i], i_b_year[i], i_d_year[i])

Output:

Zaha Hadid 1950 2016
Rem Koolhaas 1944 2005

Disclaimer: in the above code, I have appended 0 if two 4 digit numbers are not found in the summary of any person. As I have already mentioned there is no assertion that wikipedia summary will list a person's birth year and death year as first two 4 digits numbers the lists may include wrong information.

Sign up to request clarification or add additional context in comments.

1 Comment

Cheers. I knew about the 'not dead yet'-problem and was already thinking about how to solve it. You saved me some time. Thank you.
1

I am not familiar with the Wikipedia package, but it seems like you could just iterate over the names tuple:

import Wikipedia as wp

names = ('Zaha Hadid', 'Rem Koolhaas')

i_b_year = []
for name in names: #This line is new
    wiki_summary = wp.summary(name) #Just changed names for name
    b_counter = 0
    d_counter = 0
    i_d_year = []

    for i,x in enumerate(wiki_summary):
        if x.isdigit() == True:
            b_counter += 1
            if b_counter == 4:
               i_b_year.append(i) #I am guessing you want this list to increase with each name in names. Thus, 'append'.
               break
            else:
                continue        
        else:
            b_counter = 0

1 Comment

Thank you. I had to change it up a little but it got me there.
1

First of all, your code won't work due to several reasons:

  1. Importing wikipedia will only work with first lowercase letter import wikipedia
  2. summary method accepts strings (in your case names), so you would have to call it for every name in a set

All of this aside, let's try to achieve what you're trying to do:

import wikipedia as wp
import re

# First thing we see (at least for pages provided) is that dates all share the same format:
# For those who are no longer with us 31 October 1950 – 31 March 2016
# For those who are still alive 17 November 1944
# So we have to build regex patterns to find those
# First is the months pattern, since it's quite a big one
MONTHS_PATTERN = r"January|February|March|April|May|June|July|August|September|October|November|December"
# Next we build our date pattern, double curly braces are used for literal text
DATE_PATTERN = re.compile(fr"\d{{1,2}}\s({MONTHS_PATTERN})\s\d{{,4}}")
# Declare our set of names, great choice of architects BTW :)
names = ('Zaha Hadid', 'Rem Koolhaas')
# Since we're trying to get birthdays and dates of death, we will create a dictionary for storing values
lifespans = {}
# Iterate over them in a loop
for name in names:
    lifespan = {'birthday': None, 'deathday': None}
    try:
        summary = wp.summary(name)
        # First we find the first date in summary, since it's most likely to be the birthday
        first_date = DATE_PATTERN.search(summary)
        if first_date:
            # If we've found a date – suppose it's birthday
            bday = first_date.group()
            lifespan['birthday'] = bday
            # Let's check whether the person is no longer with us
            LIFESPAN_PATTERN = re.compile(fr"{bday}\s–\s{DATE_PATTERN.pattern}")
            lifespan_found = LIFESPAN_PATTERN.search(summary)
            if lifespan_found:
                lifespan['deathday'] = lifespan_found.group().replace(f"{bday} – ", '')
            lifespans[name] = lifespan
        else:
            print(f'No dates were found for {name}')
    except wp.exceptions.PageError:
        # Handle not found page, so that code won't break
        print(f'{name} was not found on Wikipedia')
        pass

# Print result
print(lifespans)

Output for provided names:

{'Zaha Hadid': {'birthday': '31 October 1950', 'deathday': '31 March 2016'}, 'Rem Koolhaas': {'birthday': '17 November 1944', 'deathday': None}}

This approach is inefficient and has many flaws, like if we get a page with dates fitting our regular expression, yet not being birthday and death day. It's quite ugly (even though I've tried my best :) ) and you'd be better off parsing tags.

If you're not happy with date format from Wikipedia, I suggest you look into datetime. Also, consider that those regular expressions fit those two specific pages, I did not conduct any research on how dates might be represented in Wikipedia. So, if there are any inconsistencies, I suggest you stick with parsing tags.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.