How to use Selenium Python to get a field information of each linked page

Question

The context is springerlink. For example this series of books GTM

I want to get the information located at the bottom of each book's webpage:

book info

All I want is the E-ISBN information on each page.

Is there's a way(not limited to selenium) that enumerate each book page and get the information?

@Prophet I will edit this question to the scope of only get E-ISBN information, the download require authentication which is much more difficult — Kushinada
– Kushinada, Commented Jan 17, 2023 at 16:08
You should provide your code trials + waht exactly did not work, what errors you faced etc. — Prophet
– Prophet, Commented Jan 17, 2023 at 16:11

sound wave · Accepted Answer · 2023-01-18 07:49:53Z

For this easy task you can use both Selenium and BeautifulSoup, but the latter is easier and faster so let's use it to get title and E-ISBN codes.

First install BeautifulSoup with the command pip install beautifulsoup4.

Method 1 (faster): get E-ISBN directly from books list

Notice that in the books list for each book there is an eBook link, which is something like https://www.springer.com/book/9783031256325 where 9783031256325 is the EISBN code without the - characters.

So we can get the EISBN codes directly from those urls, without the need to load a new page for each book:

import requests
from bs4 import BeautifulSoup

url = 'https://www.springer.com/series/136/books'
soup = BeautifulSoup(requests.get(url).text, "html.parser")
titles = [title.text.strip() for title in soup.select('.c-card__title')]
EISBN = []
for a in soup.select('ul:last-child .c-meta__item:last-child a'):
    c = a['href'].split('/')[-1] # a['href'] is something like https://www.springer.com/book/9783031256325
    EISBN.append( f'{c[:3]}-{c[3]}-{c[4:7]}-{c[7:12]}-{c[-1]}' ) # insert four '-' in the number 9783031256325 to create the E-ISBN code

for i in range(len(titles)):
    print(EISBN[i],titles[i])

Output

978-3-031-25632-5 Random Walks on Infinite Groups
978-3-031-19707-9 Drinfeld Modules
978-3-031-13379-4 Partial Differential Equations
978-3-031-00943-3 Stationary Processes and Discrete Parameter Markov Processes
978-3-031-14205-5 Measure Theory, Probability, and Stochastic Processes
978-3-030-56694-4 Quaternion Algebras
978-3-030-73839-6 Mathematical Logic
978-3-030-71250-1 Lessons in Enumerative Combinatorics
978-3-030-35118-2 Basic Representation Theory of Algebras
978-3-030-59242-4 Ergodic Dynamics

Method 2 (slower): get E-ISBN by loading a page for each book

This method load the details page for each book and extract from there the EISBN code:

soup = BeautifulSoup(requests.get(url).text, "html.parser")
books = soup.select('a[data-track-label^="article"]')
titles, EISBN = [], []

for book in books:
    titles.append(book.text.strip())
    soup_book = BeautifulSoup(requests.get(book['href']).text, "html.parser")
    EISBN.append( soup_book.select('p:has(span[data-test=electronic_isbn_publication_date]) .c-bibliographic-information__value')[0].text )

If you are wondering p:has(span[data-test=electronic_isbn_publication_date]) select the parent p of the span having attribute data-test=electronic_isbn_publication_date.

undetected Selenium · Accepted Answer · 2023-01-17 19:17:48Z

You can open each book through it's link within the website in a seperate tab and after switching to the new tab you need to induce WebDriverWait for the visibility_of_element_located() and you can extract any of the desired info. As an example to extract the Hardcover ISBN you can use the following locator strategies:

Code Block:

driver.get('https://www.springer.com/series/136/books')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-cc-action='accept']"))).click()
hrefs = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[data-track='click'][data-track-label^='article'][href]")))]
for href in hrefs:
    main_window = driver.current_window_handle
    driver.execute_script("window.open('" + href +"');")
    WebDriverWait(driver, 5).until((EC.number_of_windows_to_be(2)))
    windows_after = driver.window_handles
    new_window = [handle for handle in windows_after if handle != main_window][0]
    driver.switch_to.window(new_window)
    print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Hardcover ISBN']//following::span[@class='c-bibliographic-information__value']"))).text)
    driver.close()
    driver.switch_to.window(main_window)
driver.quit()

Console Output:

978-3-031-25631-8
978-3-031-19706-2
978-3-031-13378-7
978-3-031-00941-9
978-3-031-14204-8
978-3-030-56692-0
978-3-030-73838-9
978-3-030-71249-5
978-3-030-35117-5
978-3-030-59241-7

Collectives™ on Stack Overflow

How to use Selenium Python to get a field information of each linked page

2 Answers 2

Method 1 (faster): get E-ISBN directly from books list

Method 2 (slower): get E-ISBN by loading a page for each book

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Method 1 (faster): get E-ISBN directly from books list

Method 2 (slower): get E-ISBN by loading a page for each book

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related