0

I'm new to web scraping, and I'm having issues getting a link to data from a USGS earthquake's did you feel it page. The url I'm trying to get the data from is: https://earthquake.usgs.gov/earthquakes/eventpage/us7000biji/dyfi/intensity

I'm trying to automate the pickup of this data so I don't have to manually pick it up after each earthquake. The url for the data that I'm trying to pull is consistent except for the earthquakes id, which I have, and a number that doesn't seem to be tied to anything, and so I thought I could just get the url with web scraping.

If you look at the page there is a drop down menu called downloads with different data products. I am trying to get the url for the DYFI Geospatial Data, UTM aggregated(10 km spacing) so I can pull the geojson file using curl.

I don't know much about web scraping or html code, and most of what I've tried has been based on what I've found here and on youtube.

What I've tried:

I tried using requests to get the html and parse it with beautiful soup, but the page is dynamically generated so the html that came over didn't include what I was looking for.

import requests
import bs4 #beautiful soup

res = requests.get('https://earthquake.usgs.gov/earthquakes/eventpage/us7000bi0e/dyfi/intensity')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for link in soup.find_all('a'):
    print(link)

This outputs three links, but not the one I need:

<a href="/earthquakes/feed/">Real-time Notifications, Feeds, and Web Services</a>
<a href="https://angular.io/guide/browser-support">view supported
            browsers</a>
<a href="/earthquakes/feed/">Real-time Notifications, Feeds, and
            Web Services</a>

I think that the USGS site uses javascript to populate the drop down downloads menu which is why the regular requests method didn't work, and so I thought that I might try to use selenium instead. I hoped that it would give me the html that I can see when I use the inspect element tool, but I didn't have any luck.

from selenium import webdriver
path = "/Users/jon/Desktop/selenium_webdriver/chromedriver" #path to chromedriver on my machine
driver = webdriver.Chrome(executable_path=path)
driver.get('https://earthquake.usgs.gov/earthquakes/eventpage/us7000bi0e/dyfi/intensity')
html_eq = driver.page_source
soup = bs4.BeautifulSoup(html_eq, 'html.parser')
for link in soup.find_all('a'):
    print(link) 

This outputs more links than my original attempt, but doesn't get me the link I'm looking for. Here is the output of my selenium attempt:

<a _ngcontent-fgi-c8="" class="hazdev-site-logo" href="/" title="U.S. Geological Survey"><img _ngcontent-fgi-c8="" alt="U.S. Geological Survey logo" src="assets/usgs-logo.svg"/></a>
<a _ngcontent-fgi-c8="" class="hazdev-jumplink-navigation" href="#site-sectionnav">Jump to Navigation</a>
<a _ngcontent-fgi-c5="" class="up-one-level ng-star-inserted" href="/earthquakes/map/" templatesidenavigation=""> Latest Earthquakes </a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/executive" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Overview </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/map" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Interactive Map </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/region-info" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Regional Information </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/impact" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Impact </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/tellus" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Felt Report - Tell Us! </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted active-link" href="/earthquakes/eventpage/us7000bi0e/dyfi" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Did You Feel It? </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/technical" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Technical </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/origin" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Origin </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/waveforms" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Waveforms </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/feed/v1.0/detail/us7000bi0e.kml" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Download Event KML </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/map/#%7B%22autoUpdate%22%3Afalse%2C%22basemap%22%3A%22terrain%22%2C%22event%22%3A%22us7000bi0e%22%2C%22feed%22%3A%22us7000bi0e%22%2C%22mapposition%22%3A%5B%5B6.104279985601153%2C-85.06432001439885%5D%2C%5B10.603920014398849%2C-80.56467998560115%5D%5D%2C%22search%22%3A%7B%22id%22%3A%22us7000bi0e%22%2C%22isSearch%22%3Atrue%2C%22name%22%3A%22Search%20Results%22%2C%22params%22%3A%7B%22endtime%22%3A%222020-09-25T17%3A46%3A43.975Z%22%2C%22latitude%22%3A8.3541%2C%22longitude%22%3A-82.8145%2C%22maxradiuskm%22%3A250%2C%22minmagnitude%22%3A2%2C%22starttime%22%3A%222020-08-14T17%3A46%3A43.975Z%22%7D%7D%7D" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> View Nearby Seismicity </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Earthquakes </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/hazards/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Hazards </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/data/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Data &amp; Products </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/learn/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Learn </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/monitoring/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Monitoring </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/research/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Research </div></a>
<a _ngcontent-fgi-c18="" class="tell-us-link" href="/earthquakes/eventpage/us7000bi0e/tellus" queryparamshandling="preserve"> Felt Report - Tell Us! </a>
<a _ngcontent-fgi-c22=""> View all dyfi products (1 total) </a>
<a _ngcontent-fgi-c20="" href="/earthquakes/eventpage/us7000bi0e/dyfi/intensity"> US </a>
<a _ngcontent-fgi-c18="" aria-current="true" aria-disabled="false" class="mat-tab-link ng-star-inserted mat-tab-label-active" href="/earthquakes/eventpage/us7000bi0e/dyfi/intensity" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> Intensity </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/zip" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> ZIP Map </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/intensity-vs-distance" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> Intensity Vs. Distance </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/responses-vs-time" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> Responses Vs. Time </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/responses" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> DYFI Responses </a>
<a _ngcontent-fgi-c28="" class="ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/map?dyfi-responses-10km=true&amp;shakemap-intensity=false"><img _ngcontent-fgi-c28="" alt="DYFI intensity map" src="https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/us7000bi0e_ciim_geo.jpg"/></a>
<a _ngcontent-fgi-c23="" href="/earthquakes/eventpage/us7000bi0e">Overview</a>
<a _ngcontent-fgi-c32="" class="ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/impact"> Impact Summary </a>
<a _ngcontent-fgi-c18="" href="https://earthquake.usgs.gov/data/dyfi/">Scientific Background for Did You Feel It?</a>
<a href="https://earthquake.usgs.gov/data/comcat/contributor/us/">USGS National Earthquake Information Center, PDE</a>
<a _ngcontent-fgi-c7="" href="/data/comcat/"> ANSS Comprehensive Earthquake Catalog (ComCat) Documentation </a>
<a _ngcontent-fgi-c7="" href="/data/comcat/data-eventterms.php"> Technical terms used on event pages </a>
<a _ngcontent-fgi-c11="" href="mailto:lisa%[email protected]">Questions or comments?</a>
<a _ngcontent-fgi-c11="" class="facebook ng-star-inserted" href="https://www.facebook.com/sharer.php?u=https%3A%2F%2Fearthquake.usgs.gov%2Fearthquakes%2Feventpage%2Fus7000bi0e%2Fdyfi%2Fintensity" title="Share using Facebook">Facebook</a>
<a _ngcontent-fgi-c11="" class="twitter ng-star-inserted" href="https://twitter.com/intent/tweet?url=https%3A%2F%2Fearthquake.usgs.gov%2Fearthquakes%2Feventpage%2Fus7000bi0e%2Fdyfi%2Fintensity&amp;text=USGS%20%7C%20M 5.3 - 1 km NNW of Manaca Norte, Panama" title="Share using Twitter">Twitter</a>
<a _ngcontent-fgi-c11="" class="email ng-star-inserted" href="mailto:lisa%[email protected]?to=&amp;subject=M 5.3 - 1 km NNW of Manaca Norte, Panama&amp;body=https%3A%2F%2Fearthquake.usgs.gov%2Fearthquakes%2Feventpage%2Fus7000bi0e%2Fdyfi%2Fintensity" title="Share using Email">Email</a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/"> Home </a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/aboutus/"> About Us </a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/contactus/"> Contacts </a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/legal.php"> Legal </a>

I found a youtube tutorial about web scraping using requests_html that I thought might work: https://www.youtube.com/watch?v=MeBU-4Xs2RU I can get the example he gives in the video to work with the beer website, but I haven't been able to apply it to my situation.

Here is the code I've tried,

from requests_html import HTMLSession

url_usgs = 'https://earthquake.usgs.gov/earthquakes/eventpage/us7000biji/dyfi/intensity'

r_usgs = s.get(url_usgs)

r_usgs.html.render(sleep=1)

downloads = r_usgs.html.xpath('//*[@id="mat-expansion-panel-header-0"]', first=True)
print(downloads.absolute_links)

This isn't returning anything though. I don't know html so it's possible that I'm selecting the wrong item's xpath to use.

If anyone has any ideas on how I can get the url for the 10km dyfi data from the downloads menu (https://earthquake.usgs.gov/archive/product/dyfi/us7000biji/us/1601214674370/dyfi_geo_10km.geojson), or could point me in the direction of some more in depth material on web scraping I would appreciate it.

1 Answer 1

1

You need to click on the "Downloads" menu in order to expand the content.

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time


driver = webdriver.Chrome()
driver.get('https://earthquake.usgs.gov/earthquakes/eventpage/us7000bi0e/dyfi/intensity')

# get a reference to the download menu. This will run before the page has 
# finished loading, so we stick it in a while loop and just keep looping
# until we're successful.
while True:
    try:
        download_menu = driver.find_element_by_id('mat-expansion-panel-header-0')
    except NoSuchElementException:
        time.sleep(0.2)
        continue
    else:
        break

# click on the download menu to expand the content
download_menu.click()

while True:
    try:
        downloads = driver.find_element_by_id('cdk-accordion-child-0')
    except NoSuchElementException:
        time.sleep(0.2)
        continue
    else:
        break

links = downloads.find_elements_by_css_selector('a')
geojson = [link for link in links if 'geojson' in link.text.lower()]

for link in geojson:
    print(link.text, ':', link.get_attribute('href'))


driver.close()

Which will produce:

GEOJSON 645.0 B : https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/dyfi_zip.geojson
GEOJSON 844.0 B : https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/dyfi_geo_1km.geojson
GEOJSON 1.0 KB : https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/dyfi_geo_10km.geojson

...and of course you could inspect the value of the href attributes to find the 10km data (by looking for the one that contains 10km in the link).

Sign up to request clarification or add additional context in comments.

1 Comment

What a legend! Thanks for getting me past that level.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.