1

I would like to scrape a website for its "raw" JavaScript code. For example, if I were to scrape this website. I would get a string containing:

enter image description here

This is just a small portion of the existing JS in the given link, but I would like to obtain the entire JS in a string or array of strings.

I have tried different approaches to obtain this data: using requests and selenium. Simply loading the HTML of the website doesn't seem to work, as the script tags don't seem to load.

Using selenium, I hoped this would work:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = "https://www.udemy.com"

driver = webdriver.Chrome()
driver.get(url)

wait = ui.WebDriverWait(driver, 10) 
results = wait.until(EC.visibility_of_all_elements_located((By.TAG_NAME, "script")))

print(results)

Then using results I could get a string, but it doesn't work.

Another example for the JS Scripts chunks I'd like to get:

enter image description here

The red rectangle indicates JS Scripts, as you can see there is a lot of it and I would like to get it in its "raw" form (not execute it).

My question is: How would I get the "raw" JS script in a string format? and what is the most efficient way (time-wise) to perform this?

1 Answer 1

1

You are looking for .get_attribute('innerHTML'). You also do not want to use visibility_of_all_elements_located since you are looking for something that will not ever be visible.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = "https://www.udemy.com"
driver = webdriver.Chrome()
driver.get(url)

#wait = ui.WebDriverWait(driver, 10) 
#results = wait.until(EC.visibility_of_all_elements_located((By.TAG_NAME, "script")))

wait = WebDriverWait(driver, 10)
script_tag = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//script")))
innerHTML_of_script_tag = []
for script in script_tag:
    innerHTML_of_script_tag.append(script.get_attribute('innerHTML'))
    print(script.get_attribute('innerHTML'))
    print("################################################################")

print("---------------------------------------------------------------------")
print("---------------------------------------------------------------------")
print(innerHTML_of_script_tag)
Sign up to request clarification or add additional context in comments.

8 Comments

Thanks for answering! This code doesn't seem to get the entire JS script found in the website. Though it is able to obtain some script, some larger sections of it (as shown in the image in the question itself) are missing and are empty in innerHTML_of_script_tag as well as in the print statement. It's those parts that I am having trouble retrieving.
@Omer Hen When I go to udemy.com I am not seeing that large block of javascript from your screenshot. Are you interacting with the page in any way before you are seeing this block of java script code?
I added another image showing the large chunks of JS script. The given code doesn't really seem to capture all that. Apart from that, I simply load the website and go to the "inspect" option given by Chrome to view the source code. I don't perform any special operations or interact with the page.
@Omer Hen Small update, this is interesting I can see the block you are looking for when I print the driver.page_source to a text file but not in the attribute innerHTML.
That's what I'm trying to find a solution to. Also, when I printed driver.page_source I didn't see the large chunks of JS script, but only a few smaller chunks.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.