3

I have a URL to a public google doc which is published (It says published using Google Docs at the top). It has a URL in the form of https://docs.google.com/document/d/e/<Some long random string, I think the ID of the document>/pub

Please note that this is not a spreadsheet (Google sheet), but a doc. This doc contains some explanatory text at the beginning and then a table I need to read. How do I accomplish this using Python and only the URL? I don't have much knowledge of Google APIs, etc. I don't want the text at the beginning, but only the table data in some popular format like a Pandas dataframe, etc. The table data could also contain Unicode characters.

I tried following some steps in the Docs API quickstart guide (https://developers.google.com/docs/api/quickstart/python). After I followed the instructions, the given code (copy-pasted as it is) worked. Still, it involved some steps about creating a new Google project, enabling the API, configuring the OAuth screen and then authorizing credentials for a desktop application. However, when I replaced the example document ID (the string inside the quotes

DOCUMENT_ID = "195j9eDD3ccgjQRttHhJPymLJUCOUjs-jmwTrekvdjFE")

with the ID of the document I need to access, I got this error:

<HttpError 404 when requesting https://docs.googleapis.com/v1/documents/<MY_GIVEN_DOCUMENT_ID>?alt=json returned "Requested entity was not found.". Details: "Requested entity was not found.">

I just want a simple solution which uses only the published doc's URL, since the doc is already public. I don't want to go through some authentication steps. I need that even if I send the code to someone else, they can also run the same code and get the same results without any authentication issues. Please help me with this.

1
  • Please edit your question and include your code Commented Aug 24, 2024 at 16:09

1 Answer 1

5

I was faced with this same exact problem. I'm going to guess you and I were probably doing the same application challenge!

Using requests, I was able to pull down the raw HTML response from calling the page, then using BeautifulSoup I was able to turn it into a workable, parse-able object:

# Make request
html_response = requests.get(url=url)

# Parse html into a BeautifulSoup object
soup = BeautifulSoup(html_response.text, 'html.parser')

# Collect and return the first table (assuming the first table is what you want)
return soup.find('table')

From there, you can parse the table more precisely to pull out the data you want. Here are a couple examples of how you can work with a BeautifulSoup table to get what you need:

I'm refraining from copy-pasting my exact solution because I know others will use this to fill out the same job application challenge, but this gets you everything you need as long as you have a Python foundation.

Sign up to request clarification or add additional context in comments.

5 Comments

The same question was also asked here: stackoverflow.com/questions/78832288/… and someone else had a solution you could explore too.
When you were working on solving this problem, did you find that the input data / table was wrong and/or missing some characters? Because unless I've had a stroke, I'm pretty darn sure mine is. : (
@ScottFraley no, actually, when I ran the script with the problem data I got a working answer. But that doesn't mean yours isn't messed up!
The "test" data was definitely borked, but when I ran my script against the "final/actual Url," it worked great! :D
Nice! Glad to hear it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.