0

I am trying to use multithreading to insert raw_html into postgres but I can't get the data to insert and keep getting

LINE 1: INSERT INTO raw_html VALUES b'<!DOCTYPE html><html lang="en"...
                                    ^

Error: syntax error at or near "b'<!DOCTYPE html

I can't figure out what the b is doing in front of each of these strings (nor can I find anything online explaining what the b is) and I am assuming that's why the insert is failing. I also am trying to get the raw_html string to convert to a tuple because I think it needs to be a tuple for sql to accept it. I can't figure out what to try next to fix these two issues. Any direction appreciated:

UPDATE:

def execute_values(LST):
    DB_HOST = "localhost"
    DB_NAME = "yellow_pages_scraper"
    DB_USER = "justinbenfit"
    DB_PASS = "postgres"
    DB_URL = "mydburl"

    conn = psycopg2.connect(dbname=DB_NAME, user=DB_USER, password=DB_PASS, host=DB_HOST)
    cur = conn.cursor()
    try:
        LST = tuple(LST)
        for L in LST:
            cur.execute("INSERT INTO raw_html VALUES (?)", (L.decode('utf-8'),))
        conn.commit()
    except (Exception, psycopg2.DatabaseError) as error:
        print("Error: %s" % error)
        conn.rollback()
        return 1
    print(f"the dataframe is inserted")

def main(LINKS):
    t0 = time.time()
    download_raw(LINKS)
    execute_values(LST)
    t1 = time.time()
    print(f"{t1-t0} seconds to download {len(LST)} raw_html.")
0

1 Answer 1

2

There are several problems. First you are sending a bytes string instead of a Unicode string. Second, you are not doing any quoting. Third, your INSERT syntax is wrong; the values need to be in parens. Fourth, you should be allowing the database to do the quoting for you, to prevent SQL arracks.

This should work:

    cur.execute("INSERT INTO raw_html VALUES (?)", (L.decode('utf-8'),)

However, your whole concept is flawed. You have each thread opening a new connection into the database. That's horrible. The database insertion is not your bottleneck here. You should be able to do this serially. Just delay the commit until the whole set is done.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you @Tim Roberts! that makes sense to just do away with the concurrency for the sql insertion and just save the conn.commit until the end of the set. I have updated the code and am a little closer I think. Am I on the right track? Still getting Error: syntax error at or near ")" LINE 1: INSERT INTO raw_html VALUES (?) I have tried adding a semi-colon after the last ) but that doesn't seem to work.
never mind! got it to work with this: cur.execute("INSERT INTO raw_html (raw_html) VALUES (%s);", (L.decode('utf-8'),))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.