1

Goodmorning,

I m trying to put data from this site. i m trying to get the Date, Creators, Relevance, Discription, subject, Audience and Access of every search result and put it in my postgres database. The problem is that the Discription is sometimes missing. So sometimes there are 6 record on a result and sometimes 7 records on a result.

So my question is: how can i make a empty result for Discription if it is not there. Any tips how to do it are welcome!

My script so far is this. It fill the database if there are always 7 records on a result(i tested with three, keep that in mind)

import urllib.parse
import urllib.request
import re
import sys
import psycopg2 as dbapi

url = 'https://easy.dans.knaw.nl/ui/'
values = {'wicket:bookmarkablePage':':nl.knaw.dans.easy.web.search.pages.PublicSearchResultPage',
          'q' : 'opgraving'}
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
headers = {}
headers['User-Agent'] =  'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'
req = urllib.request.Request(url,data, headers =headers)
resp = urllib.request.urlopen(req)
respData = resp.read()


saveRecord= open('C:/Users/berend/Desktop/record.txt','w')
record =  re.findall(r'<dd>(.*?)</dd>',str(respData))
for item in record:
    saveRecord.write("%s\n" % item)
saveRecord.close()

fin = open("C:/Users/berend/Desktop/record.txt",'r')
fit = open("C:/Users/berend/Desktop/record_schoon.txt",'w')
delete_list = ['</em>', '[',']','<em>','</span>', '<span>', '\\n']
for line in fin:
    for word in delete_list:
        line = line.replace(word, "")
    fit.write(line)
fin.close()
fit.close()

open_record= open('C:/Users/berend/Desktop/record_schoon.txt','r')
content = list(open_record)
print(len(content))
open_record.close()

n = 3
for i in range(0, len(content), 3):
   q= content[i:i+n]
   con = dbapi.connect(database='import', user='postgres', password='xxx')
   cur = con.cursor()
   cur.execute("INSERT into import VALUES (%s,%s,%s)",q)
   con.commit()

The first 3 results:

2000
Groenewoudt, B.J.; Deeben, J.H.C.; Velde, H.M. van der
100% relevant
Na verkennend onderzoek in 1996 en een grootschalige opgraving met uitgebreid bodemkundig
opgraving
Archaeology
Open (registered users)
2001-09
Peters, F.J.C.; Peeters, J.H.M.
100% relevant
opgraving
Archaeology
Open (registered users)
2008
Jacobs, E.; Burnier, C.Y.
100% relevant
OPGRAVING
Archaeology
Open (registered users)
0

1 Answer 1

0

I would use the pandas and sqlalchemy libraries for their efficiency in this situation. I'm suggesting a solution with additional packages because you didn't specify "not" using them.

instead of this:

n = 3
for i in range(0, len(content), 3):
   q= content[i:i+n]
   con = dbapi.connect(database='import', user='postgres', password='xxx')
   cur = con.cursor()
   cur.execute("INSERT into import VALUES (%s,%s,%s)",q)
   con.commit()

Use something like this:

import pandas as pd
from sqlalchemy import create_engine

# create a connection engine using sqlalchemy
engine = sqla.create_engine('postgresql+psycopg2://postgres:xxx@localhost/import', echo=False)

# read the results file into a pandas DataFrame
df = pd.read_csv('C:/Users/berend/Desktop/record_schoon.txt', delimiter='\t') # or whatever your delimiter is
dfFill = df.fillna("") # "" will be blank space when any record is missing data or 'nan'
dfFill.to_sql("tablename", engine, if_exists="append") #change tablename to the name of your table in import

HTH

Sign up to request clarification or add additional context in comments.

4 Comments

thank you for your resonce. I cant test it right now because the site is down for maintenaince but ihave 2 questions: 1. with engine = sqla.create_engine. what is sqla because i didnt name anything sqla. 2. how does the program know that i miss sometimes a disription. thats not clear to me. Can you explain that?
i tried it today and i still have some questions: the table import already exist. But it is trying to make a new one. How can i fix this? and like you said is this better but can you give me a hand on how to fix my problem? thanks!
thanks for your quick response! now i get a strange error. It says that my table Import does not exist so it can not import my records in my table
What is the name of your table? Here is how you can list the table names in your database. It's also easier for people to help if you post at the error messages.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.