Alternative Solution to the OPs question:
Synopsis of solution:
- Send user input via HTML Form with 'GET'
- Process values from the url encoded values 'GET' sent to the shell script
- Shell script parses values and saves them passing the arguments to the Python script while calling it to run.
Javascript and php work nicely with this setup and allows for mysql, etc. from there.
Using 'GET' we send the user's input from client side to server side using a shell script to process our data.
Example Index.php
<!DOCTYPE html>
<html>
<head>
<title>Google Email Search</title>
</head>
<body>
<h1>Script Options</h1>
<form action="/cgi-bin/call.sh" method="get">
<TABLE BORDER="1">
<TR>
<TD>Keyword:</TD>
<TD><input type="text" name="query" value="Query"></TD>
</TR>
<TR>
<TD># of Pages:</TD>
<TD><input type="text" name="pages" value="1"></TD>
</TR>
<TR>
<TD>Output File Name:</TD>
<TD><input type="text" name="output_name" value="results"></TD>
</TR>
<TR>
<TD>E-mail Address:</TD>
<TD><input type="text" name="email_address" value="[email protected]">
</TD>
</TR>
<TR>
<TD><input type="submit" value="Submit"></TD>
</TR>
</TABLE>
</form>
</body>
</html>
Example shell script to call the python script which would be located in your cgi-bin or other designated 'executable' allowed directory.
#!/bin/bash
# Runs the cgi-script, using the shell, using 'get' results from the index html form we parse it to the options in the python script.
echo "Content-type: text/html"
echo ""
echo '<html>'
echo '<head>'
echo '<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">'
echo '<title></title>'
echo '</head>'
echo '<body>'
query=`echo "$QUERY_STRING" | sed -n 's/^.*query=\([^&]*\).*$/\1/p' | sed "s/%20/ /g"`
pages=`echo "$QUERY_STRING" | sed -n 's/^.*pages=\([^&]*\).*$/\1/p' | sed "s/%20/ /g"`
output_name=`echo "$QUERY_STRING" | sed -n 's/^.*output_name=\([^&]*\).*$/\1/p' | sed "s/%20/ /g"`
email_address=`echo "$QUERY_STRING" | sed -n 's/^.*email_address=\([^&]*\).*$/\1/p' | sed "s/%20/ /g"`
echo '<h1>'
echo 'Running...'
echo '</h1>'
DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
cd "$DIR"
python main.py -query $query -pages $pages -o $output_name
echo ''
echo '</body>'
echo '</html>'
Example Python script.
called from the shell script:
#!/usr/bin/env python
from xgoogle.search import GoogleSearch
import urllib2, re, csv, os
import argparse
class ScrapeProcess(object):
emails = [] # for duplication prevention
def __init__(self, filename):
self.filename = filename
self.csvfile = open(filename, 'wb+')
self.csvwriter = csv.writer(self.csvfile)
def go(self, query, pages):
search = GoogleSearch(query)
search.results_per_page = 10
for i in range(pages):
search.page = i
results = search.get_results()
for page in results:
self.scrape(page)
def scrape(self, page):
try:
request = urllib2.Request(page.url.encode("utf8"))
html = urllib2.urlopen(request).read()
except Exception, e:
return
emails = re.findall(r'([A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html)
for email in emails:
if email not in self.emails: # if not a duplicate
self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email])
self.emails.append(email)
parser = argparse.ArgumentParser(description='Scrape Google results for emails')
parser.add_argument('-query', type=str, default='test', help='a query to use for the Google search')
parser.add_argument('-pages', type=int, default=10, help='number of Google results pages to scrape')
parser.add_argument('-o', type=str, default='emails.csv', help='output filename')
args = parser.parse_args()
args.o = args.o+'.csv' if '.csv' not in args.o else args.o # make sure filename has .csv extension
s = ScrapeProcess(args.o)
s.go(args.query, args.pages)
Full working example located here:
https://github.com/mhenes/Google-EmailScraper
Disclaimer this is my git- using a forked project to show this functionality.