I have a python program running that results a csv file with data in 2 columns. The problem is, the data is resulted such that each row has a starting webpage in column A and a list of connected websites in column b. I need this data to be in a different format such that I have one worksheet with a list of each unique website and an unique ID for each (i.e. 1, 2 3, 4, etc.) and then a second sheet which contains the pairs of connections.
I'm very new with python and I don't fully know where to start. Ideally, since I have several of these programs, I would like a separate process to transform the data, but if it's easier to do it in the initial program, I'm not sure how to do that. The program I'm running has the following code.
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
i = SitegraphItem()
i['url'] = response.url
# i['http_status'] = response.status
llinks=[]
for anchor in hxs.select('//a[@href]'):
href=anchor.select('@href').extract()[0]
if not href.lower().startswith("javascript"):
llinks.append(urljoin_rfc(response.url,href))
i['linkedurls'] = llinks
return i
from scrapy.item import Item, Field
class SitegraphItem(Item):
url=Field()
linkedurls=Field()
The output is as follows:
Column A | Column B
[websiteA] | [b'website1, b'website2, b'website3]
The output I need is like this:
Column A | Column B
[WebsiteA] | [website1]
[WebsiteA] | [website2]
[WebsiteA] | [website3]
i['linkedurls']as the second column, resulting in the brackets andb'...'wrappers you see. It's easiest to fix it in this program, but you need to show us the code that callsparse_item(), and writes the output to a file.parse_item()nor writes anything to a file. (Also, I should have clarified that you should edit your question to include the relevant code; don't squeeze it in comments.)