0

I have a python program running that results a csv file with data in 2 columns. The problem is, the data is resulted such that each row has a starting webpage in column A and a list of connected websites in column b. I need this data to be in a different format such that I have one worksheet with a list of each unique website and an unique ID for each (i.e. 1, 2 3, 4, etc.) and then a second sheet which contains the pairs of connections.

I'm very new with python and I don't fully know where to start. Ideally, since I have several of these programs, I would like a separate process to transform the data, but if it's easier to do it in the initial program, I'm not sure how to do that. The program I'm running has the following code.

def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        i = SitegraphItem()
        i['url'] = response.url
        # i['http_status'] = response.status
        llinks=[]
        for anchor in hxs.select('//a[@href]'):
            href=anchor.select('@href').extract()[0]
            if not href.lower().startswith("javascript"):
                llinks.append(urljoin_rfc(response.url,href))
        i['linkedurls'] = llinks
        return i
from scrapy.item import Item, Field

class SitegraphItem(Item):
     url=Field()
     linkedurls=Field()
The output is as follows:
Column A   | Column B
[websiteA] | [b'website1, b'website2, b'website3]

The output I need is like this:
Column A   | Column B
[WebsiteA] | [website1]
[WebsiteA] | [website2]
[WebsiteA] | [website3]

3
  • It looks like the part of your program that generates output is just dumping the list i['linkedurls'] as the second column, resulting in the brackets and b'...' wrappers you see. It's easiest to fix it in this program, but you need to show us the code that calls parse_item(), and writes the output to a file. Commented Aug 6, 2019 at 22:41
  • Hi! Thanks for you response, I think this is the segment you mean? from scrapy.item import Item, Field class SitegraphItem(Item): url=Field() linkedurls=Field() Commented Aug 6, 2019 at 22:48
  • This neither calls parse_item() nor writes anything to a file. (Also, I should have clarified that you should edit your question to include the relevant code; don't squeeze it in comments.) Commented Aug 6, 2019 at 23:04

2 Answers 2

1

Instead of returning 1 item with 1 url and 1 list with 3 urls
You can return 3 items with single url and 1 link (for each if these links) in data pattern described in your question:

def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        i = dict()
        i['url'] = response.url
        # i['http_status'] = response.status
        llinks=[]
        for anchor in hxs.select('//a[@href]'):
            href=anchor.select('@href').extract()[0]
            if not href.lower().startswith("javascript"):
                i['linkedurl'] = urljoin_rfc(response.url,href)
                yield i
Sign up to request clarification or add additional context in comments.

1 Comment

With this method, is it also possible to assign unique IDs to each website as a separate column for both A and B? Regardless of whether it's in column A or B though it would have to have the appropriate ID. Or is it better to run the output file through another program to assign unique IDs?
0

If you don't go the route suggested in the comments, here's an example of how you can alter it.

Code:

import pandas as pd

df = pd.DataFrame(data=[['WebsiteA', ['Website1', 'Website2', 'Website3']]], columns=['A', 'B'])
print(df)
tmp = df.apply(lambda x: pd.Series(x['B']),axis=1).stack().reset_index(level=1, drop=True)
tmp.name = 'B'
df = df.drop('B', axis=1).join(tmp).reset_index(drop=True)
print(df)

Output:

          A                               B
0  WebsiteA  [Website1, Website2, Website3]
          A         B
0  WebsiteA  Website1
1  WebsiteA  Website2
2  WebsiteA  Website3

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.