How to transform excel data using python

Question

I have a python program running that results a csv file with data in 2 columns. The problem is, the data is resulted such that each row has a starting webpage in column A and a list of connected websites in column b. I need this data to be in a different format such that I have one worksheet with a list of each unique website and an unique ID for each (i.e. 1, 2 3, 4, etc.) and then a second sheet which contains the pairs of connections.

I'm very new with python and I don't fully know where to start. Ideally, since I have several of these programs, I would like a separate process to transform the data, but if it's easier to do it in the initial program, I'm not sure how to do that. The program I'm running has the following code.

def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        i = SitegraphItem()
        i['url'] = response.url
        # i['http_status'] = response.status
        llinks=[]
        for anchor in hxs.select('//a[@href]'):
            href=anchor.select('@href').extract()[0]
            if not href.lower().startswith("javascript"):
                llinks.append(urljoin_rfc(response.url,href))
        i['linkedurls'] = llinks
        return i

from scrapy.item import Item, Field

class SitegraphItem(Item):
     url=Field()
     linkedurls=Field()

The output is as follows:
Column A   | Column B
[websiteA] | [b'website1, b'website2, b'website3]

The output I need is like this:
Column A   | Column B
[WebsiteA] | [website1]
[WebsiteA] | [website2]
[WebsiteA] | [website3]

It looks like the part of your program that generates output is just dumping the list i['linkedurls'] as the second column, resulting in the brackets and b'...' wrappers you see. It's easiest to fix it in this program, but you need to show us the code that calls parse_item(), and writes the output to a file. — alexis
– alexis, Commented Aug 6, 2019 at 22:41
Hi! Thanks for you response, I think this is the segment you mean? from scrapy.item import Item, Field class SitegraphItem(Item): url=Field() linkedurls=Field() — MAb2021
– MAb2021, Commented Aug 6, 2019 at 22:48
This neither calls parse_item() nor writes anything to a file. (Also, I should have clarified that you should edit your question to include the relevant code; don't squeeze it in comments.) — alexis
– alexis, Commented Aug 6, 2019 at 23:04

Georgiy · Accepted Answer · 2019-08-06 23:09:25Z

1

Instead of returning 1 item with 1 url and 1 list with 3 urls
You can return 3 items with single url and 1 link (for each if these links) in data pattern described in your question:

def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        i = dict()
        i['url'] = response.url
        # i['http_status'] = response.status
        llinks=[]
        for anchor in hxs.select('//a[@href]'):
            href=anchor.select('@href').extract()[0]
            if not href.lower().startswith("javascript"):
                i['linkedurl'] = urljoin_rfc(response.url,href)
                yield i

answered Aug 6, 2019 at 23:09

Georgiy

3,5711 gold badge8 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

MAb2021 Over a year ago

With this method, is it also possible to assign unique IDs to each website as a separate column for both A and B? Regardless of whether it's in column A or B though it would have to have the appropriate ID. Or is it better to run the output file through another program to assign unique IDs?

brentertainer · Accepted Answer · 2019-08-06 22:58:54Z

0

If you don't go the route suggested in the comments, here's an example of how you can alter it.

Code:

import pandas as pd

df = pd.DataFrame(data=[['WebsiteA', ['Website1', 'Website2', 'Website3']]], columns=['A', 'B'])
print(df)
tmp = df.apply(lambda x: pd.Series(x['B']),axis=1).stack().reset_index(level=1, drop=True)
tmp.name = 'B'
df = df.drop('B', axis=1).join(tmp).reset_index(drop=True)
print(df)

Output:

          A                               B
0  WebsiteA  [Website1, Website2, Website3]
          A         B
0  WebsiteA  Website1
1  WebsiteA  Website2
2  WebsiteA  Website3

answered Aug 6, 2019 at 22:58

brentertainer

2,2101 gold badge8 silver badges17 bronze badges

Collectives™ on Stack Overflow

How to transform excel data using python

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related