0

I have a html page with multiple javascript tags. The problem that I want to extract data from specific tag:

<head>
...
</head>
<body>
...
<script type="text/javascript">

    $j(document).ready(function() {

        if (!($j.cookie("ios"))) {
            new $c.free.widgets.FreeAdvDialog().open();
            $j.cookie("ios", "seen", { path: '/', expires: 10000});
        };

        ajax_keys = ["d24349f205e3deb7f1015f42d3a14da7205b62e4", "0ae78c4797d47745ebd44e2754367da10c6f56a4", "567b2bfb6fd1aee784115da54e5e116a280ee225", "fc5cd251be46ff101c471553d52c07bf08c9aa65"];
        var is_dm = false;

        /* async chart loader */
        var chart = new $c.free.widgets.Chart({
            target: $j('#graph'),
            width: 990,
            height: 275,
            site: "911.com",
            source_panel: 'us'
        });

        var chart_view = new $c.free.widgets.ChartView({
            chart: chart,
            csv_button: 'csv-export',
            save_button: 'graph-image',
            embed_button: 'embed-graph',
            key: ajax_keys[1]
        });
        chart_view.render();

        /* zoom info initialization */
        var zoom_info = new $c.free.widgets.ZoomInfo({
            site: "911.com",
            el: '#zoominfo',
            key: ajax_keys[3]
        });
        zoom_info.load();


        /* compete numbers initialization */
        var compete_numbers = new $c.free.widgets.CompeteNumbers({
            site: "911.com",
            key: ajax_keys[0],
            el: '#compete_numbers'
        });
        compete_numbers.load();

        /* DM Marketing widget init */
        new $c.free.widgets.DMSignupMessage({
            is_dm: is_dm,
            compete_numbers: compete_numbers
        });

        /* personalization initialization */


            var logged_in_as = null;


        var d = {
          site_name: "911.com",
          logged_in_as: logged_in_as,
          current_source_panel: {"display_abbreviation": "us", "panel_name": "us", "image_url": "http://media.compete.com/site_media/images/icons/flag_us.gif", "id": 1, "display_name": "United States"}
        };

        var auth_model = new $c.free.widgets.FreeLoginModel(d);
        var links_opts = { model: auth_model };
        var links_view = new $c.free.widgets.FreeAccountLinksView(links_opts);

        var sites_view = new $c.free.widgets.FollowSiteButtonView(links_opts);
        var manage_view = new $c.free.widgets.ManageSitesListButtonView(links_opts);

        var sites = new $c.free.widgets.SimilarSitesCollection([], {
            site: "911.com",
            source_panel: 'us',
            key: ajax_keys[2],
            auth: auth_model
        });
        var graph = new $c.free.widgets.BarGraph({
            el: $j('#similar-sites'),
            collection: sites
        });

        // tell KISSMetrics where we are
        // also identify user so KM console can refer to them by email
        if(logged_in_as != null) {
            _kmq.push(['identify', logged_in_as]);
        }
        _kmq.push(['record', 'Viewed Free Site Analytics Report (M)']);
    });

...

How can I get ajax_keys (i.e. "d24349f205e3deb7f1015f42d3a14da7205b62e4") from specific tag of the page?

p.s. i tried to use regular expressions in python script but i can't retrieve necessary element from tag.

Thanks for help.

1 Answer 1

2

If you use a library like BeautifulSoup you can fetch the specific script tag, and then use a regex on the contents of the tag instead of the entire document.

That said, it looks like a regex will work assuming there is only the one ajax_keys:

import re

ajaxre = re.compile(r"^\s+ajax_keys = ([^;]+)", re.MULTILINE)
ajax_string = ajaxre.match(source).group(1)

# to get it as a python list
import json
ajax_keys = json.loads(ajax_string)

Edit: thanks @Karl Knechtel for json.loads

Sign up to request clarification or add additional context in comments.

10 Comments

"careful doing this in general" - no excuses; you want ast.literal_eval. Or maybe even json.loads.
eval is evil. Good call on json.loads, updating answer
Thanks! Another problem that i have multiple <script type="text/javascript"> tags without tag id. I tried smth. like this: from bs4 import BeautifulSoup import re import urllib2 data = urllib2.urlopen('url').read() soup = BeautifulSoup(data) to_extract = soup.findAll('script') for item in to_extract: item.extract() - that exactly print out data from all <script> tags of the page. How can I find specific tag if it doesn't have any id?
If you don't have any id but you know it's position you can use indexing. EG. 3rd script tag soup.findAll("script")[2], last script tag soup.findAll("script")[-1] and so on.
Source is just the HTML as one big string (not split by lines) or the contents of the script tag. The regex matches all lines that start with white space and then has ajax_keys = . As long as only one line of the html looks like that, you can just run it on the full page. Otherwise, as long as only one line in the script tag looks like that, it'll work for the script text.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.