Extracting URLs using regex in .NET

Question

I've taken inspiration from the example show in the following URL csharp-online and intended to retrieve all the URLs from this page alexa

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
using System.Text.RegularExpressions;
namespace ExtractingUrls
{
    class Program
    {
        static void Main(string[] args)
        {
            WebClient client = new WebClient();
            const string url = "http://www.alexa.com/topsites/category/Top/Society/History/By_Topic/Science/Engineering_and_Technology";
            string source = client.DownloadString(url);
            //Console.WriteLine(Getvals(source));
            string matchPattern =
                    @"<a.rel=""nofollow"".style=""font-size:0.8em;"".href=[""'](?<url>[^""^']+[.]*)[""'].class=""offsite"".*>(?<name>[^<]+[.]*)</a>";
            foreach (Hashtable grouping in ExtractGroupings(source, matchPattern, true))
            {
                foreach (DictionaryEntry DE in grouping)
                {
                    Console.WriteLine("Value = " + DE.Value);
                    Console.WriteLine("");
                }
            }
            // End.
            Console.ReadLine();
        }
        public static ArrayList ExtractGroupings(string source, string matchPattern, bool wantInitialMatch)
        {
            ArrayList keyedMatches = new ArrayList();
            int startingElement = 1;
            if (wantInitialMatch)
            {
                startingElement = 0;
            }
            Regex RE = new Regex(matchPattern, RegexOptions.Multiline);
            MatchCollection theMatches = RE.Matches(source);
            foreach (Match m in theMatches)
            {
                Hashtable groupings = new Hashtable();
                for (int counter = startingElement; counter < m.Groups.Count; counter++)
                {
                    // If we had just returned the MatchCollection directly, the
                    // GroupNameFromNumber method would not be available to use
                    groupings.Add(RE.GroupNameFromNumber(counter),
                    m.Groups[counter]);
                }
                keyedMatches.Add(groupings);
            }
            return (keyedMatches);
        }
    }
}

But here I face a problem, when I'm executing each URL is being displayed thrice, That's first the whole anchor tag is getting displayed, next the URL is being displayed twice. can anyone suggest me where should I correct so that I can have each URL displayed exactly once.

DO NOT PARSE HTML USING Regular Expressions! stackoverflow.com/questions/1732348/… — SLaks
– SLaks, Commented Jan 31, 2010 at 23:49
@SLacks: "it's sometimes appropriate to parse a limited, known set of HTML" — Ben Shelock
– Ben Shelock, Commented Feb 6, 2010 at 1:12

Mark Byers · Accepted Answer · 2010-02-06 01:09:02Z

3

Use HTML Agility Pack to parse HTML. I think it will make your problem much easier to solve.

Here's one way to do it:

WebClient client = new WebClient();
string url = "http://www.alexa.com/topsites/category/Top/Society/History/By_Topic/Science/Engineering_and_Technology";
string source = client.DownloadString(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(source);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href and @rel='nofollow']"))
{
    Console.WriteLine(link.Attributes["href"].Value);
}

edited Feb 6, 2010 at 1:09

answered Jan 31, 2010 at 23:43

Mark Byers

843k202 gold badges1.6k silver badges1.5k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mike Sherov · Accepted Answer · 2010-01-31 23:48:50Z

1

in your regex, you have two groupings, and the entire match. If I'm reading it correctly, you should only want the URL portion of the matches, which is the second of the 3 groupings....

instead of this:

for (int counter = startingElement; counter < m.Groups.Count; counter++)
            {
                // If we had just returned the MatchCollection directly, the
                // GroupNameFromNumber method would not be available to use
                groupings.Add(RE.GroupNameFromNumber(counter),
                m.Groups[counter]);
            }

don't you want this?:

groupings.Add(RE.GroupNameFromNumber(1),m.Groups[1]);

answered Jan 31, 2010 at 23:48

Mike Sherov

13.4k9 gold badges43 silver badges64 bronze badges

Comments

Paul Creasey · Accepted Answer · 2010-01-31 23:50:55Z

1

int startingElement = 1;
if (wantInitialMatch)
{
startingElement = 0;
}

...

for (int counter = startingElement; counter < m.Groups.Count; counter++)
{
// If we had just returned the MatchCollection directly, the
// GroupNameFromNumber method would not be available to use
    groupings.Add(RE.GroupNameFromNumber(counter),
    .Groups[counter]);
}

Your passing wantInitialMatch = true, so your for loop is returning:

.Groups[0] //entire match
.Groups[1] //(?<url>[^""^']+[.]*) href part
.Groups[2] //(?<name>[^<]+[.]*) link text

answered Jan 31, 2010 at 23:50

Paul Creasey

28.9k10 gold badges59 silver badges91 bronze badges

Collectives™ on Stack Overflow

Extracting URLs using regex in .NET

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related