extract href from html

Question

I am building a Perl module which I am attempting to use as few non-core dependencies as possible. Should my following code fail, I would add a dependency on HTML::LinkExtor and be done with it, but I want to try. All I want is to extract the href= attributes from <a> tags. I do it using Text::Balanced which is core as of modern Perls and is installable for others. So yes, I know I should use a HTML library. That said, is this passably ok?

#!/usr/bin/env perl

use strict;
use warnings;

use Data::Dumper;

use Text::Balanced qw/extract_bracketed extract_delimited extract_multiple/;

my $html = q#Some <a href=link>link text</a> stuff. And a little <A HREF="link2">different link text</a>.#;

my @tags = find_anchor_targets($html);

print Dumper \@tags;

sub find_anchor_targets {
  my $html = shift;

  my @tags = extract_multiple( 
    $html, 
    [ sub { extract_bracketed($_[0], '<>') } ],
    undef, 1
  );

  @tags = 
    map { extract_href($_) }  # find related href=
    grep { /^<a/i }            # only anchor begin tags
    @tags;

  return @tags;
}

sub extract_href {
  my $tag = shift;
  if($tag =~ /href=(?='|")/gci) {
    my $text = scalar extract_delimited( $tag, q{'"} );
    my $delim = substr $text, 0, 1;
    $text =~ s/^$delim//;
    $text =~ s/$delim$//;
    return $text;
  } elsif ($tag =~ /href=(.*?)(?:\s|\n|>)/) {
    return $1;
  } else {
    return ();
  }
}

This dumps

$VAR1 = [
          'link',
          'link2'
        ];

which is what one would expect.

minopret · Accepted Answer · 2012-03-19 06:12:05Z

1

<script language="javascript">
var a='<a href="1" title="Passably ok, yes, why not. Perfect, no.">'
document.write('<a href="2" title="Real-world HTML is just really complicated.">')
</script>
<style type="text/css">
p { font-family: "<a href='3' title='...in so many ways'>" }
</style>

edited Mar 19, 2012 at 6:12

answered Mar 19, 2012 at 5:53

minopret

2942 silver badges6 bronze badges

\$\begingroup\$ Thanks, point taken. In the days since I posted this I reworked my code to use HTML::LinkExtor if available and my code elsewise. Further, once the links are extracted, they are filtered for a certain file name pattern (source files of C projects), so there isn't too much chance of false posititives. Thanks for the good examples though! \$\endgroup\$

Joel Berger
– Joel Berger

2012-03-19 14:33:41 +00:00
Commented Mar 19, 2012 at 14:33

Add a comment |

Stack Exchange Network

extract href from html

1 Answer 1

You must log in to answer this question.

Hot Network Questions

extract href from html

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions