I am building a Perl module which I am attempting to use as few non-core dependencies as possible. Should my following code fail, I would add a dependency on HTML::LinkExtor and be done with it, but I want to try. All I want is to extract the href= attributes from <a> tags. I do it using Text::Balanced which is core as of modern Perls and is installable for others. So yes, I know I should use a HTML library. That said, is this passably ok?
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use Text::Balanced qw/extract_bracketed extract_delimited extract_multiple/;
my $html = q#Some <a href=link>link text</a> stuff. And a little <A HREF="link2">different link text</a>.#;
my @tags = find_anchor_targets($html);
print Dumper \@tags;
sub find_anchor_targets {
my $html = shift;
my @tags = extract_multiple(
$html,
[ sub { extract_bracketed($_[0], '<>') } ],
undef, 1
);
@tags =
map { extract_href($_) } # find related href=
grep { /^<a/i } # only anchor begin tags
@tags;
return @tags;
}
sub extract_href {
my $tag = shift;
if($tag =~ /href=(?='|")/gci) {
my $text = scalar extract_delimited( $tag, q{'"} );
my $delim = substr $text, 0, 1;
$text =~ s/^$delim//;
$text =~ s/$delim$//;
return $text;
} elsif ($tag =~ /href=(.*?)(?:\s|\n|>)/) {
return $1;
} else {
return ();
}
}
This dumps
$VAR1 = [
'link',
'link2'
];
which is what one would expect.