16

I want to receive an array that contains all the h1 tag values from a text

Example, if this where the given input string:

<h1>hello</h1>
<p>random text</p>
<h1>title number two!</h1>

I need to receive an array containing this:

titles[0] = 'hello',
titles[1] = 'title number two!'

I already figured out how to get the first h1 value of the string but I need all the values of all the h1 tags in the given string.

I'm currently using this to receive the first tag:

function getTextBetweenTags($string, $tagname) 
 {
  $pattern = "/<$tagname ?.*>(.*)<\/$tagname>/";
  preg_match($pattern, $string, $matches);
  return $matches[1];
 }

I pass it the string I want to be parsed and as $tagname I put in "h1". I didn't write it myself though, I've been trying to edit the code to do what I want it to but nothing really works.

I was hoping someone could help me out.

Thanks in advance.

1

4 Answers 4

36

you could use simplehtmldom:

function getTextBetweenTags($string, $tagname) {
    // Create DOM from string
    $html = str_get_html($string);

    $titles = array();
    // Find all tags 
    foreach($html->find($tagname) as $element) {
        $titles[] = $element->plaintext;
    }
}
10
  • Oooh I didn't know you could do that!
    – Rimian
    Commented Jul 21, 2010 at 12:26
  • 2
    Is simplehtmldom any faster then DOMDocument or just for those occasions where DOMDocument doesn't exist (although it's enabled by default)?
    – Wrikken
    Commented Jul 21, 2010 at 12:39
  • 1
    @Wrikken it is userland code, so it doubt it is faster. Dunno why people are so fascinated with it (must be the simple in the name), especially because there is also Zend_Dom, phpquery or FluentDom for alternatives.
    – Gordon
    Commented Jul 21, 2010 at 12:49
  • 2
    @kgb DOM can load invalid HTML fine if you load it with loadHTML. The only thing not working then is getElementById and that is solely due to the fallback to the HTML4.0 DTD. You can still very much query nodes by ID via XPath then. Also, you do not have to suppress the errors with @ at all. You can use libxml_use_internal_errors and handle any errors by custom error handlers. SimpleHTMLDom isnt more suitable for HTML. It doesnt even use libxml but parses the HTML with String functions.
    – Gordon
    Commented Jul 21, 2010 at 13:22
  • 2
    -1 for not using the built in c extension to do the exact same thing (Seriously, why do things in PHP if the exact same thing is built into the PHP core?)... Use DomDocument instead...
    – ircmaxell
    Commented Jul 21, 2010 at 15:56
25
function getTextBetweenTags($string, $tagname){
    $d = new DOMDocument();
    $d->loadHTML($string);
    $return = array();
    foreach($d->getElementsByTagName($tagname) as $item){
        $return[] = $item->textContent;
    }
    return $return;
}
0
10

Alternative to DOM. Use when memory is an issue.

$html = <<< HTML
<html>
<h1>hello<span>world</span></h1>
<p>random text</p>
<h1>title number two!</h1>
</html>
HTML;

$reader = new XMLReader;
$reader->xml($html);
while($reader->read() !== FALSE) {
    if($reader->name === 'h1' && $reader->nodeType === XMLReader::ELEMENT) {
        echo $reader->readString();
    }
}
2
  • thanks, I'm still using the DOM method though. Still thank you for taking your time answering :)
    – Pieter888
    Commented Jul 21, 2010 at 12:43
  • 1
    @Pieter yup, I had supplied the DOM solution myself if Wrikken hadnt already done so.
    – Gordon
    Commented Jul 21, 2010 at 12:46
6
 function getTextBetweenH1($string)
 {
    $pattern = "/<h1>(.*?)<\/h1>/";
    preg_match_all($pattern, $string, $matches);
    return ($matches[1]);
 }
6
  • 12
    Using regex is quite fine here. He isn't parsing HTML. He is matching stuff between <h1> and </h1>, which is inherently regular. Matching a regular language with regular expressions is quite fine. Drop the mindless "OMG regex cannot be used for anything if there is HTML involved" crap that everybody seems to be hyping. It's not like he is trying to match all of HTML, only a very small subset of the language which happens to be regular. Commented Jul 21, 2010 at 13:17
  • 2
    @Daniel what if there is attributes to the <h1>? What if the headings contain element children?
    – Gordon
    Commented Jul 21, 2010 at 13:28
  • 2
    @Gordon: The attribute problem can be solved using this regex: #<h1(?:"(?:[^\\\"]|\\\.)*"|\'(?:[^\\\\\']|\\\.)*\'|[^\'">])*>(.*?)</h1>#i (which I believe still describes a regular language and thus can be represented using a finite state machine). The problem with child elements is non-existent because there cannot be an <h1> within another <h1> anyways. Edit: The regex is written for a single-quoted PHP string. Commented Jul 21, 2010 at 13:51
  • 2
    @Daniel you have to admit that this is completely unreadable :) Also, there can be inline elements in an h1. What about spans? strongs? ems? The h1 of this very page has a link inside. Regex has no concept of TextNodes. It just knows Strings.
    – Gordon
    Commented Jul 21, 2010 at 13:56
  • 2
    This regex will still work, even if there are inline elements inside the H1 element... IMHO, it doesn't matter if it is unreadable either, as it is very much a set and forget function.
    – evilunix
    Commented Jan 6, 2015 at 14:18

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.