getting all values from h1 tags using php

Question

I want to receive an array that contains all the h1 tag values from a text

Example, if this where the given input string:

<h1>hello</h1>
<p>random text</p>
<h1>title number two!</h1>

I need to receive an array containing this:

titles[0] = 'hello',
titles[1] = 'title number two!'

I already figured out how to get the first h1 value of the string but I need all the values of all the h1 tags in the given string.

I'm currently using this to receive the first tag:

function getTextBetweenTags($string, $tagname) 
 {
  $pattern = "/<$tagname ?.*>(.*)<\/$tagname>/";
  preg_match($pattern, $string, $matches);
  return $matches[1];
 }

I pass it the string I want to be parsed and as $tagname I put in "h1". I didn't write it myself though, I've been trying to edit the code to do what I want it to but nothing really works.

I was hoping someone could help me out.

Thanks in advance.

At minimum you should use preg_match_all. Take a look at simplehtmldom.sourceforge.net — fabrik, Commented Jul 21, 2010 at 12:15

Luke Manson · Accepted Answer · 2019-04-23 14:51:10Z

36

you could use simplehtmldom:

function getTextBetweenTags($string, $tagname) {
    // Create DOM from string
    $html = str_get_html($string);

    $titles = array();
    // Find all tags 
    foreach($html->find($tagname) as $element) {
        $titles[] = $element->plaintext;
    }
}

edited Apr 23, 2019 at 14:51

Luke Manson

32 bronze badges

answered Jul 21, 2010 at 12:18

Sergey Eremin

11.1k2 gold badges41 silver badges44 bronze badges

Oooh I didn't know you could do that!
– Rimian
Commented Jul 21, 2010 at 12:26
2

Is simplehtmldom any faster then DOMDocument or just for those occasions where DOMDocument doesn't exist (although it's enabled by default)?
– Wrikken
Commented Jul 21, 2010 at 12:39
1

@Wrikken it is userland code, so it doubt it is faster. Dunno why people are so fascinated with it (must be the simple in the name), especially because there is also Zend_Dom, phpquery or FluentDom for alternatives.
– Gordon
Commented Jul 21, 2010 at 12:49
2

@kgb DOM can load invalid HTML fine if you load it with loadHTML. The only thing not working then is getElementById and that is solely due to the fallback to the HTML4.0 DTD. You can still very much query nodes by ID via XPath then. Also, you do not have to suppress the errors with @ at all. You can use libxml_use_internal_errors and handle any errors by custom error handlers. SimpleHTMLDom isnt more suitable for HTML. It doesnt even use libxml but parses the HTML with String functions.
– Gordon
Commented Jul 21, 2010 at 13:22
2

-1 for not using the built in c extension to do the exact same thing (Seriously, why do things in PHP if the exact same thing is built into the PHP core?)... Use DomDocument instead...
– ircmaxell
Commented Jul 21, 2010 at 15:56

| Show 5 more comments

Wrikken · Accepted Answer · 2010-07-21 12:33:39Z

25

function getTextBetweenTags($string, $tagname){
    $d = new DOMDocument();
    $d->loadHTML($string);
    $return = array();
    foreach($d->getElementsByTagName($tagname) as $item){
        $return[] = $item->textContent;
    }
    return $return;
}

edited Jul 21, 2010 at 12:33

answered Jul 21, 2010 at 12:20

Wrikken

70.6k8 gold badges98 silver badges136 bronze badges

Add a comment |

Gordon · Accepted Answer · 2010-07-21 12:40:12Z

10

Alternative to DOM. Use when memory is an issue.

$html = <<< HTML
<html>
<h1>hello<span>world</span></h1>
<p>random text</p>
<h1>title number two!</h1>
</html>
HTML;

$reader = new XMLReader;
$reader->xml($html);
while($reader->read() !== FALSE) {
    if($reader->name === 'h1' && $reader->nodeType === XMLReader::ELEMENT) {
        echo $reader->readString();
    }
}

edited Jul 21, 2010 at 12:40

answered Jul 21, 2010 at 12:26

Gordon

317k76 gold badges546 silver badges565 bronze badges

thanks, I'm still using the DOM method though. Still thank you for taking your time answering :)
– Pieter888
Commented Jul 21, 2010 at 12:43
1

@Pieter yup, I had supplied the DOM solution myself if Wrikken hadnt already done so.
– Gordon
Commented Jul 21, 2010 at 12:46

Add a comment |

Ahmed Aman · Accepted Answer · 2010-07-21 12:24:29Z

6

 function getTextBetweenH1($string)
 {
    $pattern = "/<h1>(.*?)<\/h1>/";
    preg_match_all($pattern, $string, $matches);
    return ($matches[1]);
 }

answered Jul 21, 2010 at 12:24

Ahmed Aman

2,4131 gold badge20 silver badges33 bronze badges

12

Using regex is quite fine here. He isn't parsing HTML. He is matching stuff between <h1> and </h1>, which is inherently regular. Matching a regular language with regular expressions is quite fine. Drop the mindless "OMG regex cannot be used for anything if there is HTML involved" crap that everybody seems to be hyping. It's not like he is trying to match all of HTML, only a very small subset of the language which happens to be regular.
– Daniel Egeberg
Commented Jul 21, 2010 at 13:17
2

@Daniel what if there is attributes to the <h1>? What if the headings contain element children?
– Gordon
Commented Jul 21, 2010 at 13:28
2

@Gordon: The attribute problem can be solved using this regex: #<h1(?:"(?:[^\\\"]|\\\.)*"|\'(?:[^\\\\\']|\\\.)*\'|[^\'">])*>(.*?)</h1>#i (which I believe still describes a regular language and thus can be represented using a finite state machine). The problem with child elements is non-existent because there cannot be an <h1> within another <h1> anyways. Edit: The regex is written for a single-quoted PHP string.
– Daniel Egeberg
Commented Jul 21, 2010 at 13:51
2

@Daniel you have to admit that this is completely unreadable :) Also, there can be inline elements in an h1. What about spans? strongs? ems? The h1 of this very page has a link inside. Regex has no concept of TextNodes. It just knows Strings.
– Gordon
Commented Jul 21, 2010 at 13:56
2

This regex will still work, even if there are inline elements inside the H1 element... IMHO, it doesn't matter if it is unreadable either, as it is very much a set and forget function.
– evilunix
Commented Jan 6, 2015 at 14:18

| Show 1 more comment

Collectives™ on Stack Overflow

getting all values from h1 tags using php

4 Answers 4

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Linked

Related