Removing tags from HTML

Question

I'm working on a small project (something like a blog) where users can write articles and I made this function to prevent XSS based on a few answers I found on Stack Overflow about removing HTML tags and other sources.

<?php

 class html 
 {
    public static function parse(String $html) : String
    {
      $tidy = new \tidy();
      $html = $tidy->repairString($html);
      $dom = new \DOMDocument();
      $dom->loadHTML('<?xml encoding="UTF-8">' . $html);
      // dirty fix
      foreach ($dom->childNodes as $item)
          if ($item->nodeType == XML_PI_NODE)
              $dom->removeChild($item); // remove hack
      $dom->encoding = 'UTF-8'; // insert proper
      $script     = $dom->getElementsByTagName('script');
      $style      = $dom->getElementsByTagName('style');
      $iframe     = $dom->getElementsByTagName('iframe');
      $applet     = $dom->getElementsByTagName('applet');
      $video      = $dom->getElementsByTagName('video');
      $audio      = $dom->getElementsByTagName('audio');
      $link       = $dom->getElementsByTagName('link');
      $meta       = $dom->getElementsByTagName('meta');
      $head       = $dom->getElementsByTagName('head');
      $form       = $dom->getElementsByTagName('form');
      $input      = $dom->getElementsByTagName('input');
      $textarea   = $dom->getElementsByTagName('textarea');
      $list   = [$form,$input,$textarea,$head,$script,$style,$iframe,$applet,$video,$audio,$link,$meta];
      $remove_img = []; $remove = [];
      foreach ($list as $s) foreach ($s as $v) $remove[] = $v;
      foreach ($remove as $item) $item->parentNode->removeChild($item);
      $imgs = $dom->getElementsByTagName('img');
      foreach ($imgs as $img) foreach ($img->attributes as $attr)
      if($attr->nodeName == 'src' && strpos($attr->nodeValue,'base64:') !== 0 && strpos($attr->nodeValue,'https://') !== 0 && strpos($attr->nodeValue,'http://') !== 0)
      $remove_img[] = $img;
      foreach ($remove_img as $item) $item->parentNode->removeChild($item);
      $xpath = new \DOMXPath($dom);
      $nodes = $xpath->query('//@*');
      foreach ($nodes as $node) {
          if(
            !in_array(
              // i use tinyMCE and sometimes it places data-mce-* attributes
              str_replace('data-mce-','',$node->nodeName),
              ['href','src','class','style','width','height','data-href','title','target','rel']
              )
          ) $node->parentNode->removeAttribute($node->nodeName);
      }
      $html = $dom->saveHTML();
      $buffer = strip_tags($html, '<figure><section><p><strong><em><u><h1><h2><h3><h4><h5><h6><img><li><ol><ul><span><div><br><ins><del><a>');
      $clean = $tidy->repairString($buffer);
      return $clean;
    }
}

I tried not to use regex for 2 reasons:

I'm not good at regex (at all)
People said not to use it

So far this works fine, removing bad HTML tags attributes and cleaning the HTML code. I even tested this "XSS test" gist I found on GitHub and it worked fine. But can this be better? Can I add something that makes it more secure or faster?

I have rolled back your edits. Please see what you may and may not do after receiving an answer. — Emily L.
– Emily L., Commented Dec 31, 2017 at 16:39
why not taking existing code for that? I think there are well made libraries that can be used for the task at hand, e.g. HTML Purifier. — hakre
– hakre, Commented Jan 8, 2018 at 21:37
@hakre i search for existing libraries but i couldn't find one, anyway i have changed alot of this class by now, and its much better, i also don't like using libraries cause i want to learn how to do staff by myself then i can use libraries, using a code that i don't understand is not a good idea. — user155894
– user155894, Commented Jan 13, 2018 at 19:11
Reading other persons code is often the best option to learn. Just saying. — hakre
– hakre, Commented Jan 14, 2018 at 22:29

Emily L. · Accepted Answer · 2017-12-29 01:04:45Z

2

I'm not that great with php but to me this looks like you're blacklisting tags. This is not an approach I would recommend as it is easy to miss some tags or combinations of attributes.

Instead I would recommend to use a white list of tags that you allow and block everything else. This is more secure as if you miss to white list something, your users will complain and you can fix it. Xss attackers typically won't inform you if you forget to blacklist a tag and have a vulnerability ;) And you are also automatically protected from new tags and attribute combinations.

Or you could simply strip ALL tags or html escape the entire body.

answered Dec 29, 2017 at 1:04

Emily L.

16.7k1 gold badge39 silver badges89 bronze badges

\$\begingroup\$ the last function just does that ! tidy cleans the HTML code ( close any unclosed tags etc..), DOM removes the blacklisted tags and remove all attributes except the white-listed attributes , and the last function removes all tags except the white-listed ones and tidy again cleans the code \$\endgroup\$

user155894
– user155894

2017-12-29 01:09:51 +00:00
Commented Dec 29, 2017 at 1:09
\$\begingroup\$ So why remove blacklisted tags when you later only keep white listed tags? Either I'm missing something or it doesn't sound right to me. \$\endgroup\$

Emily L.
– Emily L.

2017-12-30 13:42:42 +00:00
Commented Dec 30, 2017 at 13:42
\$\begingroup\$ php <code>strip_tags</code> mess up sometimes and don't remove specific attributes so using DOM is way better to remove blacklisted tags and strip_tags to remove the reset , i don't just use strip tags cause it may miss once or something, so i make sure that harmful tags and attribute have already been removed from the DOM \$\endgroup\$

user155894
– user155894

2017-12-30 18:47:27 +00:00
Commented Dec 30, 2017 at 18:47

Add a comment |

mickmackusa · Accepted Answer · 2018-01-08 21:03:44Z

I may have flaws in my suggested code (because I didn't test it), but it seems to me that you could afford to implement the DRY principle. Writing a loop, to generate an array of items, then writing another loop to traverse the array of items just doesn't make sense to me -- just perform all necessary processes in the first loop (no temporary arrays).

I recommend curly brackets for your conditionals and loops; not because they are essential, but because they improve readability for most people (never assume that you are the only person to see your code).

Untested Suggested Code:

class html 
{
    public static function parse(String $html) : String
    {
        $tidy = new \tidy();
        $html = $tidy->repairString($html);
        $dom = new \DOMDocument();
        $dom->loadHTML('<?xml encoding="UTF-8">' . $html);
        // dirty fix
        foreach ($dom->childNodes as $item) {
            if ($item->nodeType == XML_PI_NODE) {
                $dom->removeChild($item); // remove hack
            }
        }
        $dom->encoding = 'UTF-8'; // insert proper
        $tagnames = ['script','style','iframe','applet','video','audio','link','meta','head','form','input','textarea'];  // use array to make DRY
        foreach ($tagnames as $tagname) {
            $dom->getElementsByTagName($tagname)->parentNode->removeChild($item);
        }

        $imgs = $dom->getElementsByTagName('img');
        foreach ($imgs as $img) {
            foreach ($img->attributes as $attr) {
                if ($attr->nodeName == 'src' && !in_array(substr($attr->nodeValue,0,7),['base64:','https:/','http://'])) {  // condensed this line
                    $img->parentNode->removeChild($item);
                }
            }
        }

        $xpath = new \DOMXPath($dom);
        $nodes = $xpath->query('//@*');
        foreach ($nodes as $node) {
            if (!in_array(str_replace('data-mce-','',$node->nodeName),['href','src','class','style','width','height','data-href','title','target','rel'])) { // i use tinyMCE and sometimes it places data-mce-* attributes
                $node->parentNode->removeAttribute($node->nodeName);
            }
        }
        $html = $dom->saveHTML();
        $buffer = strip_tags($html, '<figure><section><p><strong><em><u><h1><h2><h3><h4><h5><h6><img><li><ol><ul><span><div><br><ins><del><a>');
        $clean = $tidy->repairString($buffer);
        return $clean;
    }
}

These changes not only shorten your code block, they improve readability with curly brackets and tabbing, and improve efficiency by reducing the number of loops.

thanks, i have already changed the whole code now, but i appreciate your advice :) thanks — user155894
– user155894, Commented Jan 13, 2018 at 19:09

Stack Exchange Network

Removing tags from HTML

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Removing tags from HTML

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions