I'm working on a small project (something like a blog) where users can write articles and I made this function to prevent XSS based on a few answers I found on Stack Overflow about removing HTML tags and other sources.
<?php
class html
{
public static function parse(String $html) : String
{
$tidy = new \tidy();
$html = $tidy->repairString($html);
$dom = new \DOMDocument();
$dom->loadHTML('<?xml encoding="UTF-8">' . $html);
// dirty fix
foreach ($dom->childNodes as $item)
if ($item->nodeType == XML_PI_NODE)
$dom->removeChild($item); // remove hack
$dom->encoding = 'UTF-8'; // insert proper
$script = $dom->getElementsByTagName('script');
$style = $dom->getElementsByTagName('style');
$iframe = $dom->getElementsByTagName('iframe');
$applet = $dom->getElementsByTagName('applet');
$video = $dom->getElementsByTagName('video');
$audio = $dom->getElementsByTagName('audio');
$link = $dom->getElementsByTagName('link');
$meta = $dom->getElementsByTagName('meta');
$head = $dom->getElementsByTagName('head');
$form = $dom->getElementsByTagName('form');
$input = $dom->getElementsByTagName('input');
$textarea = $dom->getElementsByTagName('textarea');
$list = [$form,$input,$textarea,$head,$script,$style,$iframe,$applet,$video,$audio,$link,$meta];
$remove_img = []; $remove = [];
foreach ($list as $s) foreach ($s as $v) $remove[] = $v;
foreach ($remove as $item) $item->parentNode->removeChild($item);
$imgs = $dom->getElementsByTagName('img');
foreach ($imgs as $img) foreach ($img->attributes as $attr)
if($attr->nodeName == 'src' && strpos($attr->nodeValue,'base64:') !== 0 && strpos($attr->nodeValue,'https://') !== 0 && strpos($attr->nodeValue,'http://') !== 0)
$remove_img[] = $img;
foreach ($remove_img as $item) $item->parentNode->removeChild($item);
$xpath = new \DOMXPath($dom);
$nodes = $xpath->query('//@*');
foreach ($nodes as $node) {
if(
!in_array(
// i use tinyMCE and sometimes it places data-mce-* attributes
str_replace('data-mce-','',$node->nodeName),
['href','src','class','style','width','height','data-href','title','target','rel']
)
) $node->parentNode->removeAttribute($node->nodeName);
}
$html = $dom->saveHTML();
$buffer = strip_tags($html, '<figure><section><p><strong><em><u><h1><h2><h3><h4><h5><h6><img><li><ol><ul><span><div><br><ins><del><a>');
$clean = $tidy->repairString($buffer);
return $clean;
}
}
I tried not to use regex for 2 reasons:
- I'm not good at regex (at all)
- People said not to use it
So far this works fine, removing bad HTML tags attributes and cleaning the HTML code. I even tested this "XSS test" gist I found on GitHub and it worked fine. But can this be better? Can I add something that makes it more secure or faster?