Like anything that shouldn't be done, I decided to see if it is possible to match <script> tags robustly using regexes in PHP. Since there is no arbitrary nesting, I figured it should at least be possible.
This is what I came up with. It is designed to handle every edge case I could think of, including:
- arbitrary attributes in the opening script tag
- single and multiline comments and single and double-quoted strings (which might include arbitrary escape sequences) in the javascript which may contain the characters
</script> - Captures the smallest script tag it finds.
Did I miss anything? Ideally, I want it to match exclusively anything a browser would consider a script element (might not be possible), but at the very least, I would like it to match only well-formed script tags with well-formed javascript.
Here is the string for the regex that I am passing to preg_match:
'#<script(?:[^>"]*(?:"[^"]*")?)*>((?:"(?:[^\\\\\\n"]*(?:\\\\.)*)*"|\'(?:[^\\\\\\n\']*(?:[^\\\\.)*)*\'|<[^/]?|/\\*(?:[^*]|\\*[^/]?)*\\*/|//.*|/[^/*]|[^\'"</])*)</script>#';
Note: I am not using this in production.