« Back to code index... | Contact me regarding this code...

Using regex to find tags without a trailing slash.

Note: http://www.w3.org/TR/REC-xml defines attribute/value pairs as having no more than one space on either side of the = sign. However, most browsers allow for this and these rules were therefore constructed to allow multiple spaces on either side of the = sign.

Also, these rules do not allow for attributes without values (such as "selected") to appear anywhere but at the end of the tag after all the attribute/value pairs. This is technically the same limitation as XHTML Strict since XHTML doesn't allow for unmatch attributes at all.

Purpose

This demo started after someone asked the simple question "How do I find image tags within a block of text that are not XHTML compliant (don't have the proper trailing slash)?" The problem with most solutions is that they didn't handle the case where the attributes of the image tag contained a ">" character. In order to correctly detect this scenario it was necessary to build a regex that finds tags in a very robust fashion. Using the RFC as a guideline, the following rules were created to match various portions of tags, and then ultimately the whole tag once they were combined. Rule 13 represents the final regex to solve the original problem..

Testing The Rules

When you enter an HTML tag below (only one tag) and hit the button, the string will be tested against a set of rules. The results of the rules will be presented below. Try entering invalid tags to see the results.

Test String:
Start At:

Results of the test against the above string:
Rule 7: At which position does it contain a valid tag? 1
Rule 8: At which position does it contain a valid tag (using relaxed rules)? 1
Rule 9: At which position does it contain a valid start tag? 0
Rule 11: At which position does it contain a valid empty tag? 1
Rule 12: At which position does it contain a valid empty tag without any whitespace before the slash? 0
Rule 13: At which position does it contain an valid img start tag? 0


The Rules

1. Defines an attribute + equal (eg, title=)
[[:alpha:]]+[[:space:]]*=[[:space:]]*

2. Defines a value
("([^"]*)"|'([^']*)')

3. Defines an attribute/value pair (eg, title="whatever")
[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)')

4. Defines multiple attribute/value pairs (eg, title="whatever" href="somethingelse")
([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*

5. Defines a tag without a trailing slash and no additional characters after the last attribute value
<[[:alpha:]]+([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*>

6. Defines a valid tag using strict rules (no valueless attributes)
<[[:alpha:]]+([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*[[:space:]]*/?>

7. Defines a valid tag using less strict rules (allows for valueless attributes at end)
<[[:alpha:]]+([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*[^"'/>]*/?>

8. Defines a valid tag with slightly relaxed rules (allows for anything after the last attribute)
<[[:alpha:]]+([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*[^>]*>

9. Defines a valid "start" tag (no trailing slash)
<[[:alpha:]]+([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*[[:space:]]*>

10. Defines a valid "end" tag (has leading slash)
</[[:alpha:]]+([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*[[:space:]]*>

11. Defines a valid "empty" tag (has trailing slash)
<[[:alpha:]]+([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*[[:space:]]*/>

12. Defines a valid "empty" tag (has trailing slash) without whitespace before the slash
<[[:alpha:]]+([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*/>

If we, for example, want to find an image tag this is not closed, we'd make a small change to
rule 9 to find img start tags—

13. Defines an unclosed img tag
<img([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*[[:space:]]*>


Valid XHTML 1.1!