This demo started after someone asked the simple question "How do I find image tags within a block of text that are not XHTML compliant (don't have the proper trailing slash)?" The problem with most solutions is that they didn't handle the case where the attributes of the image tag contained a ">" character. In order to correctly detect this scenario it was necessary to build a regex that finds tags in a very robust fashion. Using the RFC as a guideline, the following rules were created to match various portions of tags, and then ultimately the whole tag once they were combined. Rule 13 represents the final regex to solve the original problem..
When you enter an HTML tag below (only one tag) and hit the button, the string will be tested against a set of rules. The results of the rules will be presented below. Try entering invalid tags to see the results.
Results of the test against the above string:
Rule 7: At which position does it contain a valid tag?
1
Rule 8: At which position does it contain a valid tag (using relaxed rules)?
1
Rule 9: At which position does it contain a valid start tag?
0
Rule 11: At which position does it contain a valid empty tag?
1
Rule 12: At which position does it contain a valid empty tag without any whitespace before the slash?
0
Rule 13: At which position does it contain an valid img start tag?
0
1. Defines an attribute + equal (eg, title=) [[:alpha:]]+[[:space:]]*=[[:space:]]* 2. Defines a value ("([^"]*)"|'([^']*)') 3. Defines an attribute/value pair (eg, title="whatever") [[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)') 4. Defines multiple attribute/value pairs (eg, title="whatever" href="somethingelse") ([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))* 5. Defines a tag without a trailing slash and no additional characters after the last attribute value <[[:alpha:]]+([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*> 6. Defines a valid tag using strict rules (no valueless attributes) <[[:alpha:]]+([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*[[:space:]]*/?> 7. Defines a valid tag using less strict rules (allows for valueless attributes at end) <[[:alpha:]]+([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*[^"'/>]*/?> 8. Defines a valid tag with slightly relaxed rules (allows for anything after the last attribute) <[[:alpha:]]+([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*[^>]*> 9. Defines a valid "start" tag (no trailing slash) <[[:alpha:]]+([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*[[:space:]]*> 10. Defines a valid "end" tag (has leading slash) </[[:alpha:]]+([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*[[:space:]]*> 11. Defines a valid "empty" tag (has trailing slash) <[[:alpha:]]+([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*[[:space:]]*/> 12. Defines a valid "empty" tag (has trailing slash) without whitespace before the slash <[[:alpha:]]+([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*/> If we, for example, want to find an image tag this is not closed, we'd make a small change to rule 9 to find img start tags— 13. Defines an unclosed img tag <img([[:space:]]+[[:alpha:]]+[[:space:]]*=[[:space:]]*("([^"]*)"|'([^']*)'))*[[:space:]]*>