WooCommerce Code Reference

HtmlNormalizer extends AbstractHtmlProcessor
in package

Normalizes HTML: - add a document type (HTML5) if missing - disentangle incorrectly nested tags - add HEAD and BODY elements (if they are missing) - reformat the HTML

Table of Contents

CONTENT_TYPE_META_TAG  = '<meta http-equiv="Content-Type" content="text/html; charset=utf-8">'
DEFAULT_DOCUMENT_TYPE  = '<!DOCTYPE html>'
HTML_COMMENT_PATTERN  = '/<!--[^-]*+(?:-(?!->)[^-]*+)*+(?:-->|$)/'
regular expression pattern to match an HTML comment, including delimiters and modifiers
HTML_TEMPLATE_ELEMENT_PATTERN  = '%<template[\s>][^<]*+(?:<(?!/template>)[^<]*+)*+(?:</template>|$)%i'
regular expression pattern to match an HTML `<template>` element, including delimiters and modifiers
PHP_UNRECOGNIZED_VOID_TAGNAME_MATCHER  = '(?:command|embed|keygen|source|track|wbr)'
TAGNAME_ALLOWED_BEFORE_BODY_MATCHER  = '(?:html|head|base|command|link|meta|noscript|script|style|template|title)'
Regular expression part to match tag names that may appear before the start of the `<body>` element. A start tag for any other element would implicitly start the `<body>` element due to tag omission rules.
$domDocument  : DOMDocument|null
$xPath  : DOMXPath|null
fromDomDocument()  : static
Builds a new instance from the given DOM document.
fromHtml()  : static
Builds a new instance from the given HTML.
getDomDocument()  : DOMDocument
Provides access to the internal DOMDocument representation of the HTML in its current state.
render()  : string
Renders the normalized and processed HTML.
renderBodyContent()  : string
Renders the content of the BODY element of the normalized and processed HTML.
getHtmlElement()  : DOMElement
Returns the HTML element.
getXPath()  : DOMXPath
__construct()  : mixed
The constructor.
addContentTypeMetaTag()  : string
Adds a Content-Type meta tag for the charset.
createRawDomDocument()  : void
Creates a DOMDocument instance from the given HTML and stores it in $this->domDocument.
createUnifiedDomDocument()  : void
Creates a DOM document from the given HTML and stores it in $this->domDocument.
ensureDocumentType()  : string
Makes sure that the passed HTML has a document type, with lowercase "html".
ensureExistenceOfBodyElement()  : void
Checks that $this->domDocument has a BODY element and adds it if it is missing.
ensurePhpUnrecognizedSelfClosingTagsAreXml()  : string
Makes sure that any self-closing tags not recognized as such by PHP's DOMDocument implementation have a self-closing slash.
getBodyElement()  : DOMElement
Returns the BODY element.
hasContentTypeMetaTagInHead()  : bool
Tests whether the given HTML has a valid `Content-Type` metadata element within the `<head>` element. Due to tag omission rules, HTML parsers are expected to end the `<head>` element and start the `<body>` element upon encountering a start tag for any element which is permitted only within the `<body>`.
hasEndOfHeadElement()  : bool
Tests whether the `<head>` element ends within the given HTML. Due to tag omission rules, HTML parsers are expected to end the `<head>` element and start the `<body>` element upon encountering a start tag for any element which is permitted only within the `<body>`.
normalizeDocumentType()  : string
Makes sure the document type in the passed HTML has lowercase "html".
prepareHtmlForDomConversion()  : string
Returns the HTML with added document type, Content-Type meta tag, and self-closing slashes, if needed, ensuring that the HTML will be good for creating a DOM document from it.
removeHtmlComments()  : string
Removes comments from the given HTML, including any which are unterminated, for which the remainder of the string is removed.
removeHtmlTemplateElements()  : string
Removes `<template>` elements from the given HTML, including any without an end tag, for which the remainder of the string is removed.
removeSelfClosingTagsClosingTags()  : string
Eliminates any invalid closing tags for void elements from the given HTML.
setDomDocument()  : void
setHtml()  : void
Sets the HTML to process.

Constants

TAGNAME_ALLOWED_BEFORE_BODY_MATCHER

Regular expression part to match tag names that may appear before the start of the `<body>` element. A start tag for any other element would implicitly start the `<body>` element due to tag omission rules.

protected string TAGNAME_ALLOWED_BEFORE_BODY_MATCHER = '(?:html|head|base|command|link|meta|noscript|script|style|template|title)'

Properties

Methods

fromHtml()

Builds a new instance from the given HTML.

public static fromHtml(string $unprocessedHtml) : static
Parameters
$unprocessedHtml : string

raw HTML, must be UTF-encoded, must not be empty

Tags
throws
InvalidArgumentException

if $unprocessedHtml is anything other than a non-empty string

Return values
static

createUnifiedDomDocument()

Creates a DOM document from the given HTML and stores it in $this->domDocument.

private createUnifiedDomDocument(string $html) : void

The DOM document will always have a BODY element and a document type.

Parameters
$html : string
Return values
void

ensurePhpUnrecognizedSelfClosingTagsAreXml()

Makes sure that any self-closing tags not recognized as such by PHP's DOMDocument implementation have a self-closing slash.

private ensurePhpUnrecognizedSelfClosingTagsAreXml(string $html) : string
Parameters
$html : string
Return values
stringHTML with problematic tags converted.

hasContentTypeMetaTagInHead()

Tests whether the given HTML has a valid `Content-Type` metadata element within the `<head>` element. Due to tag omission rules, HTML parsers are expected to end the `<head>` element and start the `<body>` element upon encountering a start tag for any element which is permitted only within the `<body>`.

private hasContentTypeMetaTagInHead(string $html) : bool
Parameters
$html : string
Return values
bool

hasEndOfHeadElement()

Tests whether the `<head>` element ends within the given HTML. Due to tag omission rules, HTML parsers are expected to end the `<head>` element and start the `<body>` element upon encountering a start tag for any element which is permitted only within the `<body>`.

private hasEndOfHeadElement(string $html) : bool
Parameters
$html : string
Tags
throws
RuntimeException
Return values
bool

prepareHtmlForDomConversion()

Returns the HTML with added document type, Content-Type meta tag, and self-closing slashes, if needed, ensuring that the HTML will be good for creating a DOM document from it.

private prepareHtmlForDomConversion(string $html) : string
Parameters
$html : string
Return values
stringthe unified HTML

removeHtmlComments()

Removes comments from the given HTML, including any which are unterminated, for which the remainder of the string is removed.

private removeHtmlComments(string $html) : string
Parameters
$html : string
Tags
throws
RuntimeException
Return values
string

removeHtmlTemplateElements()

Removes `<template>` elements from the given HTML, including any without an end tag, for which the remainder of the string is removed.

private removeHtmlTemplateElements(string $html) : string
Parameters
$html : string
Tags
throws
RuntimeException
Return values
string