AbstractHtmlProcessor
in package
Base class for HTML processor that e.g., can remove, add or modify nodes or attributes.
The "vanilla" subclass is the HtmlNormalizer.
Table of Contents
- CONTENT_TYPE_META_TAG = '<meta http-equiv="Content-Type" content="text/html; charset=utf-8">'
- DEFAULT_DOCUMENT_TYPE = '<!DOCTYPE html>'
- HTML_COMMENT_PATTERN = '/<!--[^-]*+(?:-(?!->)[^-]*+)*+(?:-->|$)/'
- regular expression pattern to match an HTML comment, including delimiters and modifiers
- HTML_TEMPLATE_ELEMENT_PATTERN = '%<template[\s>][^<]*+(?:<(?!/template>)[^<]*+)*+(?:</template>|$)%i'
- regular expression pattern to match an HTML `<template>` element, including delimiters and modifiers
- PHP_UNRECOGNIZED_VOID_TAGNAME_MATCHER = '(?:command|embed|keygen|source|track|wbr)'
- TAGNAME_ALLOWED_BEFORE_BODY_MATCHER = '(?:html|head|base|command|link|meta|noscript|script|style|template|title)'
- Regular expression part to match tag names that may appear before the start of the `<body>` element. A start tag for any other element would implicitly start the `<body>` element due to tag omission rules.
- $domDocument : DOMDocument|null
- $xPath : DOMXPath|null
- fromDomDocument() : static
- Builds a new instance from the given DOM document.
- fromHtml() : static
- Builds a new instance from the given HTML.
- getDomDocument() : DOMDocument
- Provides access to the internal DOMDocument representation of the HTML in its current state.
- render() : string
- Renders the normalized and processed HTML.
- renderBodyContent() : string
- Renders the content of the BODY element of the normalized and processed HTML.
- getHtmlElement() : DOMElement
- Returns the HTML element.
- getXPath() : DOMXPath
- __construct() : mixed
- The constructor.
- addContentTypeMetaTag() : string
- Adds a Content-Type meta tag for the charset.
- createRawDomDocument() : void
- Creates a DOMDocument instance from the given HTML and stores it in $this->domDocument.
- createUnifiedDomDocument() : void
- Creates a DOM document from the given HTML and stores it in $this->domDocument.
- ensureDocumentType() : string
- Makes sure that the passed HTML has a document type, with lowercase "html".
- ensureExistenceOfBodyElement() : void
- Checks that $this->domDocument has a BODY element and adds it if it is missing.
- ensurePhpUnrecognizedSelfClosingTagsAreXml() : string
- Makes sure that any self-closing tags not recognized as such by PHP's DOMDocument implementation have a self-closing slash.
- getBodyElement() : DOMElement
- Returns the BODY element.
- hasContentTypeMetaTagInHead() : bool
- Tests whether the given HTML has a valid `Content-Type` metadata element within the `<head>` element. Due to tag omission rules, HTML parsers are expected to end the `<head>` element and start the `<body>` element upon encountering a start tag for any element which is permitted only within the `<body>`.
- hasEndOfHeadElement() : bool
- Tests whether the `<head>` element ends within the given HTML. Due to tag omission rules, HTML parsers are expected to end the `<head>` element and start the `<body>` element upon encountering a start tag for any element which is permitted only within the `<body>`.
- normalizeDocumentType() : string
- Makes sure the document type in the passed HTML has lowercase "html".
- prepareHtmlForDomConversion() : string
- Returns the HTML with added document type, Content-Type meta tag, and self-closing slashes, if needed, ensuring that the HTML will be good for creating a DOM document from it.
- removeHtmlComments() : string
- Removes comments from the given HTML, including any which are unterminated, for which the remainder of the string is removed.
- removeHtmlTemplateElements() : string
- Removes `<template>` elements from the given HTML, including any without an end tag, for which the remainder of the string is removed.
- removeSelfClosingTagsClosingTags() : string
- Eliminates any invalid closing tags for void elements from the given HTML.
- setDomDocument() : void
- setHtml() : void
- Sets the HTML to process.
Constants
CONTENT_TYPE_META_TAG
protected
string
CONTENT_TYPE_META_TAG
= '<meta http-equiv="Content-Type" content="text/html; charset=utf-8">'
DEFAULT_DOCUMENT_TYPE
protected
string
DEFAULT_DOCUMENT_TYPE
= '<!DOCTYPE html>'
HTML_COMMENT_PATTERN
regular expression pattern to match an HTML comment, including delimiters and modifiers
protected
string
HTML_COMMENT_PATTERN
= '/<!--[^-]*+(?:-(?!->)[^-]*+)*+(?:-->|$)/'
HTML_TEMPLATE_ELEMENT_PATTERN
regular expression pattern to match an HTML `<template>` element, including delimiters and modifiers
protected
string
HTML_TEMPLATE_ELEMENT_PATTERN
= '%<template[\s>][^<]*+(?:<(?!/template>)[^<]*+)*+(?:</template>|$)%i'
PHP_UNRECOGNIZED_VOID_TAGNAME_MATCHER
protected
string
PHP_UNRECOGNIZED_VOID_TAGNAME_MATCHER
= '(?:command|embed|keygen|source|track|wbr)'
Tags
TAGNAME_ALLOWED_BEFORE_BODY_MATCHER
Regular expression part to match tag names that may appear before the start of the `<body>` element. A start tag for any other element would implicitly start the `<body>` element due to tag omission rules.
protected
string
TAGNAME_ALLOWED_BEFORE_BODY_MATCHER
= '(?:html|head|base|command|link|meta|noscript|script|style|template|title)'
Properties
$domDocument
protected
DOMDocument|null
$domDocument
= null
$xPath
private
DOMXPath|null
$xPath
= null
Methods
fromDomDocument()
Builds a new instance from the given DOM document.
public
static fromDomDocument(DOMDocument $document) : static
Parameters
- $document : DOMDocument
-
a DOM document returned by getDomDocument() of another instance
Return values
static —fromHtml()
Builds a new instance from the given HTML.
public
static fromHtml(string $unprocessedHtml) : static
Parameters
- $unprocessedHtml : string
-
raw HTML, must be UTF-encoded, must not be empty
Tags
Return values
static —getDomDocument()
Provides access to the internal DOMDocument representation of the HTML in its current state.
public
getDomDocument() : DOMDocument
Tags
Return values
DOMDocument —render()
Renders the normalized and processed HTML.
public
render() : string
Return values
string —renderBodyContent()
Renders the content of the BODY element of the normalized and processed HTML.
public
renderBodyContent() : string
Return values
string —getHtmlElement()
Returns the HTML element.
protected
getHtmlElement() : DOMElement
This method assumes that there always is an HTML element, throwing an exception otherwise.
Tags
Return values
DOMElement —getXPath()
protected
getXPath() : DOMXPath
Tags
Return values
DOMXPath —__construct()
The constructor.
private
__construct() : mixed
Please use ::fromHtml or ::fromDomDocument instead.
Return values
mixed —addContentTypeMetaTag()
Adds a Content-Type meta tag for the charset.
private
addContentTypeMetaTag(string $html) : string
This method also ensures that there is a HEAD element.
Parameters
- $html : string
Return values
string — the HTML with the meta tag addedcreateRawDomDocument()
Creates a DOMDocument instance from the given HTML and stores it in $this->domDocument.
private
createRawDomDocument(string $html) : void
Parameters
- $html : string
Return values
void —createUnifiedDomDocument()
Creates a DOM document from the given HTML and stores it in $this->domDocument.
private
createUnifiedDomDocument(string $html) : void
The DOM document will always have a BODY element and a document type.
Parameters
- $html : string
Return values
void —ensureDocumentType()
Makes sure that the passed HTML has a document type, with lowercase "html".
private
ensureDocumentType(string $html) : string
Parameters
- $html : string
Return values
string — HTML with document typeensureExistenceOfBodyElement()
Checks that $this->domDocument has a BODY element and adds it if it is missing.
private
ensureExistenceOfBodyElement() : void
Tags
Return values
void —ensurePhpUnrecognizedSelfClosingTagsAreXml()
Makes sure that any self-closing tags not recognized as such by PHP's DOMDocument implementation have a self-closing slash.
private
ensurePhpUnrecognizedSelfClosingTagsAreXml(string $html) : string
Parameters
- $html : string
Return values
string — HTML with problematic tags converted.getBodyElement()
Returns the BODY element.
private
getBodyElement() : DOMElement
This method assumes that there always is a BODY element.
Tags
Return values
DOMElement —hasContentTypeMetaTagInHead()
Tests whether the given HTML has a valid `Content-Type` metadata element within the `<head>` element. Due to tag omission rules, HTML parsers are expected to end the `<head>` element and start the `<body>` element upon encountering a start tag for any element which is permitted only within the `<body>`.
private
hasContentTypeMetaTagInHead(string $html) : bool
Parameters
- $html : string
Return values
bool —hasEndOfHeadElement()
Tests whether the `<head>` element ends within the given HTML. Due to tag omission rules, HTML parsers are expected to end the `<head>` element and start the `<body>` element upon encountering a start tag for any element which is permitted only within the `<body>`.
private
hasEndOfHeadElement(string $html) : bool
Parameters
- $html : string
Tags
Return values
bool —normalizeDocumentType()
Makes sure the document type in the passed HTML has lowercase "html".
private
normalizeDocumentType(string $html) : string
Parameters
- $html : string
Return values
string — HTML with normalized document typeprepareHtmlForDomConversion()
Returns the HTML with added document type, Content-Type meta tag, and self-closing slashes, if needed, ensuring that the HTML will be good for creating a DOM document from it.
private
prepareHtmlForDomConversion(string $html) : string
Parameters
- $html : string
Return values
string — the unified HTMLremoveHtmlComments()
Removes comments from the given HTML, including any which are unterminated, for which the remainder of the string is removed.
private
removeHtmlComments(string $html) : string
Parameters
- $html : string
Tags
Return values
string —removeHtmlTemplateElements()
Removes `<template>` elements from the given HTML, including any without an end tag, for which the remainder of the string is removed.
private
removeHtmlTemplateElements(string $html) : string
Parameters
- $html : string
Tags
Return values
string —removeSelfClosingTagsClosingTags()
Eliminates any invalid closing tags for void elements from the given HTML.
private
removeSelfClosingTagsClosingTags(string $html) : string
Parameters
- $html : string
Return values
string —setDomDocument()
private
setDomDocument(DOMDocument $domDocument) : void
Parameters
- $domDocument : DOMDocument
Return values
void —setHtml()
Sets the HTML to process.
private
setHtml(string $html) : void
Parameters
- $html : string
-
the HTML to process, must be UTF-8-encoded
