HtmlPruner
extends AbstractHtmlProcessor
in package
This class can remove things from HTML.
Table of Contents
- CONTENT_TYPE_META_TAG = '<meta http-equiv="Content-Type" content="text/html; charset=utf-8">'
- DEFAULT_DOCUMENT_TYPE = '<!DOCTYPE html>'
- HTML_COMMENT_PATTERN = '/<!--[^-]*+(?:-(?!->)[^-]*+)*+(?:-->|$)/'
- regular expression pattern to match an HTML comment, including delimiters and modifiers
- HTML_TEMPLATE_ELEMENT_PATTERN = '%<template[\s>][^<]*+(?:<(?!/template>)[^<]*+)*+(?:</template>|$)%i'
- regular expression pattern to match an HTML `<template>` element, including delimiters and modifiers
- PHP_UNRECOGNIZED_VOID_TAGNAME_MATCHER = '(?:command|embed|keygen|source|track|wbr)'
- TAGNAME_ALLOWED_BEFORE_BODY_MATCHER = '(?:html|head|base|command|link|meta|noscript|script|style|template|title)'
- Regular expression part to match tag names that may appear before the start of the `<body>` element. A start tag for any other element would implicitly start the `<body>` element due to tag omission rules.
- DISPLAY_NONE_MATCHER = '//*[@style and contains(translate(translate(@style," ",""),"NOE","noe"),"display:none")' . ' and not(@class and contains(concat(" ", normalize-space(@class), " "), " -emogrifier-keep "))]'
- We need to look for display:none, but we need to do a case-insensitive search. Since DOMDocument only supports XPath 1.0, lower-case() isn't available to us. We've thus far only set attributes to lowercase, not attribute values. Consequently, we need to translate() the letters that would be in 'NONE' ("NOE") to lowercase.
- $domDocument : DOMDocument|null
- $xPath : DOMXPath|null
- fromDomDocument() : static
- Builds a new instance from the given DOM document.
- fromHtml() : static
- Builds a new instance from the given HTML.
- getDomDocument() : DOMDocument
- Provides access to the internal DOMDocument representation of the HTML in its current state.
- removeElementsWithDisplayNone() : $this
- Removes elements that have a "display: none;" style.
- removeRedundantClasses() : $this
- Removes classes that are no longer required (e.g. because there are no longer any CSS rules that reference them) from `class` attributes.
- removeRedundantClassesAfterCssInlined() : $this
- After CSS has been inlined, there will likely be some classes in `class` attributes that are no longer referenced by any remaining (uninlinable) CSS. This method removes such classes.
- render() : string
- Renders the normalized and processed HTML.
- renderBodyContent() : string
- Renders the content of the BODY element of the normalized and processed HTML.
- getHtmlElement() : DOMElement
- Returns the HTML element.
- getXPath() : DOMXPath
- __construct() : mixed
- The constructor.
- addContentTypeMetaTag() : string
- Adds a Content-Type meta tag for the charset.
- createRawDomDocument() : void
- Creates a DOMDocument instance from the given HTML and stores it in $this->domDocument.
- createUnifiedDomDocument() : void
- Creates a DOM document from the given HTML and stores it in $this->domDocument.
- ensureDocumentType() : string
- Makes sure that the passed HTML has a document type, with lowercase "html".
- ensureExistenceOfBodyElement() : void
- Checks that $this->domDocument has a BODY element and adds it if it is missing.
- ensurePhpUnrecognizedSelfClosingTagsAreXml() : string
- Makes sure that any self-closing tags not recognized as such by PHP's DOMDocument implementation have a self-closing slash.
- getBodyElement() : DOMElement
- Returns the BODY element.
- hasContentTypeMetaTagInHead() : bool
- Tests whether the given HTML has a valid `Content-Type` metadata element within the `<head>` element. Due to tag omission rules, HTML parsers are expected to end the `<head>` element and start the `<body>` element upon encountering a start tag for any element which is permitted only within the `<body>`.
- hasEndOfHeadElement() : bool
- Tests whether the `<head>` element ends within the given HTML. Due to tag omission rules, HTML parsers are expected to end the `<head>` element and start the `<body>` element upon encountering a start tag for any element which is permitted only within the `<body>`.
- normalizeDocumentType() : string
- Makes sure the document type in the passed HTML has lowercase "html".
- prepareHtmlForDomConversion() : string
- Returns the HTML with added document type, Content-Type meta tag, and self-closing slashes, if needed, ensuring that the HTML will be good for creating a DOM document from it.
- removeClassAttributeFromElements() : void
- Removes the `class` attribute from each element in `$elements`.
- removeClassesFromElements() : void
- Removes classes from the `class` attribute of each element in `$elements`, except any in `$classesToKeep`, removing the `class` attribute itself if the resultant list is empty.
- removeHtmlComments() : string
- Removes comments from the given HTML, including any which are unterminated, for which the remainder of the string is removed.
- removeHtmlTemplateElements() : string
- Removes `<template>` elements from the given HTML, including any without an end tag, for which the remainder of the string is removed.
- removeSelfClosingTagsClosingTags() : string
- Eliminates any invalid closing tags for void elements from the given HTML.
- setDomDocument() : void
- setHtml() : void
- Sets the HTML to process.
Constants
CONTENT_TYPE_META_TAG
protected
string
CONTENT_TYPE_META_TAG
= '<meta http-equiv="Content-Type" content="text/html; charset=utf-8">'
DEFAULT_DOCUMENT_TYPE
protected
string
DEFAULT_DOCUMENT_TYPE
= '<!DOCTYPE html>'
HTML_COMMENT_PATTERN
regular expression pattern to match an HTML comment, including delimiters and modifiers
protected
string
HTML_COMMENT_PATTERN
= '/<!--[^-]*+(?:-(?!->)[^-]*+)*+(?:-->|$)/'
HTML_TEMPLATE_ELEMENT_PATTERN
regular expression pattern to match an HTML `<template>` element, including delimiters and modifiers
protected
string
HTML_TEMPLATE_ELEMENT_PATTERN
= '%<template[\s>][^<]*+(?:<(?!/template>)[^<]*+)*+(?:</template>|$)%i'
PHP_UNRECOGNIZED_VOID_TAGNAME_MATCHER
protected
string
PHP_UNRECOGNIZED_VOID_TAGNAME_MATCHER
= '(?:command|embed|keygen|source|track|wbr)'
Tags
TAGNAME_ALLOWED_BEFORE_BODY_MATCHER
Regular expression part to match tag names that may appear before the start of the `<body>` element. A start tag for any other element would implicitly start the `<body>` element due to tag omission rules.
protected
string
TAGNAME_ALLOWED_BEFORE_BODY_MATCHER
= '(?:html|head|base|command|link|meta|noscript|script|style|template|title)'
DISPLAY_NONE_MATCHER
We need to look for display:none, but we need to do a case-insensitive search. Since DOMDocument only supports XPath 1.0, lower-case() isn't available to us. We've thus far only set attributes to lowercase, not attribute values. Consequently, we need to translate() the letters that would be in 'NONE' ("NOE") to lowercase.
private
string
DISPLAY_NONE_MATCHER
= '//*[@style and contains(translate(translate(@style," ",""),"NOE","noe"),"display:none")' . ' and not(@class and contains(concat(" ", normalize-space(@class), " "), " -emogrifier-keep "))]'
Properties
$domDocument
protected
DOMDocument|null
$domDocument
= null
$xPath
private
DOMXPath|null
$xPath
= null
Methods
fromDomDocument()
Builds a new instance from the given DOM document.
public
static fromDomDocument(DOMDocument $document) : static
Parameters
- $document : DOMDocument
-
a DOM document returned by getDomDocument() of another instance
Return values
static —fromHtml()
Builds a new instance from the given HTML.
public
static fromHtml(string $unprocessedHtml) : static
Parameters
- $unprocessedHtml : string
-
raw HTML, must be UTF-encoded, must not be empty
Tags
Return values
static —getDomDocument()
Provides access to the internal DOMDocument representation of the HTML in its current state.
public
getDomDocument() : DOMDocument
Tags
Return values
DOMDocument —removeElementsWithDisplayNone()
Removes elements that have a "display: none;" style.
public
removeElementsWithDisplayNone() : $this
Return values
$this —removeRedundantClasses()
Removes classes that are no longer required (e.g. because there are no longer any CSS rules that reference them) from `class` attributes.
public
removeRedundantClasses([array<array-key, string> $classesToKeep = [] ]) : $this
Note that this does not inspect the CSS, but expects to be provided with a list of classes that are still in use.
This method also has the (presumably beneficial) side-effect of minifying (removing superfluous whitespace from)
class attributes.
Parameters
- $classesToKeep : array<array-key, string> = []
-
names of classes that should not be removed
Return values
$this —removeRedundantClassesAfterCssInlined()
After CSS has been inlined, there will likely be some classes in `class` attributes that are no longer referenced by any remaining (uninlinable) CSS. This method removes such classes.
public
removeRedundantClassesAfterCssInlined(CssInliner $cssInliner) : $this
Note that it does not inspect the remaining CSS, but uses information readily available from the CssInliner
instance about the CSS rules that could not be inlined.
Parameters
- $cssInliner : CssInliner
-
object instance that performed the CSS inlining
Tags
Return values
$this —render()
Renders the normalized and processed HTML.
public
render() : string
Return values
string —renderBodyContent()
Renders the content of the BODY element of the normalized and processed HTML.
public
renderBodyContent() : string
Return values
string —getHtmlElement()
Returns the HTML element.
protected
getHtmlElement() : DOMElement
This method assumes that there always is an HTML element, throwing an exception otherwise.
Tags
Return values
DOMElement —getXPath()
protected
getXPath() : DOMXPath
Tags
Return values
DOMXPath —__construct()
The constructor.
private
__construct() : mixed
Please use ::fromHtml or ::fromDomDocument instead.
Return values
mixed —addContentTypeMetaTag()
Adds a Content-Type meta tag for the charset.
private
addContentTypeMetaTag(string $html) : string
This method also ensures that there is a HEAD element.
Parameters
- $html : string
Return values
string — the HTML with the meta tag addedcreateRawDomDocument()
Creates a DOMDocument instance from the given HTML and stores it in $this->domDocument.
private
createRawDomDocument(string $html) : void
Parameters
- $html : string
Return values
void —createUnifiedDomDocument()
Creates a DOM document from the given HTML and stores it in $this->domDocument.
private
createUnifiedDomDocument(string $html) : void
The DOM document will always have a BODY element and a document type.
Parameters
- $html : string
Return values
void —ensureDocumentType()
Makes sure that the passed HTML has a document type, with lowercase "html".
private
ensureDocumentType(string $html) : string
Parameters
- $html : string
Return values
string — HTML with document typeensureExistenceOfBodyElement()
Checks that $this->domDocument has a BODY element and adds it if it is missing.
private
ensureExistenceOfBodyElement() : void
Tags
Return values
void —ensurePhpUnrecognizedSelfClosingTagsAreXml()
Makes sure that any self-closing tags not recognized as such by PHP's DOMDocument implementation have a self-closing slash.
private
ensurePhpUnrecognizedSelfClosingTagsAreXml(string $html) : string
Parameters
- $html : string
Return values
string — HTML with problematic tags converted.getBodyElement()
Returns the BODY element.
private
getBodyElement() : DOMElement
This method assumes that there always is a BODY element.
Tags
Return values
DOMElement —hasContentTypeMetaTagInHead()
Tests whether the given HTML has a valid `Content-Type` metadata element within the `<head>` element. Due to tag omission rules, HTML parsers are expected to end the `<head>` element and start the `<body>` element upon encountering a start tag for any element which is permitted only within the `<body>`.
private
hasContentTypeMetaTagInHead(string $html) : bool
Parameters
- $html : string
Return values
bool —hasEndOfHeadElement()
Tests whether the `<head>` element ends within the given HTML. Due to tag omission rules, HTML parsers are expected to end the `<head>` element and start the `<body>` element upon encountering a start tag for any element which is permitted only within the `<body>`.
private
hasEndOfHeadElement(string $html) : bool
Parameters
- $html : string
Tags
Return values
bool —normalizeDocumentType()
Makes sure the document type in the passed HTML has lowercase "html".
private
normalizeDocumentType(string $html) : string
Parameters
- $html : string
Return values
string — HTML with normalized document typeprepareHtmlForDomConversion()
Returns the HTML with added document type, Content-Type meta tag, and self-closing slashes, if needed, ensuring that the HTML will be good for creating a DOM document from it.
private
prepareHtmlForDomConversion(string $html) : string
Parameters
- $html : string
Return values
string — the unified HTMLremoveClassAttributeFromElements()
Removes the `class` attribute from each element in `$elements`.
private
removeClassAttributeFromElements(DOMNodeList $elements) : void
Parameters
- $elements : DOMNodeList
Return values
void —removeClassesFromElements()
Removes classes from the `class` attribute of each element in `$elements`, except any in `$classesToKeep`, removing the `class` attribute itself if the resultant list is empty.
private
removeClassesFromElements(DOMNodeList $elements, array<array-key, string> $classesToKeep) : void
Parameters
- $elements : DOMNodeList
- $classesToKeep : array<array-key, string>
Return values
void —removeHtmlComments()
Removes comments from the given HTML, including any which are unterminated, for which the remainder of the string is removed.
private
removeHtmlComments(string $html) : string
Parameters
- $html : string
Tags
Return values
string —removeHtmlTemplateElements()
Removes `<template>` elements from the given HTML, including any without an end tag, for which the remainder of the string is removed.
private
removeHtmlTemplateElements(string $html) : string
Parameters
- $html : string
Tags
Return values
string —removeSelfClosingTagsClosingTags()
Eliminates any invalid closing tags for void elements from the given HTML.
private
removeSelfClosingTagsClosingTags(string $html) : string
Parameters
- $html : string
Return values
string —setDomDocument()
private
setDomDocument(DOMDocument $domDocument) : void
Parameters
- $domDocument : DOMDocument
Return values
void —setHtml()
Sets the HTML to process.
private
setHtml(string $html) : void
Parameters
- $html : string
-
the HTML to process, must be UTF-8-encoded
