PHP – How to Parse Html DOM with DOMDocument

By Silver Moon | August 5, 2020

Domdocument

The domdocument class of Php is a very handy one that can be used for a number of tasks like parsing xml, html and creating xml. It is documented here.

In this tutorial we are going to see how to use this class to parse html content. The need to parse html happens when are you are for example writing scrapers, or similar data extraction scripts.

Sample html

The following is the sample html file that we are going to use with DomDocument.

<html>
	<body>
		<div id="mango">
			This is the mango div. It has some text and a form too.
			<form>
				<input type="text" name="first_name" value="Yahoo" />
				<input type="text" name="last_name" value="Bingo" />
			</form>
			
			<table class="inner">
				<tr><td>Happy</td><td>Sky</td></tr>
			</table>
		</div>
		
		<table id="data" class="outer">
			<tr><td>Happy</td><td>Sky</td></tr>
			<tr><td>Happy</td><td>Sky</td></tr>
			<tr><td>Happy</td><td>Sky</td></tr>
			<tr><td>Happy</td><td>Sky</td></tr>
			<tr><td>Happy</td><td>Sky</td></tr>
		</table>
	</body>
</html>

1. Loading the html

So the first thing to do would be to construct a domdocument object and load the html content in it. Lets see how to do that.

// a new dom object
$dom = new domDocument; 

// load the html into the object
$dom->loadHTML($html); 

// discard white space
$dom->preserveWhiteSpace = false;

Done. The $dom object has loaded the html content and can be used to extract contents from the whole html structure just like its done inside javascript. Most common functions are getElementsByTagName and getElementById.

Now that the html is loaded, its time to see how nodes and child elements can be accessed.

2. Get an element by its html id

This will get hold of a node/element by using its ID.

//get element by id
$mango_div = $dom->getElementById('mango');

if(!mango_div)
{
	die("Element not found");
}

echo "element found";

Getting the value/html of a node

The "nodeValue" attribute of an node shall give its value but strip all html inside it. For example

echo $mango_div->nodeValue;

The second method is to use the saveHTML function, that gets out the exact html inside that particular node.

echo $dom->saveHTML($mango_div);

Note that the function saveHTML is called on the dom object and the node object is passed as a parameter. The saveHTML function will provide the whole html (outer html) of the node including the node's own html tags as well.

Another function called C14N does the same thing more quickly

//echo the contents of mango_div element
echo $mango_div->C14N();

inner html

To get just the inner html take the following approach. It adds up the html of all of the child nodes.

$tables = $dom->getElementsByTagName('table');

echo get_inner_html($tables->item(0));

function get_inner_html( $node ) 
{
	$innerHTML= '';
	$children = $node->childNodes;
	
	foreach ($children as $child)
	{
		$innerHTML .= $child->ownerDocument->saveXML( $child );
	}
	
	return $innerHTML;
}

The function get_inner_html gets the inner html of the html element. Note that we used the saveXML function instead of the saveHTML function. The property "childNodes" provides the child nodes of an element. These are the direct children.

3. Getting elements by tagname

This will get elements by tag name.

$tables = $dom->getElementsByTagName('table');

foreach($tables as $table)
{
	echo $dom->saveHTML($table);
}

The function getElementsByTagName returns an object of type DomNodeList that can be read as an array of objects of type DomNode. Another way to fetch the nodes of the NodeList is by using the item function.

$tables = $dom->getElementsByTagName('table');

echo "Found : ".$tables->length. " items";

$i = 0;
while($table = $tables->item($i++))
{
	echo $dom->saveHTML($table);
}

The item function takes the index of the item to be fetched. The length attribute of the DomNodeList gives the number of objects found.

4. Get the attributes of an element

Every DomNode has an attribute called "attributes" that is a collection of all the html attributes of that node.
Here is a quick example

$tables = $dom->getElementsByTagName('table');

$i = 0;

while($table = $tables->item($i++))
{
	foreach($table->attributes as $attr)
	{
		echo $attr->name . " " . $attr->value . "<br />";
	}
}

To get a particular attribute using its name, use the "getNamedItem" function on the attributes object.

$tables = $dom->getElementsByTagName('table');

$i = 0;

while($table = $tables->item($i++))
{
	$class_node = $table->attributes->getNamedItem('class');
	
	if($class_node)
	{
		echo "Class is : " . $table->attributes->getNamedItem('class')->value . PHP_EOL;
	}
}

5. Children of a node

A DomNode has the following properties that provide access to its children

1. childNodes
2. firstChild
3. lastChild

$tables = $dom->getElementsByTagName('table');

$table = $tables->item(1);

//get the number of rows in the 2nd table
echo $table->childNodes->length; 

//content of each child
foreach($table->childNodes as $child)
{
	echo $child->ownerDocument->saveHTML($child);
}

Checking if child nodes exist

The hasChildNodes function can be used to check if a node has any children at all.
Quick example

if( $table->hasChildNodes() )
{
	//print content of children
	foreach($table->childNodes as $child)
	{
		echo $child->ownerDocument->saveHTML($child);
	}
}

6. Comparing 2 elements for equality

It might be needed to check if the element in 1 variable is the same as the element in another variable. The function "isSameNode" is used for this. The function is called on one node, and the other node is passed as the parameter. If the nodes are same, then boolean true is returned.

$tables = $dom->getElementsByTagName('table');

$table = $tables->item(1);

$table2 = $dom->getElementById('data');

var_dump($table->isSameNode($table2));

The var_dump would show true , indicating that the tables in both $table and $table2 are the same.

Conclusion

The above examples showed how Domdocument can be used to access elements in an html document in an object oriented manner. Domdocument can not only parse html but also create/modify html and xml. In later articles we shall see how to do that.

About Silver Moon

A Tech Enthusiast, Blogger, Linux Fan and a Software Developer. Writes about Computer hardware, Linux and Open Source software and coding in Python, Php and Javascript. He can be reached at [email protected].

View all posts by Silver Moon →

17 Comments

PHP – How to Parse Html DOM with DOMDocument

Matt K.
December 24, 2023 at 3:03 am

Thank you for the post

Reply
Bert Hooyman
April 8, 2022 at 2:02 pm

Here is a working php script that adds some enhancements to this useful article:

<?php
$html = file_get_contents("sample html document.html");
echo "0 this is the input HTML:”;
echo “”;
echo htmlentities($html);
echo “”;

// a new dom object
$dom = new domDocument;

// load the html into the object
$dom->loadHTML($html);

// discard white space
removeEmptyTextNodes($dom);
// preserveWhiteSpace does not help us much
$dom->preserveWhiteSpace = false;

//get element by id
$mango_div = $dom->getElementById(‘mango’);

if(!$mango_div)
{
die(“Element not found”);
}

echo “1 ‘mango’ element found”;
echo “2 the node value of the ‘mango’ element:”;
echo $mango_div->nodeValue;
echo “3 saveHTML on the mango div:”;
echo $dom->saveHTML($mango_div);
echo “4 Another way to do this using canonicalization (C14N) of the node itself:”;
//echo the contents of mango_div element
echo $mango_div->C14N();

echo “5 retrieving the inner HTML of the ‘.inner’ table:”;
$tables = $dom->getElementsByTagName(‘table’);
echo get_inner_html($tables->item(0));

echo “6 get elements by tag name ‘table’:”;
$tables = $dom->getElementsByTagName(‘table’);
foreach($tables as $table)
{
echo $dom->saveHTML($table);
}
echo “7 alternative way to use getElementsByTagName, using the item() method:”;
$tables = $dom->getElementsByTagName(‘table’);
echo “Found : “.$tables->length. ” items”;
$i = 0;
while($table = $tables->item($i++))
{
echo $dom->saveHTML($table);
}

echo “8 Finding the attributes of an element:”;
$tables = $dom->getElementsByTagName(‘table’);
$i = 0;
while($table = $tables->item($i++))
{
foreach($table->attributes as $attr)
{
echo $attr->name . ” ” . $attr->value . “”;
}
}

echo “9 Finding named attributes (using class as example):”;
$tables = $dom->getElementsByTagName(‘table’);
$i = 0;
while($table = $tables->item($i++))
{
$class_node = $table->attributes->getNamedItem(‘class’);

if($class_node)
{
echo “Class is : ” . $table->attributes->getNamedItem(‘class’)->value . PHP_EOL;
}
}

echo “10 Looping over child nodes:”;
$tables = $dom->getElementsByTagName(‘table’);
$table = $tables->item(1);

//get the number of rows in the 2nd table
// when empty text nodes are not removed, there will be 11 child nodes.
// when empty text nodes are removed there are 5 nodes
echo $table->childNodes->length;

//content of each child
foreach($table->childNodes as $child)
{
echo $child->ownerDocument->saveHTML($child);
}

echo “11 Check for existence of child nodes:”;
if( $table->hasChildNodes() )
{
//print content of children
foreach($table->childNodes as $child)
{
echo $child->ownerDocument->saveHTML($child);
}
}

echo “12 Compare two nodes for equality:”;
$tables = $dom->getElementsByTagName(‘table’);
$table = $tables->item(1);
$table2 = $dom->getElementById(‘data’);

var_dump($table->isSameNode($table2));

exit(0);

function get_inner_html( $node )
{
$innerHTML= ”;
$children = $node->childNodes;

foreach ($children as $child)
{
$innerHTML .= $child->ownerDocument->saveXML( $child );
}

return $innerHTML;
}

// recursive white space remover: DOMText nodes containing only white space
// this is taken from the loadHTML documentation page on php.net (April 2022)
function removeEmptyTextNodes(DOMNode $node) {
if ($node->hasChildNodes()) {
// depth-first, right-to-left
for ($i = $node->childNodes->length – 1; $i >= 0; –$i) {
removeEmptyTextNodes($node->childNodes->item($i));
}
}
if ($node->nodeType === XML_TEXT_NODE && // this is a text node
!$node->hasChildNodes() && // this is a leaf node
!$node->hasAttributes() && // it has no attributes
empty(trim($node->textContent))) { // and there is only white space in there
$node->parentNode->removeChild($node);
}
}
?>

Reply
Amit Shee
September 25, 2018 at 3:03 pm

Great article! I would be very interested to read about create/modify html nodes.

Reply
Mathieu
August 24, 2018 at 6:06 pm

Hi, I have a question. As I’m not sure the HTML standards are always respected when I parse a page, I’d like to retrieve a tag with a specific id and classes.

Something like:
Hello there
in other words : span#myId.class1.class2

is there a way to perform this with DOMElement ?

thanks

Reply
dexidle
January 23, 2018 at 9:09 am

Thanks, great tutorial. It helped me.

Reply
Adolf
February 4, 2017 at 1:01 am

The variable $html in the examples above was not defined! Or was it? Where or how? Please

Reply
Srikalyan
March 5, 2016 at 4:04 pm

Thanks. Great Tutorial. But can you please add the outputs of the code as well.

Reply
Razzi.eu
September 16, 2014 at 2:02 am

Many thanks, got i finally working because of your example

Reply
Jeffrey
January 17, 2014 at 10:10 pm

This is an old article, but I’m just getting started… In any case I think you are missing a “$” in front of the variable in your code above if(!mango_div).. Atleast I did. Thanks

Reply
johnwright79
October 21, 2013 at 8:08 pm

Thanks Silver, your get_inner_html function has saved me a lot of time. Really great article.

Reply
bhavesh
September 21, 2013 at 5:05 pm

$mango_div = $dom->getElementById(‘mango’);

above text not working in windows 7. it occurring this error

Catchable fatal error: Object of class DOMElement could not be converted to string in /var/www/websites/shabbonacreekrv/shabbonacreekrv.com/www/html/vehicle.php on line 65

to test it…

Reply
Jonas Zumkehr
August 28, 2013 at 1:01 pm

Great article! I would be very interested to read about create/modify html nodes.

Reply
Vasim Padhiyar
August 3, 2013 at 4:04 pm

Hello I am getting Following error :

Warning: domdocument::domdocument() expects parameter 2….

Fatal error: Call to undefined method domdocument::loadHTML()

I am using php 5.2.x + WAMP

php_domxml extension is enebled.

What is wrong in my code ?? $html has html content.. i did not paste it here. its the same as your example.

$dom = new DOMDocument(‘1.0′,’utf-8’);
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName(‘table’);
print_r($tables);

Reply
jlgarhdez
June 6, 2013 at 4:04 pm

Thanks, great tutorial!

Reply
Kevin Gong
January 7, 2013 at 12:12 pm

Thanks for the tut. Would be helpful if you included the code (as you have) + the return values of your echo statements!

Reply
1. Silver Moon
  January 7, 2013 at 2:02 pm
  
  thanks for the feedback, shall update the post soon.
  
  Reply
Delaiah
December 22, 2012 at 8:08 pm

Thanks, this helped me a lot (was just looking for a nice DOCDocument example)!

Reply

PHP – How to Parse Html DOM with DOMDocument

Domdocument

Sample html

1. Loading the html

2. Get an element by its html id

3. Getting elements by tagname

4. Get the attributes of an element

5. Children of a node

6. Comparing 2 elements for equality

Conclusion

17 Comments

Leave a Reply Cancel reply

About

Linux and Open Source

Other Categories