Tidy is a new extension for PHP 5 which allows you to parse, validate, manipulate and repair markup documents from within your PHP 5 scripts. It is based on the tidy command line utility released by the W3C, and the extension comes bundled standard with PHP 5 beginning with PHP 5.0 Beta 3. This article will explore the Tidy extension, and its use within your PHP 5 applications.
Although this article relies on Tidy 2.0 (which is available only in PHP 5), a very stable 1.0 version of Tidy is also available for PHP 4.3.x and above. It can be found in the PHP PECL Repository at http://pecl.php.net/.
Although the Tidy extension comes bundled by default in PHP 5.0, it must be enabled in order to be used and requires that the libTidy library be installed on your system. The latest version of libTidy can be found on the official Tidy utility web site at http://tidy.sourceforge.net/.
[user@localhost]$ tar -zxvf tidy_src.tgz
[user@localhost]$ cd tidy
[user@localhost]$ /bin/sh build/gnuauto/setup.sh
[user@localhost]$ ./configure
[user@localhost]$ make
[user@localhost]$ make install
Once libTidy has been downloaded and installed, use the
--with-tidy
configuration option to configure Tidy support into PHP.
[user@localhost]$ cd php
[user@localhost]$ ./configure --with-tidy=/path/to/libtidy
As is the case with most extensions in PHP, Windows users will be provided with everything they need to use Tidy in their applications by default. To test that you have PHP installed, check the output of the phpinfo() function or execute the CLI version of PHP with the -m parameter to check for the tidy module.
An introduction to Syntax
Tidy, as is the case with many of the new PHP 5 specific extensions, supports a dual procedural / object oriented syntax. This syntax is designed for maximum API flexibility for the developer and works in the following fashion. For now, don't concern yourself with the functions/methods being called themselves. They will all be discussed later.
Consider the following procedural use of the tidy extension:
<?php
$tidy = tidy_parse_file("http://www.coggeshall.org/");
tidy_clean_repair($tidy);
echo tidy_get_output($tidy);
?>
As one might expect, the $tidy value return from the call to the tidy_parse_file() function is a handle representing the parsed URL http://www.coggeshall.org/. However, in PHP 5, this handle is more than a simple resource. Rather, it is a complete object which may either be passed to other procedures in the tidy extension or used to call the procedures directly as methods. Thus, the following code is also acceptable:
<?php
$tidy = tidy_parse_file("http://www.coggeshall.org/");
$tidy->cleanRepair();
echo $tidy->value;
?>
Note that, unlike the first example the second uses the cleanRepair() method of the object returned from the tidy_parse_file() function. Because this example uses the cleanRepair() method, there is no need to specify the handle which should be used and thus the first parameter (the handle to manipulate) is omitted. Because "resources" returned from the tidy extension are really PHP 5 objects, it allows the tidy extension to take advantage of many of the powerful object-oriented features available in PHP 5. One of these advantages is the way objects may be casted to other types transparently. For Tidy, this means that the $tidy object returned from the tidy_parse_file() function may also be treated as a string with an output equivalent to the contents of $value property as shown below:
<?php
$tidy = tidy_parse_file("http://www.coggeshall.org/");
$tidy->cleanRepair();
echo $tidy;
?>
Finally, if one would like to extend the object returned from tidy the tidy class is available to be instanciated:
<?php
$tidy = new tidy();
$tidy->parseFile("http://www.coggeshall.org/");
$tidy->cleanRepair();
echo $tidy;
?>
Although not recommended, the procedural and object oriented syntaxes of Tidy in PHP 5 are completely interchangeable. In all examples in this article I will stick to a single type of syntax to avoid confusion. In general, procedural and object oriented syntaxes may be converted between each other in the following fashion:
Remove / Add tidy_ to the method / procedure
Remove Underscores / Add Underscores between words for methods/procedures (i.e. tidy_clean_repair() becomes $tidy->cleanRepair())
When calling from an object syntax, the first parameter of every function (the handle to a valid tidy document) is omitted.
With syntax out of the way, let's take a look at the basic usage of the Tidy extension. All real functionality within the tidy extension begins with the
tidy_parse_file()
or equivalent function or method. The sytnax for the
tidy_parse_file()
function is as follows:
tidy_parse_file($file [, $options [, $encoding [ $use_inc_path]]]);
$file is a valid PHP filename, either in the local file system or a remote URL / stream resource. The second parameter, $options is a very important parameter representing the configuration options which will be applied to this document. For now, we'll ignore thisparameter as a large portion of this article is devoted to it. The third optional parameter, $encoding is a string representing the encoding of the file being parsed. The fourth and final parameter, $use_inc_path, is a boolean value indicating if PHP should attempt to find the requested file in the include path (if not found otherwise).
If you would like to parse a document which already exists within a PHP variable, Tidy also provides the
tidy_parse_string()
function:
tidy_parse_string($data [, $options [, $encoding]]);
For this function, both
$options
and
$encoding
are identical to their equivalent parameters in the
tidy_parse_file()
function. The only difference is found in the first parameter
$data
, which accepts a string to parse rather than a filename or stream resource.
Regardless of where the data is taken from (a string or read from a file), when a document is parsed by tidy any syntax errors (missing quotes, end tags, etc.) will automatically be corrected using Tidy's intelligent parser. Once the document has been parsed both the tidy_parse_file() and tidy_parse_string() functions will return a tidy document handle representing the document to other Tidy functions.
Although when Tidy parses a document it does correct any syntax errors found, other errors that are not syntax related (such as omitting a <HEAD> tag in an HTML document) are not corrected. To correct these errors the document must be cleaned and repaired using the
tidy_clean_repair()
function. The syntax for this function is as follows:
tidy_clean_repair($tidy);
Where
$tidy
is the document handle returned from a call to either
tidy_parse_file()
or
tidy_parse_string()
. The exact nature of how tidy will clean and repair the document depends very heavily on the configuration options assigned to this document.
As shown in earlier examples, the actual output of documents manipulated by Tidy can be done in a number of ways. Procedurally, the
tidy_get_output()
function is used to retrieve the current state of the document in memory:
tidy_get_output($tidy);
From an object oriented perspective, the document handle can be treated as a string directly (which is equal to the output of the
tidy_get_output()
function), or the
value
property may be accessed as well:
<?php
/* These are equivalent */
echo $tidy;
echo $tidy->value;
?>
Not only is Tidy very good at intelligently parsing and repairing markup documents, but it also provides tools to identify specifically the problems it found in the original document. These errors begin logging as soon as the document is parsed, and can be retrieved by calling the
tidy_get_error_buffer()
function:
tidy_get_error_buffer($tidy);
Where
$tidy
is the document handle to retrieve the error buffer for. When executed this function will return a string representing all of the errors encountered thus far complete with a line number / column listing of the offending line in the original document as shown:
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 11 - Warning: replacing unexpected i by </i>
line 1 column 27 - Warning: replacing unexpected u by </u>
line 1 column 49 - Warning: discarding unexpected </b>
line 1 column 1 - Warning: inserting missing 'title' element
In an object syntax this information can also be retrieved through the
errorBuffer
property.
Since many common operations in Tidy are based on the above three functions (
tidy_parse_file()
,
tidy_clean_repair()
, and
tidy_get_output()
), the Tidy extension provides two shorthand functions which combine these functions into a single call. These functions are
tidy_repair_file()
and
tidy_repair_string()
for files and strings repsectively. The syntax for the
tidy_repair_file()
function is as follows:
tidy_repair_file($filename [, $options [, $encoding [, $use_inc_path]]]);
Where each parameter is identical to that found in the
tidy_parse_file()
function. Likewise, the syntax for the
tidy_repair_string()
function is as follows:
tidy_parse_string($data [, $options [, $encoding]]);
Which coincides with the prototype for the
tidy_parse_string()
function. When these functions are used, the given document will be parsed and repaired based on the encoding and configuration specified and a string is returned containing the final result.
Tidy Configuration Options
The majority of the power found in the Tidy extension can be found in the over 80 individual configuration options which may be set. These options control everything from the way Tidy will treat the input as it is parsed, how it will treat things such as PHP code, and even the format the final document will be rendered. In many cases working with Tidy is a matter of setting up the appropriate configuration options with very little change in code.
By default, Tidy has a default configuration which will be used on every document. This default configuration can be altered (as you will see) through a few different means. To begin, they may be altered at run time by dealing with the
$options
parameter of the
tidy_parse_file()
or
tidy_parse_string()
functions introduced earlier. This parameter is either a string (representing a Tidy configuration file), or an associative array of configuration option / values. For example, below is an example of configuring Tidy to output a given HTML document in XHTML 1.0 format:
<?php
$options = array('output-xhtml' => true);
$tidy = tidy_parse_file("somefile.html", $options);
tidy_clean_repair($tidy);
echo tidy_get_output($tidy);
?>
As an alternative to specifying configuration options at runtime, they may also be set in a configuration file which is then loaded by specifying the full path and filename for the
$options
parameter as shown below:
<?php
$options = "/path/to/my/tidy.tcfg";
$tidy = tidy_parse_file("somefile.html", $options);
tidy_clean_repair($tidy);
echo tidy_get_output($tidy);
?>
The format of the tidy configuration file is fairly straight-forward. An example of a valid tidy configuration file is found below:
indent-spaces: 4
indent: auto
tidy-mark: no
show-body-only: yes
new-blocklevel-tags: mytag, anothertag
Unlike parsing of markup documents, configuration files may only reside on the local file system. Configuration files have many uses, one of which is to define a number of Tidy "profiles" for different types of operations. For instance, you could create a profile which completely strips HTML documents of whitespace (to save on bandwidth) and another which beautifies HTML for editing.
Another use for Configuration files is to override the default configuration for any Tidy document handle created. To accomplish this, create a configuration file with your desired defaults and then set the
tidy.default_config
configuration directive to point to this configuration file. When specified, the configuration specified by this directive will be used any time a new Tidy document handle is created.
Beyond just validation and repair of documents, the Tidy extension also provides a robust mapping of the internal document tree to objects within PHP. Using this object oriented interface, you are able to access specific blocks of a given markup document quickly and easily.
To understand how these features work, first you must understand how Tidy represents documents internally. Consider the following simple HTML document:
<HTML>
<HEAD>
<TITLE>Example Basic HTML Document</TITLE>
</HEAD>
<BODY>
<B>Hello, World!</B> <I>This is italic text.</I></B>
</BODY>
</HTML>
Internally, this document would be represented as follows within Tidy:
From within PHP, Tidy allows access to this document tree through four methods available from the tidy document handle. These four methods (
root()
,
head()
,
html()
and
body()
) correspond to key points within any valid HTML document and return an instance of a yet un-introduced class: the
tidyNode
class. This class represents a single node within the document tree and provides the following properties and methods:
<?php
class tidyNode {
/* The string value of this node and all of its child nodes */
public $value;
/* The tag name i.e 'HTML' or 'BODY' */
public $name;
/* A numeric value representing the node type */
public $type;
/* A numeric value representing type of tag (if any) */
public $id;
/* An associative array of tag attributes */
public $attribute[];
/* An indexed array of child nodes */
public $child[];
public function hasChildren();
public function hasSiblings();
public function isComment();
public function isHtml();
public function isText();
public function isJste();
public function isAsp();
public function isPhp();
}
?>
Through the use of the
tidyNode
class, it is possible to navigate to any part within the document tree. For instance, to retrieve the background color of a given HTML document (the BGCOLOR attribute of the <BODY> tag) the following could be used:
<?php
$tidy = tidy_parse_file("somedoc.html");
echo "The background is: {$tidy->body()->attribute['bgcolor']}";
?>
Note that an instance of the
tidyNode
class can also be treated as a string, the contents of which is the same as the
value
property (the contents of this node and all of its child nodes).
All nodes which represent HTML tags are assigned an
id
property. This property represents a numeric identifier for the HTML tag, and corresponds to a predefined constant within PHP for that tag. This property makes searching for a particular type of tag faster and easier than ever before. The format of the constants used to represent each tag is as follows:
TIDY_TAG_<TAGNAME>
Where <TAGNAME> is the HTML tag name. For an <A> tag, the constant would therefore be TIDY_TAG_A.
To demonstrate the use of the Tidy Parser, consider the following function
dump_urls()
. This function searches through the given node and returns all of the URLs found within anchor (<A>) tags:
<?php
function dump_urls(tidyNode $node, &$urls = NULL) {
$urls = (is_array($urls)) ? $urls : array();
if(isset($node->id)) {
if($node->id == TIDY_TAG_A) {
$urls[] = $node->attribute['href'];
}
}
if($node->hasChildren()) {
foreach($node->child as $c) {
dump_nodes($c, $urls);
}
}
return $urls;
}
$tidy = tidy_parse_file("http://www.coggeshall.org/");
tidy_clean_repair($tidy);
$urls = dump_urls($tidy->html());
print_r($urls);
?>
This function, which uses recursion to traverse the entire document tree, is much easier than older regex-style methods of markup data mining with a minimal impact on performance. Similar methods can be used to extract any data from an HTML document, from all of the tables, the URLs of images, and more.