Enriching user-entered HTML markup with PHP parsing

I recently found myself faced with an interesting little web dev challenge. Here's the scenario. You've got a site that's powered by a PHP CMS (in this case, Drupal). One of the pages on this site contains a number of HTML text blocks, each of which must be user-editable with a rich-text editor (in this case, TinyMCE). However, some of the HTML within these text blocks (in this case, the unordered lists) needs some fairly advanced styling – the kind that's only possible either with CSS3 (using, for example, nth-child pseudo-selectors), with JS / jQuery manipulation, or with the addition of some extra markup (for example, some first, last, and first-in-row classes on the list item elements).

Naturally, IE7+ compatibility is required – so, CSS3 selectors are out. Injecting element attributes via jQuery is a viable option, but it's an ugly approach, and it may not kick in immediately on page load. Since the users will be editing this content via WYSIWYG, we can't expect them to manually add CSS classes to the markup, or to maintain any markup that the developer provides in such a form. That leaves only one option: injecting extra attributes on the server-side.

When it comes to HTML manipulation, there are two general approaches. The first is Parsing HTML The Cthulhu Way (i.e. using Regular Expressions). However, you already have one problem to solve – do you really want two? The second is to use an HTML parser. Sadly, this problem must be solved in PHP – which, unlike some other languages, lacks an obvious tool of choice in the realm of parsers. I chose to use PHP5's built-in DOMDocument library, which (from what I can tell) is one of the most mature and widely-used PHP HTML parsers available today. Here's my code snippet.

Markup parsing function

 * Parses the specified markup content for unordered lists, and enriches
 * the list markup with unique identifier classes, 'first' and 'last'
 * classes, 'first-in-row' classes, and a prepended inside element for
 * each list item.
 * @param $content
 *   The markup content to enrich.
 * @param $id_prefix
 *   Each list item is given a class with name 'PREFIX-item-XX'.
 *   Optional.
 * @param $items_per_row
 *   For each Nth element, add a 'first-in-row' class. Optional.
 *   If not set, no 'first-in-row' classes are added.
 * @param $prepend_to_li
 *   The name of an HTML element (e.g. 'span') to prepend inside
 *   each liist item. Optional.
 * @return
 *   Enriched markup content.
function enrich_list_markup($content, $id_prefix = NULL,
$items_per_row = NULL, $prepend_to_li = NULL) {
  // Trim leading and trailing whitespace, DOMDocument doesn't like it.
  $content = preg_replace('/^ */', '', $content);
  $content = preg_replace('/ *$/', '', $content);
  $content = preg_replace('/ *\n */', "\n", $content);
  // Remove newlines from the content, DOMDocument doesn't like them.
  $content = preg_replace('/[\r\n]/', '', $content);
  $doc = new DOMDocument();
  foreach ($doc->getElementsByTagName('ul') as $ul_node) {
    $i = 0;
    foreach ($ul_node->childNodes as $li_node) {
      $li_class_list = array();
      if ($id_prefix) {
        $li_class_list[] = $id_prefix . '-item-' . sprintf('%02d', $i+1);
      if (!$i) {
        $li_class_list[] = 'first';
      if ($i == $ul_node->childNodes->length-1) {
        $li_class_list[] = 'last';
      if (!empty($items_per_row) && !($i % $items_per_row)) {
        $li_class_list[] = 'first-in-row';
      $li_node->setAttribute('class', implode(' ', $li_class_list));
      if (!empty($prepend_to_li)) {
        $prepend_el = $doc->createElement($prepend_to_li);
        $li_node->insertBefore($prepend_el, $li_node->firstChild);
  $content = $doc->saveHTML();
  // Manually fix up HTML entity encoding - if there's a better
  // solution for this, let me know.
  $content = str_replace('&acirc;&#128;&#147;', '&ndash;', $content);
  // Manually remove the doctype, html, and body tags that DOMDocument
  // wraps around the text. Apparently, this is the only easy way
  // to fix the problem:
  // http://stackoverflow.com/a/794548
  $content = mb_substr($content, 119, -15);
  return $content;

This is a fairly simple parsing routine, that loops through the li elements of the unordered lists in the text, and that adds some CSS classes, and also prepends a child node. There's some manual cleanup needed after the parsing is done, due to some quirks associated with DOMDocument.

Markup parsing example

For example, say your users have entered the following markup:


And your designer has given you the following rules:

  • List items to be laid out in rows, with three items per row
  • The first and last items to be coloured purple
  • The third and fifth items to be coloured green
  • All other items to be coloured blue
  • Each list item to be given a coloured square 'bullet', which should be the same colour as the list item's background colour, but a darker shade

You can ready the markup for the implementation of these rules, by passing it through the parsing function as follows:

$content = enrich_list_markup($content, 'fruit', 3, 'span');

After parsing, your markup will be:

  <li class="fruit-item-01 first first-in-row"><span></span>Apples</li>
  <li class="fruit-item-02"><span></span>Bananas</li>
  <li class="fruit-item-03"><span></span>Boysenberries</li>
  <li class="fruit-item-04 first-in-row"><span></span>Peaches</li>
  <li class="fruit-item-05"><span></span>Lemons</li>
  <li class="fruit-item-06 last"><span></span>Grapes</li>

You can then whip up some CSS to make your designer happy:

#fruit ul {
  list-style-type: none;

#fruit ul li {
  display: block;
  width: 150px;
  padding: 20px 20px 20px 45px;
  float: left;
  margin: 0 0 20px 20px;
  background-color: #bbddfb;
  position: relative;

#fruit ul li.first-in-row {
  clear: both;
  margin-left: 0;

#fruit ul li span {
  display: block;
  position: absolute;
  left: 20px;
  top: 23px;
  width: 15px;
  height: 15px;
  background-color: #191970;

#fruit ul li.first, #fruit ul li.last {
  background-color: #968adc;

#fruit ul li.fruit-item-03, #fruit ul li.fruit-item-05 {
  background-color: #7bdca6;

#fruit ul li.first span, #fruit ul li.last span {
  background-color: #4b0082;

#fruit ul li.fruit-item-03 span, #fruit ul li.fruit-item-05 span {
  background-color: #00611c;

Your finished product is bound to win you smiles on every front:

  • Apples
  • Bananas
  • Boysenberries
  • Peaches
  • Lemons
  • Grapes

Obviously, this is just one example of how a markup parsing function might look, and of the exact end result that you might want to achieve with such parsing. Take everything presented here, and fiddle liberally to suit your needs.

In the approach I've presented here, I believe I've managed to achieve a reasonable balance between stakeholder needs (i.e. easily editable content, good implementation of visual design), hackery, and technical elegance. Also note that this article is not at all CMS-specific (the code snippets work stand-alone), nor is it particularly parser-specific, or even language-specific (although code snippets are in PHP). Feedback welcome.

Comments are closed



Your code would be much less if you went the regular expressions route.

Regular expressions take a little bit to get your head around but they are so powerful and a very useful tool to know how to use.

Also, it seems like you are saying to call the parsing function from your theme template.

That is not a good idea.

You should be at least doing it in the template preprocess function for that template. See http://drupal.org/node/223430

But even better would be to do it as an input filter.
Then for example you could assign your parser to your filtered HTML input filter and then whenever the Filtered HTML input format is used the text runs trough your parser.
For more info see hook_filter_info() http://api.drupal.org/api/d...

There are some examples around the place, including in the examples module - http://drupal.org/project/e...

Jeremy Epstein


I decided not to use regular expressions in this case, as the nature of the task would have required quite a complicated regex pattern. Perhaps my code would be less quantity, but it would be much harder to read and to maintain. I have plenty of experience in using regexes, enough to know when not to use them.

In my case, I'm actually calling the function from the callback code of a custom-defined block, in a custom module. I agree with your suggestions: calling it from a template preprocess function would also be a good idea, and implementing it as an input filter would be appropriate too. Of course, calling the function directly from a theme template would be bad coding practice. However, I purposely chose not to discuss in this article where I'm calling the code from in Drupal (or from what possible locations one could call it in Drupal), as the example in this article is not CMS-dependent, and as the article is purely about HTML parsing (not about how to integrate HTML parsing with Drupal).


Try http://querypath.org/ for parsing the HTML.

Nate Wildermuth

I believe that using CKeditor's templates would be another (limited) solution.


Another vote for querypath! It's like jquery for PHP.