Highlighting XHTML
As you’re well aware by now, you can search my blog posts by clicking on the little magnifier on the bottom right. This feature was already pretty decent, if I do say so myself, but today, I added some extra niceness: search queries are now highlighted.
This may seem trivial to implement, but, in fact, requires quite a bit of black magic. After all, we’re dealing with XHTML, which means we can’t use nested tags for highlighting. In addition, since it’s a whole word search, tags may be in the middle of words, e.g. to emphasize a single character.
For future reference, the algorithm I used is the following:
Split the HTML into tags and character data. For example, the string
<p>Hello, <em>world</em>!</p>would result in an array containing
'<p>' 'Hello, ' '<em>' 'world' '</em>' '!' '</p>'Next, hold on to the tags from the array above, and create a new string where you replace all the tags with an arbitrary placeholder, e.g. a null byte. In the example, you’d get
\0Hello, \0world\0!\0Update: In addition, do the same for HTML entities; otherwise, the entity codes might get split up.
Now for the tricky part. Create a regular expression that will match all the words you need, but work the placeholder into it so that it may occur between each pair of characters. For example, if you wanted to match both “hello” and “world”, the regular expression would be
/(h\0?e\0?l\0?l\0?o|w\0?o\0?r\0?l\0?d)/isThe
iswitch stands for case independence; thesswitch expands the expression over multiple lines, which isn’t necessary for this case.That won’t cut it though. That is to say, it will also match parts of words, e.g. “worlds”. To prevent this, you use lookbehind and lookahead. In my case, I put
(?<![\p{L}\p{N}])in front and
(?![\p{L}\p{N}])at the back, which, for the example, results in
/(?<![\p{L}\p{N}])(h\0?e\0?l\0?l\0?o|w\0?o\0?r\0?l\0?d)(?![\p{L}\p{N}])/isObviously, you’ll never actually get to see the expression as it should be dynamically generated.
The worst part’s over. Now, the basic idea is to apply the generated expression to the string with null bytes from the second step, replacing each match with its own value, surrounded by whichever bit of HTML you wish to use for highlighting—in my case,
<em class="highlight">and</em>.Another caveat though. To avoid getting nested tags, you need to split up the highlighting whenever a new tag starts. Fortunately, tags are all represented by null bytes, so surrounding each null byte in the matching string with a closing and opening highlighting element gets the job done.
Almost there now. The highlighting itself has already happened, but the original tags still need to be restored. This is probably the easiest step: just replace all null bytes with consecutive elements from the array of tags from step 2 and you’re done.
My eventual implementation is in Pwnt_Highlighter. If you wish to use it, go right ahead, but like everything else on this site, it’s licensed BY-NC-ND. Neat improvements are always welcome.
One final note: since my site uses MySQL’s Unicode features, the matching still isn’t 100% accurate when you use international characters. That is to say, MySQL thinks é and e are the same character, while PHP disagrees. The result is that, for example, the database layer will match “résumé” and “resume” for the same query, but since PHP does the highlighting, it will only match what you entered literally.
