Handling content from strangers
One thing that makes web development both fascinating and exhausting is how the same subjects keep popping up, over and over, without resulting in any clear answers. One the one hand it’s remarkably easy to put up your own website. But building a site capable of handling a lot of traffic, and is easy to change and modify is not so easy.
What’s got me started is this recent blog by Sam Ruby, owner of the job to die for, at IBM, to whom PHP can thank for the Java extension, who’s been a member of the Apache group and has had a part countless other web innovations and groups.
The issue? How to publish content submitted to your site by it’s visitors. Solving this is one is as old as that most dated of web apps – the Guestbook and if you trawl through the comments on Sam’s site, you’ll quickly get the idea that still, no ones too sure of the answer.
Been developing a fetish recently for knocking up lists for describing common development problems, as a means to really nail it down. Here’s a guess at what a good solution to this problem needs to do (subject to my opinion / limited vision);
1. Prevents the structure of your site from being broken
2. Poses no threat to your site’s or it’s visitors security
3. Provides visitors with enough power, in terms of how they are able to format their submissions, to be happy.
4. Is easy to parse (extracting submitted formatting and handling it should not require a PhD)
5. Is easy to use. Believe it or not, there are people “out there” who have no idea of HTML.
6. Preserves the intent of the formatting. Not quite sure how to explain what I mean here but the thought test might be: “Is it possible to transform submitted content to other output types?” – i.e. is generating a PDF document, as opposed to HTML, at least feasible.
Any more / less?
Meeting all those requirements is probably an impossibility – it’s going to be a compromise at some level.
Some of the common solutions, off the top of my head, to solving this are;
b. Wiki style. Using “markup” like *this for bold* and _this for italic_. Wiki style often starts out well, being easy to use and secure but is perhaps weak on point 3. But things go down hill the more formatting options you provide to users, the parsing getting progressively harder to manage and the syntax weirder like !!!this for some large text. Users are required to learn this alien markup and may find it difficult to express their precise formatting intent (intent thereby being lost of becoming arbitrary as text like McDougals gets automatically assigned as a link to a new wiki page). In the end, I don’t think wikis do much to address those beyond their original target audience – software developers (flame on…)
c. Implied formatting. This is less frequently used as standalone mechanism but turns up often as part of other styles. Essentially whitespace takes on a meaning it doesn’t normally have with HTML. PHP offers nl2br() for example. It’s definitely easy to use and fairly safe (depending on what you do with URLs, for example). It’s also easy to parse. Where it fails is it typically offers little power to the user and it’s very easy to lose the intent of the formatting, hence it’s often augmented with one or more of the other styles.
d. BBCode style. Essentially use your own custom markup; one which will be ignored by web browsers completely, should any un parsed fragments turn up in the finished page. Although this can be a little tricky for users have never run into it before, it’s a tried, tested and successful, as forums apps like vBulletin and phpBB have proved, to the point where BBCode is almost (an unwritten) standard. Surprisingly, on Sams blog, no one mentioned it but perhaps that reflects the common divide between PHP developers and the rest of the web; doing it vs. talking about it. For end users, it generally means that based HTML tags have been translated more or less one to one to BBCode – simply replace
Couple of quick points, outside of security issues;
– When storing visitor submitted content in a database, for later display, apply the parsing operations after the content has been stored, not before. In other words, don’t parse, INSERT then SELECT but INSERT, SELECT then parse (if performance is an issue, cache the HTML resulting from the parse). The basic reason for this is it makes editing the content later (either by you as a site admin or by the visitor themselves) easy – you display their content (more or less) as is in a textarea rather than having to reverse the parsing operation to give them back what they started with (a recipe for headaches). You also stand a better chance of preserving the intent of the formatting, which is easy to lose if you’re required to reverse the parsing. You might consider filtering the content before storing it – certainly for SQL injections and possibly for stuff like “bad word filters” but don’t transform or add to the content.
– Document your markup. The number of blogs I see that expect visitors to guess (nudge nudge Sitepoint ;))…
While I’m here, some PEAR projects that can help in this area;
– PEAR::Text_Wiki – in effect, and abstraction layer for WIKI markup. Text_Wiki “captures” all the common document structuring requirements, end users may have, as “rules” and can translate whatever markup you like to those rules, the rules rendering (X)HTML. Very clever project. Would also work as a BBCode parser (and pretty much anything else in fact).
– XML_HTMLSax – a SAX parser which won’t choke on HTML (badly formed XML). In fact the name HTMLSax is a little misleading, as it has no specific knowledge of HTML vocab. In fact it’s much like Pythons HTMLParser although tags which are closed implicitly, like
result in a four argument to the open tag handler with XML_HTMLSax, as well as a call to the close handler, while Python;s HTMLParser has a “startendtag” callback for this situation. A couple of projects I’ve seen but never tried is HTML Parser for PHP-4, which provides a state based API and PHP HTML Parser, which does have some knowledge of HTML and seems to designed to transform HTML is a single pass (from the user point of view). Note also Simple Test has a (you guessed it) simple SAX based parser for HTML – it uses regular expressions, based on the Lexer in lamplib – still need to benchmark it against HTMLSax which uses a string position based approach to parsing, just for interest.
Anyway – long rant. Enough already.