This article discusses the state of trees as a content organization structure in modern CMS as opposed to other approaches.
Update 18th Feb, 2015: This post got a reply from Contentful, which you can read here.
For several years I have been interested in content repositories as a key aspect of modern CMS. With “modern”, I mean CMS that are not just “page management systems” but CMS that actually manage content, thereby enabling authors to reuse their content on different devices and even different applications. This interest culminated in the creation of PHPCR and its reference implementation Jackalope. In this spirit, I was very intrigued by services like prismic.io and contentful.com that essentially provide a content repository as a service. I was especially impressed by Prismic’s UI. But when evaluating these systems, I noticed a surprising trend: they do not leverage trees, neither as a native storage concept nor as a visualization concept. Instead, they for the most part rely on flat structures with tagging. My gut feeling was telling me that this was a mistake, especially when managing larger content repositories. At the same time I wondered: “Am I just a dinosaur that is missing the ark?”.
I discussed the topic with Ekke at a conference last fall and after a short Twitter exchange we decided to write down our thoughts. I found additional inspiration in an article by David Weinberger who helped put my feelings in a historical context as well as explaining the advantages of different approaches to content organization, namely: trees, facets and tags. Additionally, I also want to mention the concept of references since they are supported by Contentful.
Trees are the oldest of the methods mentioned above. The reason for this is likely that they work great in the physical world, ie. good old paper books, as they require no content duplication. That is, every piece of information is placed in exactly one place. The fact that trees have been around so long also gives them one distinct advantage: everyone knows how they work. Facets and tags, however, very much leverage the new possibilities of the digital age in that content can easily live in several places at once. But just because trees predate the digital age does not make them a dinosaur waiting for extinction. Let us first look at some of the advantages and disadvantages of facets and tags.
Lets start with the latter. Tags likely gained the most popularity with the advent of blogs. Fundamentally, blogs are otherwise a flat, chronologically sorted list of content pieces. Tags added an effective way to denote the main focus topics of a given article while also providing a useful filter criteria. By combining multiple tags in a filter it becomes possible in many cases to quickly drill down. Moreover, as each tag essentially stands on its own, adding a new tag is trivial. Simply begin using the new tag and it exists. This is simpler than a tree structure, where it is necessary to decide where in the tree a new topic best fits.
As such tags are also useful for allowing crowd sourced categorization. But here we also come to the main pain point of tagging: its inherently messy. Trying to stay on top of synonyms and abbreviations and typos that unintentionally place content in different “buckets” requires almost as much work as placing a topic into a tree structure and can lead to confusion when tags are later renamed/merged. Another approach can, of course, be to strictly control the creation of tags to prevent these issues from occurring but then one loses a lot of the reasons why tags are useful. Furthermore, homonyms cause major problems with tagging. For example the tag “apple” could relate to a fruit or to the computer company. A common solution is to then introduce tags like “apple fruit”, but with that tags lose a lot of their elegance. This brings us back to exactly why tags are so popular on blogs. Blogs were originally used for personal digital diaries, thereby reducing the risk of synonyms by different authors causing duplicate tags for the same topic. Also they usually focused on a specific topic which thereby reduced the chances of homonyms.
Facets have become especially popular in e-commerce sites to allow users to filter based on multiple dimensions in the order they prefer. However, they basically require content to be somewhat structured to be effective. Whereas chapters in a book usually just provide a title following a lot of text, for facets, one should further work to split the text into more structured pieces of information. It is not necessary to have the same structure for all pieces of content, however. Furthermore, just like with tags, with facets it becomes possible to find the same piece of information in different places.
Facets are specifically useful when it is very hard to anticipate which strategy someone will use to find the given piece of content. Going back to the e-commerce example – one user might focus on the price first, then on the color and then on the fabric with the next user potentially wanting to drill down in a totally different order. Furthermore facets are great because they allow non domain expert users to discover the relevant dimensions simply by looking at the left over facets as they add filters. As a content provider, it also becomes quite easy to offer new facets by simply starting to fill in some new “facets”. That being said it is also possible to run into issues with homonyms when searching across different content types, but its much less likely than with tags. For example a status property might be a numeric value for some pieces of content and or a simple flag for others. In this case, with some additional work, it might even be possible to translate the flag to a numeric value on the fly.
References have also been around for a long time. With the digital age, it has become much easier to follow a reference. Since the pre-digital age, they’ve been a popular addition to physical books in the form of footnotes and indexes. On the web, a reference is just a click away and can even be inlined if needed (for example, browsers inline image references). Images, or rather media content in general, are a good example of references used in many CMS.
Often text content and media content is kept in separate storage containers that are just connected via references. This is likely because creation of media content requires different skills and resources while also requiring significantly more storage, which means the technical challenges are also not the same. As such, media content is often reused, hence the logical use of references. References are a very powerful tool from which one can effectively build not only tree but also graph structures. But this additional power also means it becomes very hard to visualize and therefore comprehend the actual data structure without actively traversing it. Querying a graph structure tends to require expert knowledge and providing a performant experience is also a non trivial challenge solved only by very specialized systems.
Which brings us back to trees. The main drawback of trees is, to some extent, also their biggest advantage: the rigid allocation of content to a single place in the tree. This requires careful planning and can lead to iffy situations where one piece of content could be placed in multiple categories. For example, an article about the economics in sports, could be placed under “economics” or “sports”. This can of course be solved via references, but as pointed out above, the overuse of references as a means to structure content can cause problems.
On the other hand, this rigidness also gives things a lot of clarity. Most importantly, trees can be used as a very simple way to model inheritance that is understandable even for non developers. In this way the location, the context of the content in the tree, provides an important piece of information. For example, placing an article under “sports” expresses that this article is about sports. But it can express multiple things on top of that. Going back to the above dilemma about the “economics in sports” article, placing the article in one or the other category can also be used to determine responsibility. That is, by placing it under “sports” it can also automatically assign rights to all the sports editors. Interestingly it can also help to bridge back to the physical world to, for example, determine where in the print version the article will appear.
The categorization inside a tree also enables weighting of facets so to say. If I use economics vs. sports in the first dimension and the publication date as the second, I steer people towards the ideal way to explore the content. Is the emphasis of the content on giving a snapshot of information for a specific date or is it rather on the specific topic? Obviously in practice, most tree structures also support references and as I said before, references can be used to build tree structures themselves. But the true power of a tree structure lies in the context and the natural visualization they provide. Forcing as much content to remain within the limitations of a tree ensures that this metaphor remains useful, where as overusing references will mean that it becomes ineffective requiring a much more complicated visualization of a graph.
In summary, it becomes clear that all above mentioned systems have their advantages.
Tags are great for managing content structure that exhibits one or all of the below criteria:
- focused on a specific topic
- small data set
- categorization can be done after content creation
Faceting is mostly useful for content with the following attributes:
- content is “structured” in the sense that different facets of the content can be sensible separated and be given attribute names
- there is no singular way that users are expected to explore the content
References or rather graphs are ideal when:
- the content creation is very distinct
- the content itself is highly interrelated
Finally, trees are ideal:
- when most content fits into a rigid structure
- when there are experts with sufficient amount of time to properly place content into the structure
In practice, we of course see a lot of hybrid systems. That is, many blogs support tagging along with categories (essentially a tree with a depth of one). Many tree storage systems also support references which effectively also enable graphs.
My personal takeaway is that any CMS managing any sizeable amount of data needs to support trees. Anything else will lead to an unmanageable mess. However, systems with smaller sets of content, especially with a smaller group of authors, can get away with tagging as well. Facetting only really works well with a system that stores content that is highly structured at least on a per node type basis. In this spirit, I maintain that repository as a service providers will need to provide full support for trees, both to structure as well as visualized content, in order to become able to handle larger volumes of data. Faceting will also need to be provided if they intend to make inroads from doing more than just serving large chunks of text and media content.
Lukas Kahwe Smith is a partner at Liip. His enthusiasm for the web and open source is only rivaled by his love of ultimate frisbee. He was co-release-manager for PHP 5.3. He is part of the Symfony core team, is co-leading PHPCR and Symfony CMF and has contributed to various parts of the Doctrine project. Basically whatever he manages to fit into his busy life as an ultimate frisbee player.