CMS Content Organization Structures: Trees vs Facets vs Tags

By Lukas Smith

This article discusses the state of trees as a content organization structure in modern CMS as opposed to other approaches.

Update 18th Feb, 2015: This post got a reply from Contentful, which you can read here.

For several years I have been interested in content repositories as a key aspect of modern CMS. With “modern”, I mean CMS that are not just “page management systems” but CMS that actually manage content, thereby enabling authors to reuse their content on different devices and even different applications. This interest culminated in the creation of PHPCR and its reference implementation Jackalope. In this spirit, I was very intrigued by services like and that essentially provide a content repository as a service. I was especially impressed by Prismic’s UI. But when evaluating these systems, I noticed a surprising trend: they do not leverage trees, neither as a native storage concept nor as a visualization concept. Instead, they for the most part rely on flat structures with tagging. My gut feeling was telling me that this was a mistake, especially when managing larger content repositories. At the same time I wondered: “Am I just a dinosaur that is missing the ark?”.

I discussed the topic with Ekke at a conference last fall and after a short Twitter exchange we decided to write down our thoughts. I found additional inspiration in an article by David Weinberger who helped put my feelings in a historical context as well as explaining the advantages of different approaches to content organization, namely: trees, facets and tags. Additionally, I also want to mention the concept of references since they are supported by Contentful.


Trees are the oldest of the methods mentioned above. The reason for this is likely that they work great in the physical world, ie. good old paper books, as they require no content duplication. That is, every piece of information is placed in exactly one place. The fact that trees have been around so long also gives them one distinct advantage: everyone knows how they work. Facets and tags, however, very much leverage the new possibilities of the digital age in that content can easily live in several places at once. But just because trees predate the digital age does not make them a dinosaur waiting for extinction. Let us first look at some of the advantages and disadvantages of facets and tags.


Lets start with the latter. Tags likely gained the most popularity with the advent of blogs. Fundamentally, blogs are otherwise a flat, chronologically sorted list of content pieces. Tags added an effective way to denote the main focus topics of a given article while also providing a useful filter criteria. By combining multiple tags in a filter it becomes possible in many cases to quickly drill down. Moreover, as each tag essentially stands on its own, adding a new tag is trivial. Simply begin using the new tag and it exists. This is simpler than a tree structure, where it is necessary to decide where in the tree a new topic best fits.


As such tags are also useful for allowing crowd sourced categorization. But here we also come to the main pain point of tagging: its inherently messy. Trying to stay on top of synonyms and abbreviations and typos that unintentionally place content in different “buckets” requires almost as much work as placing a topic into a tree structure and can lead to confusion when tags are later renamed/merged. Another approach can, of course, be to strictly control the creation of tags to prevent these issues from occurring but then one loses a lot of the reasons why tags are useful. Furthermore, homonyms cause major problems with tagging. For example the tag “apple” could relate to a fruit or to the computer company. A common solution is to then introduce tags like “apple fruit”, but with that tags lose a lot of their elegance. This brings us back to exactly why tags are so popular on blogs. Blogs were originally used for personal digital diaries, thereby reducing the risk of synonyms by different authors causing duplicate tags for the same topic. Also they usually focused on a specific topic which thereby reduced the chances of homonyms.


Facets have become especially popular in e-commerce sites to allow users to filter based on multiple dimensions in the order they prefer. However, they basically require content to be somewhat structured to be effective. Whereas chapters in a book usually just provide a title following a lot of text, for facets, one should further work to split the text into more structured pieces of information. It is not necessary to have the same structure for all pieces of content, however. Furthermore, just like with tags, with facets it becomes possible to find the same piece of information in different places.


Facets are specifically useful when it is very hard to anticipate which strategy someone will use to find the given piece of content. Going back to the e-commerce example – one user might focus on the price first, then on the color and then on the fabric with the next user potentially wanting to drill down in a totally different order. Furthermore facets are great because they allow non domain expert users to discover the relevant dimensions simply by looking at the left over facets as they add filters. As a content provider, it also becomes quite easy to offer new facets by simply starting to fill in some new “facets”. That being said it is also possible to run into issues with homonyms when searching across different content types, but its much less likely than with tags. For example a status property might be a numeric value for some pieces of content and or a simple flag for others. In this case, with some additional work, it might even be possible to translate the flag to a numeric value on the fly.


References have also been around for a long time. With the digital age, it has become much easier to follow a reference. Since the pre-digital age, they’ve been a popular addition to physical books in the form of footnotes and indexes. On the web, a reference is just a click away and can even be inlined if needed (for example, browsers inline image references). Images, or rather media content in general, are a good example of references used in many CMS.

Often text content and media content is kept in separate storage containers that are just connected via references. This is likely because creation of media content requires different skills and resources while also requiring significantly more storage, which means the technical challenges are also not the same. As such, media content is often reused, hence the logical use of references. References are a very powerful tool from which one can effectively build not only tree but also graph structures. But this additional power also means it becomes very hard to visualize and therefore comprehend the actual data structure without actively traversing it. Querying a graph structure tends to require expert knowledge and providing a performant experience is also a non trivial challenge solved only by very specialized systems.


Which brings us back to trees. The main drawback of trees is, to some extent, also their biggest advantage: the rigid allocation of content to a single place in the tree. This requires careful planning and can lead to iffy situations where one piece of content could be placed in multiple categories. For example, an article about the economics in sports, could be placed under “economics” or “sports”. This can of course be solved via references, but as pointed out above, the overuse of references as a means to structure content can cause problems.


On the other hand, this rigidness also gives things a lot of clarity. Most importantly, trees can be used as a very simple way to model inheritance that is understandable even for non developers. In this way the location, the context of the content in the tree, provides an important piece of information. For example, placing an article under “sports” expresses that this article is about sports. But it can express multiple things on top of that. Going back to the above dilemma about the “economics in sports” article, placing the article in one or the other category can also be used to determine responsibility. That is, by placing it under “sports” it can also automatically assign rights to all the sports editors. Interestingly it can also help to bridge back to the physical world to, for example, determine where in the print version the article will appear.

The categorization inside a tree also enables weighting of facets so to say. If I use economics vs. sports in the first dimension and the publication date as the second, I steer people towards the ideal way to explore the content. Is the emphasis of the content on giving a snapshot of information for a specific date or is it rather on the specific topic? Obviously in practice, most tree structures also support references and as I said before, references can be used to build tree structures themselves. But the true power of a tree structure lies in the context and the natural visualization they provide. Forcing as much content to remain within the limitations of a tree ensures that this metaphor remains useful, where as overusing references will mean that it becomes ineffective requiring a much more complicated visualization of a graph.


In summary, it becomes clear that all above mentioned systems have their advantages.

Tags are great for managing content structure that exhibits one or all of the below criteria:

  • focused on a specific topic
  • small data set
  • categorization can be done after content creation

Faceting is mostly useful for content with the following attributes:

  • content is “structured” in the sense that different facets of the content can be sensible separated and be given attribute names
  • there is no singular way that users are expected to explore the content

References or rather graphs are ideal when:

  • the content creation is very distinct
  • the content itself is highly interrelated

Finally, trees are ideal:

  • when most content fits into a rigid structure
  • when there are experts with sufficient amount of time to properly place content into the structure

In practice, we of course see a lot of hybrid systems. That is, many blogs support tagging along with categories (essentially a tree with a depth of one). Many tree storage systems also support references which effectively also enable graphs.


My personal takeaway is that any CMS managing any sizeable amount of data needs to support trees. Anything else will lead to an unmanageable mess. However, systems with smaller sets of content, especially with a smaller group of authors, can get away with tagging as well. Facetting only really works well with a system that stores content that is highly structured at least on a per node type basis. In this spirit, I maintain that repository as a service providers will need to provide full support for trees, both to structure as well as visualized content, in order to become able to handle larger volumes of data. Faceting will also need to be provided if they intend to make inroads from doing more than just serving large chunks of text and media content.

I would like to thank Ekke and David for reviewing this article and proposing various improvements


Interesting recurring topic indeed, thanks for sharing.
I am also thinking trees have a lot to offer, plus the advantage of not preventing anyone from using tagging or categorization & taxonomies.
I think facets are another beast though, purely intended to be used for Search & Find. I would not really consider them as a solution to organize our content upfront. Are they?

It's also interesting to see how FS and other software (beside CMS) are evolving. FS UI, which are per definition trees, are less broadly used by end-user apps. Apps such as iCloud, GDoc, IA Writer, the Mac Finder and many others are more and more trying to leave the concept of "physical tree" to get to something more based on meta-information, that can also be a virtual tree sometime but with different // organizations. From a UX standpoint, trees have a lot of flaws, as you mention it.

To be continued...


Fascinating stuff. I would love to read a follow-up that covers hybrid systems in depth, as that's an issue that many site owners confront every day.

As a manager of (lots and lots of) content, I struggle with this daily. In fact, there's something deliciously meta about this article being listed under the PHP section when it has very little to do with PHP! I suspect it has to do with editorial assignment (and @swader can chime in), as you describe in the Tree section.

It becomes messy when you factor in multiple platforms: content that "lives" in one CMS (let's say WordPress) will be distributed via many channels (RSS feeds, either via a full feed, a category feed, or a tag feed), found via search engines, and, in our case, listed on a forum with similar categories.


Indeed in practice it likely makes the most sense for organizations to look towards hybrid systems to get the best of all worlds that are relevant to them. For example on twitter someone suggested to use a curated tree structure of tags to overcome some of the issues I mentioned with tags. This of course can be a huge help but also means you can no longer easily create new tags ad hoc, which is one of the big advantages of tags.


While it would be easy to say I was driven by the fact that the PHP world is so dense with CMS solutions or attempts at them and as such found this topic fitting to the general gist of the channel, it would probably be more honest to admit it was simple human error and bias - having recognized the article as excellent, I neglected to even think about other channels, enthusiastically giving it only one main category - mine. : )

Regarding the categories, I was always fond of a nested tags approach which worked for me in the past on large repositories of content. Tags with children (which essentially translates to categories and subcategories) solve all the problems I can envision in large scale CMS efforts, as long as the creation of tags is centralized and well defined. Internally to a site, there needs to be an approval flow to new tags, and the tags structure needs to support synonyms. Their IDs (and, by extension, URL slugs) need to differ if they're homonyms and the problem is solved indefinitely.

Furthermore, such tags can be given root (or "meta") tags that define their purpose, which themselves may be nested. Thereby, a CMS is given the flexibility to define which tags are visible to the end user in the site's search engine, which are to be interpreted as forum categories, which are to be considered statistics-tracking-related, and so on.

In short, a tagTree-on-tagTree model has worked very well for me in the past, covering thousands of books each with dozens of chapters, hundreds of thousands of users, dozens of journals with hundreds of entries and dozens of quarterlies each, all in a single system, and every entity tagged to some extent. The same tag engine powered the user-facing search, the employee-facing search and CRM, and the system-facing invoice tracker and more.


I just read over some PHPCR...interesting...but how would you express many-to-many relationships? It seems the PHP version does not implement the shareable nodes option -- which I understand is what would allow nodes to have more than one parent???

Drupal (as you likely know) has a very powerful data model...cross between entity attribute value or a graph's taxonomy module allows unlimited arbitrary categorization of data.

Interesting article smile


Very thought provoking stuff. Having a software development background I'm definitely a fan of trees (and especially the filesystem) as an organizational tool. The single-location limitation was solved in filesystems by using links (hard or symbolic) and I see no reason you couldn't model taxonomies in a content repositories the same way. (e.g. every item existing at one or more locations in the tree).

Disclaimer: I work for Contentful, but the views above are my own and don't reflect any sort of official position.


Indeed PHPCR currently does not support multiple parents. At least in all projects I have done so far, I didn't really miss this.


Yeah, in PHPCR the solution we have for this are references. I think references are more expressive than multiple parents, since it still means that "ownership" is clearly assigned to a specific place in the tree.


I think that facets, tags and category trees are just different expressions of basically the same paradigm. Facets are tags organized in groups with some enforcement related to specific information types (one-of, many-of, required facet). Strict category trees are tags hierarchies which only allow one tag per object.

The distinction between the three is made with consideration to UI, performance and data structure in mind, but I don't see a reason to treat them as completely separate entities.


I generally agree with this statement - it's all about how you use them, and you can turn each of those into the other.

Note that this post got a reply from Contentful and you can read it here.


Great post!

It took us some time, but since you've been calling us out directly, here are our official Contentful thoughts on trees. Long story short: We actually do like (and support) trees as one form of creating structure. True, not in your drag and drop type of web CMS visual UI, but conceptually for sure. However, we also think of trees as just one specific organizing principle with certain use cases (e.g. your collection of evergreen pages type of website). Other other scenarios may call for very different structures (think recipe app).


Indeed references can be used to build tree structures or at least makes it possible to navigate as if it was a tree. I mentioned this in the article above. This can indeed be a great solution however to gain all the above mentioned advantages of trees, one needs an actual tree structure (which then also means one gets the disadvantages as well). For example while a proper tree structure can serve to also "inherit" properties, the same cannot really be done with a graph/tree (where a graph is of course a super set of a tree structure) build via references.

That being said, some of the advantages of references you mention in your response would of course be compatible with a tree structure as well. Take the Roger Federer example. If one creates a node "Roger Federer" using references one can of course point articles to that node and then one can make all such references query-able. Using tags in this scenario is imho not ideal as likely as a content author I would actually like to be able to point to a specific place to actually describe who this Roger Federer guy is and what he has done. As such pointing to an actual node is imho way superior in this case.

Additionally to me tags can become a problem when rewording an article. Lets say an article originally mentioned Roger Federer, but that sentence is removed. The tags might be overlooked. Now if instead the reference would be set by essentially putting a reference in the actual text, then if I remove the sentence, the reference is removed as well.

But I guess I am just a tag-skeptic smile

At any rate, which approach makes the most sense depends on the specific use case. As such contentful offers quite a range of options but the lack of trees means that for me its not useful in many cases, especially when dealing with large data sets.



Because We Like You
Free Ebooks!

Grab SitePoint's top 10 web dev and design ebooks, completely free!

Get the latest in PHP, once a week, for free.