Intelligent String Abbreviation

James Edwards
Share

For the seventh article in the small-and-sweet functions series, I’d like you show you a function called abbreviate() — the main purpose of which I’m sure you can guess! It abbreviates a string to a specified maximum length, but it does so intelligently — ensuring that the split will never occur in the middle of a word, as well as pre-processing the string to remove extraneous whitespace.

Here’s the abbreviate function’s code:

function abbreviate(str, max, suffix)
{
  if((str = str.replace(/^\s+|\s+$/g, '').replace(/[\r\n]*\s*[\r\n]+/g, ' ').replace(/[ \t]+/g, ' ')).length <= max)
  {
    return str;
  }
  
  var 
  abbr = '',
  str = str.split(' '),
  suffix = (typeof suffix !== 'undefined' ? suffix : ' ...'),
  max = (max - suffix.length);
  
  for(var len = str.length, i = 0; i < len; i ++)
  {
    if((abbr + str[i]).length < max)
    {
      abbr += str[i] + ' ';
    }
    else { break; }
  }

  return abbr.replace(/[ ]$/g, '') + suffix;
}

The function takes three arguments — the original input string, the maximum output length, and an optional suffix to add to the end of the abbreviated string. If the suffix is not defined then it defaults to " ..." (a space followed by three dots), which is a common and recognisable way of indicating abbreviation.

What the Function’s For

The function can be used whenever you need to limit the length of a string, as a more-intelligent alternative to a simple substr expression. There are any number of possible applications — such as processing form input, creating custom tooltips, displaying message subjects in a web-based email list, or pre-processing data to be sent via Ajax.

For example, to limit a string to 100 characters and add the default suffix, we’d call it like this:

str = abbreviate(str, 100);

Which is notionally equivalent to this substr expression:

str = str.substr(0, 96) + " ..."

But that’s a very blunt instrument, as it will often result in an output string which is split in the middle of a word. The abbreviate function is specifically designed not to do that, and will split the string before the last word rather than in the middle of it. So the output string produced by abbreviate() will often be shorter than the specified maximum — but it will never be longer.

The function also accounts for the space required by the abbreviation suffix, i.e. if the specific maximum if 100 but the suffix itself is 4 characters, then we can only use up to 96 characters of the main input string.

You can specify no suffix at all by passing an empty-string, or if you wanted to abbreviate a markup string then you can define it as an HTML close-tag. For example, the following input:

abbreviate("<p>One two three four five</p>", 15, "</p>");

Would produce this output:

<p>One two</p>

How the Function Works

The key to the abbreviate function is the ability to split an input string into individual words, then to re-compile as many of the words as will fit into the maximum length.

To make this effective, we need to ensure that the splits between words are predictable, and the simplest way to do that is by minimising internal whitespace — converting line-breaks and tabs to spaces, and then reducing contiguous spaces, so that every chunk of internal whitespace becomes a single space. There are other ways of handling that, of course — for example, we could define a more flexible regular-expression for the split, that accounts for all the different kinds of character we might find between words. There’s even a word-boundary character for regular-expressions ("b") so we could just use that.

But I’ve found that the whitespace pre-processing is useful in its own right, especially when it comes to user input. And splitting by word-boundary doesn’t produce the desired results, since dashes, dots, commas, and most special characters in fact, count as word-boundaries. But I don’t think it’s appropriate to split the words by punctuation characters, unless the character is followed by a space, so that things like hyphenated words and code-fragments are not split in the middle.

So the function’s first job is to do that whitespace pre-processing, and then if the result is already shorter than the specified maximum, we can return it straight away:

if((str = str.replace(/^\s+|\s+$/g, '').replace(/[\r\n]*\s*[\r\n]+/g, ' ').replace(/[ \t]+/g, ' ')).length <= max)
{
  return str;
}

If we didn’t do that, then we might get cases where the string becomes abbreviated when it doesn’t have to be, for example:

abbreviate("Already long enough", 20)

Without that first condition we’d get abbreviated output, since the specified maximum has to account for the length of the suffix:

Already long ...

Whereas adding that first condition produces unmodified output:

Already long enough

So unless we return at that point, we proceed to compile the abbreviated string — splitting the input string by spaces to create individual words, then iteratively adding each word-space pair back together, for as long as the abbreviated string is shorter than the specified maximum.

Once we’ve compiled as much as we need, we can break iteration, and then trim the residual space from the end of the abbreviated string, before adding the suffix and finally returning the result. It may seem a little wasteful to right-trim that residual space, only to add it back with the default suffix, but by doing so we allow for an input suffix to have no space at all.

Conclusion

So there you have it — a simple but intelligent function for abbreviating strings, which also pre-processes the input to remove extraneous whitespace. In my experience, these two requirements are often found together, and that’s why I’ve developed the function to work this way.