URL safe characters

I am building a site where a user’s profile page’s url is determined by the user’s username. But what characters should I exclude in a user’s name to make it websafe?

One site ( http://www.moss2007.be/blogs/vandest/archive/2007/12/27/illegal-characters-in-site-url.aspx ) said these:

# % & * { } \\ : < > ? / +

but he also said:

"In addition to this list it is prohibited to have consecutive periods in the URL. It’s also prohibited to start or end with a space or underline, or to end with a period. "

But I tested a url beginning with an underline example.com/_mypage and it worked fine and displayed the content. Same thing for consecutive periods. So I am doubting that this list of characters provided is up-to-date or authentic. Can someone link me to a more up to date listing or describe which characters are safe? (I am guessing that this list must be extended to overlap with what is safe in a linux filesystem because the site is hosted on a LAMP server.)

Forget about the consecutive period bit. I forgot that might mean directory up.

What characters you may use in an URI and how is described in RFC 3986 section 2 :slight_smile:

What the author is talking about there is specific characters some IIS filters take out by default.

Run urlencode on everything, doesn’t matter if the input was legal or not because the output will be.

Running urlencode on everything is a good idea because it’ll always leave you with valid URLs… but they may be ugly.

When I have something like that, I generally restrict it just to a-z0-9_-, lowercase the name and strip out anything else.

Since this could cause conflicts, I generally store a “slug” aside from their username which must also be unique. So, if I get someone that is “Bob” and someone that is “bob”, I would force the second person to pick a different slug (since “Bob” would have a slug of “bob”).

Some of those characters you mentioned can be used in a URL, but not the way in want.

is the hash sign, so if you have a URL like http://example.com/#bob it will go to “http://example.com” and look for an element with an id of “bob”.
? indicates a query string, so if you have a URL like http://example.com/?bob it will go to “http://example.com” and will register a GET key of “bob”.
& is a query string pair divider. I’m actually not sure how it reacts when there is no ? present. It seems like it can be used validly, but I would still avoid it.

  • in a URL also indicates a space (generally the same as %20).
    % in a URL is used for encoding a symbol, like spaces. If you went to http://example.com/U, it would convert that %55 to a U, so it’d look for “http://example.com/U”.

Like wwb_99 said, urlencode is your friend. :wink:

I didn’t mean encode everything – rather that most stacks have some sort of url encoding functionality to sanitize strings for urls. This should be run on any untrusted variable that gets written out as a url.