I am building a site where a user’s profile page’s url is determined by the user’s username. But what characters should I exclude in a user’s name to make it websafe?
"In addition to this list it is prohibited to have consecutive periods in the URL. It’s also prohibited to start or end with a space or underline, or to end with a period. "
But I tested a url beginning with an underline example.com/_mypage and it worked fine and displayed the content. Same thing for consecutive periods. So I am doubting that this list of characters provided is up-to-date or authentic. Can someone link me to a more up to date listing or describe which characters are safe? (I am guessing that this list must be extended to overlap with what is safe in a linux filesystem because the site is hosted on a LAMP server.)
Running urlencode on everything is a good idea because it’ll always leave you with valid URLs… but they may be ugly.
When I have something like that, I generally restrict it just to a-z0-9_-, lowercase the name and strip out anything else.
Since this could cause conflicts, I generally store a “slug” aside from their username which must also be unique. So, if I get someone that is “Bob” and someone that is “bob”, I would force the second person to pick a different slug (since “Bob” would have a slug of “bob”).
Some of those characters you mentioned can be used in a URL, but not the way in want.
is the hash sign, so if you have a URL like http://example.com/#bob it will go to “http://example.com” and look for an element with an id of “bob”.
? indicates a query string, so if you have a URL like http://example.com/?bob it will go to “http://example.com” and will register a GET key of “bob”.
& is a query string pair divider. I’m actually not sure how it reacts when there is no ? present. It seems like it can be used validly, but I would still avoid it.
in a URL also indicates a space (generally the same as %20).
% in a URL is used for encoding a symbol, like spaces. If you went to http://example.com/U, it would convert that %55 to a U, so it’d look for “http://example.com/U”.
I didn’t mean encode everything – rather that most stacks have some sort of url encoding functionality to sanitize strings for urls. This should be run on any untrusted variable that gets written out as a url.