SitePoint Sponsor

User Tag List

Results 1 to 11 of 11
  1. #1
    SitePoint Member
    Join Date
    Apr 2001
    Location
    New York, NY
    Posts
    18
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Problems with javascript & utf-8 encoding

    Sorry in advance for the length of this... you can skip to "the problem" if you want to avoid extraneous backstory.

    The background...

    I'm a designer with some moderate knowledge of programming working with my client's ASP/javascript programmer. The client's site was created back in 1999 or so, when he used FrontPage 98 to develop all the pages. When I was brought on board I was also forced to use FP, reluctantly, because that was basically the only way to edit / upload his pages.

    Fast forward to a few months ago, when I got a new computer with Vista. This meant I had to switch to ExpressionWeb, since everything I'd read about FP & Vista compatibility was not very good. Anyway, xWeb was changing everything to UTF-8 by default, despite there being inconsistent charset definitions throughout the site; most were missing, some were ISO 8859-1, others were something else. This inconsistently was revealed once the pages went live thanks to a bunch of odd characters, primarily related to curly quotes, trademarks, and other symbols that had been transferred over when the client copied/pasted stuff from MS Word.

    The problem...

    I suggested that we go through page by page and switch everything to UTF-8. I argued that we've been lucky to have gotten away with crappy haphazard coding as long as we have; we need to standardize already. Fortunately a consensus was reached and the project began.

    I was in charge of converting the static pages (i.e. not our shopping cart or other script-laden pages), which I did by opening the pages in xWeb, adding the proper charset declaration, and resaving/encoding as UTF-8. The pages I did this with ended up working fine, except for one or two places where old MS Word code was still used. Once the extra stuff was removed it worked fine.

    Meanwhile the ASP programmer understandably preferred to do the conversion of the vital ASP- and Javascript-laden pages herself. She uses Microsoft Script Editor rather than FP, specifically because MSE doesn't add extra bloated code.

    But when she tested these pages last night, they were broken. There was a new line of code at the top (which I believe was something like %codePage="65001") and boxes (square characters) in the middle of her javascript where there should have been blank spaces. She's at a loss to understand what happened.

    Now again, I'm no programmer. I completely cede to her knowledge of ASP and javascript. Nevertheless, it struck me that these errors implied that the pages weren't correctly saved as UTF-8. When we were trying to figure out what caused the problem, I asked if, in addition to adding the meta declaration tag, she actually encoded the files as utf-8.

    She said she opened up the pages, added the meta charset definition, and closed them again. This concerned me, since I didn't hear anything about 'encoding' in there. So I asked if they were saved as UTF-8 encoded pages, and she said "MS Script Editor doesn't do Save As, it just saves the file."

    I thought the problem seems to be that simply adding the charset declaration isn't enough, the pages have to be specifically encoded to match. You usually have to tell your editor how you want pages to be encoded (i.e. what language). The programmer seemed to get irritated by my suggestion -- of course, she was frustrated, understandably -- and said that MS Script Editor is more advanced than FP and that why she uses it, and basically implied that it knows what to do.

    Since I'm not a programmer and I'm not nearly as knowledgable about ASP or Javascript as she is (and I have no experience whatsoever with Script Editor), I really couldn't argue with that, or offer any other suggestions. Also I think she resents me for the whole encoding mess anyway. Maybe she's right.

    So my question to you gurus is: anyone have ideas about what might have gone wrong? Does anyone have experience in changing files w/javascript & ASP coding to unicode? Is MSE able to encode files in utf-8?
    Need a break from work? Visit About Schuyler Falls.

  2. #2
    SitePoint Guru
    Join Date
    Apr 2006
    Posts
    802
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    And you still have a job?

    It would have been better to learn this before you 'standardized' your client's site.

  3. #3
    SitePoint Member
    Join Date
    Apr 2001
    Location
    New York, NY
    Posts
    18
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yeah, believe it or not! I guess they're as incompetant as I am.

    In my defense, I didn't "standardize" the site on my own decision. (And by the way, all these changes were made to a test/backup site, not the live site. So nothing is permanently broken or anything.)

    If the programmer had said flatly that "no, the asp/javascript will be screwed up if we do that," we'd of course have stuck with the old files and I'd have to either resign or use an old computer or something.

    But she never said that the asp/javascript code would be screwed up, and indeed the pages on the test site that I converted that do have javascript & ASP are working without a problem. So I don't think it was "standardizing" the site that caused the issue -- it seems to be something to do with Microsoft Script Editor or the method the programmer used to make the change that caused the problem.

    Which is the main question here. What could have caused this?
    Need a break from work? Visit About Schuyler Falls.

  4. #4
    Avid Logophile silver trophy
    ParkinT's Avatar
    Join Date
    May 2006
    Location
    Central Florida
    Posts
    2,337
    Mentioned
    192 Post(s)
    Tagged
    4 Thread(s)
    First, regardless of any prejudice (toward or opposing a particular editor) ASP, Javascript, and HTML all should be saved as simple text. The "square boxes" you described are a flag that the MS editor saved some characters beyond the limited 128 ASCII set. {isn't that encoding?}
    Can't you open those files in notepad and resave the changes?
    Don't be yourself. Be someone a little nicer. -Mignon McLaughlin, journalist and author (1913-1983)


    Git is for EVERYONE
    Literally, the best app for readers.
    Make Your P@ssw0rd Secure
    Leveraging SubDomains

  5. #5
    SitePoint Member
    Join Date
    Apr 2001
    Location
    New York, NY
    Posts
    18
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    That's probably what she'll be doing. (I ain't going near those files, myself!) Actually she'll probably just use the backup files. I believe that even Notepad gives you the choice to save things in ANSI or other languages.

    I have no idea what those boxes represent; there shouldn't be ANYthing except that blank space. They seem to have been added instead of the usual indenting one finds in scripts (be they php, asp or javascript).
    Need a break from work? Visit About Schuyler Falls.

  6. #6
    Avid Logophile silver trophy
    ParkinT's Avatar
    Join Date
    May 2006
    Location
    Central Florida
    Posts
    2,337
    Mentioned
    192 Post(s)
    Tagged
    4 Thread(s)
    Those boxes represent "non-printable" characters. That is characters that translate to ASCII outside the range of (about) 8 to 160 (I think. I have forgotten the limits). Even Notepad respects Tab!
    Don't be yourself. Be someone a little nicer. -Mignon McLaughlin, journalist and author (1913-1983)


    Git is for EVERYONE
    Literally, the best app for readers.
    Make Your P@ssw0rd Secure
    Leveraging SubDomains

  7. #7
    SitePoint Member
    Join Date
    Apr 2001
    Location
    New York, NY
    Posts
    18
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks, ParkinT! I really appreciate your help.

    Turns out that I was right -- the programmer did just add the meta tag without saving the files using utf-8 encoding. The important thing is that she was able to open up the files, save them as ASCII and that cleared up the formatting issues with the javascript.

    For now we've decided to go back to Western European ISO. Which means I go back in and save/re-encode all of the site's pages. Lesson learned: sometimes you have to go backward in order to go forward!
    Need a break from work? Visit About Schuyler Falls.

  8. #8
    SitePoint Addict Mirek Komárek's Avatar
    Join Date
    Dec 2006
    Location
    Prague
    Posts
    210
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Last edited by Mirek Komárek; Nov 13, 2007 at 15:23. Reason: ups wrong url in clipboard

  9. #9
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by kira View Post
    The important thing is that she was able to open up the files, save them as ASCII and that cleared up the formatting issues with the javascript.
    There is no such thing as plain ASCII. No such thing. It's a myth. What she probably saved the files as, is CP-1252 (Which coincidentally is almost the same as ISO-8859-1). Judging from your description so far, you're probably better off, using ISO-8859-1 for charset, since it tends to be the default in most systems (No guarantees though).

    Oh, and just to save you the grief later on; meta-tags are only relevant, when the page isn't served from a web server, which sends a HTTP-header. In this case, the header takes precedence.

  10. #10
    SitePoint Member
    Join Date
    Apr 2001
    Location
    New York, NY
    Posts
    18
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Mirek Komárek
    That was slightly separate -- we had that, too, at least in some pages. Not all of them. The BOM was mostly at the top of the page and just caused a few extra characters at the top; these blank/tab characters were breaking things altogether. It was like finding a needle in a haystack!

    Quote Originally Posted by kyberfabrikken View Post
    There is no such thing as plain ASCII. No such thing. It's a myth. What she probably saved the files as, is CP-1252 (Which coincidentally is almost the same as ISO-8859-1). Judging from your description so far, you're probably better off, using ISO-8859-1 for charset, since it tends to be the default in most systems (No guarantees though).
    Truthfully I *think* she might be talking about ANSI instead. Notepad offers that as an option instead of unicode, I know that. Not sure. Honestly, I dunno what the story is, there's kind of a political situation here and the less I question her at this point, the better.

    Oh, and just to save you the grief later on; meta-tags are only relevant, when the page isn't served from a web server, which sends a HTTP-header. In this case, the header takes precedence.
    Unless I'm mistaken (and God knows, I could be!), isn't that only as long as the server is sending a charset along with the HTTP header? Our server isn't; I've checked, believe me! All it's sending is:

    Content-Type:·text/html

    Therefore, the meta tag is important, at least in our situation. Heck, what started us off on this merry adventure in the first place was my discovery that in the pages without any charset declaration (a majority), things were getting royally screwed up. Sigh. We were so innocent back then!

    Thanks for your help, guys.
    Need a break from work? Visit About Schuyler Falls.

  11. #11
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by kira View Post
    Unless I'm mistaken (and God knows, I could be!), isn't that only as long as the server is sending a charset along with the HTTP header? Our server isn't; I've checked, believe me! All it's sending is:
    Yes, but most servers would send a charset as part of the header. You can configure it not to (As you have in your case), but that's an odd choice. To prevent any ambiguity, I'd always send the proper charset as part of the content-type header.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •