SitePoint Sponsor

User Tag List

Results 1 to 11 of 11
  1. #1
    Keeper of the SFL StarLion's Avatar
    Join Date
    Feb 2006
    Location
    Atlanta, GA, USA
    Posts
    3,748
    Mentioned
    73 Post(s)
    Tagged
    0 Thread(s)

    2-character NBSP?

    For some odd reason, I managed to find a way to make NBSP two-characters long.

    The original string was (With 'start' and 'end' appended for display purposes):
    Code:
    start 5TS50568A30099246end
    Okay, not a problem... obviously, it's thinking of   as 5 characters; substr from 5 (remember, start isnt actually there), and you get the number, right? No.... you lose 5TS.
    Well, then it must be thinking of the Non-Breaking-Space character as one, ASCII, character. so substr starting at 1. Nope... get a question mark in front of the string.
    Try 2, since i lost 3 characters with 5 - works fine.

    Is this a case of some odd substr indexing? Or did i really end up with a 2-character-long NBSP?

  2. #2
    . shoooo... silver trophy logic_earth's Avatar
    Join Date
    Oct 2005
    Location
    CA
    Posts
    9,013
    Mentioned
    8 Post(s)
    Tagged
    0 Thread(s)
      is an html entity you need to convert it first.

    html_entity_decode
    http://www.php.net/manual/en/functio...ity-decode.php
    Logic without the fatal effects.
    All code snippets are licensed under WTFPL.


  3. #3
    Keeper of the SFL StarLion's Avatar
    Join Date
    Feb 2006
    Location
    Atlanta, GA, USA
    Posts
    3,748
    Mentioned
    73 Post(s)
    Tagged
    0 Thread(s)
    this is even after entity decoding.
    Character 1 of the string returned ord 194, Character 2 returned 160.

    For clarity:
    $simba is initialized to the HTML output of a webpage.
    PHP Code:
    $simba strip_tags($simba);
    $simba html_entity_decode($simba,ENT_NOQUOTES);
    $id explode('ID #',$simba);
    $id trim($id[1]);
    $id explode(')',$id);
    $id $id[0];
    echo 
    "ID 1:".ord($id{0})." ID 2:".ord($id{1})."<br>"

  4. #4
    . shoooo... silver trophy logic_earth's Avatar
    Join Date
    Oct 2005
    Location
    CA
    Posts
    9,013
    Mentioned
    8 Post(s)
    Tagged
    0 Thread(s)
    Are you using UTF-8?

    got this from the manual:
    Note: You might wonder why trim(html_entity_decode('&nbsp;')); doesn't reduce the string to an empty string, that's because the '&nbsp;' entity is not ASCII code 32 (which is stripped by trim()) but ASCII code 160 (0xa0) in the default ISO 8859-1 characterset.
    Logic without the fatal effects.
    All code snippets are licensed under WTFPL.


  5. #5
    Keeper of the SFL StarLion's Avatar
    Join Date
    Feb 2006
    Location
    Atlanta, GA, USA
    Posts
    3,748
    Mentioned
    73 Post(s)
    Tagged
    0 Thread(s)
    The database i'm pulling $simba from is in the latin1_swedish_ci (the mysql server's default, apprantly) collation - other than that i'm not modifying the charset anywhere.

    Edit: ah - okay... odd. you'd think they'd include that in the TRIM code... dont think it would hurt UTF-8 coding?

  6. #6
    . shoooo... silver trophy logic_earth's Avatar
    Join Date
    Oct 2005
    Location
    CA
    Posts
    9,013
    Mentioned
    8 Post(s)
    Tagged
    0 Thread(s)
    Condecting my own test using UTF-8 results in 194 160 for &nbsp and ISO-8859-1 results in just 160.

    Test Code:
    Code php:
    header('Content-type: text/plain');
     
    $s = '&nbsp;5TS50568A30099246';
     
    print $s . ' | ' . strlen($s);
     
    $i = html_entity_decode($s, ENT_NOQUOTES, 'ISO-8859-1');
    $u = html_entity_decode($s, ENT_NOQUOTES, 'UTF-8');
     
    print "\n\n" . $i . ' | ' . strlen($i);
    print "\n" . $u . ' | ' . strlen($u);
     
    function tex2dec($data) {
        $r = '';
        $l = strlen($data);
     
        for ($i = 0; $i < $l; $i++) {
            $r .= ord(substr($data, $i, 1)) . ' ';
        }
     
        return $r;
     
    }
     
    print "\n\n" . tex2dec($i);
    print "\n" . tex2dec($u);

    Also I was using UTF-8 as the encoding for writing the page and sending it via HTTP chareset.
    Logic without the fatal effects.
    All code snippets are licensed under WTFPL.


  7. #7
    Keeper of the SFL StarLion's Avatar
    Join Date
    Feb 2006
    Location
    Atlanta, GA, USA
    Posts
    3,748
    Mentioned
    73 Post(s)
    Tagged
    0 Thread(s)
    Interesting - so in UTF-8, a NBSP IS 2-characters long.... very weird.

  8. #8
    SitePoint Enthusiast Bellthorpe's Avatar
    Join Date
    Aug 2006
    Posts
    82
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Why is that weird? It's not an ASCII character.

  9. #9
    Keeper of the SFL StarLion's Avatar
    Join Date
    Feb 2006
    Location
    Atlanta, GA, USA
    Posts
    3,748
    Mentioned
    73 Post(s)
    Tagged
    0 Thread(s)
    No, but it's represented as a single-character space. Why would a single character space take two values to interpret, especially when one of those values is completely unnecessary in the other charset?

  10. #10
    Programming Since 1978 silver trophybronze trophy felgall's Avatar
    Join Date
    Sep 2005
    Location
    Sydney, NSW, Australia
    Posts
    16,875
    Mentioned
    25 Post(s)
    Tagged
    1 Thread(s)
    Unicode uses up to four bytes per character because it supports about 16 million different characters. ASCII only supports 128 characters and so doesn't even use a full byte. If you can find a way to fit 16 million values into a single byte thhat can only hold 256 different values then you can define an altenative to Unicode that only uses single byte characters and will be able to make a fortune.

    Since UTF-8 and UTF-16 do not use 4 bytes consistently for all characters there is obviously a set range of special values that are reserved in the Unicode character set to mark that the character uses two or four bytes instead of one. Obviously A0 (or 160) is within that range and therefore uses a two byte represenation. It is still a single character, it just uses two bytes to hold the character instead of one.
    Stephen J Chapman

    javascriptexample.net, Book Reviews, follow me on Twitter
    HTML Help, CSS Help, JavaScript Help, PHP/mySQL Help, blog
    <input name="html5" type="text" required pattern="^$">

  11. #11
    SitePoint Enthusiast Bellthorpe's Avatar
    Join Date
    Aug 2006
    Posts
    82
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by StarLion View Post
    No, but it's represented as a single-character space. Why would a single character space take two values to interpret, especially when one of those values is completely unnecessary in the other charset?
    What do you mean by a single-character space? ❶ is a single character. So is 中 . Do you propose that each of them could somehow be represented by a single byte, considering that there are no more than 256 combinations of bits in a byte?


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •