SitePoint Sponsor

User Tag List

Results 1 to 13 of 13
  1. #1
    SitePoint Wizard bronze trophy
    Join Date
    Jul 2006
    Location
    Augusta, Georgia, United States
    Posts
    4,151
    Mentioned
    16 Post(s)
    Tagged
    3 Thread(s)

    ISO-8859-1 to utf-8 smart replacement/conversion

    Anyone have any good resources on the topic of converting ISO-8859-1 to utf8?

    I'm ending up with weird characters using this function I swiped from the PHP doc comments:

    PHP Code:
    function fixEncoding($in_str)
    {
      
    $cur_encoding mb_detect_encoding($in_str) ;
      if(
    $cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
        return 
    $in_str;
      else
        return 
    utf8_encode($in_str);

    Is there any "straightforward" way to prevent the "weird" characters?

    I'm converting information stored in a table into XML. If I exclude the conversion above the XML errors with this:

    HTML Code:
    XML Parsing Error: junk after document element
    Location: http://local.project-padv5/cascade_module.php
    Line Number 10, Column 1:
    Not really sure about the correct route to take here considering the conversion function above isn't really "smart" which leads to weird(excuse my lack thereof terminology) characters being outputted in the XML in certain places.

    These are the current functions to build the XML hierarchy from the domain level objects.

    PHP Code:
    function parse_object(IActiveRecordDataEntity $entity,DOMDocument $dom,DOMElement $node=null) {
        
        if(
    is_null($node)) {
            
    $node $dom->createElement(Inflector::underscore(get_class($entity)));
            
    $dom->appendChild($node);
        }

        foreach(
    $entity as $property=>$value) {
            
            
    $objectNode $dom->createElement($property);
            
    $node->appendChild($objectNode);
            
            if(
    $value instanceof IActiveRecordDataEntity) {
            
                
    parse_object($value,$dom,$objectNode);
            
            } else if(
    $value instanceof ActiveRecordCollection) {
                
                
    parse_collection($value,$dom,$objectNode);
            
            } else {
                
                            
    // this line is the problem
                
    $textNode $dom->createTextNode(is_null($value)?'':$value);
                
    $objectNode->appendChild($textNode);
            
            }
            
        }
        
        return 
    $node;

    }

    function 
    parse_collection(ActiveRecordCollection $collection,DOMDocument $dom,DOMElement $node=null) {
        
        if(
    is_null($node)) {
            
    $node $dom->createElement('active_records');
            
    $dom->appendChild($node);
        }
        
        if(
    count($collection)!=0) {
            foreach(
    $collection as $object) {
            
                
    $childNode parse_object($object,$dom);    
                
    $node->appendChild($childNode);
        
            }
        }
        

    I'm not really familiar with the specifics of character encoding so if someone could help me it would be appreciated.

    thanks

  2. #2
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You need to pass a list of encodings to test for to mb_detect_encoding():
    PHP Code:
    mb_detect_encoding($in_str'ascii, iso-8859-1, cp1252, utf-8'); 
    The default list is tailored for Japanese.

    Plus, utf8_encode() assumes that the input is always ISO-8859-1. Use mb_convert_encoding() instead.

    Although in actuality, mb_convert_encoding() can detect and convert in one function call. See the documentation.

  3. #3
    SitePoint Wizard bronze trophy
    Join Date
    Jul 2006
    Location
    Augusta, Georgia, United States
    Posts
    4,151
    Mentioned
    16 Post(s)
    Tagged
    3 Thread(s)
    Are there any decent resources available online for converting non supported characters in a ISO-8859-1 to UTF-8 conversion?

  4. #4
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    What do you mean by non-supported characters?

  5. #5
    SitePoint Wizard bronze trophy
    Join Date
    Jul 2006
    Location
    Augusta, Georgia, United States
    Posts
    4,151
    Mentioned
    16 Post(s)
    Tagged
    3 Thread(s)
    A weird box looking character with fraction like thing inside…

  6. #6
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well, did you try my suggestion? The function you were using originally would have given you such characters...

  7. #7
    SitePoint Wizard bronze trophy
    Join Date
    Jul 2006
    Location
    Augusta, Georgia, United States
    Posts
    4,151
    Mentioned
    16 Post(s)
    Tagged
    3 Thread(s)
    Yeah, still getting those weird characters. I'm thinking I need some type of conversion mechanism or something. However, I'm not really sure what that would entail at the moment. Around line 68 is where I've placed the function.

    PHP Code:
    <?php
    require_once('active_record_model_config.interface.php');
    require_once(
    'active_record_model_config.class.php');
    class 
    ActiveRecordDOMElement extends DOMDocument {

        public function 
    __construct($active,$version=null,$encoding=null) {
        
            
    parent::__construct($version,$encoding);
            
            
    $this->_init($active);
        
        }
        
        protected function 
    _init($value) {
        
            if(
    $value instanceof IActiveRecordDataEntity) {
                
                
    $this->_parseObject($value);
            
            } else if(
    $value instanceof ActiveRecordCollection) {
                
                
    $this->_parseCollection($value);
            
            } else {
            
                throw new 
    Exception('First argument of '.__CLASS__.'must be instance of either IActiveRecordDataEntity or ActiveRecordCollection. Exception thrown from method '.__METHOD__.' at line '.__LINE__.'.');
            
            }
        
        }
        
        protected function 
    _parseObject(ActiveRecord $entity,DOMElement $node=null) {
            
            
    $className get_class($entity);
            if(
    is_null($node)) {
                
    $node $this->createElement(Inflector::underscore($className));
                
    $this->appendChild($node);    
            }
            
            
    $config ActiveRecordModelConfig::getModelConfig($className);
            
    $node->setAttribute('model',$config->getClassName());
            
    $node->setAttribute('table',$config->getTable());
            
            
    $pk false;
            
            foreach(
    $entity as $property=>$value) {
            
                
    $objectNode $this->createElement($property);
                
    $node->appendChild($objectNode);
                
                if(
    $pk===false && strcmp($config->getPrimaryKey(),$property)==0) {
                    
    $node->setAttribute(IActiveRecordModelConfig::defaultPrimaryKeyName,$value);
                    
    $pk true;
                }
                
                
    $config ActiveRecordModelConfig::getModelConfig(get_class());
            
                if(
    $value instanceof IActiveRecordDataEntity) {
            
                    
    $this->_parseObject($value,$objectNode);
            
                } else if(
    $value instanceof ActiveRecordCollection) {
                
                    
    $this->_parseCollection($value,$objectNode);
            
                } else {
                    
                    
    // here 
                    
    $convertedString is_null($value)?'':mb_convert_encoding($value,'UTF-8',"iso-8859-1");
                    
    $textNode $this->createTextNode($convertedString);
                    
    $objectNode->appendChild($textNode);
            
                }
            
            }
        
            return 
    $node;    
        
        }
        
        protected function 
    _parseCollection(ActiveRecordCollection $collection,DOMElement $node=null) {
            
            if(
    is_null($node)) {
                
    $name count($collection)!=0?Inflector::pluralize(Inflector::underscore(get_class($collection[0]))):'active_records';
                
    $node $this->createElement($name);
                
    $this->appendChild($node);    
            }
        
            if(
    count($collection)!=0) {
                foreach(
    $collection as $object) {
            
                    
    $childNode $this->_parseObject($object);    
                    
    $node->appendChild($childNode);
        
                }
            }    
        
        }

    }

  8. #8
    SitePoint Wizard bronze trophy
    Join Date
    Jul 2006
    Location
    Augusta, Georgia, United States
    Posts
    4,151
    Mentioned
    16 Post(s)
    Tagged
    3 Thread(s)
    example of the output:
    Attached Images Attached Images

  9. #9
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Looks like CP1252. CP1252 is Microsoft's version of ISO-8859-1, but with more characters (curly quotes, for one).

    You need to do this:
    PHP Code:
    mb_convert_encoding($value'utf-8''ascii, iso-8859-1, cp1252, utf-8'); 

  10. #10
    SitePoint Wizard bronze trophy
    Join Date
    Jul 2006
    Location
    Augusta, Georgia, United States
    Posts
    4,151
    Mentioned
    16 Post(s)
    Tagged
    3 Thread(s)
    That didn't change anything.

  11. #11
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    iso-8859-1 is an encoding that covers a charset consisting of ascii + 127 extra characters. utf-8 is an encoding that covers unicode (which ~ all characters). So there are no characters in a iso-8859-1 encoded string, that can't be represented in utf-8. However, if you have a string of bytes that you treat as if it was iso-8859-1 and it isn't in fact iso-8859-1, then it will be interpreted wrong, when you try to convert it.

    That all aside, the simplest way to convert from iso-8859-1 to utf-8, is with utf8_decode. And from utf-8 to iso-8859-1 can be done with utf8_encode. If that doesn't work, then the input isn't iso-8859-1 (In that case, it might be cp-1252 .. or it might be random garbage)

  12. #12
    SitePoint Wizard bronze trophy
    Join Date
    Jul 2008
    Posts
    5,757
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Try to isolate a small amount of text that has the garbage characters in it. Then do a hex dump or base64 of it and post it.

  13. #13
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Err.

    PHP Code:
    mb_convert_encoding($value'utf-8''ascii, cp1252, utf-8'); 
    No clue why mbstring was detecting that string as ISO-8859-1, since 0x0092 does not exist as a character in ISO-8859-1 (only in Windows-1252).


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •