SitePoint Sponsor

User Tag List

Results 1 to 2 of 2
  1. #1
    SitePoint Guru
    Join Date
    Nov 2004
    Location
    Parry Sound, ON
    Posts
    725
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Translating non-SGML characters

    I have a CMS for a newspaper site in which users from production who use Macs input the data, pasting stuff in from the actual Quark files that were used to layout the print paper.

    There's lots of funny characters that go in there, like typographically correct apostrophes and dashes, other stuff too. These make the pages that are created from such data fail the validator with errors like "non SGML character number 146". My question is, what methods do people use to translate the data into something that will pass the validator (XHTML 1.0 strict)?

    I have created a function like this to take care of the biggest offenders:
    PHP Code:
    function prep_text($text)
    {
        
    //$text = htmlentities($text);
        
    $text preg_replace("/[&]/","&"$text);
        
    $text preg_replace("/[']/","'"$text);
        
    $text preg_replace("/[\x91]/","'"$text);
        
    $text preg_replace("/[\x92]/","'"$text);
        
    $text preg_replace("/[\x93]/","'"$text);
        
    $text preg_replace("/[\x94]/","'"$text);
        
    $text preg_replace("/[\x96]/","-"$text);

        return 
    $text;

    But I'm certain that's ugly and there's a cleaner way to do it and what if they use some character that I haven't accouted for here. Maybe I'm missing something altogether, like that I should be serving the content with some different character encoding or something.

    Help please.

  2. #2
    SitePoint Guru
    Join Date
    Nov 2004
    Location
    Parry Sound, ON
    Posts
    725
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ahem...(or some similar bump phrase)


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •