SitePoint Sponsor

User Tag List

Results 1 to 16 of 16
  1. #1
    SitePoint Addict raydenx's Avatar
    Join Date
    Jun 2003
    Location
    Singapore
    Posts
    208
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Replacing Ampersand & in XML documents

    I am trying to replace ampersands & to & in my string. Seems easy but it must be smart enough to not replace strings like   to &nbsp and other stuff like Ï

    Currently I have:

    $message = ereg_replace("&", "&", $message);

    I want text to be like:

    tom & jerry = tom & jerry
    & = & (& on it's own with nothing before or after it)
      =   (stay unchanged and other html ascii codes that have a & in front)

    Any ideas anyone??

  2. #2
    SitePoint Zealot zalucius's Avatar
    Join Date
    Jul 2007
    Location
    Denmark
    Posts
    162
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I once did it by adding a space after both the search and replace chars, that way it will only replace a single & and not &

    I know its not the best solution, some sort of regular expression would probably be best...

    Code:
    $message = ereg_replace("& ", "& ", $message);
    It did the job for me, hope it can help you aswell.
    zalucius

  3. #3
    Working on it... Contrid's Avatar
    Join Date
    Apr 2006
    Location
    Online
    Posts
    955
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    How about :

    Code php:
    htmlentities($string);
    And so I got lost in code...completely asphyxiated by it...

    Premium WordPress plugins - Tribulant Software

  4. #4
    Working on it... Contrid's Avatar
    Join Date
    Apr 2006
    Location
    Online
    Posts
    955
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Or...if you just wanted to replace the ampersand, you could probably go with something like this :

    Code php:
    $newstring = preg_replace("/^[\&]$/i", "", $old_string);

    I haven't tried it. My regex also isn't up to standard. ...so you might want to play around with it.
    And so I got lost in code...completely asphyxiated by it...

    Premium WordPress plugins - Tribulant Software

  5. #5
    SitePoint Guru
    Join Date
    Jun 2004
    Location
    Finland
    Posts
    703
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If you are lucky enough to be running PHP 5.2.3+, you can use htmlspecialchars() or htmlentities() with double_encode set to false. You might also want to try html_entity_decode() the string first and then apply htmlspecialchars() for somewhat similar effect. Or you could just write a suitable regex.

  6. #6
    SitePoint Addict raydenx's Avatar
    Join Date
    Jun 2003
    Location
    Singapore
    Posts
    208
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    "When double_encode is turned off PHP will not encode existing html entities. The default is to convert everything.".

    Cool! But my PHP version on my web host is 5.2.0. Only version 5.2.3 and up support double_encode. Might be worth the upgrade just for this function to work properly.

  7. #7
    SitePoint Addict raydenx's Avatar
    Join Date
    Jun 2003
    Location
    Singapore
    Posts
    208
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    PHP Code:
    $message htmlentities(trim($message), ENT_NOQUOTES"UTF-8"false); 
    I installed PHP 5.2.3 on my development server and tried the above code. However, I still get crappy code like:

    PHP Code:
    &nbsp
    Any idea why it htmlentities is still replacing existing html codes? I tried double_encode = true and false. Setting it as true was much worse.

  8. #8
    SitePoint Addict raydenx's Avatar
    Join Date
    Jun 2003
    Location
    Singapore
    Posts
    208
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    PHP Code:
    $message html_entity_decode(trim($message)); 
    I fixed my previous bug by making the string purely HTML first. Next bug is allowing certain tags to remain untouched...

    When I use htmlentities().... It replaces all my hyperlinks to crappy codes like:

    PHP Code:
    <a target="_blank" href="http://www.youtube.com/profile?user=oasisvideos">Oasis Fanatic Youtube account</a&gt
    That is why I tried not to use this htmlentities function. Any alternatives?

  9. #9
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by raydenx View Post
    That is why I tried not to use this htmlentities function. Any alternatives?
    The root of your problem is, that you encoded data too early. You should never have the need for the functionality, you're describing. Where do you get your data from?

  10. #10
    SitePoint Addict raydenx's Avatar
    Join Date
    Jun 2003
    Location
    Singapore
    Posts
    208
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    My data is from my web site's forum posts and the posts may have URLs and other tags that I want to preserve.

  11. #11
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by raydenx View Post
    My data is from my web site's forum posts and the posts may have URLs and other tags that I want to preserve.
    I see. You could try to prevent posters from posting invalid markup then. Eg. validate it and give an error message. I'm not sure if that's feasible -- It probably depends on your audience.

    Else you can use htmltidy, which is a tool for cleaning up malformed HTML.

  12. #12
    SitePoint Addict raydenx's Avatar
    Join Date
    Jun 2003
    Location
    Singapore
    Posts
    208
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The user's input is validated. I just want to allow URLs, images and bullet tags without replacing the < > characters into html entities.

  13. #13
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by raydenx View Post
    The user's input is validated. I just want to allow URLs, images and bullet tags without replacing the < > characters into html entities.
    You're accepting HTML (Or a subset hereof) as input. Thus you should validate that this input is valid HTML. That includes encoding ampersands as entities. As it stands, you have really no way of knowing if the user wanted to write an & or the literal text &amp;, if the input text is &amp;. It's not a major thing, but it's just a bad practise to mix different levels of abstraction like that.

  14. #14
    SitePoint Addict raydenx's Avatar
    Join Date
    Jun 2003
    Location
    Singapore
    Posts
    208
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    In the forum posts, there isn't any HTML code, just BBCode like [ U R L ] http://www.whatever.com [/ U R L]. I convert some of the BBCode into HTML code. Actually it is not the forum posts that is giving me problems but the ampersand character.

    Do you guys know a regular expression that could solve the problem I pointed out at the top of this topic?

  15. #15
    SitePoint Addict raydenx's Avatar
    Join Date
    Jun 2003
    Location
    Singapore
    Posts
    208
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    PHP Code:
    $message ereg_replace("[\&]{2,}"""$message);
    $message str_replace(array(" &""& "" & "), array(" &""& "" & "), trim($message));
    $message trim($message); 
    I have fixed the ampersand problem temporarily with the code above.

  16. #16
    SitePoint Addict raydenx's Avatar
    Join Date
    Jun 2003
    Location
    Singapore
    Posts
    208
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    PHP Code:
    #FORMAT STRING INTO PURE HTML FIRST
    $message trim(html_entity_decode($message));
            
    #REPLACE HTML ENTITIES WITH HTML CODES
    $message htmlentities($messageENT_NOQUOTES);

    #REPLACE < & > HTML CODES WITH THE ACTUAL CHARACTERS
    $message str_replace(array("&lt;""&gt;"), array("<"">"), $message); 
    I think the above code works way better than anything I have tried so far.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •