SitePoint Sponsor

User Tag List

Results 1 to 19 of 19
  1. #1
    SitePoint Enthusiast
    Join Date
    Apr 2011
    Posts
    36
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Regular expression to remove www.xxxxx.com from string

    I have searched but I can't find a simple method to remove links of the form www.website.com or .co.uk etc from a string. I have found regular expressions that remove urls that start with http:// but not straightforward www ones. Any suggestions?

    This is the code I have got to remove urls from a string called $data:

    $data = preg_replace('/\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|$!:,.;]*[A-Z0-9+&@#\/%=~_|$]/i', '', $data);

    This simply deletes any links, so it will remove for example http://www.junkwebsite.com/ but not www.junkwebsite.com.

    Thanks for any suggestions.

  2. #2
    SitePoint Zealot 2ndmouse's Avatar
    Join Date
    Jan 2007
    Location
    West London
    Posts
    196
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Would it be acceptable to simply disable the link by stripping dots and slashes?
    Detect file changes remotely. SimpleSiteAudit is an early
    warning anti-hacker system which sends an alert on detection.

    PHP Find Orphan Files - Finds all the unreferenced files on your site.

  3. #3
    SitePoint Enthusiast
    Join Date
    Apr 2011
    Posts
    36
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks for the reply. I need to remove any link in its entirety. There will also be HTML in the string so I can't simply remove any HTML. I don't want to leave a link that is human readable either.

  4. #4
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,813
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    I haven't had a chance to fully test this one yet (I will later), but this is my quick attempt before my morning coffee.
    PHP Code:
    $data preg_replace('/\b((https?|ftp|file):\/\/|www\.)[-A-Z0-9+&@#\/%?=~_|$!:,.;]*[A-Z0-9+&@#\/%=~_|$]/i'''$data); 
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  5. #5
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    Do you actually want to remove all <a href=""></a> tags from the string?

  6. #6
    SitePoint Enthusiast
    Join Date
    Apr 2011
    Posts
    36
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by cpradio View Post
    I haven't had a chance to fully test this one yet (I will later), but this is my quick attempt before my morning coffee.
    PHP Code:
    $data preg_replace('/\b((https?|ftp|file):\/\/|www\.)[-A-Z0-9+&@#\/%?=~_|$!:,.;]*[A-Z0-9+&@#\/%=~_|$]/i'''$data); 
    Thanks for that. That gets rid of pretty much any link, the only exception being if someone types in website.com for example. Would it be easy to catch anything with a .com or .co.uk etc on the end as well?

  7. #7
    SitePoint Enthusiast
    Join Date
    Apr 2011
    Posts
    36
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Cups View Post
    Do you actually want to remove all <a href=""></a> tags from the string?
    No. I just want to get rid of any actual websites, human or machine readable. What I want to avoid is someone adding a link or saying something like "check out this great website xxxxxx.com". I don't mind if the <a> tags are still there afterwards.

  8. #8
    I solve practical problems. bronze trophy
    Michael Morris's Avatar
    Join Date
    Jan 2008
    Location
    Knoxville TN
    Posts
    2,011
    Mentioned
    56 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by SandyPodenco View Post
    Thanks for the reply. I need to remove any link in its entirety. There will also be HTML in the string so I can't simply remove any HTML. I don't want to leave a link that is human readable either.
    Not really possible. Humans can parse out www dot example dot com quite easily after all. You'll never come up with an expression that stops all possible ways of including a link reference in a message. You can strip anchor tags and perhaps any string ending in .com, but a persistent spammer will find a way to include the link.

  9. #9
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,813
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    You could always use strip_tags, giving it a list of tags you want to allow
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  10. #10
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,813
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by SandyPodenco View Post
    Thanks for that. That gets rid of pretty much any link, the only exception being if someone types in website.com for example. Would it be easy to catch anything with a .com or .co.uk etc on the end as well?
    You could try this
    PHP Code:
    $data preg_replace('/\b((https?|ftp|file):\/\/|www\.)?[-A-Z0-9+&@#\/%?=~_|$!:,.;]*[A-Z0-9+&@#\/%=~_|$]/i'''$data); 
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  11. #11
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by cpradio View Post
    You could always use strip_tags, giving it a list of tags you want to allow
    Yeah, that was the solution I was angling towards - seems to offer a greater degree of protection too.

    I wonder how effective a security solution that really is?

  12. #12
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,813
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Cups View Post
    Yeah, that was the solution I was angling towards - seems to offer a greater degree of protection too.

    I wonder how effective a security solution that really is?
    Not very, as you can place onmouseover tags on anything and pretty much get XSS attacks to varying degrees. Which is why it is very important to NOT allow your users to write HTML that will be displayed directly and to instead force the use of shortcodes (such as WP) or bbcodes (for forums). At least you can then use strip tags and have full control of the output of the shortcodes and bbcodes.
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  13. #13
    SitePoint Enthusiast
    Join Date
    Apr 2011
    Posts
    36
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by cpradio View Post
    You could try this
    PHP Code:
    $data preg_replace('/\b((https?|ftp|file):\/\/|www\.)?[-A-Z0-9+&@#\/%?=~_|$!:,.;]*[A-Z0-9+&@#\/%=~_|$]/i'''$data); 
    Thanks for that. What is is supposed to do differently to the code above?

  14. #14
    SitePoint Enthusiast
    Join Date
    Apr 2011
    Posts
    36
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Michael Morris View Post
    Not really possible. Humans can parse out www dot example dot com quite easily after all. You'll never come up with an expression that stops all possible ways of including a link reference in a message. You can strip anchor tags and perhaps any string ending in .com, but a persistent spammer will find a way to include the link.
    Agreed. These are all users who have registered and have had their identities verified in some way. I just want to make it impossible to add a working link and harder to add a human readable link, which the code does do quite well.

  15. #15
    I solve practical problems. bronze trophy
    Michael Morris's Avatar
    Join Date
    Jan 2008
    Location
    Knoxville TN
    Posts
    2,011
    Mentioned
    56 Post(s)
    Tagged
    0 Thread(s)
    Outside of spammers, why is link sharing among users a problem for your site?

  16. #16
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,813
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by SandyPodenco View Post
    Thanks for that. What is is supposed to do differently to the code above?
    It should remove website.com or .co.uk, whereas the original didn't.
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  17. #17
    SitePoint Enthusiast
    Join Date
    Apr 2011
    Posts
    36
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Michael Morris View Post
    Outside of spammers, why is link sharing among users a problem for your site?
    Because the pages are open to visitors as well as the users and I don't want visitors to the site distracted by endless links to other sites. If someone wants to advertise something there are paid options.

  18. #18
    I solve practical problems. bronze trophy
    Michael Morris's Avatar
    Join Date
    Jan 2008
    Location
    Knoxville TN
    Posts
    2,011
    Mentioned
    56 Post(s)
    Tagged
    0 Thread(s)
    Well, your site, your rules. I don't know enough about the place to know what will and won't work in your specific case, but in general the more you try to control what users can post the more likely they will simply choose to post elsewhere. It's one thing to disallow link tags, but disallowing mentions of other domains is the sort of thing that would send me to your competitor pretty much immediately.

  19. #19
    SitePoint Enthusiast
    Join Date
    Apr 2011
    Posts
    36
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Michael Morris View Post
    Well, your site, your rules. I don't know enough about the place to know what will and won't work in your specific case, but in general the more you try to control what users can post the more likely they will simply choose to post elsewhere. It's one thing to disallow link tags, but disallowing mentions of other domains is the sort of thing that would send me to your competitor pretty much immediately.
    I understand the point you are making but that's not a problem. Their information should be on their page, not on another page that they want to link to - there is no legitimate reason to want to add a link and I doubt if anyone will complain. In fact if I automatically remove links, they are less likely to be annoyed than if I manually removed them later. Rules is rules!


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •