Untested, so there may be an error or two (but it should be darn close)
$matches = array();
$total_matches = array();
foreach ($pattern in $badWordsArray) {
preg_match_all($pattern, $text, $matches);
$total_matches = array_merge($total_matches, $matches[0]); // this may need tweaking based on your bad word patterns
}
// $total_matches has all of the words that were matched.
Now, one thing to note, it will include each and every match! So if they used the same bad word 4 times, you will have that 4 times in your $total_matches. You can get down to the unique words by using $unique_matches = array_unique($total_matches);
There are a few problems with auto masking: Charles Dickens, shitake mushrooms, Dick Cheney, the University of South Carolina Gamecocks, “cock the gun.” I could go on of course but I think that makes my point - the stupid things just become a thorn in the side of normal conversation and the kids with potty mouths will use !33+ to bypass the filters. Also, you can gravely insult someone without cursing at all if you know how to express yourself.
boundary detection won’t help with former Vice President Cheney’s first name though. Also, boundary detection let’s in curse verbs in different tenses and plurals.
Well, since you brought it up, my original code had a more sophisticated approach, but I created a simplified example to try and get @cpradio suggestions working…
Last time I ran a forum I had the system notify the moderators when a message had a hit but it didn’t actually do anything to the message. Still, no matter how sophisticated the approach things will get through and things you don’t want masked will be. It’s unavoidable.
In the approach I had working last night, I have a database table with words which are marked to denote if they should be replaced if they are a “substring”.
So if I see the word “f*ck” then I replace anywhere any time. But for “ass”, it has to be a standalone word so that “glasses” is left alone.
I think my approach will catch 80% of the issues.
Now I just need to get cpradio’s code working and understand it!
@mike_w
May I suggest that your efforts will never ever be absolutely resolved because the internet is forever being updated and defeats even knowledgeable programmers.
According to your previous posts you already only allow registered users to post and have also adopted the SitePoint’s approach to allow users to flag posts.
To reiterate, perfect Internet security will never be achieved and it is best to adopt a simple solution such as to only allow approved posts to be published.
My opinion is to publish your site and try to encourage a literate community. There will be far more important problems to resolve once your site is live.
Like is usually the case, my struggles to get things done are a function of my coding weaknesses - just ask @Mittineague how lowly I am!!!
Devising a decent bad word strategy hasn’t taken me long and I had it working last night - until I decided to expand it and make things better!
I have to leave for tonight, but will hopefully get @cpradio suggestions working in the morning. (That and I found another way to do things per @cpradio suggestions earlier, and would like to better understand those as well.)
It’s all a good learning experience, before I turn things over to the wolves!!!