How to use javascript to check if an string contain words from a blacklist,
I was think to create a whitelist as well to run with this script.
I want to remove spaces from the beginning and the end of the bios and / or check for other characters.
Note: I do not have access to the webpage to change it.
// tests bio info
// -----------------
// bio = ''This is a string. š'; // false
// bio = 'I'm a Freelance translator' ; // true
// bio = 'application developer' ; // false
bio = 'I use apple-cyder everyday š ' ; // true
// bio = 'every day I listen The Beatles on the radio' ; // true
const blacklist = [
'trans',
'apple',
'beer',
"beat",
];
for (item of blacklist) {
if (bio.toLowerCase().indexOf(item) !== -1) {
console.log("%cThis bio matched blacklist keyword %c"+item,"color: red; font-size: 14px ; font-weight:bold","color: blue; font-size: 14px ; font-weight:bold ");
}
}
console.log("%cDone with this Bio ","color: Black; font-size: 14px; font-weight:bold ");
An small addition split(" ") to your very clean code revision ( very neat, I like it ) helps the script to find the exact match between worlds in blacklist and the string.
if (bio.toLowerCase().split(" ").indexOf(item) !== -1)
next challengeā¦ what if the blacklist has words like āmorning personā ānight owlā ācandy barā` āpotato chipsā
PS: I was reading about array.protoype.filter(), I still need to review that deeper and try it.
You could map the blacklist entries with a space out to a separate list, and remove them from the original blacklist array. That way you can separately deal with the ones thatās have a space in them.
Another possibility would be to map your blacklisted words to regular expressions wrapped in word boundaries; this way youāll also match a word if it is followed by say a comma or hyphen, not only white space:
const blacklist = ['apple' ,'night owl'].map(item => new RegExp(`\\b${item}\\b`))
const test = string => blacklist.some(item => item.test(string))
console.log(test('beer and apple-cider')) // true
console.log(test('the bear and the night owl')) // true
console.log(test('no more pineapples')) // false
True. :-D As soon as the blacklist contains non-word characters, weāll have to properly escape everything (probably again using regular expressions).
I suppose soā¦ but if you want to get the full list of words we might use match() along with a g flag instead.
I should have possibly used the word āadvantageā instead of ādifferenceā. My thinking being āsomeā breaking early is more efficient ā Iām suffering with brain fog here.
It depends on the use case I guess. If you want to check a user comment field for instance, it would probably be desirable to show the complete list of problematic words right away, rather than having to click āsubmitā again and again until all the included curse words have been rejected. :-P
Thank you for all your input, this is an interesting topic and the debate is getting nicer
Iāve been thinking about each comment.
One thing that I know for sure is that the blacklist will grow. I donāt know yet how much it will grow. but Iād like to plan for the worse case scenario.
First thing Iād like to do is to clean the bio/profile info, remove those funny characters, emojis and keep a clean text string, and probable add spaces between a word and a punctuation mark.
facebook: would be different than facebook : or facebook=
Maybe the external blacklist file is the best approachā¦
that way I donāt have to touch the code file anymore or just touch the code for improvements
Next questions :
how to create that file? CSV Delimiter File
what the rules should be. . semi-colon as field delimiter
and list the words one under the other. cr, that way the file would be vertical reading instead for horizontal way to readā¦
upload that csv file to an array?
and them proceed with one of the checking procedures,
I donāt know which one yet, I donāt know which one is the most efficient / easy one to implement.
I agree, and it was kind of what I was experimenting with last night, pre word boundaries
/**
* @param {String} strg - String to match
* @param {Array} listed - Blacklisted words array
* @returns {Array} An array of matching words, each in the form of
* {word: matched word, index: matched index, length: length of matched word}
*
const isBlackListed = (strg = '', listed = []) => (
listed.reduce(
(matches, word) => {
const index = strg.indexOf(word)
if (index >= 0) matches.push({ word, index, length: word.length })
return matches
},
[] // <-- matches array
)
)
isBlackListed(bio, blacklist) // returns [{word: 'trans', index: 16, length: 5}, {word: 'apple', index: 32, length: 5}]
.forEach(({ word, index, length } /* destructure each item ) => {
console.log(
`This bio matched %c${word}`,
'background-color: teal; font-weight:bold; padding: 3px 8px;',
`with a word length of ${length} at index ${index}.`
)
})
It was late, and possibly/ not entirely thought out, but had the idea that knowing the index as well might be useful. Length is not really needed as the string would have that property.
I am very slow compare with you guys. here Its my small contribution. maybe you already know it. I donāt want to take assumption. Clean String, from emojis and other thingsā¦.
a='I was š¤, represent āmeā. š¤šāš¢ mean āIām running to work. Instagram:@findme';
document.write(a);
document.write('<br><br>');
a=a.replace(/[^\p{L}\p{N}\p{P}\p{Z}]/gu, '');
document.write(a);
Result: I was , represent āmeā. mean āIām running to work. Instagram:@findme
\p{L} ā to allow all letters from any language
\p{N} ā for numbers
\p{P} ā for punctuation
\p{Z} ā for whitespace separators
^ is for negation, so all these expressions will be whitelisted
I tried to test everything you guys talk or post and at the same time I keep researching and learningā¦
I like that the code can catch 2+ words from blacklist.
I wish and look for the solution for the script to catch also a single complete word
catch part of a word is too risky. i.e. the script will ban / catch
ātransā words like transit, transport transfer and so on
āappā words like application, appointment, appropriate including apples! and so onā¦ note I am not adding i.e. bee, because that root word will ban beer, and that can be a big problemā¦
I keep looking and trying different thingsā¦
following your code, maybe checking the length of the blacklist word and compare it with the length of the string part that the script caught? maybe that can work.
If you are interested in regular expressions though this is quite informative https://www.regular-expressions.info/tutorial.html and I would say the go to book is OāReillyās Mastering Regular Expressions ā I picked up a decent second-hand copy about six months ago
This is indeed an interesting ideaā¦ but to completely avoid the Scunthorpe Problem I think you have no other choice than including every unwanted combination manually. So at the end of the day you have to weigh up if youāre willing to accept false positives, or possibly have bad words in your bios.
m3g4p0p Mentor I though about endless blacklist or when to stop adding words to the listā¦ probably it will be a personal decision base on the ban/no ban experience and consequences of it ( Scunthorpe problem ) excellent articleā¦
Thinking deeperā¦ caffeine is getting into my veinsā¦ I will use this backlist code just to catch some bio / profile / descriptions, if I can make this strict code works well. it will be easy to remove words from blacklist list itself.
rpg_digital Iāll look / read and try " Mastering Regular Expressions ", thank you for the ideas. Iāll try them as well.
Iāll keep working and readingā¦ and post ideas / solutions
Iāve been toying with that idea and came up with the following:
function trimPunctuation (word) {
return word.replace(/^\W*|\W*$/g, '')
}
function getFullWord (value, index) {
const start = value.slice(0, index).search(/\S*$/)
const end = value.slice(index).search(/\s|$/) + index
const word = value.slice(start, end)
return trimPunctuation(word)
}
function calculateWeighting (value, word) {
return word.length / value.length
}
function accumulateMatches (value, word) {
const index = value.indexOf(word)
if (index === -1) {
return {}
}
const fullWord = getFullWord(value, index)
const remaining = value.slice(index + 1)
return {
[fullWord]: calculateWeighting(fullWord, word),
...accumulateMatches(remaining, word)
}
}
function getWeightedMatches (list, value) {
return list.reduce((result, word) => ({
...result,
...accumulateMatches(value, word)
}), {})
}
const matches = getWeightedMatches(['apple', 'bee'], 'the beer drinking bee does not like apple-cider')
console.log(matches) // { 'apple-cider': 0.45454545454545453, beer: 0.75, bee: 1 }
Not sure if this is particularly useful like this though ā ābeerā having a higher weighting than āapple-ciderā in the example. So thereās certainly room for improvement regarding the weighting function, such as taking the difference of the word lengths into account as well. Itās quite a fun exercise anyway. :-)