Check string for one or more words from blacklist or whitelist

romeoanc · October 7, 2020, 7:32pm

How to use javascript to check if an string contain words from a blacklist,
I was think to create a whitelist as well to run with this script.
I want to remove spaces from the beginning and the end of the bios and / or check for other characters.

Note: I do not have access to the webpage to change it.

// tests bio info
// -----------------  
// bio = ''This is a string. 👏'; // false 
// bio = 'I'm a Freelance translator' ;  //  true
// bio = 'application developer' ; //  false 
bio = 'I use apple-cyder everyday  💁 ' ; //  true
// bio = 'every day I listen The Beatles on the radio' ; // true


const blacklist = [
  'trans',
  'apple',
  'beer',
  "beat",
 ];

for (item of blacklist) {
  if (bio.toLowerCase().indexOf(item) !== -1) {
    
       console.log("%cThis bio matched blacklist keyword %c"+item,"color: red; font-size: 14px ; font-weight:bold","color: blue; font-size: 14px ; font-weight:bold ");	
  
  }
}
console.log("%cDone with this Bio ","color: Black; font-size: 14px; font-weight:bold ");

rpg_digital · October 7, 2020, 9:39pm

trim is what you want e.g.

" text ".trim()
or

const strg = "       text      "
console.log(strg.trim()) // text

I think you code is fine, it’s clear isn’t it?

You could possibly store your results in an array

const bio = "I'm a freelance translator and like apples"

const blacklist = [
  'trans',
  'apple',
  'beer',
  'beat'
]

const matches = []

for (const item of blacklist) {
  if (bio.toLowerCase().indexOf(item) !== -1) {
    matches.push(item)
  }
}

console.log(
  '%cThis bio matched blacklist keywords %c' + matches.join(', '),
  'color: red; font-size: 14px ; font-weight:bold',
  'color: blue; font-size: 14px ; font-weight:bold '
) // This bio matched blacklist keywords trans, apple

A nicer solution might be to use array’s higher order function ‘filter’
Array.prototype.filter()

I will leave you to read up on that

edit: just noticed ‘beer’, ‘like beer’ not ‘apples’

romeoanc · October 7, 2020, 10:46pm

I’ll start my research about array.protoype.filter()
thanks for the guide line. when I put something together I’ll re-post it.

romeoanc · October 8, 2020, 1:07am

rpg_digital:

const matches = []

for (const item of blacklist) {
  if (bio.toLowerCase().indexOf(item) !== -1) {
    matches.push(item)
  }
}

console.log(
  '%cThis bio matched blacklist keywords %c' + matches.join(', '),
  'color: red; font-size: 14px ; font-weight:bold',
  'color: blue; font-size: 14px ; font-weight:bold '
) // This bio matched blacklist keywords trans, apple

An small addition split(" ") to your very clean code revision ( very neat, I like it ) helps the script to find the exact match between worlds in blacklist and the string.

if (bio.toLowerCase().split(" ").indexOf(item) !== -1)

next challenge… what if the blacklist has words like “morning person” “night owl” “candy bar”` “potato chips”

PS: I was reading about array.protoype.filter(), I still need to review that deeper and try it.

Paul_Wilkins · October 8, 2020, 4:03am

You could map the blacklist entries with a space out to a separate list, and remove them from the original blacklist array. That way you can separately deal with the ones that’s have a space in them.

m3g4p0p · October 8, 2020, 6:02am

Another possibility would be to map your blacklisted words to regular expressions wrapped in word boundaries; this way you’ll also match a word if it is followed by say a comma or hyphen, not only white space:

const blacklist = ['apple' ,'night owl'].map(item => new RegExp(`\\b${item}\\b`))
const test = string => blacklist.some(item => item.test(string))

console.log(test('beer and apple-cider'))       // true
console.log(test('the bear and the night owl')) // true
console.log(test('no more pineapples'))         // false

rpg_digital · October 8, 2020, 9:22am

I considered regular expressions in the original question using pipe.
const blackListRx = /trans|apple|beer|beat/

I had resisted though as I thought it might be opening up a can of worms.

Just playing with an alterative here using the built in test

// builds \bapple\b|\bnight owl\b
const blacklistRx = new RegExp(
  ['apple', 'night owl'].map(item => `\\b${item}\\b`).join('|')
)

console.log(blacklistRx.test('beer and apple-cider'))       // true
console.log(blacklistRx.test('the bear and the night owl')) // true
console.log(blacklistRx.test('no more pineapples'))         // false

The difference is that Array’s ‘some’ breaks on the first match, I’m not sure if the regex engine does the same with test or not.

m3g4p0p · October 8, 2020, 9:29am

True. :-D As soon as the blacklist contains non-word characters, we’ll have to properly escape everything (probably again using regular expressions).

I suppose so… but if you want to get the full list of words we might use match() along with a g flag instead.

rpg_digital · October 8, 2020, 10:08am

I should have possibly used the word ‘advantage’ instead of ‘difference’. My thinking being ‘some’ breaking early is more efficient — I’m suffering with brain fog here.

m3g4p0p · October 8, 2020, 12:54pm

It depends on the use case I guess. If you want to check a user comment field for instance, it would probably be desirable to show the complete list of problematic words right away, rather than having to click “submit” again and again until all the included curse words have been rejected. :-P

romeoanc · October 8, 2020, 1:57pm

Thank you for all your input, this is an interesting topic and the debate is getting nicer
I’ve been thinking about each comment.
One thing that I know for sure is that the blacklist will grow. I don’t know yet how much it will grow. but I’d like to plan for the worse case scenario.
First thing I’d like to do is to clean the bio/profile info, remove those funny characters, emojis and keep a clean text string, and probable add spaces between a word and a punctuation mark.

facebook: would be different than facebook : or facebook=

Maybe the external blacklist file is the best approach…
that way I don’t have to touch the code file anymore or just touch the code for improvements

Next questions :

how to create that file? CSV Delimiter File
what the rules should be. . semi-colon as field delimiter
and list the words one under the other. cr, that way the file would be vertical reading instead for horizontal way to read…
upload that csv file to an array?
and them proceed with one of the checking procedures,

I don’t know which one yet, I don’t know which one is the most efficient / easy one to implement.

rpg_digital · October 8, 2020, 7:05pm

I agree, and it was kind of what I was experimenting with last night, pre word boundaries

/**
 * @param {String} strg - String to match
 * @param {Array} listed - Blacklisted words array
 * @returns {Array} An array of matching words, each in the form of
 * {word: matched word, index: matched index, length: length of matched word}
 *
const isBlackListed = (strg = '', listed = []) => (
  listed.reduce(
    (matches, word) => {
      const index = strg.indexOf(word)
      if (index >= 0) matches.push({ word, index, length: word.length })
      return matches
    },
    [] // <-- matches array
  )
)

isBlackListed(bio, blacklist) // returns [{word: 'trans', index: 16, length: 5}, {word: 'apple', index: 32, length: 5}]
  .forEach(({ word, index, length } /* destructure each item ) => {
    console.log(
    `This bio matched %c${word}`,
    'background-color: teal; font-weight:bold; padding: 3px 8px;',
    `with a word length of ${length} at index ${index}.`
    )
  })

It was late, and possibly/ not entirely thought out, but had the idea that knowing the index as well might be useful. Length is not really needed as the string would have that property.

Just playing really

romeoanc · October 9, 2020, 5:38pm

I am very slow compare with you guys. here Its my small contribution. maybe you already know it. I don’t want to take assumption.
Clean String, from emojis and other things….

a='I was 🤓, represent “me”.   🤓🏃‍🏢  mean “I’m running to work. Instagram:@findme';
document.write(a);
document.write('<br><br>');
a=a.replace(/[^\p{L}\p{N}\p{P}\p{Z}]/gu, '');
document.write(a);

Result:  I was , represent “me”. mean “I’m running to work. Instagram:@findme

\p{L} – to allow all letters from any language
\p{N} – for numbers
\p{P} – for punctuation
\p{Z} – for whitespace separators
^ is for negation, so all these expressions will be whitelisted

Info source click here

rpg_digital · October 9, 2020, 8:23pm

Unicode property escapes are a new one on me romeoanc, thanks for that.

I did initially mistake them for named capturing groups (?P<name>...), which I have used in PHP, but I don’t think are available in JS yet

m3g4p0p · October 9, 2020, 8:56pm

Yes they are (with a slightly different syntax), but browser support is still rather meager AFAIK…

romeoanc · October 9, 2020, 10:03pm

I tried to test everything you guys talk or post and at the same time I keep researching and learning…

I like that the code can catch 2+ words from blacklist.
I wish and look for the solution for the script to catch also a single complete word

catch part of a word is too risky. i.e. the script will ban / catch

“trans” words like transit, transport transfer and so on
“app” words like application, appointment, appropriate including apples! and so on…
note I am not adding i.e. bee, because that root word will ban beer, and that can be a big problem…

I keep looking and trying different things…

following your code, maybe checking the length of the blacklist word and compare it with the length of the string part that the script caught? maybe that can work.

it is just a thought…

rpg_digital · October 10, 2020, 9:53am

For a single complete word, again a word boundary
\bsuper\b

matches: super, but not superficial or superintendent

You also have lookaheads, for example a negative lookahead
/mega(?!phone|tron)\w*/g

matches: mega, megabytes, but not megaphone or megatron

As mentioned above we are opening a can of worms here.

Whilst I think it’s good to learn regular expressions, they can become lengthy and complicated to read.

You may find more straightforward alternatives such as, split, indexOf, filter etc or as m3g4p0p posted above a mix of the two.

Furthermore by being overly strict with your blacklist, you can make the user experience and your job a PITA. For example what about names and places?

Reading Rude and Funny British place names highlights this issue and may give you a bit of a giggle at the same time.

If you are interested in regular expressions though this is quite informative https://www.regular-expressions.info/tutorial.html and I would say the go to book is O’Reilly’s Mastering Regular Expressions — I picked up a decent second-hand copy about six months ago

m3g4p0p · October 10, 2020, 12:36pm

This is indeed an interesting idea… but to completely avoid the Scunthorpe Problem I think you have no other choice than including every unwanted combination manually. So at the end of the day you have to weigh up if you’re willing to accept false positives, or possibly have bad words in your bios.

romeoanc · October 10, 2020, 1:19pm

m3g4p0p Mentor I though about endless blacklist or when to stop adding words to the list… probably it will be a personal decision base on the ban/no ban experience and consequences of it ( Scunthorpe problem ) excellent article…
Thinking deeper… caffeine is getting into my veins… I will use this backlist code just to catch some bio / profile / descriptions, if I can make this strict code works well. it will be easy to remove words from blacklist list itself.

rpg_digital I’ll look / read and try " Mastering Regular Expressions ", thank you for the ideas. I’ll try them as well.

I’ll keep working and reading… and post ideas / solutions

m3g4p0p · October 10, 2020, 4:49pm

I’ve been toying with that idea and came up with the following:

function trimPunctuation (word) {
  return word.replace(/^\W*|\W*$/g, '')
}

function getFullWord (value, index) {
  const start = value.slice(0, index).search(/\S*$/)
  const end = value.slice(index).search(/\s|$/) + index
  const word = value.slice(start, end)

  return trimPunctuation(word)
}

function calculateWeighting (value, word) {
  return word.length / value.length
}

function accumulateMatches (value, word) {
  const index = value.indexOf(word)

  if (index === -1) {
    return {}
  }

  const fullWord = getFullWord(value, index)
  const remaining = value.slice(index + 1)

  return {
    [fullWord]: calculateWeighting(fullWord, word),
    ...accumulateMatches(remaining, word)
  }
}

function getWeightedMatches (list, value) {
  return list.reduce((result, word) => ({
    ...result,
    ...accumulateMatches(value, word)
  }), {})
}

const matches = getWeightedMatches(['apple', 'bee'], 'the beer drinking bee does not like apple-cider')
console.log(matches) // { 'apple-cider': 0.45454545454545453, beer: 0.75, bee: 1 }

Not sure if this is particularly useful like this though – “beer” having a higher weighting than “apple-cider” in the example. So there’s certainly room for improvement regarding the weighting function, such as taking the difference of the word lengths into account as well. It’s quite a fun exercise anyway. :-)