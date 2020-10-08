Check string for one or more words from blacklist or whitelist

JavaScript
#1

How to use javascript to check if an string contain words from a blacklist,
I was think to create a whitelist as well to run with this script.
I want to remove spaces from the beginning and the end of the bios and / or check for other characters.

Note: I do not have access to the webpage to change it.

// tests bio info
// -----------------  
// bio = ''This is a string. 👏'; // false 
// bio = 'I'm a Freelance translator' ;  //  true
// bio = 'application developer' ; //  false 
bio = 'I use apple-cyder everyday  💁 ' ; //  true
// bio = 'every day I listen The Beatles on the radio' ; // true


const blacklist = [
  'trans',
  'apple',
  'beer',
  "beat",
 ];

for (item of blacklist) {
  if (bio.toLowerCase().indexOf(item) !== -1) {
    
       console.log("%cThis bio matched blacklist keyword %c"+item,"color: red; font-size: 14px ; font-weight:bold","color: blue; font-size: 14px ; font-weight:bold ");	
  
  }
}
console.log("%cDone with this Bio ","color: Black; font-size: 14px; font-weight:bold ");
#2

trim is what you want e.g.

" text ".trim()
or

const strg = "       text      "
console.log(strg.trim()) // text

I think you code is fine, it’s clear isn’t it?

You could possibly store your results in an array

const bio = "I'm a freelance translator and like apples"

const blacklist = [
  'trans',
  'apple',
  'beer',
  'beat'
]

const matches = []

for (const item of blacklist) {
  if (bio.toLowerCase().indexOf(item) !== -1) {
    matches.push(item)
  }
}

console.log(
  '%cThis bio matched blacklist keywords %c' + matches.join(', '),
  'color: red; font-size: 14px ; font-weight:bold',
  'color: blue; font-size: 14px ; font-weight:bold '
) // This bio matched blacklist keywords trans, apple

A nicer solution might be to use array’s higher order function ‘filter’
Array.prototype.filter()

I will leave you to read up on that :smiley:

edit: just noticed ‘beer’, ‘like beer’ not ‘apples’

#3

I’ll start my research about array.protoype.filter()
thanks for the guide line. when I put something together I’ll re-post it.

#4

An small addition split(" ") to your very clean code revision ( very neat, I like it ) helps the script to find the exact match between worlds in blacklist and the string.

if (bio.toLowerCase().split(" ").indexOf(item) !== -1)

next challenge… what if the blacklist has words like “morning person” “night owl” “candy bar”` “potato chips”

PS: I was reading about array.protoype.filter(), I still need to review that deeper and try it.

#5

You could map the blacklist entries with a space out to a separate list, and remove them from the original blacklist array. That way you can separately deal with the ones that’s have a space in them.

#6

Another possibility would be to map your blacklisted words to regular expressions wrapped in word boundaries; this way you’ll also match a word if it is followed by say a comma or hyphen, not only white space:

const blacklist = ['apple' ,'night owl'].map(item => new RegExp(`\\b${item}\\b`))
const test = string => blacklist.some(item => item.test(string))

console.log(test('beer and apple-cider'))       // true
console.log(test('the bear and the night owl')) // true
console.log(test('no more pineapples'))         // false
#7

I considered regular expressions in the original question using pipe.
const blackListRx = /trans|apple|beer|beat/

I had resisted though as I thought it might be opening up a can of worms.

Just playing with an alterative here using the built in test

// builds \bapple\b|\bnight owl\b
const blacklistRx = new RegExp(
  ['apple', 'night owl'].map(item => `\\b${item}\\b`).join('|')
)

console.log(blacklistRx.test('beer and apple-cider'))       // true
console.log(blacklistRx.test('the bear and the night owl')) // true
console.log(blacklistRx.test('no more pineapples'))         // false

The difference is that Array’s ‘some’ breaks on the first match, I’m not sure if the regex engine does the same with test or not.

#8

True. :-D As soon as the blacklist contains non-word characters, we’ll have to properly escape everything (probably again using regular expressions).

I suppose so… but if you want to get the full list of words we might use match() along with a g flag instead.

#9

I should have possibly used the word ‘advantage’ instead of ‘difference’. My thinking being ‘some’ breaking early is more efficient — I’m suffering with brain fog here.

#10

It depends on the use case I guess. If you want to check a user comment field for instance, it would probably be desirable to show the complete list of problematic words right away, rather than having to click “submit” again and again until all the included curse words have been rejected. :-P

#11

Thank you for all your input, this is an interesting topic and the debate is getting nicer
I’ve been thinking about each comment.
One thing that I know for sure is that the blacklist will grow. I don’t know yet how much it will grow. but I’d like to plan for the worse case scenario.
First thing I’d like to do is to clean the bio/profile info, remove those funny characters, emojis and keep a clean text string, and probable add spaces between a word and a punctuation mark.

  • facebook: would be different than facebook : or facebook=
  • Maybe the external blacklist file is the best approach…
    that way I don’t have to touch the code file anymore or just touch the code for improvements

Next questions :

  • how to create that file? CSV Delimiter File

  • what the rules should be. . semi-colon as field delimiter
    and list the words one under the other. cr, that way the file would be vertical reading instead for horizontal way to read…

  • upload that csv file to an array?

  • and them proceed with one of the checking procedures,

I don’t know which one yet, I don’t know which one is the most efficient / easy one to implement.