Check string for one or more words from blacklist or whitelist

rpg_digital · October 10, 2020, 9:51pm

I like the above use of spread with the accumulated object, nice.

Had to test it

const fruit = ['apple', 'banana', 'cantaloupe', 'durian']
  .reduce(
    (result, word) => 
      ({
        ...result,
        [word.charAt(0)]: word
      }), 
      {}
    )

console.log(fruit) // {a: "apple", b: "banana", c: "cantaloupe", d: "durian"}

romeoanc · October 11, 2020, 12:40pm

my small contribution ( I was too tired last night to update you guys )

let words = ['transit', 'transport ', 'prevent', 'transfer ', 'application', 
    'appointment', 'appropriate', 'event','translator','morning person']
 
var bl='morning person';
console.log("bl: "+ bl);

var bl1=("/^"+ bl +"$/");
console.log("bl1: "+bl1);


var bl = new RegExp("^"+bl+"$"); //     ^ match first part  ^after  => afterhour, afternoon, aftershave
                                 //     $ match the last part   on$ => icon, clon, neon, upon    
                                 //     ^variable$  matches the first part and the last part of a word

console.log('bl after RegExp: ', bl);
 
words.forEach(word => {
    if (bl.exec(word)) {
        console.log(`- Found: ${word}`);
    } 
     
})
console.log('- Done searching...');;

The idea could work… I was trying to use this idea and blend it into your code guys, searching inside of a sentence (string). I couldn’t make it to work, I’ll keep trying today,
Then we can make the code less strict just changing or switching or removing the ^ and/or $

m3g4p0p · October 11, 2020, 6:31pm

Actually this will only match the exact string “variable” (i.e. word === 'variable'); the ^ assertion matches the beginning of the tested string, and $ the very end. So this would only work for sentences if the complete sentence is included in the blacklist.

BTW, while it is not forbidden you should avoid redeclaring variables. Actually, there’s no need to use var at all if you’re using let anyway, which is preferable in every respect – including that it will throw an error when attempting to redeclare it within the same scope.

romeoanc · October 11, 2020, 9:03pm

" So after my talk about not using regexes " oops! during this 2 or 3 weeks I have been reading / reaching a lot about javascript that at some point all this info is in my head… but I can not easily links or relate them with our conversation, big mix of information… but, that, didn’t stop me to keep moving forward and learn. I double check the spelling of the word, yes it is correct. appropriate i.e… what is appropriate to wear to work?.
Thanks rpg_digital for all your help and input. I’m going to read and try the code.

romeoanc · October 12, 2020, 1:18am

I’ll try it, soon and thanks for your support
Today Sunday I dedicated a couple of hours just to read about RegExp. very interesting things. the pros and cos and the controversial part of it.

m3g4p0p · October 12, 2020, 12:01pm

Hm isn’t the lookahead assertion kinda redundant at the beginning of the expression, meaning “anything followed by x”? If I’m not mistaken the same could be achieved like

/(\bbee|\bapple)\S*/g

Anyway, maybe another approach would be using an actual comparison algorithm for the heavy lifting, such as the Levenshtein distance or the Sørensen-Dice coefficient… the latter probably being more useful here as it gives us a percentage value.

This would also allow for fuzzy matches if desired; here’s an example using the Sørensen-Dice-based string-similarity package (too lazy to re-implement the wheel right now hehe):

const { findBestMatch } = require('string-similarity')

function trimPunctuation (word) {
  // Strip surrounding non-word characters; e.g. remove the
  // exclamation mark from "something!" but not "someth!ng"
  return word.replace(/^\W*|\W*$/g, '')
}

function getMatches (list, value, { fuzzy = false } = {}) {
  const words = value.split(/\s+/).map(trimPunctuation)

  return words.reduce((matches, word) => {
    const testWords = fuzzy ? list : list.filter(listed => word.includes(listed))

    if (testWords.length === 0) {
      return matches
    }

    const { bestMatch } = findBestMatch(word, testWords)
    const { rating, target } = bestMatch

    if (rating > 0) {
      matches[word] = { rating, target }
    }

    return matches
  }, {})
}

const blacklist = ['apple', 'bee', 'like']
const input = 'the beer drinking bee does not l!ke apple-cider'

console.log(getMatches(blacklist, input))
// {
//   beer: { rating: 0.8, target: 'bee' },
//   bee: { rating: 1, target: 'bee' },
//   'apple-cider': { rating: 0.5714285714285714, target: 'apple' }
// }

console.log(getMatches(blacklist, input, { fuzzy: true }))
// {
//   beer: { rating: 0.8, target: 'bee' },
//   bee: { rating: 1, target: 'bee' },
//   'l!ke': { rating: 0.3333333333333333, target: 'like' },
//   'apple-cider': { rating: 0.5714285714285714, target: 'apple' }
// }

rpg_digital · October 12, 2020, 12:50pm

That’s embarrassing.

[...'the beer drinking bee does not like apple-cider.'.matchAll(/(\bbee|\bapple)\S*/g)]

// Output
(3) [Array(2), Array(2), Array(2)]
0: (2) ["beer", "bee", index: 4, input: "the beer drinking bee does not like apple-cider.", groups: undefined]
1: (2) ["bee", "bee", index: 18, input: "the beer drinking bee does not like apple-cider.", groups: undefined]
2: (2) ["apple-cider.", "apple", index: 36, input: "the beer drinking bee does not like apple-cider.", groups: undefined]
length: 3

Back to school

you know how long it took me to write that &*?#*cks, breaking down carefully how not to do it step by step

rpg_digital · October 12, 2020, 12:57pm

@romeoanc Lost some of it’s value, and checking m3g4p0p’s post #31 is probably prudent, but here is the former script amended accordingly

Potentially how not to do it:

// removing the code out of the global name space into an IIFE function
(function () {

  // can use 'const' for an array
  const words = [
    'transit',
    'transport ',
    'prevent',
    'transfer ',
    'application',
    'appointment',
    'appropriate',
    'event',
    'translator',
    'morning person'
  ]

  function buildRegex (start, words, end) {

    const wordsGroup = words
      .map(word => `\\b${word.trim()}`)
      .join('|')

    return new RegExp(`${start}${wordsGroup}${end}`, 'g')
  }

  const matchWordsRx = buildRegex('(', words, ')s?')

  // Check you console to see the outputs
  // Note: .trim removes trailing spaces (see transport and transfer above)
  console.log('%cwords array --->', 'color: Aquamarine', words)
  console.log('%cwords.map(word => `\\b${word.trim()}`) --->', 'color: Aquamarine', words.map(word => `\\b${word.trim()}`))
  console.log('%cjoin(\'|\') --->', 'color: Aquamarine', words.map(word => `\\b${word.trim()}`).join('|'))
  console.log('%cnew RegExp(`(${wordsGroup}))s?`) --->', 'color: Aquamarine', matchWordsRx)

}())

romeoanc · October 12, 2020, 3:58pm

that is one of the things that I’ve been reading a lot… regex issue to write something correctly, bc sometimes each symbol can have more than one meaning…

m3g4p0p · October 12, 2020, 5:48pm

Yeah well that’s regular expressions. ¯\_(ツ)_/¯ Maybe a bit off-topic, but there’s a neat library called Super Expressive that makes working with these a bit easier; the above expression would then look something like this:

const SuperExpressive = require('super-expressive')

const regex = SuperExpressive()
  .wordBoundary
  .capture
    .anyOf
      .string('apple')
      .string('bee')
    .end()
  .end()
  .zeroOrMore.nonWhitespaceChar
  .allowMultipleMatches
  .caseInsensitive
  .toRegex()

const input = 'the beer drinking bee does not l!ke apple-cider'
console.log([...input.matchAll(regex)])

romeoanc · October 12, 2020, 6:43pm

I think we are in topic… very interesting library… I’m reading about it…

rpg_digital · October 13, 2020, 1:48am

I confess it got under my skin a bit, because your correct solution from memory (a false memory) is where I started. I thought I was losing the plot, to the point where I was thinking it would be ridiculous if I had to do the following

/transports?|transfers?|prevents?/

The only thing I can put it down is that I may have been doing something daft along the lines of

/(?:\btransport\b|\btransfer\b|\bprevent\b)s?)/

A break from the computer might be beneficial. lol

romeoanc · October 14, 2020, 7:07pm

After the break, should we start making noise?
I’m open to hear other alternatives / approach or make this code better but… this is working so far. to find the exact match of 1 or 2 word/s in a sentences ( I didn’t try to find more than 2 words or more. ).

// ****  bio samples to test  the script **** 
// bio = 'I am an application developer and I like beer' ;  
// bio = 'application developer' ; 
// bio = 'application developer and I dont like Apples and I love Beer' ; 
bio = 'I translate texts, drink beer, beer and beer. I am a morning person';  
// bio = 'I use apple-cyder everyday' ; 
// bio = 'every day I listen The Beatles on the radio' ; 
// bio =  'every month I buy potato chips';
// bio = "find me dont have instagram : @asas__li"
// bio = "find me dont have instagram"
// bio = 'Im a Freelance translator beer fr' ;  
const blacklist = [
  'trans',
  'apples',
  'beer',
  "beat",
  "morning person",
  "potato chips",
  "instagram",
   ];

// -----  no need  --- FROM here ----- 
console.log(blacklist);
console.log(bio);
var linea="";  
var hits=0;
for ( var j=0; j < bio.length; j++ ) {  
             linea=linea +"-"; }
            console.log(linea);
// -----  no need ----  TO here ----- 

var hits=0;
for (var i = 0; i < blacklist.length; i++) {
  var re = new RegExp("\\b" + blacklist[i] + "\\b", "g");
  var quant=(bio.match(re) ? bio.match(re).length : 0);
  if(quant!==0){
                hits++;          
                console.log( quant, "EXACT Match/es for " + blacklist[i] );
            } else {    
                //  console.log("No Matches");
              }
              
}
console.log('Total Found:', hits);

rpg_digital · October 17, 2020, 11:52am

You’re certainly picking things up romeoanc

Just a bit of a look at your script and playing around with alternatives

var hits = 0

// instead of checking the length property each time
// we could assign it to a variable
// e.g. for (let i = 0, len = blacklist.length; i < len; i++)
for (var i = 0; i < blacklist.length; i++) {

  var re = new RegExp('\\b' + blacklist[i] + '\\b', 'g')

  // below we are doing the same match twice 'bio.match(re)'
  // we could assign the match to a variable instead
  // const match = bio.match(re)
  // const quant = (match ? match.length : 0)
  var quant = (bio.match(re) ? bio.match(re).length : 0)
  if (quant !== 0) {
    hits++
    console.log(quant, 'EXACT Match/es for ' + blacklist[i])
    continue // will jump to the next iteration of the loop
  }

  console.log('No Matches')
}

console.log('Total Found:', hits);

The above refactored into a forEach loop

// wrapped in an IIFE (immediately invoked function expression)
// this takes the code out of the global namespace
(function () {
  const blacklist = [
    'trans',
    'apples',
    'beer',
    'beat',
    'morning person',
    'potato chips',
    'instagram'
  ]

  const bio = 'I translate texts, drink beer, beer and beer. I am not a morning person'

  // not great having 'hits' be changed from inside the forEach function
  // but for now....
  let hits = 0

  blacklist.forEach(function (word) {
    const wordRx = new RegExp('\\b' + word + '\\b', 'g')
    const matches = bio.match(wordRx) // will return an array or null if no matches

    if (matches !== null) {
      const matchCount = matches.length

      console.log(matchCount + ' exact match/es for ' + word)
      hits += matchCount
      return // early return if a match (a bit like 'continue')
    }

    console.log('No match for ' + word)
  })

  console.log('Total Found', hits)
}());

Going for a more functional and declarative approach using filter and map and flat

(function () {
  const blacklist = [
    'trans',
    'apples',
    'beer',
    'beat',
    'morning person',
    'potato chips',
    'instagram'
  ]

  const bio = 'I translate texts, drink beer, beer and beer. I am not a morning person'

  function isMatch (needle, haystack) {
    return new RegExp('\\b' + needle + '\\b', 'g').test(haystack)
  }

  function matches (needle, haystack) {
    return haystack.match(new RegExp('\\b' + needle + '\\b', 'g'))
  }

  const hits = blacklist
    .filter(function (word) { return isMatch(word, bio) })
    .map(function (word) { return matches(word, bio) })

  console.dir(hits)
  // Array(2)
  //   0: (3) ["beer", "beer", "beer"]
  //   1: ["morning person"]
  console.log('Total Found', hits.flat().length) // 4
}())

I’m moving on to currying here, to refactor further — maybe a bit advanced. I would recommend looking into ‘closures’ first

(function () {
  const blacklist = [
    'trans',
    'apples',
    'beer',
    'beat',
    'morning person',
    'potato chips',
    'instagram'
  ]

  const bio = 'I translate texts, drink beer, beer and beer. I am not a morning person'

  // this is currying.
  const isMatch = haystack => needle => new RegExp('\\b' + needle + '\\b', 'g').test(haystack)

  const matches = haystack => needle => haystack.match(new RegExp('\\b' + needle + '\\b', 'g'))

  // a lot cleaner
  const hits = blacklist
    .filter(isMatch(bio)) // return matched words only
    .map(matches(bio)) // return actual matches

  console.dir(hits)
  console.log('Total Found', hits.flat().length) // 4
}())

Finally we could go back to building our regex with the or(‘|’) and all words

(function () {
  const blacklist = [
    'trans',
    'apples',
    'beer',
    'beat',
    'morning person',
    'potato chips',
    'instagram'
  ]

  // \btrans\b|\bapples\b|\bbeer\b|\bbeat\b|\bmorning person\b|\bpotato chips\b|\binstagram\b
  const blacklistRx = new RegExp('\\b' + blacklist.join('\\b|\\b') + '\\b', 'g')

  const bio = 'I translate texts, drink beer, beer and beer. I am not a morning person'

  // ['beer', 'beer', 'beer', 'morning person']
  const hits = bio.match(blacklistRx)

  console.log(hits) // (4) ["beer", "beer", "beer", "morning person"]
  console.log('Total hits', hits.length) // 4
  // sets only accept unique elements, so duplicates are ignored
  // 'size' returns the number of elements, like length on an array
  console.log('Total word matches', new Set(hits).size) // 2
}())

We could also use array’s reduce method and a whole lot of alternatives, but hopefully there is something of use to you there.

romeoanc · October 20, 2020, 12:34pm

What a great coincidence Last weekend I was reading about it. and I couldn’t understand it. Now all the piece fall in place thanks to your code.

romeoanc · October 22, 2020, 2:51pm

I’ve been studying and testing those alternative solutions.
they are simple and much cleaner. Thanks!

system · January 21, 2021, 9:51pm

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.