Unicode property escapes are a new one on me romeoanc, thanks for that.
I did initially mistake them for named capturing groups
(?P<name>...), which I have used in PHP, but I don’t think are available in JS yet
Unicode property escapes are a new one on me romeoanc, thanks for that.
I did initially mistake them for named capturing groups
(?P<name>...), which I have used in PHP, but I don’t think are available in JS yet
Yes they are (with a slightly different syntax), but browser support is still rather meager AFAIK…
I tried to test everything you guys talk or post and at the same time I keep researching and learning…
catch part of a word is too risky. i.e. the script will ban / catch
I keep looking and trying different things…
it is just a thought…
For a single complete word, again a word boundary
\bsuper\b
matches: super, but not superficial or superintendent
You also have lookaheads, for example a negative lookahead
/mega(?!phone|tron)\w*/g
matches: mega, megabytes, but not megaphone or megatron
As mentioned above we are opening a can of worms here.
Whilst I think it’s good to learn regular expressions, they can become lengthy and complicated to read.
You may find more straightforward alternatives such as, split, indexOf, filter etc or as m3g4p0p posted above a mix of the two.
Furthermore by being overly strict with your blacklist, you can make the user experience and your job a PITA. For example what about names and places?
Reading Rude and Funny British place names highlights this issue and may give you a bit of a giggle at the same time.
If you are interested in regular expressions though this is quite informative https://www.regular-expressions.info/tutorial.html and I would say the go to book is O’Reilly’s Mastering Regular Expressions — I picked up a decent second-hand copy about six months ago
This is indeed an interesting idea… but to completely avoid the Scunthorpe Problem I think you have no other choice than including every unwanted combination manually. So at the end of the day you have to weigh up if you’re willing to accept false positives, or possibly have bad words in your bios.
m3g4p0p Mentor I though about endless blacklist or when to stop adding words to the list… probably it will be a personal decision base on the ban/no ban experience and consequences of it ( Scunthorpe problem ) excellent article…
Thinking deeper… caffeine is getting into my veins… I will use this backlist code just to catch some bio / profile / descriptions, if I can make this strict code works well. it will be easy to remove words from blacklist list itself.
I’ll keep working and reading… and post ideas / solutions
I’ve been toying with that idea and came up with the following:
function trimPunctuation (word) {
return word.replace(/^\W*|\W*$/g, '')
}
function getFullWord (value, index) {
const start = value.slice(0, index).search(/\S*$/)
const end = value.slice(index).search(/\s|$/) + index
const word = value.slice(start, end)
return trimPunctuation(word)
}
function calculateWeighting (value, word) {
return word.length / value.length
}
function accumulateMatches (value, word) {
const index = value.indexOf(word)
if (index === -1) {
return {}
}
const fullWord = getFullWord(value, index)
const remaining = value.slice(index + 1)
return {
[fullWord]: calculateWeighting(fullWord, word),
...accumulateMatches(remaining, word)
}
}
function getWeightedMatches (list, value) {
return list.reduce((result, word) => ({
...result,
...accumulateMatches(value, word)
}), {})
}
const matches = getWeightedMatches(['apple', 'bee'], 'the beer drinking bee does not like apple-cider')
console.log(matches) // { 'apple-cider': 0.45454545454545453, beer: 0.75, bee: 1 }
Not sure if this is particularly useful like this though – “beer” having a higher weighting than “apple-cider” in the example. So there’s certainly room for improvement regarding the weighting function, such as taking the difference of the word lengths into account as well. It’s quite a fun exercise anyway.
:-)
I like the above use of spread with the accumulated object, nice.
Had to test it
const fruit = ['apple', 'banana', 'cantaloupe', 'durian']
.reduce(
(result, word) =>
({
...result,
[word.charAt(0)]: word
}),
{}
)
console.log(fruit) // {a: "apple", b: "banana", c: "cantaloupe", d: "durian"}
my small contribution ( I was too tired last night to update you guys )
let words = ['transit', 'transport ', 'prevent', 'transfer ', 'application',
'appointment', 'appropriate', 'event','translator','morning person']
var bl='morning person';
console.log("bl: "+ bl);
var bl1=("/^"+ bl +"$/");
console.log("bl1: "+bl1);
var bl = new RegExp("^"+bl+"$"); // ^ match first part ^after => afterhour, afternoon, aftershave
// $ match the last part on$ => icon, clon, neon, upon
// ^variable$ matches the first part and the last part of a word
console.log('bl after RegExp: ', bl);
words.forEach(word => {
if (bl.exec(word)) {
console.log(`- Found: ${word}`);
}
})
console.log('- Done searching...');;
The idea could work… I was trying to use this idea and blend it into your code guys, searching inside of a sentence (string). I couldn’t make it to work, I’ll keep trying today,
Then we can make the code less strict just changing or switching or removing the ^ and/or $
Actually this will only match the exact string “variable” (i.e.
word === 'variable'); the
^ assertion matches the beginning of the tested string, and
$ the very end. So this would only work for sentences if the complete sentence is included in the blacklist.
BTW, while it is not forbidden you should avoid redeclaring variables. Actually, there’s no need to use
var at all if you’re using
let anyway, which is preferable in every respect – including that it will throw an error when attempting to redeclare it within the same scope.
" So after my talk about not using regexes " oops! during this 2 or 3 weeks I have been reading / reaching a lot about javascript that at some point all this info is in my head… but I can not easily links or relate them with our conversation, big mix of information… but, that, didn’t stop me to keep moving forward and learn. I double check the spelling of the word, yes it is correct. appropriate i.e… what is appropriate to wear to work?.
Thanks rpg_digital for all your help and input. I’m going to read and try the code.
I’ll try it, soon and thanks for your support
Today Sunday I dedicated a couple of hours just to read about RegExp. very interesting things. the pros and cos and the controversial part of it.
Hm isn’t the lookahead assertion kinda redundant at the beginning of the expression, meaning “anything followed by x”? If I’m not mistaken the same could be achieved like
/(\bbee|\bapple)\S*/g
Anyway, maybe another approach would be using an actual comparison algorithm for the heavy lifting, such as the Levenshtein distance or the Sørensen-Dice coefficient… the latter probably being more useful here as it gives us a percentage value.
This would also allow for fuzzy matches if desired; here’s an example using the Sørensen-Dice-based
string-similarity package (too lazy to re-implement the wheel right now hehe):
const { findBestMatch } = require('string-similarity')
function trimPunctuation (word) {
// Strip surrounding non-word characters; e.g. remove the
// exclamation mark from "something!" but not "someth!ng"
return word.replace(/^\W*|\W*$/g, '')
}
function getMatches (list, value, { fuzzy = false } = {}) {
const words = value.split(/\s+/).map(trimPunctuation)
return words.reduce((matches, word) => {
const testWords = fuzzy ? list : list.filter(listed => word.includes(listed))
if (testWords.length === 0) {
return matches
}
const { bestMatch } = findBestMatch(word, testWords)
const { rating, target } = bestMatch
if (rating > 0) {
matches[word] = { rating, target }
}
return matches
}, {})
}
const blacklist = ['apple', 'bee', 'like']
const input = 'the beer drinking bee does not l!ke apple-cider'
console.log(getMatches(blacklist, input))
// {
// beer: { rating: 0.8, target: 'bee' },
// bee: { rating: 1, target: 'bee' },
// 'apple-cider': { rating: 0.5714285714285714, target: 'apple' }
// }
console.log(getMatches(blacklist, input, { fuzzy: true }))
// {
// beer: { rating: 0.8, target: 'bee' },
// bee: { rating: 1, target: 'bee' },
// 'l!ke': { rating: 0.3333333333333333, target: 'like' },
// 'apple-cider': { rating: 0.5714285714285714, target: 'apple' }
// }
That’s embarrassing.
[...'the beer drinking bee does not like apple-cider.'.matchAll(/(\bbee|\bapple)\S*/g)]
// Output
(3) [Array(2), Array(2), Array(2)]
0: (2) ["beer", "bee", index: 4, input: "the beer drinking bee does not like apple-cider.", groups: undefined]
1: (2) ["bee", "bee", index: 18, input: "the beer drinking bee does not like apple-cider.", groups: undefined]
2: (2) ["apple-cider.", "apple", index: 36, input: "the beer drinking bee does not like apple-cider.", groups: undefined]
length: 3
Back to school
you know how long it took me to write that &*?#*cks, breaking down carefully how not to do it step by step
@romeoanc Lost some of it’s value, and checking m3g4p0p’s post #31 is probably prudent, but here is the former script amended accordingly
Potentially how not to do it:
// removing the code out of the global name space into an IIFE function
(function () {
// can use 'const' for an array
const words = [
'transit',
'transport ',
'prevent',
'transfer ',
'application',
'appointment',
'appropriate',
'event',
'translator',
'morning person'
]
function buildRegex (start, words, end) {
const wordsGroup = words
.map(word => `\\b${word.trim()}`)
.join('|')
return new RegExp(`${start}${wordsGroup}${end}`, 'g')
}
const matchWordsRx = buildRegex('(', words, ')s?')
// Check you console to see the outputs
// Note: .trim removes trailing spaces (see transport and transfer above)
console.log('%cwords array --->', 'color: Aquamarine', words)
console.log('%cwords.map(word => `\\b${word.trim()}`) --->', 'color: Aquamarine', words.map(word => `\\b${word.trim()}`))
console.log('%cjoin(\'|\') --->', 'color: Aquamarine', words.map(word => `\\b${word.trim()}`).join('|'))
console.log('%cnew RegExp(`(${wordsGroup}))s?`) --->', 'color: Aquamarine', matchWordsRx)
}())
that is one of the things that I’ve been reading a lot… regex issue to write something correctly, bc sometimes each symbol can have more than one meaning…
Yeah well that’s regular expressions.
¯\_(ツ)_/¯ Maybe a bit off-topic, but there’s a neat library called Super Expressive that makes working with these a bit easier; the above expression would then look something like this:
const SuperExpressive = require('super-expressive')
const regex = SuperExpressive()
.wordBoundary
.capture
.anyOf
.string('apple')
.string('bee')
.end()
.end()
.zeroOrMore.nonWhitespaceChar
.allowMultipleMatches
.caseInsensitive
.toRegex()
const input = 'the beer drinking bee does not l!ke apple-cider'
console.log([...input.matchAll(regex)])
I think we are in topic… very interesting library… I’m reading about it…
I confess it got under my skin a bit, because your correct solution from memory (a false memory) is where I started. I thought I was losing the plot, to the point where I was thinking it would be ridiculous if I had to do the following
/transports?|transfers?|prevents?/
The only thing I can put it down is that I may have been doing something daft along the lines of
/(?:\btransport\b|\btransfer\b|\bprevent\b)s?)/
A break from the computer might be beneficial. lol
After the break, should we start making noise?
I’m open to hear other alternatives / approach or make this code better but… this is working so far. to find the exact match of 1 or 2 word/s in a sentences ( I didn’t try to find more than 2 words or more. ).
// bio info to text samples //
// bio = 'I am an application developer and I like beer' ; // true
// bio = 'application developer' ; // false
// bio = 'application developer and I dont like Apples and I love Beer' ; // true
bio = 'I translate texts, drink beer, beer and beer. I am a morning person'; // true
// bio = 'I use apple-cyder everyday' ; // false
// bio = 'every day I listen The Beatles on the radio' ; // 4- true
// bio = 'every month I buy potato chips';
// bio = "find me dont have instagram : @asas__li"
// bio = "find me dont have instagram"
// bio = 'Im a Freelance translator beer fr' ; // true
const blacklist = [
'trans',
'apples',
'beer',
"beat",
"morning person",
"potato chips",
"instagram",
];
// ----- no need --- FROM here -----
console.log(blacklist);
console.log(bio);
var linea="";
var hits=0;
for ( var j=0; j < bio.length; j++ ) {
linea=linea +"-"; }
console.log(linea);
// ----- no need ---- TO here -----
var hits=0;
for (var i = 0; i < blacklist.length; i++) {
var re = new RegExp("\\b" + blacklist[i] + "\\b", "g");
var quant=(bio.match(re) ? bio.match(re).length : 0);
if(quant!==0){
hits++;
console.log( quant, "EXACT Match/es for " + blacklist[i] );
} else {
// console.log("No Matches");
}
}
console.log('Total Found:', hits);