["'“]?([A-Z]
Trying to establish a rule for a typical sentence beginning: "It starts with some optional quotation marks (or whatever else you want to allow here. -> One reason why there is no bulletproof solution -> see the conclusions) and a leading capital letter, for sure.
Then try to define the conditions to know when you are coming to the end of a sentence.
My definition is – not all inclusive, as you will see!- : “When you look ahead you will know you are coming close to a sentence end if you see a minimum two letter word, or a number followed by the usual punctuation marks, optional quotation marks before and after them, and, finally, a capital letter that is the start of the next sentence.” You will also see why I don’t care for the end-of-string anchor…
((?!([A-Za-z]{2,}|\d+)[‘“]?[.?!]+[”’]?\s+["']?[A-Z]).))
The negative lookahead is grouped with the universal selector to repeat zero, or more times. Because the negative lookahead’s selection is zero length, practically, this means that the regexp is taking a lockstep approach to stop at each character to check if we are not coming towards the end of the sentence as defined above, step back before the current character and let the universal selector take it. Then the look-ahead repeats the same check from one character ahead…
The problem is this way the in-sentence abbreviations would be regarded as sentence ends, which is unacceptable for me.
Also note that the above regexp part stops selecting the string two normal, or capital letters before the end of any “ordinary” sentence, and similarly leaves out any two letter or longer in-sentence abbreviations!
This is when I cater for the possible abbreviations, first.
(((Mr|Ms|Mrs|Dr|Capt|Col)\.\s+((?!\w{2,}[.?!][‘“]?\s+[”’]?[A-Z]).))?)
Here the logic is similar to the one above.
If the capital letter the above part of the regexp arrived at is in a named abbreviation (Mr|Mr|Mrs|Dr|Capt|Col), it is followed by a dot and one or more white space characters.
Then in a similar lockstep manner I always look ahead and select everything till I arrive at another critical point of minimum two letters – a normal sentence ending of min two normal letters in the last word, or at another min. two-capital-letter abbreviation + punctuation marks + optional quotation marks + white space(s) + optional quotation marks before a capital letter comes at the start of the next sentence or as part of another abbreviation. The outer asterisk means zero or more repetition of the whole abbreviation selection part, so more abbreviations can come…
Then in the end I only have to select all the remaining letters in the sentence.
((?![.?!][“']?\s+[”‘]?[A-Z]).)
This will select everything preceding the punctuation marks at the very end of the sentence.
[.?!]+["’”]?
Then the punctuation marks + optional quotation marks.
The negative look-ahead – universal selector combo will select everything to the end of the string if the pattern in the negative look-ahead does not match any more…
Basically I could only arrive at a very good approximation in selecting whole sentences, and there is no all- inclusive solution, because
1.) sentences can start with numbers - 3.14 is used as a special value in mathematics. - but numbers quite commonly occur in the inside of sentences, don’t they? How will you distinguish between the two situations? You can create a list of possible in-sentence abbreviations but numbers are just numbers, regardless where they are…
2.) There is a whole lot of ANSI characters for types of quotation marks long hyphens — and the like that can precede or follow the punctuation marks at the end of a sentence. Now I changed the regexp to allow a quotation mark before the sentence closing punctuation marks to solve your problem, but you can just go on and on to include more and more ANSI characters here and after these punctuation marks in the square brackets! There will always be newspaper articles, for instance, that will use some special unexpected characters in these positions, so it’s the matter of your inference to know what you will put in the square brackets …
3.) A possible solution: You could replace special characters for a well -cared -for limited set in your texts.
I hope this could help. If not feel free to ask.