Choose whole sentences and ONLY whole sentences RELIABLY with regex
I have found the solution to select ANY whole English sentence reliably regardless of quotation marks, or even punctuation marks used inside them for abreviations, decimals or whatever other purposes! Tests reliably on any non-accented string!
In human language it reads as follows:
Find a non-accented capital letter that might be preceeded by a quotation mark and check that it is not directly followed by any punctuation marks to exclude capital letter abbreviations inside sentences. Then crawl forward by repeating a group consisting of a negative look-ahead and the universal selector character until you arrive at the end of the sentence you are in. You will know you are there if you find the sequence of a possible quotation mark - the one closing its pair at the start of your sentence, followed by the sentence- closing punctuation mark and the white space that neccessarily separates your sentence from the next one. Then you repeat the criteria for the start of a sence to see it's already a new one! Because of the negative condition in the look-ahead the repeated group - the universal selector really - did not choose the closing punctuation mark + the possible quotation mark, so you should care for these separately.
SUGGESTION FOR FURTHER DEVELOPMENT:
Together with the starting non-accented capital letters you can also use hexadecimal notations to describe accented ANSI capital letters to select sentences in any other European languages. But this is not an issue for me at the present..
some reasons to try to validate whole sentences
There are minimum two reasons.
The first is a general one: It is easy to match whole words and paragraphs with regular expressions and I never understood why one shouldn’t attempt to work in-between the two and choose whole sentences, too. In everyday life we often need to know how many sentences a text can consist of – write only between 10 -15 sentences etc. Then we should be able to find and count sentences in programming, shouldn’t we?
The practical reason is, I am extending a browser- based language learning material making tool with a feature that allows teachers to take paragraphs of text and freely manipulate any parts of it – drop vowels, consonants, mix letters, omit words etc. etc before they would automatically get a crossword as one possible output. This involves lots of things, but here the point is when filling in the crossword each time people should only see the exact sentence the word(s) in a line or in a column came from.
some examples for how selecting sentences could be useful
I don't think the code would be beautifull, or fast, just the opposite, in fact, but, at least, it does the job in, let's say, 98% of all cases with the rare exceptions mentioned and the possibility to expand it further to be used with accented character sets..
If you ever want to design desktop-like applications to manipulate texts - to add a variety of interactions to some parts, or to allow transformations, for instance- you will soon need to select sentences.. Belive me! Interactive tests are a good example..
there are always rules and definitions behind..
1.) I wrote about a GENERAL NEED to choose sentences, not just whole words and paragraphs.
2.) You can only select a word or a paragraph with any certainty because you can very clearly define what they are, and you insist on their definitions when you write regular expressions to select them: a continuous part of a string without any white space in it, and anything between the start of/ the end of the string and/or between two (carriage return +) a newline character(s).
3. It is the same basic issue with sentences regardless if people use proper punctuation and grammar, or not. Unfortunately, even syntactically perfect sentences defy an all-inclusive, clear definition as I discussed this at length above. Your example belongs to the problem with the wide range of possible ANSI characters around punctuation marks if you read my last post..
4.) It is very nice from you to recommend a parser, but all parsers must also base their algorithms on some rules and definitions to choose different parts of strings, they just keep these details hidden from us, don't they? Regular expressions are used in lots of (all?) coding languages and I do not want to use a parser, but a formula that can be easily adapted to very different needs on the client - server - database sides.
5.) Please, provide a better solution to choose syntactically perfect sentences with regular expressions and I will gladly give all the kudos to you!!