SitePoint Sponsor

User Tag List

Results 1 to 3 of 3
  1. #1
    SitePoint Enthusiast
    Join Date
    Jan 2006
    Posts
    28
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    cut up a text into sentences with regex

    I need to cut up paragraphs into their constituent sentences. I also want to calculate with abreviations and decimals containing dots in the middle of the sentences.

    My simplistic definition: A sentence starts with a capital letter, finishes with (.?!) followed by a space and the capital letter of the next sentence, plus something that is NOT (.?!)
    I came up with the following negative look- ahead solution that works on decimals, but fails on abreviations.

    [A-Z]((?![.?!]\s+[A-Z][^.?!]).)+

    The negative look-ahead should fail - and come back with true- till it arrives at the sentence ending position in my definition, but it doesn't..
    I would appreciate an explanaton for its failure.


    /**********************************************************/
    Example:
    Just after daybreak in Nags Head on the Outer Banks, about 200 miles northeast of Jacksonville, winds 85.43 miles / hour whipped heavy rain across the resort town. Tall waves covered what had been the beach, and the surf pushed as high as the backs of some of the N.Y. dt. houses and hotels fronting the strand. Lights flickered in one hotel, but the power was still on.
    /****************************************************/

  2. #2
    SitePoint Zealot Gar onn's Avatar
    Join Date
    Feb 2011
    Location
    Belgium
    Posts
    130
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    split on '.'
    str.split(/[.|!|?]\s/)

    splits on every occurence of .,! or ? folowed by a space

  3. #3
    SitePoint Enthusiast
    Join Date
    Jan 2006
    Posts
    28
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Sorry, but I can't find any better words for this than STUPID! Are there any sitepoint moderators at work here? And where is a link to report abuse?

    I came here for professional advice, but there seems to be a need for a basic regexp lesson, instead!

    First of all, simple alternation - (\.|\?|\!) - is stuff in round brackets where the characters need to be separated by the pipe sign (|). In square bracketted character classes, on the other hand, there are NO pipe signs, since alternation is IMPLICIT!

    In alternation notice the need to escape metacharacters - characters with a special use in regexp.
    The same special characters, or metacharacters do NOT need to be escaped in character classes - the stuff between square brackets, because ANY character there is taken literarely!! (There are special positions for a closing square bracket / negation / end of string mark etc but let's leave them now..)

    I left sitepoint forums years ago for these spammers and people who do not even bother to read a detailed description of a professional problem, but jot in their incompetent stuff like this!

    By the way, do YOU know anything about look-arounds and things like atomic grouping?

    Still waiting for expert help and an in-depth explanation for the failure of the regexp!

    Thanks!


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •