SitePoint Sponsor

User Tag List

Results 1 to 8 of 8
  1. #1
    SitePoint Zealot soapergem's Avatar
    Join Date
    Mar 2005
    Location
    Madison, WI
    Posts
    165
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Deceivingly difficult regex

    I have some long lines of data, and in the middle of each of these lines are occassional pseudo-fixed-length numbers. By that I mean that they will always occupy a fixed number of characters (e.g. 5 chars) but they may or may not be padded on the left with spaces (e.g. 2 spaces and 3 digits). For example, I might have some lines of data like this (The pound signs represent other alphanumeric data; I'm just highlighting the portion that I'm referring to):
    Code:
    #####12345#####
    #####  123#####
    ##### 1234#####
    I need a regex that will create a consistent back reference to just the number part of that and exclude the spaces. My first thought was, of course, to use something like this:
    Code:
    /\s*(\d+)/
    If that worked, the number would be put into backreference \1. However, it doesn't always work, since the pound signs represent other bits of alphanumeric data. I run into a problem when the data just beyond this number is also numeric--the expression above wouldn't be able to tell the difference. Example:

    Code:
    abcde  1234a4a4
    I would want to just match 123, but my expression above would spill into the next piece of data and give me 1234.

    So what I really want to do is something more like this (this is only a pseudo-regular expression):
    Code:
    /\s*(\d{(5 - number of spaces matched)})/
    Except I don't know if anything like that can be done. I even thought of compiling a grouping of a lot of different possibilities OR'ed together, but I'm not sure how to consistently retrieve the backreference there either. Something like this:

    Code:
    /(?:(\d{5})|\s(\d{4})|\s{2}(\d{3})|\s{3}(\d{2})|\s{4}(\d))/
    That would match perfectly every time, but it also creates a new problem: the number would be stored in either \1, \2, \3, \4, or \5 depending on how many digits it was. I would like it to always be in the same place so that I can actually do something with it.

    Let me know if any of this is unclear. Thanks in advance!
    // useless crap about my relationships, philosophy,
    // theology, music and programming projects:

    my $blog = 'http://gordon-myers.com/';

  2. #2
    SitePoint Wizard bronze trophy KevinR's Avatar
    Join Date
    Nov 2004
    Location
    Moon Base Alpha
    Posts
    1,053
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Unless the lines/records are fixed length, I see no way of doing this consistently. If the records are fixed length you wold want to use substr() anyway instead of a regexp.

    Can you post some real lines of the data?

  3. #3
    SitePoint Wizard bronze trophy KevinR's Avatar
    Join Date
    Nov 2004
    Location
    Moon Base Alpha
    Posts
    1,053
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If there is always at least one space only in that part of the string you are interested in, or only ever one occurance of 5 digits in a row or only the first occurance of 5 digits in a row is the only good match, this can be doable with a regexp.

    Code:
    if (/(\s+)/) {
       my $max = (5 - length($1));
       /\s+(\d{1,$max})/;
       print $1;
    }
    else {
       /(\d{5})/;
       print $1;
    }

  4. #4
    SitePoint Zealot soapergem's Avatar
    Join Date
    Mar 2005
    Location
    Madison, WI
    Posts
    165
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well, here's some oversimplified data to work with:
    Code:
    abcde  1234a4a4
    1234567890bbbbb
      4j5 888899cc7
    And let's say that said data consists of three parts: first a five-digit, right-aligned alphanumeric value, followed by a five-digit, right-aligned numeric value, followed by a fixed (consistently) five-digit alphanumeric value.

    And for the sake of clarity, let's only worry about the middle value, the numeric one. I will actually be retrieving all the separate values in succession, but if we're able to write an expression to handle one of them, I'll be able to copy it for the others as well.

    Finally, I'm curious about something that you said: that your suggestion would only work if there is at least one space. Would the following modification work, or not?:

    Code:
    if (/(\s*)/) {
       my $len = (5 - length($1));
       /\s*(\d{$len})/;
       print $1;
    }
    Finally, like I said, I only gave you oversimplified examples of the real data I'm working with. The real data is actually about 50GB of text data from the U.S. Census Bureau, found here. (If you want to see it, pick a state, pick any zip file, and then open up any of the data files contained within except for the .MET file, which is only metadata.)
    // useless crap about my relationships, philosophy,
    // theology, music and programming projects:

    my $blog = 'http://gordon-myers.com/';

  5. #5
    SitePoint Wizard bronze trophy KevinR's Avatar
    Join Date
    Nov 2004
    Location
    Moon Base Alpha
    Posts
    1,053
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    And let's say that said data consists of three parts: first a five-digit, right-aligned alphanumeric value, followed by a five-digit, right-aligned numeric value, followed by a fixed (consistently) five-digit alphanumeric value.

    if that is true you would not want to use a regexp, you would want to use substr().

    Code:
    my @data = ('abcde  1234a4a4','1234567890bbbbb','4j5 888899cc7');
    
    foreach my $foo (@data) {
       my $middle = substr($foo,5,5);
       print "$middle\n";
    }

    Finally, I'm curious about something that you said: that your suggestion would only work if there is at least one space. Would the following modification work, or not?:

    No, it would not work.

  6. #6
    SitePoint Zealot soapergem's Avatar
    Join Date
    Mar 2005
    Location
    Madison, WI
    Posts
    165
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Okay, thank you. That helps a lot. I'll end up using substrings that instead.
    // useless crap about my relationships, philosophy,
    // theology, music and programming projects:

    my $blog = 'http://gordon-myers.com/';

  7. #7
    SitePoint Zealot
    Join Date
    Sep 2003
    Location
    An Clár, Éire
    Posts
    137
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    That's fine for only extracting one column but gets a bit unwieldy if you want all the cols. For fixed-width columns, "unpack" is ideal. A simple example with your simplified data:
    Code:
    #!/usr/bin/perl -w
    use strict;
    
    while(<DATA>) {
       my ( $first, $second, $third ) = unpack 'A5A5A5', $_;
       print "First:$first\nSecond:$second\nThird:$third\n";
    }
    
    __DATA__
    abcde  1234a4a4
    1234567890bbbbb
      4j5 888899cc7
    Once you've it split up, it's easy to deal with the left- or right-alignment by removing the spaces.

  8. #8
    SitePoint Wizard bronze trophy KevinR's Avatar
    Join Date
    Nov 2004
    Location
    Moon Base Alpha
    Posts
    1,053
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yes, unpack is a better solution for extracting multiple columns. unpack() should also be more efficient than substr().


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •