SitePoint Sponsor

User Tag List

Results 1 to 8 of 8
  1. #1
    SitePoint Enthusiast ganesch's Avatar
    Join Date
    Feb 2004
    Location
    Zürich, Switzerland
    Posts
    66
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Convert text file into an array

    I've been trying to learn something about regular expressions now for quite some time, but I'm still having problems to convert a very large text file with the following elements into an array and then read the array into a MySQL database:
    20 17 Mar 2004 We Suddha Ekadasi K Siva Sravana *
    21 Fasting for Papamocani Ekadasi
    41 1 Apr 2004 Th Suddha Ekadasi G Dhriti Aslesa *
    42 Fasting for Kamada Ekadasi
    72 15 Apr 2004 Th Suddha Ekadasi K Sukla Satabhisa *
    73 Fasting for Varuthini Ekadasi
    I've used regular expressions to come that far. What I now need is a comma seperated list for each two lines (e.g. 41 and 42). Should look something like this:
    '1', 'Apr', '2004', 'Th', 'Kamada Ekadasi'
    Or if someone has another idea how to solve such a task I would be glad to hear about it. It might also be better to convert 1 Apr 2004 into a digit based form?

    Thanks,
    Nick

  2. #2
    SitePoint Addict silent's Avatar
    Join Date
    Jun 2004
    Location
    Roaming North America
    Posts
    220
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Some questions:

    1) What does this data represent?
    2) Why, if pairs the lines are related, are they on different lines?
    3) Why do some lines have dates?
    4) Could we see the DB schema?
    5) What is the overall problem, in English, that you are trying to solve?

    cheers,

    jay

  3. #3
    SitePoint Enthusiast ganesch's Avatar
    Join Date
    Feb 2004
    Location
    Zürich, Switzerland
    Posts
    66
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    1) What does this data represent?
    2) Why, if pairs the lines are related, are they on different lines?
    3) Why do some lines have dates?
    Ekadasi is a special day on the Hindu calendar. I have a small Windows application that can calculate different Hindu "holidays" according to the geographical position. Every relevant line in the text file has one date at the beginning followed by the word Ekadasi. The supplementary line without date simply mentions the exact name of that particular Ekadasi day.

    So best would be to join these two lines together and then comma seperate everything after stripping the unnecessary words and numbers.
    4) Could we see the DB schema?
    This would be a very simple schema with only 3 rows: ID, DATE and NAME. The NAME corresponds to Papamocani Ekadasi or Varuthini Ekadasi e.g.

    5) What is the overall problem, in English, that you are trying to solve?
    The overall problem is to send an automatic e-mail message one day before Ekadasi in order to remind Hindus (or other interested people) of the special occasion (usually twice a month). Ekadasi is astrologically and spiritually a good day for meditation and fasting.

    I hope this gives you a clearer picture of what I'm trying to do

  4. #4
    eschew sesquipedalians silver trophy sweatje's Avatar
    Join Date
    Jun 2003
    Location
    Iowa, USA
    Posts
    3,749
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    PHP Code:
    $txt '20 17 Mar 2004 We Suddha Ekadasi K Siva Sravana *
    21 Fasting for Papamocani Ekadasi
    41 1 Apr 2004 Th Suddha Ekadasi G Dhriti Aslesa *
    42 Fasting for Kamada Ekadasi
    72 15 Apr 2004 Th Suddha Ekadasi K Sukla Satabhisa *
    73 Fasting for Varuthini Ekadasi'
    ;

    $re = <<<EOS
    ~
    ^        # start of a line
    \d+        # one or more digits
    \s+        # some white space
    (        # start capture 1
    \d{1,2}        # one or two digits
    )        # end capture 1
    \s+        # some white space
    (        # start capture 2
    \w+        # one or more word characters
    )        # end capture 2
    \s+        # some white space
    (        # start capture 3
    \d{4}    # four digits
    )        # end capture 3
    \s+        # some white space
    (        # start capture 4
    .*?        # one or more characters, ungreedy
    )        # end capture 4
    \n        # a new line
    \d+        # one or more digits
    \s+        # some white space
    (        # start capture 5
    [^\n]*    # zero or more non-linefeed characters
    )        # end capture 5
            # end regex, extended whitespace parsing
    ~xms
    EOS;

    preg_match_all($re$txt$match);
    var_dump($match); 
    results in output of:
    Code:
      0 => 
        array
          0 => '20 17 Mar 2004 We Suddha Ekadasi K Siva Sravana *
    21 Fasting for Papamocani Ekadasi'
          1 => '41 1 Apr 2004 Th Suddha Ekadasi G Dhriti Aslesa *
    42 Fasting for Kamada Ekadasi'
          2 => '72 15 Apr 2004 Th Suddha Ekadasi K Sukla Satabhisa *
    73 Fasting for Varuthini Ekadasi'
      1 => 
        array
          0 => '17'
          1 => '1'
          2 => '15'
      2 => 
        array
          0 => 'Mar'
          1 => 'Apr'
          2 => 'Apr'
      3 => 
        array
          0 => '2004'
          1 => '2004'
          2 => '2004'
      4 => 
        array
          0 => 'We Suddha Ekadasi K Siva Sravana *
    '
          1 => 'Th Suddha Ekadasi G Dhriti Aslesa *
    '
          2 => 'Th Suddha Ekadasi K Sukla Satabhisa *
    '
      5 => 
        array
          0 => 'Fasting for Papamocani Ekadasi'
          1 => 'Fasting for Kamada Ekadasi'
          2 => 'Fasting for Varuthini Ekadasi'
    Jason Sweat ZCE - jsweat_php@yahoo.com
    Book: PHP Patterns
    Good Stuff: SimpleTest PHPUnit FireFox ADOdb YUI
    Detestable (adjective): software that isn't testable.

  5. #5
    SitePoint Addict silent's Avatar
    Join Date
    Jun 2004
    Location
    Roaming North America
    Posts
    220
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    nice job Jason! I am adding one little thing to account for the day characters ("We","Th") that aren't part of the name...

    revised:
    PHP Code:
     $txt '20 17 Mar 2004 We Suddha Ekadasi K Siva Sravana *
    21 Fasting for Papamocani Ekadasi
    41 1 Apr 2004 Th Suddha Ekadasi G Dhriti Aslesa *
    42 Fasting for Kamada Ekadasi
    72 15 Apr 2004 Th Suddha Ekadasi K Sukla Satabhisa *
    73 Fasting for Varuthini Ekadasi'
    ;

    $re = <<<EOS
    ~
    ^        # start of a line
    \d+        # one or more digits
    \s+        # some white space
    (        # start capture 1
    \d{1,2}        # one or two digits
    )        # end capture 1
    \s+        # some white space
    (        # start capture 2
    \w+        # one or more word characters
    )        # end capture 2
    \s+        # some white space
    (        # start capture 3
    \d{4}    # four digits
    )        # end capture 3
    \s+        # some white space
    (        # start capture 4
    Mn|Tu|We|Th|Fr|Sa|Su    # day characters
    )        # end capture 4
    \s+        # some white space
    (        # start capture 5
    .*?        # one or more characters, ungreedy
    )        # end capture 5
    \n        # a new line
    \d+        # one or more digits
    \s+        # some white space
    (        # start capture 6
    [^\n]*    # zero or more non-linefeed characters
    )        # end capture 6
            # end regex, extended whitespace parsing
    ~xms
    EOS;

    preg_match_all($re$txt$match);
    var_dump($match); 
    which produces:
    Code:
    array(7) {
      [0]=>
      array(3) {
        [0]=>
        string(83) "20 17 Mar 2004 We Suddha Ekadasi K Siva Sravana *
    21 Fasting for Papamocani Ekadasi"
        [1]=>
        string(79) "41 1 Apr 2004 Th Suddha Ekadasi G Dhriti Aslesa *
    42 Fasting for Kamada Ekadasi"
        [2]=>
        string(85) "72 15 Apr 2004 Th Suddha Ekadasi K Sukla Satabhisa *
    73 Fasting for Varuthini Ekadasi"
      }
      [1]=>
      array(3) {
        [0]=>
        string(2) "17"
        [1]=>
        string(1) "1"
        [2]=>
        string(2) "15"
      }
      [2]=>
      array(3) {
        [0]=>
        string(3) "Mar"
        [1]=>
        string(3) "Apr"
        [2]=>
        string(3) "Apr"
      }
      [3]=>
      array(3) {
        [0]=>
        string(4) "2004"
        [1]=>
        string(4) "2004"
        [2]=>
        string(4) "2004"
      }
      [4]=>
      array(3) {
        [0]=>
        string(2) "We"
        [1]=>
        string(2) "Th"
        [2]=>
        string(2) "Th"
      }
      [5]=>
      array(3) {
        [0]=>
        string(32) "Suddha Ekadasi K Siva Sravana *
    "
        [1]=>
        string(33) "Suddha Ekadasi G Dhriti Aslesa *
    "
        [2]=>
        string(35) "Suddha Ekadasi K Sukla Satabhisa *
    "
      }
      [6]=>
      array(3) {
        [0]=>
        string(30) "Fasting for Papamocani Ekadasi"
        [1]=>
        string(26) "Fasting for Kamada Ekadasi"
        [2]=>
        string(29) "Fasting for Varuthini Ekadasi"
      }
    }
    Now, how to get rid of the \n and the * from the strings in array[5] and print out the comma delimited lines you wanted in your first post:
    PHP Code:
    // indexes
    define('DATE_DAY'1);
    define('DATE_MONTH'2);
    define('DATE_YEAR'3);
    define('DATE_WEEKDAY'4);
    define('HINDI_NAME'6);

    $iNumLines count($match[0]);

    echo 
    "<pre>";
    for (
    $i=0;$i<$iNumLines;$i++) {
        echo 
    $match[DATE_DAY][$i] . ',' .
                 
    $match[DATE_MONTH][$i] . ',' .
                 
    $match[DATE_YEAR][$i] . ',' .
                 
    $match[DATE_WEEKDAY][$i] . ',' .
                 
    str_replace(array('Fasting for ','*'), ''$match[HINDI_NAME][$i]) . "\n";
    }
    echo 
    "</pre>"
    which produces:
    Code:
    17,Mar,2004,We,Papamocani Ekadasi
    1,Apr,2004,Th,Kamada Ekadasi
    15,Apr,2004,Th,Varuthini Ekadasi
    Jason, I'd love to get an explanation on the "ungreedy" part of your expression... and also to see if you could remove the asterisk from the output of the sixth match above by changing the expression. I tried, but are not nearly as competent with regexes as you are.

    BTW, Jason, I really like that style of presenting regexes. Nicely done.

    jay

  6. #6
    eschew sesquipedalians silver trophy sweatje's Avatar
    Join Date
    Jun 2003
    Location
    Iowa, USA
    Posts
    3,749
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi Jay,

    A little tweak to the re to get rid of the * and also optionally get rid of "Fasting for " if present in capture 6:

    PHP Code:
    $re = <<<EOS
    ~
    ^        # start of a line
    \d+        # one or more digits
    \s+        # some white space
    (        # start capture 1
    \d{1,2}        # one or two digits
    )        # end capture 1
    \s+        # some white space
    (        # start capture 2
    \w+        # one or more word characters
    )        # end capture 2
    \s+        # some white space
    (        # start capture 3
    \d{4}    # four digits
    )        # end capture 3
    \s+        # some white space
    (        # start capture 4
    Mn|Tu|We|Th|Fr|Sa|Su    # day characters
    )        # end capture 4
    \s+        # some white space
    (        # start capture 5
    .*?        # one or more characters, ungreedy
    )        # end capture 5
    \s+        # some white space
    \*        # a literal *
    \s*        # optionally some white space
    \n        # a new line
    \d+        # one or more digits
    \s+        # some white space
    (?:        # non-capturing group
    Fasting    # literal
    \s+        # some white space
    for        # literal
    \s+        # some white space
    )?        # make the fasting group optional
    (        # start capture 6
    [^\n]*    # zero or more non-linefeed characters
    )        # end capture 6
            # end regex, extended whitespace parsing
    ~xms
    EOS; 
    The ungreedy qualifier (the ? after the .* in capture 5) is becuase we have the m and s modifiers on the entire re, m is multiline and s is . is all, including \n. If you did not do that line ungreedy, it would capture the entire file all the way to the last match.

  7. #7
    SitePoint Addict silent's Avatar
    Join Date
    Jun 2004
    Location
    Roaming North America
    Posts
    220
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    nice. very nice. adding to your rep now...

  8. #8
    SitePoint Enthusiast ganesch's Avatar
    Join Date
    Feb 2004
    Location
    Zürich, Switzerland
    Posts
    66
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks for your wonderful regex tennis. I really enjoyed it and that's exactly what I was looking for.

    Namaste
    Nick


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •