SitePoint Sponsor

User Tag List

Results 1 to 24 of 24
  1. #1
    SitePoint Addict
    Join Date
    Jun 2005
    Posts
    257
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    simplify this using regex?

    Hello,

    I have this code which basically grabs the date after several variations of the text last updated: so I can store it into mysql date format in the DB.

    The problem is its not very effecient when it comes to the dates since different websites have them formatted differently.

    Here is the current code:

    PHP Code:
                $upos strpos($whois['data'], "updated date:");
                if (!(
    $upos === false)) {
                    
    $upddate substr($whois['data'], $upos+1411);
                } else {
                    
    $upos2 strpos($whois['data'], "record last updated on");
                    if (!(
    $upos2 === false)) {
                        
    $upddate substr($whois['data'], $upos2+2411);
                    } else {
                        
    $upos3 strpos($whois['data'], "last updated on:");
                        if (!(
    $upos3 === false)) {
                            
    $upddate substr($whois['data'], $upos3+1612);
                        } else {
                            
    $upos4 strpos($whois['data'], "last updated on");
                            if (!(
    $upos4 === false)) {
                                
    $upddate substr($whois['data'], $upos4+1611);
                            } else {
                                
    $upddate "";
                            }
                        }
                    }
                } 
    90% of this is repitition. Maybe I could run it thru 4 regex instead of this way because I have the following problems:

    The dates are all formatted differently so the 11 characters doesnt always catch all the dates and cuts some info off. I run them thru strtotime but that doesnt always work because the date is cut off.

    Sometimes the dates are 12-jan-2006 or sat, jan 12, 2006 or jan 12, 2006 so on and so forth.

    Is how I am doing this the best way or would using multiple regex be better for this, if it is can you show me an example of one that would work?

  2. #2
    Keep it simple, stupid! bokehman's Avatar
    Join Date
    Jul 2005
    Posts
    1,935
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    What exactly are you trying to do?

  3. #3
    SitePoint Addict
    Join Date
    Jun 2005
    Posts
    257
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Pull out the date out of here:

    Last Update: 12-jan-2006
    Record Last updated on sat, jan 12, 2006
    Last updated on jan 12, 2006
    Updated date: 12-jan-2006

    These are just examples but basically its these 4 phrases followed by a date of some kind, its that date that I want.

  4. #4
    SitePoint Addict
    Join Date
    Jun 2005
    Posts
    257
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Anybody have any ideas or suggestoins on this?

  5. #5
    Keep it simple, stupid! bokehman's Avatar
    Join Date
    Jul 2005
    Posts
    1,935
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The following works for your examples:
    PHP Code:
    $timestamp strtotime(
        
    preg_replace(
            
    '/^.*(\d?\d-[a-z]{3}-\d{4}|[a-z]{3} \d?\d, \d{4}).*$/is',
            
    '$1',
            
    $whois['data']        
        )
    ); 

  6. #6
    SitePoint Addict
    Join Date
    Jun 2005
    Posts
    257
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    It is very very close but these are some of the results I have came up with:

    Code:
    Updated date: 12-jan-2006 :: 2-jan-2006
    This one is missing the 1 in front of the 12

    Code:
    Last Update: 12-jan-2006 :: 2-jan-2006
    This one is also missing the one

    I am not sure if this is searching for just a date but if it is I need it to include the last updated: and different forms of that because there are many dates on the page I am looking for only the one that deals with the updates.

  7. #7
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    this will work for the date formats you posted:
    PHP Code:
       $pattern '#update.*?((?:\d{1,2}-.*?-\d{4})|(?:[a-z]{3},\s+.*?\d{4})|(?:[a-z]{3}\s+\d{1,2},\s+\d{4}))#i';
       
       
    preg_match_all($pattern$whois['data'], $matches);
       
    // matches will be in the $matches[1] array 

  8. #8
    SitePoint Addict
    Join Date
    Jun 2005
    Posts
    257
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ok I got it this far but i am going to need help finishing it:

    PHP Code:
    $test[] = "asdfasdfasdfadsasdf last update: 12-jan-2006 asdfasdfasdfasdfasdfasdf";
    $test[] = "fsdfasdfasdfasdfasdfa record last updated on sat, jan 12, 2006 vavsdfasdfasdfasdfasdfasdfasdf";
    $test[] = "dafasdfasdfadf last updated on jan 12, 2006 asdfasdfasdfasdfasdfasdf";
    $test[] = "<table><tr><td>Other Junk 15-jan-2006</td><td>Updated date: 12-jan-2006 </td></tr></table>";
    $test[] = "<TR><TD width='30%' valign='top'>Creation Date:</TD><TD width='70%'>Sep 18 2004 </TD></TR>";

    foreach(
    $test as $searchstr)
    {
        
    $search = array ('~updated date: (.*?)~si',
                         
    '~last updated on (.*?)~si',
                         
    '~last update: (.*?)~si',
                         
    '~Creation Date:</TD><TD width=\'70%\'>(.*?)~si');

        
    $replace '$1';

        
    $text preg_replace($search$replace$searchstr);

        echo 
    "Search: ".htmlentities($searchstr)."<br />Result: ";
        
    print_r(htmlentities($text));
        echo 
    "<br /><br />";

    The problems with this is:

    It is only removing the words in the array its not removing everything including those words except the date.

    I need to keep 12 characters involved in the (.*?) starting from that location on.

    So the end result should be the date in watever format it is in every time.

  9. #9
    SitePoint Addict
    Join Date
    Jun 2005
    Posts
    257
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thank you aamonkey let me test that out real quick with the example I posted.

  10. #10
    SitePoint Addict
    Join Date
    Jun 2005
    Posts
    257
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Awsome I tried this code:

    PHP Code:
    $test[] = "asdfasdfasdfadsasdf last update: 12-jan-2006 asdfasdfasdfasdfasdfasdf";
    $test[] = "fsdfasdfasdfasdfasdfa record last updated on sat, jan 12, 2006 vavsdfasdfasdfasdfasdfasdfasdf";
    $test[] = "dafasdfasdfadf last updated on jan 12, 2006 asdfasdfasdfasdfasdfasdf";
    $test[] = "<table><tr><td>Other Junk 15-jan-2006</td><td>Updated date: 12-jan-2006 </td></tr></table>";
    $test[] = "<TR><TD width='30%' valign='top'>Updated Date:</TD><TD width='70%'>Sep 18 2004 </TD></TR>";


    $pattern '#update.*?((?:\d{1,2}-.*?-\d{4})|(?:[a-z]{3},\s+.*?\d{4})|(?:[a-z]{3}\s+\d{1,2},\s+\d{4}))#i';

    foreach(
    $test as $searchstr)
    {

        
    preg_match_all($pattern$searchstr$matches);
        
    // matches will be in the $matches[1] array

        
    echo "Search: ".htmlentities($searchstr)."<br />Result: ";
        
    print_r($matches[1]);
        echo 
    "<br /><br />";


    and the output:

    Code:
    Search: asdfasdfasdfadsasdf last update: 12-jan-2006 asdfasdfasdfasdfasdfasdf
    Result: Array ( [0] => 12-jan-2006 ) 
    
    Search: fsdfasdfasdfasdfasdfa record last updated on sat, jan 12, 2006 vavsdfasdfasdfasdfasdfasdfasdf
    Result: Array ( [0] => sat, jan 12, 2006 ) 
    
    Search: dafasdfasdfadf last updated on jan 12, 2006 asdfasdfasdfasdfasdfasdf
    Result: Array ( [0] => jan 12, 2006 ) 
    
    Search: <table><tr><td>Other Junk 15-jan-2006</td><td>Updated date: 12-jan-2006 </td></tr></table>
    Result: Array ( [0] => 12-jan-2006 ) 
    
    Search: <TR><TD width='30%' valign='top'>Updated Date:</TD><TD width='70%'>Sep 18 2004 </TD></TR>
    Result: Array ( )
    Would it be possible to get it to work with the last one as well?

  11. #11
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    sure, just change the pattern to this:
    PHP Code:
       $pattern '#update.*?((?:\d{1,2}-.*?-\d{4})|(?:[a-z]{3},?\s+.*?\d{4})|(?:[a-z]{3}\s+\d{1,2},\s+\d{4}))#i'
    just a warning: depending on how much stuff is between the word "update" and the actual date, this expression may or may not work....it kind of assumes that there will be very little between them.

  12. #12
    SitePoint Addict
    Join Date
    Jun 2005
    Posts
    257
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I have never seen more then 50 - 60 characters between them, should that be ok?

    That pattern worked perfectly by the way thank you so much!

  13. #13
    Keep it simple, stupid! bokehman's Avatar
    Join Date
    Jul 2005
    Posts
    1,935
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Xiosen
    because there are many dates on the page I am looking for only the one that deals with the updates.
    Why didn't you say that in the first place instead of wasting my time.

  14. #14
    SitePoint Addict
    Join Date
    Jun 2005
    Posts
    257
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Because I dont understand regex or whats needed for it. You asked me and I replied with 4 examples, I said theres 4 phrases and its that date AFTER it that I want. I never said that I wanted every date.

    I appreciate you taking the time, im sorry that we had a misunderstanding but I cant pretend its what I wanted when it will not work.

  15. #15
    SitePoint Evangelist DMacedo's Avatar
    Join Date
    May 2004
    Location
    Braga, Portugal
    Posts
    596
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If you already made the jump to PHP 5 you can use strptime() which will give you an array you can use however you wish
    ~ Daniel Macedo

  16. #16
    SitePoint Addict
    Join Date
    Jun 2005
    Posts
    257
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I wish I had made the jump lol, sounds alot better.

    aamonkey or someone else reading this thread I one more modification it looks like for example look at the current example:

    Code:
    Search: asdkflaskdfasldflk last updated on sept 18 2008 asdfjaklsdfjalk;sfd
    Result: Array ( [0] => ept 18 2008 ) 
    
    Search: kadlkfajlkdfjlds last updated on september 17, 2008 asdfaksjdflaskd
    Result: Array ( [0] => ber 17, 2008 )
    It looks like it keeps only 3 characters is there some way I can modify it to keep the whole month (word) in there? or could I convert the long month to the short abbreviation before running the regex?

    I guess whats the best way to detect the whole month whether its written in full or abbreviated.

  17. #17
    SitePoint Evangelist DMacedo's Avatar
    Join Date
    May 2004
    Location
    Braga, Portugal
    Posts
    596
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Check the PEAR PHP_Compat to see if it's already there...
    ~ Daniel Macedo

  18. #18
    SitePoint Addict
    Join Date
    Jun 2005
    Posts
    257
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Code:
    Sorry, but we didn't find anything that matches "strptime"

  19. #19
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    $pattern = '#update.*?((?:\d{1,2}-.+?-\d{4})|(?:[a-z]{3,},?\s+.*?\d{4})|(?:[a-z]{3,}\s+\d{1,2},\s+\d{4}))#i';

  20. #20
    SitePoint Addict
    Join Date
    Jun 2005
    Posts
    257
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Your code worked perfect for that last part!!

    Now I got a new one problem (I think its the last haha), what do I do if its already in YYYY-MM-DD format such as:

    Code:
    Search: asdfaksdfjlaskdf updates: 2008-07-27 00:04:28 alsdjfalksdjflaksjfdkasdf
    Result: Array ( )
    Thanks again everyone for your contribution in this thread.

    EDIT: Oops aamonkey I didnt receive a notification of your last post let me try that out!! Thank you so much!!!!!!!!!

  21. #21
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    do you want the HH:MM:SS to be grabbed, too?

  22. #22
    SitePoint Addict
    Join Date
    Jun 2005
    Posts
    257
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    No thank you just the date if you could! Your a lifesaver!

  23. #23
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    $pattern = '#update.*?((?:\d{1,2}-.+?-\d{4})|(?:[a-z]{3,},?\s+.*?\d{4})|(?:[a-z]{3,}\s+\d{1,2},\s+\d{4})|(?:\d{4}-\d{2}-\d{2}))#i';

  24. #24
    SitePoint Addict
    Join Date
    Jun 2005
    Posts
    257
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    awsome! I owe ya big time! Your like a regex god.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •