SitePoint Sponsor

User Tag List

Results 1 to 5 of 5
  1. #1
    SitePoint Enthusiast
    Join Date
    Nov 2010
    Location
    Largo
    Posts
    67
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Extracting Text with Regex

    I am trying to extract the asin number from an amazon url. I spent all day writing 4 functions with regular expressions. (others are for other extractions) and was feeling pretty confident until I loaded wamp this morning and ran the functions. Everything seems to work ok except the output prints "Array" for each record instead of the text it should be out putting. On all four functions

    I could have swore I used a similar function for this before. I am sure its something silly I missed. I am still a nub but here is my code can anyone see what I am doing wrong?

    Code PHP:
    <?php
     
    	function get_asin($url)
    		{
                    	preg_match('/\/dp\/(.*)\/ref=tag/', $url, $matches[1]);
                    	return $matches;
    		}	
    ?>
     
    <?php 
     
    	$url = "http://www.amazon.com/Vulli-Sophie-the-Giraffe-Teether/dp/B000IDSLOG/ref=tag_rso_rs_edpp_url"; 
    	echo get_asin($url);
    ?>

    am i even using the right function? regex isn't one of my strong points

  2. #2
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,087
    Mentioned
    153 Post(s)
    Tagged
    2 Thread(s)
    (.*) is only to be used as a very last resort when you don't what you'll need to match exactly. In your case it seems you do: characters and digits.
    So instead of (.*) I'd go for ([a-zA-Z0-9]+). The plus also ensures that there needs to be at least one character to match. If you know the number of characters that needs be matched you can replace the + with {n}, where n is the number of characters that needs to be matched.

    As for the return value, you need to use $matches in the preg_match function, and then return $matches[1], not the other way around

    BTW. If you use ~ for delimiters instead of / you don't have to escape slashes in the regex, which makes it much more readable IMHO.

    So:

    PHP Code:
    function get_asin($url)
    {
      
    preg_match('~/dp/([a-zA-Z0-9]+)/ref=tag~'$url$matches);
      return 
    $matches[1];

    Rémon - Hosting Advisor

    SitePoint forums will switch to Discourse soon! Make sure you're ready for it!

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  3. #3
    SitePoint Wizard lorenw's Avatar
    Join Date
    Feb 2005
    Location
    was rainy Oregon now sunny Florida
    Posts
    1,101
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    About the array,
    $matches[0] will return /dp/B000IDSLOG/ref=tag which is the full pattern
    and $matches [1] will return B000IDSLOG which is only the regex
    What I lack in acuracy I make up for in misteaks

  4. #4
    SitePoint Enthusiast
    Join Date
    Nov 2010
    Location
    Largo
    Posts
    67
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks guys, I did log in yesterday morning to read what you said and then played with it for a bit before work. Didn't get to the PC last night, but I did manage to get everything going and working properly.

    I think the biggest problem is that I wasn't being strict enough with my searching parameters, I thought if I typed in / it would match the first one not the last one. That being said

    ScallioXTX

    I actually had my original expression (idk what its actually called) to search for only the 9 digit string of uppercase letters and numbers, because amazon actually has different /dp/ variables for different countries, but the way you have shown would probably be stricter policy and safer?

    and the ~ do make it easier to read, thanks bro

  5. #5
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,087
    Mentioned
    153 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by CyberToolz View Post
    I actually had my original expression (idk what its actually called) to search for only the 9 digit string of uppercase letters and numbers, because amazon actually has different /dp/ variables for different countries, but the way you have shown would probably be stricter policy and safer?
    Yes, because it will only match if the characters that are there do actually adhere to the pattern your looking for, whereas (.*) will match all kinds of garbage thus giving the impression you found something, but you didn't; you found a string of garbage where a string of 9 characters and digits should have been
    Rémon - Hosting Advisor

    SitePoint forums will switch to Discourse soon! Make sure you're ready for it!

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •