SitePoint Sponsor

User Tag List

Results 1 to 16 of 16
  1. #1
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)

    Red face Yet Another Regular Expression Problem!

    Hello all,

    I seem to be having a slight problem with the following Regular Expression.

    PHP Code:
    (?<=<.*?)$sKeyword(?=.*>|/>) 
    I basically want to match any keyword that is inside XML tags, not the content they surround and sometimes even the tag itself.

    So...a quick example:-

    Code:
    <keyword keyword="foo" />
    </keyword keyword="bar" >
    The RegEx works fine in my desktop application, but when using with PHP I get the following error:-

    Compilation failed: lookbehind assertion is not fixed length at offset 8
    A spot of Googling has determined that PHP's PCRE library does not support varying length lookbehinds.

    Can anyone help with a PHP compatible RegEx?

    Thanks,

    SilverB.
    Last edited by AnthonySterling; Jul 14, 2008 at 13:39.

  2. #2
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Nobody?

  3. #3
    SitePoint Guru mmarif4u's Avatar
    Join Date
    Dec 2006
    Location
    /dev/swat
    Posts
    619
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Did you went through this:
    link

  4. #4
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    The solution was to downgrade PCRE from 7.0 to an older version.
    Nope, I can't do that.

    I'm guessing an alternate expression would be suitable, but I'm afraid my skills are not up to it!

  5. #5
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    *bump*

  6. #6
    dooby dooby doo silver trophybronze trophy
    spikeZ's Avatar
    Join Date
    Aug 2004
    Location
    Manchester UK
    Posts
    13,807
    Mentioned
    158 Post(s)
    Tagged
    3 Thread(s)
    So kind of a skewed HTML tag thing....

    One of these any good?
    PHP Code:
    //HTML comment
    '<!--.*?-->'

    //HTML file
    //Matches a complete HTML file.  Place round brackets around the .*? parts you want to extract from the file.
    //Performance will be terrible on HTML files that miss some of the tags 
    //(and thus won't be matched by this regular expression).  Use the atomic version instead when your search 
    //includes such files (the atomic version will also fail invalid files, but much faster).
    '<html>.*?<head>.*?<title>.*?</title>.*?</head>.*?<body[^>]*>.*?</body>.*?</html>'

    //HTML file (atomic)
    //Matches a complete HTML file.  Place round brackets around the .*? parts you want to extract from the file.
    //Atomic grouping maintains the regular expression's performance on invalid HTML files.
    '<html>(?>.*?<head>)(?>.*?<title>)(?>.*?</title>)(?>.*?</head>)(?>.*?<body[^>]*>)(?>.*?</body>).*?</html>'

    //HTML tag
    //Matches the opening and closing pair of whichever HTML tag comes next.
    //The name of the tag is stored into the first capturing group.
    //The text between the tags is stored into the second capturing group.
    '<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>'

    //HTML tag
    //Matches the opening and closing pair of a specific HTML tag.
    //Anything between the tags is stored into the first capturing group.
    //Does NOT properly match tags nested inside themselves.
    '<%TAG%[^>]*>(.*?)</%TAG%>'

    //HTML tag
    //Matches any opening or closing HTML tag, without its contents.
    '</?[a-z][a-z0-9]*[^<>]*>' 
    Mike Swiffin - Community Team Advisor
    Only a woman can read between the lines of a one word answer.....

  7. #7
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    I tried to manipulate those earlier last night to no avail, I didn't think it would be that hard to be honest...

    I shouldn't have volunteered!

    Thanks for trying anyway.

  8. #8
    dooby dooby doo silver trophybronze trophy
    spikeZ's Avatar
    Join Date
    Aug 2004
    Location
    Manchester UK
    Posts
    13,807
    Mentioned
    158 Post(s)
    Tagged
    3 Thread(s)
    If I am honest, I dont really get what you are trying to do - hence the wild punt in the dark!
    Mike Swiffin - Community Team Advisor
    Only a woman can read between the lines of a one word answer.....

  9. #9
    . shoooo... silver trophy logic_earth's Avatar
    Join Date
    Oct 2005
    Location
    CA
    Posts
    9,013
    Mentioned
    8 Post(s)
    Tagged
    0 Thread(s)
    The only way I can see on how to do this is not to use lookarounds...
    This kinda works...
    Code:
    (?:<.*?)toolbox|command(?:.*?>)
    Logic without the fatal effects.
    All code snippets are licensed under WTFPL.


  10. #10
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    This kinda works...
    Kinda, but not of any use I'm afraid.

    If I am honest, I dont really get what you are trying to do - hence the wild punt in the dark!
    Ok, I'll try to explain more...given the sample XML below:-
    Code:
    <bar foo="bar">
        I went into the bar the other day, got drunk, fell over, ate kebab.
    </keyword>
    < bar="foo" />
    I need a RegEx to match 'bar' only if it's inside '<' or '>' essentially.

    So...

    Code:
    <bar foo="bar">
        I went into the bar the other day, got drunk, fell over, ate kebab.
    </bar>
    <foo bar="foo" />
    Sorry if I'm not being clear.

    SilverB.

  11. #11
    Theoretical Physics Student bronze trophy Jake Arkinstall's Avatar
    Join Date
    May 2006
    Location
    Lancaster University, UK
    Posts
    7,062
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    So you're looking for something along the lines of...
    Code:
    (<[^>]*?)bar(.*?>)
    Jake Arkinstall
    "Sometimes you don't need to reinvent the wheel;
    Sometimes its enough to make that wheel more rounded"-Molona

  12. #12
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by arkinstall View Post
    So you're looking for something along the lines of...
    Code:
    (<[^>]*?)bar(.*?>)
    Nearly, I just want it to match 'bar' though and not the entire node. The pattern you supplied arkinstall seems to match the complete node where the keyword is present.

  13. #13
    Theoretical Physics Student bronze trophy Jake Arkinstall's Avatar
    Join Date
    May 2006
    Location
    Lancaster University, UK
    Posts
    7,062
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    In which case:
    Code:
    <[^>]*?bar.*?>
    The expressions in brackets are later stored as members of the $matches array.

    What do you want to do with it once you've found it? Just find it, or grab an area around it?
    Jake Arkinstall
    "Sometimes you don't need to reinvent the wheel;
    Sometimes its enough to make that wheel more rounded"-Molona

  14. #14
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    This is what I have, I was tasked to create an object that would take namespaced XML and remove all occurrences of the namespace.

    It works, in fact, it works really well....I've extended the DOMDocument as to allow easier manipulation with it's methods.

    I've tested it on internal, Youtube and Yahoo namespaced XML and it works a treat, but there's still work to be done I think.

    Class.NamespaceNulledXMLDocument.php
    PHP Code:
    <?php
    /**
     * @author         Anthony David Sterling
     * 
     * @copyright    Anthony David Sterling
     * 
     * @version     1.0
     * 
     * @desc        Removes namespace prefixes
     *                 programmatically based on the 
     *                 declared namespaces contained
     *                 within the document itself.
     * 
     * @todo        
     */
    Class NamespaceNulledXMLDocument Extends DOMDocument
    {
        
    /**
         * @var        Boolean        Indicates whether we found any namespaces to remove.
         */
        
    public $foundNamespaces false;
        
        
    /**
         * @var        Array        Holds all namespaces we found and processed.
         */
        
    public $removedNamespaces = array();
        
        
    /**
         * @param    String    $sXML    An XML document in string form.
         * @return     Void
         */
        
    public function __construct$sXML )
        {
            
    //--> Call the DOMDocument constructor to instantiate.
            
    parent::__construct();
            
    //--> Load our 'cleaned' XML into the DOMDocument.
            
    parent::loadXMLself::removeNamespaces($sXML) );
        }
        
        
    /**
         * @param    String    $sXML    An XML document in string form.
         * @return    String
         */
        
    private function removeNamespaces$sXML )
        {
            
    //--> Collecting all 'declared' namespace prefixes from within the XML document itself.
            
    $aNamespaces = ( preg_match_all'/(?<=xmlns:).*?(?==)/' $sXML $aMatches PREG_PATTERN_ORDER ) ) ? $aMatches[0] : array() ;
            
    //--> Before continuing, we'll check if we actually have any prefixes to remove.
            
    if( count($aNamespaces) > )
            {
                
    //--> Logging the fact we found some namespaces to remove.
                
    $this->foundNamespaces true;
                
    //--> Walking through each of the matched namespace prefixes and removing the prefix from every element.
                
    foreach ( $aNamespaces as $sNamespace )
                {
                    
    //--> Log the namespace we are processing.
                    
    $this->removedNamespaces[] = $sNamespace;
                    
    //--> Replace any node with a namespace.
                    
    $sXML =  preg_replace"%(?<=<|</){$sNamespace}:%" '' $sXML );
                    
    //--> Replace any attribute with a namespace.
                    
    $sXML =  preg_replace"/(?<=\\s){$sNamespace}:(?=.*?=\".*?\")/" '' $sXML );
                }
                
    //--> Return the new, clean, all singing, all dancing namespace nulled XML.
                
    return $sXML;
            }
            else
            {
                
    //--> Hmm, no namespaces to remove, so we'll just return the original XML that was supplied.
                
    return $sXML;
            }
        }
    }
    ?>
    Usage.php
    PHP Code:
    <?php
    //--> Load the class.
    require_once( 'class.NamespaceNulledXMLDocument.php' );
    //--> Obtain the namespaced XML.
    $sXMLData = @file_get_contents('sample.xml');
    //--> Create a NamespaceNulledXMLDocument, passing it the namespaced XML.
    $oXML = new NamespaceNulledXMLDocument$sXMLData );
    //--> Using DOMDocument methods, format the output.
    $oXML->formatOutput true;
    //--> Output the XML.
    echo $oXML->saveXML();
    ?>
    There's a couple of niggles...I'd prefer to use one single Regular Expression at the following:-

    PHP Code:
                    //--> Replace any node with a namespace.
                    
    $sXML =  preg_replace"%(?<=<|</){$sNamespace}:%" '' $sXML );
                    
    //--> Replace any attribute with a namespace.
                    
    $sXML =  preg_replace"/(?<=\\s){$sNamespace}:(?=.*?=\".*?\")/" '' $sXML ); 
    Maybe this will be of use to somebody, I think the other guys are importing the object into SimpleXML for easier traversing which is possible because it extends DOMDoc.

    SilverB.

  15. #15
    SitePoint Enthusiast
    Join Date
    Feb 2005
    Location
    Glasgow, Scotland
    Posts
    97
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Whoops! Didn't read this right.

    This tool is your friend though:
    http://regex.larsolavtorvik.com/
    Last edited by Teeej; Jul 16, 2008 at 02:54. Reason: Better example:

  16. #16
    . shoooo... silver trophy logic_earth's Avatar
    Join Date
    Oct 2005
    Location
    CA
    Posts
    9,013
    Mentioned
    8 Post(s)
    Tagged
    0 Thread(s)
    I would just pull the whole tag then run it though a callback.
    PHP Code:
    <?php

    $xml 
    = <<<EOXML
    <window
        test:id="JSConsoleWindow"
        test:xmlns="http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul"
        test:title="Error Console"
        test:windowtype="global:console"
        test:width="513"
        test:height="498"
        screenX="520"
        screenY="406"
        test:persist="screenX screenY width height sizemode"
        test:onclose="return closeWindow(false);"
        test:sizemode="normal"
    >
        <test:script type="application/javascript" src="chrome://global/content/globalOverlay.js"/>
        <test:script type="application/javascript" src="chrome://global/content/console.js"/>

        <test:stringbundle id="ConsoleBundle" test:src="chrome://global/locale/console.properties"/>

    </window>
    EOXML;

    function 
    stripNamespace $m )
    {
        return 
    preg_replace'/[^\s:";>]+:/'''$m[0] );
    }

    $xml preg_replace_callback'/(<\/?[^\s>]+\b(?:[\'"].*?[\'"]|[^<>\'"]+)*>)/''stripNamespace'$xml );
    print 
    '<pre>' htmlspecialchars$xml ) . '</pre>';
    Logic without the fatal effects.
    All code snippets are licensed under WTFPL.



Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •