SitePoint Sponsor

User Tag List

Results 1 to 7 of 7
  1. #1
    SitePoint Zealot
    Join Date
    Dec 2006
    Location
    England, UK
    Posts
    160
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    grab html names & attributes

    hi,

    in short, id like a script that will search through a given source and identify tag names and attributes, then display them but my regexp and php is a little rusty so any help would be greatly appreciated..


    regards,
    Kwah =]




    example:

    Code:
    <html>
    <head>
    <title>my first page</title>
    </head>
    
    
    <body bgcolor="#FFFFFF">
    
    </body>
    
    foo bar
    
    hello world
    
    </html>
    would become

    Code:
    html
     - head
     - - title
     - body 
     - - {text} 
     - - {line break}
     - - {text}

    essentially, that example is creating a document tree of the tags and content of the stuff inside the document..

    the branches on this document tree then links to a page with a list of the tag's attirubutes / contents such as the body bgcolour

  2. #2
    SitePoint Zealot
    Join Date
    Dec 2006
    Location
    England, UK
    Posts
    160
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    sorry for the double post, but this is more a discussion of what i said in the first post about my options of what i want to/could do..


    to begin with, im starting simple and sticking with valid html


    the way that i imagine it to work most efficiently is to run through the code looking for an opening tag
    Code:
    <html>
    it then searches for a matching closing tag, then if it finds one, adds a coded comment after each tag..

    Code:
    <html> <!-- O:html.1 -->
    ...
    </html> <!-- C:html.1 -->
    <!-- A:B:C -->

    A:
    O=Opening, C=Closing, N=Non-Applicable / doesn't need/use one
    B:
    html=The 'name' of the tag
    C:
    1=An ID for that tag to match it up with its closing tag



    A slightly more complex example of how this would work:

    Code:
    <html> <!-- O:html.1 -->
    <head> <!-- O:head.1 -->
    <title> <!-- O:title.1 --> My New Web Page </title> <!-- C:title.1 -->
    </head> <!-- C:head.1 -->
    
    <body> <!-- O:body.1 -->
    
    <h1> <!-- O:h1.1 --> Welcome to My Web Page! </h1> <!-- C:h1.1 -->
    
    <p> <!-- O:p.1 -->
    Foobar
    </p> <!-- C:p.1 -->
    
    <p> <!-- O:p.2 -->
    There is a small graphic after the period at the end of this sentence. 
    <img src="images/mouse.gif" alt="Mousie" width="32" height="32" border="0">  <!-- N:img.1 -->
    </p> <!-- C:p.2 -->
    
    <p> <!-- O:p.3 -->
    Link: <a href="http://www.example.com/"> <!-- O:a href.1 -->example</a> <!-- C:a href.1 --> <br> <!-- N:br.1 -->
    Another link: <a href="example2.htm"> <!-- O:a href.2 -->example 2</a> <!-- C:a href.2 --> <br> <!-- N:br.2 -->
    foo bar
    </p> <!-- C:p.3 -->
    
    <p> <!-- O:p.4 -->&gt; <a href="example3.htm"> <!-- O:a href.3 -->example 3</a> <!-- C:a href.3 --></p> <!-- C:p.4 -->
    
    </body> <!-- O:html.1 -->
    </html> <!-- O:html.1 -->


    After I've finished, these comments will be removed

  3. #3
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    What you're trying to do, is called parsing. There is a ready-made HTML parser bundled with PHP. Have a look at the documentation for DomDocument:
    http://docs.php.net/manual/en/domdocument.loadhtml.php

  4. #4
    SitePoint Zealot
    Join Date
    Dec 2006
    Location
    England, UK
    Posts
    160
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by kyberfabrikken View Post
    What you're trying to do, is called parsing. There is a ready-made HTML parser bundled with PHP. Have a look at the documentation for DomDocument:
    http://docs.php.net/manual/en/domdocument.loadhtml.php
    I've only read through a small amount from that link, but there are several problems. I've got PHP Version 4.4.7 according to phpinfo() {ie, not PHP 5} and secondly I don't think it is suitable for what I need to do.

    The goal of this is to create a tag-by-tag editor that will work with other markup and programming languages - wml and xhtml for instance, and eventually, PHP and others..

    I figured though, starting with HTML, it would make it much easier to get that done then port it to other languages rather than to try to do it all at once, and I'd get the most support in trying html first (though I'm beginning to wonder whether I should start with the more strict wml)


    My point is that I want to extract, add-to, remove and replace parts of these files and please correct me if I'm wrong, but I believe it will be better in the long term to construct my own parser.
    I just need guidance and examples of how it should be done.



    Regards,
    Kwah

  5. #5
    . shoooo... silver trophy logic_earth's Avatar
    Join Date
    Oct 2005
    Location
    CA
    Posts
    9,013
    Mentioned
    8 Post(s)
    Tagged
    0 Thread(s)
    For markup languages use the DOM as Kyber suggested.

    For PHP there is:
    http://us3.php.net/manual/en/book.tokenizer.php

    For other languages you should look for tokenizers for them too. Making your own parser to handle everything is impossible for one simple fact, everything has different rules. Since this is just an abitions project I really must suggest that you upgrade to PHP 5 (latest 5.2.6).
    Logic without the fatal effects.
    All code snippets are licensed under WTFPL.


  6. #6
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by kwah View Post
    I've got PHP Version 4.4.7 according to phpinfo()
    Certainly a problem you need fixed. To quote the PHP website:
    Support for PHP 4 has been discontinued since 2007-12-31. Please consider upgrading to PHP 5.2.
    Quote Originally Posted by kwah View Post
    The goal of this is to create a tag-by-tag editor that will work with other markup and programming languages - wml and xhtml for instance, and eventually, PHP and others..
    Both wml and xhtml are xml based languages. You can use the dom-parser to load those. I think that you'll find that creating a tool for editing any language, is too broad a goal; Such an application wouldn't be very useful. However, if you needed that, then you would indeed need a more general parser, than the dom-parser.

    If you're doing this more as a training exercise, then you need to write a tokenizer and a state-machine that can transform the output of the tokenizer into a tree-structure. The tokenizer takes a string as input and outputs a number of smaller strings (aka. tokens). A tokenizer for xml, is pretty simple to write by hand, but there are generic libraries for generating tokenizers for all sorts of languages.
    The state-machine tends to be slightly more complex. You could start by looking at Pear FSM.

    If you don't want to mock around with the dirty details of parsing, you could also use a ready-made lexer, such as Pear LexerGenerator (See: Lexical analysis)

  7. #7
    SitePoint Zealot
    Join Date
    Dec 2006
    Location
    England, UK
    Posts
    160
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If you ignore the fact that the output is not an actual HTML page (no <html> <body> etc etc) and the actual coding is pretty shoddy, but this is a night's worth of having a shot at it without using regex ...

    it only grabs what is inbetween any < or > 's and outputs each on a new line, indicating whether it is an opening or closing tag


    known bugs:
    - anything not inside <> does not get displayed at all
    - tags that dont need a closing tag are listed as opening tags regardless
    - closing tags that have whitespace between the < and / are listed as an opening tag (something ill need a regexp for..?)
    - if there is a stray < or >, then it may falter and show everything upto the next > as part of a tag
    ==> possible workaround/fix would be to keep a list of recognised tags and display unknown code into a "code" heading

    - not really a bug, but the attributes of the tags are included with the tag name
    - this isnt really a bug, but some of the variable names aren't particularly descriptive - i was just trying to bash something out quickly.. comments were added afterwards


    there are a few other problems but i cant remember them offhand but does it give you an idea of what i want to do?

    and this is swinging between being a "training exercise" as you put it, and a potential implentation.. a contact of mine setup a team (worldwide) to do this approx 2 months ago but since then, other commitments got in the way and we've now disbanded but i want to continue with it ...

    ... so here i am =]


    anyways.. this is what ive got so far .. needs a lot of work ..


    oh, and as for php 5, idk if ill be able to ... i chose badly when choosing a reseller acc. and I'm just contacting them now to request an upgrade, quoting from php.net

    we'll see..

    regards,
    kwah



    PHP Code:
    <?php

    // the location of the file being edited
    $source file_get_contents("./source.html");

    // un-needed but will be useful in future when applying char encoding etc
    $mystring $source;

    // used when displaying the content on page 
    $mystring2 htmlentities($mystringENT_QUOTES);


    // display the source code being edited - has been somewhat sanitised (converted to entities) so should be safe
    echo "<pre>",$mystring2,"</pre><br><br>\n\n";

    // search terms .. may be added to later..
    $findme   '<';
    $findme2   '>';


    echo 
    "<br><br>\n\n";
    echo 
    "The text between these tags is:<br>\n";



    // indiator of what type of tag is being used - opening, closing, self-closing or n/a
    $marker_opening "O";
    $marker_closing "C";
    //// TODO: add checks for if it is a self-closing tag (img for example) 
    $marker_self "S";
    //// .... or doesn't need one (text for example)
    $marker_na "N";


    $i 0;
    $tyui 1;


    // count number of <'s as guide to number of tags to expect
    $findme_count substr_count($mystring,$findme);
    $max $findme_count;


    // loop --> search for a <, search for the next > and display whatever is inbetween

    do {
    echo 
    "<br>\n\n";


    //////////////////// currently unused ///////////////////////
    // search for the position of the next (first) <
    $pos strpos($mystring$findme,$i);
    // search for the position of the next (first) >
    $pos2 strpos($mystring$findme2$pos);
    ////////////////////////// end //////////////////////////////


    // explode the source into sections based on <
    //// NOTE: the explode 'term' / delimiter is removed
    $content explode($findme,$mystring);
    // explode the remaining sections into further breakdowns based on >
    $content2 explode($findme2,$content[$tyui]);
    //// TODO: show instances of text outside of a tag as 'Text'
    //// TODO: list included attributes beneath each tag name


    // displayed snippets will be surrounded by <pre></pre> tags 
    echo "<pre>";

    // check if the 'grabbed' code is a closing tag - ie, if it contains a / directly after the < then it will be considered as a closing tag
    //// TODO: use regex (?) to force it to work regardless of whitespace
    $isclosingtagcheck=substr($content2[0], 01);

    // perform v.simple check and display the coded version accordingly
    if($isclosingtagcheck=='/'){
    // if it is a closing tag, start from char # 1 (counting starts @ 0)
    echo $marker_closing,":",substr($content2[0], 1);
    }else{
    echo 
    $marker_opening,":",$content2[0];
    }
    echo 
    "</pre>";


    // make the next loop start checking where the previous tag supposedly finished
    $i=$pos2;

    // use $tyui as a counter, comparing the number of loops done with the expected total of tags
    $tyui++;
    } while (
    $tyui<=$max);


    ?>


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •