SitePoint Sponsor

User Tag List

Results 1 to 4 of 4
  1. #1
    SitePoint Enthusiast
    Join Date
    Apr 2001
    Posts
    63
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    extract info from HTML

    I have an HTML file that I'm trying to pull some info out of. The HTML is formatted like this:
    <font size=+2>Airplane Spin</font><br>
    <b>Used by /b> Mike Rotunda<br>
    <b>AKA /b> <br>
    <b>Description /b> The attacker lifts the victim up across their
    shoulders. The attacker starts spinning around quickly a few times
    to dizzy the victim, then drops them to the mat. <br>
    <br>

    <p><font size=+2>Airplane Spin Toss</font><br>
    <b>Used by /b> Al Perez, Sid Vicious, Oz<br>
    <b>AKA /b> Ally-Coptor (Perez), Human Frisbee (Sid), Twister Slam
    (Oz)<br>
    <b>Description /b> The victim is lifted up over the attacker's
    shoulder so the victim is facing upwards and their back is held
    over the shoulder of the attacker. The attacker holds the victim in
    place and spins around a few times, then tosses the victim into the
    air dropping them back first to the mat.<br>
    </p>
    <p><font size=+2>Airplane Spin Toss, Face First</font><br>
    <b>Used by /b> Mike Enos<br>
    <b>AKA /b><br>
    <b>Description /b> The attacker lifts up the victim over their
    shoulder as if for a body slam. The attacker then spins around a
    few times and then tosses the victim in the air dropping them to
    the mat face first.</p>
    <p><font size=+2>Arm Breaker</font><br>
    <b>Used by /b><br>
    <b>AKA /b><br>
    <b>Description /b> The attacker has the victim's arm in a
    wristlock. The attacker steps forward and drives the victim's arm
    across their knee.</p>
    What I want is to pull out the name (w/in the larger font tags) and description, but it isn't working properly. The first two work fine, then it skips the third, and then it just works sometimes, and not others. Here is my code:
    PHP Code:
    $input fopen ("E:/Programming/Docs/Reg.htm""r" );
    $output fopen ("E:/Programming/Docs/Reg.txt""w" );
    while (!
    feof ($input)) {
    $in_buf fgets($input4096);
    if (
    strpos($in_buf"<font size=+2>" ) !== FALSE) {
    $pos_start strpos($in_buf"<font size=+2>" ) + 14;
    $pos_end strpos($in_buf"</font>" );
    $string trim(substr($in_buf$pos_start$pos_end $pos_start));
    //fputs($output, trim($in_buf)."\n" );

    elseif (
    strpos($in_buf"<b>Description :</b>" ) !== FALSE) {
    $pos_start strpos($in_buf"<b>Description :</b>" ) + 20;
    $desc 1;
    $string .= ' *****'.trim(substr($in_buf$pos_start));

    elseif (
    $desc == 1) {
    if ((
    strpos($in_buf"<br>" ) == FALSE) && (strpos($in_buf"<p>" ) == FALSE)) {
    $string .= trim($in_buf);
    } elseif ((
    strpos($in_buf"<br>" ) !== FALSE)) {
    $pos_end strpos($in_buf"<br>" );
    $string .= trim(substr($in_buf0$pos_end).'*****');
    fputs($outputtrim($string)."\n\n" );
    $desc 0;
    } elseif ((
    strpos($in_buf"<p>" ) !== FALSE)) {
    $pos_end strpos($in_buf"<p>" );
    $string .= trim(substr($in_buf0$pos_end)).'*****';
    fputs($outputtrim($string)."\n\n" );
    $desc 0;
    }
    }
    }
    fclose ($input);
    fclose ($output); 
    and a sample the output:
    Airplane Spin *****The attacker lifts the victim up across theirshoulders. The attacker starts spinning around quickly a few timesto dizzy the victim, then drops them to the mat. *****

    Airplane Spin Toss *****The victim is lifted up over the attacker'sshoulder so the victim is facing upwards and their back is heldover the shoulder of the attacker. The attacker holds the victim inplace and spins around a few times, then tosses the victim into theair dropping them back first to the mat.*****

    Arm Breaker<b>Used by /b>*****

    Arm Breaker, Fireman's Carry<b>Used by /b> CW Anderson*****

    thanks in advance for any help

  2. #2
    The short answer is yes... Herbster's Avatar
    Join Date
    Oct 2001
    Location
    Bay City, Oregon
    Posts
    715
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I haven't played with your code. This is just from reading so I could be offbase.

    You are processing the file in 4096 byte chunks, but the variable description length suggests to me that the paragraphs do not all have the same length.

    I would probably read the entire file, explode it on <p> to create an array of paragraphs and loop through the array.

  3. #3
    SitePoint Enthusiast
    Join Date
    Apr 2001
    Posts
    63
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I thought of that, but the problem is that the person who created the site wasn't uniform, so sometimes there are p's, and others just br's

  4. #4
    The short answer is yes... Herbster's Avatar
    Join Date
    Oct 2001
    Location
    Bay City, Oregon
    Posts
    715
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    In that case, I would read the complete file and drop the paragrahs/sections with substr() as I process them.

    I suspect you may split a search string operating on uniform 4096 byte chunks.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •