Extending preg_match_all()

Hi all,

I’ll try explaining this in a concise manner, perhaps this will suffice for now.

Basically I am writing a parser. It actually works and produces a valid RSS 2.0 script.

The next step is to expand beyond the custom tags I already have in my PHP web page and add more, these I can then optionally use for example to expand (append to) the <description> </description> XML data.

All potential RSS items in my PHP web page have a <rss_content_item> </rss_content_item> around them so with a simple preg_match_all() I can quickly find out how many I have (i.e. 9, 11, 25 etc.) and then just focus future searches on that instead of all the file (to boost speed).

Now…as mentioned above I want to add some other tags. The problem is not all <rss_content_item></rss_content_item> 's feature these tags. It depends on the given rss item.

My parser reads through the entire file, finds all instances of all custom tags and then writes that to an array I have made (class) before finally once it’s all populated writing that to a physical .xml file.

It works wonderfully if you assume each rss item has all the tags (i.e. title, description, link etc.) and the data in between each tag only occurs in one instance. This is indeed true for all standard tags (those needed to make a valid RSS XML file) but not true for any additional tags I’m throwing in them (i.e. those that provide additional data that may/may not find its way into the RSS XML file).

Finding and copying the data in between these additional tags is all fine. The problem is how to know which rss_item contains the additional tags but also how many instances of these tags does the rss_item contain.

My dilemma spans from the fact that preg_match_all() returns just an array of all instances having searched all rss_items (see $content_items) but there’s no indication where the data comes from exactly (which rss_item contains it).

Any ideas how to solve this puzzle? Without knowing what goes where I can’t populate $rss_content and ultimately write it to an XML file. Thanks!

[B]
class RSSContent
{
public $rssTitle;
public $rssDescription;
public $rssHasExtra;
public $rssDescriptionExtra;
public $rssPubDate;
public $rssLink;
public $rssGUID;
public $rssAuthor;
public $rssCategory;
}

Here’s some code to give you an idea how it’s working so far:

//holds all RSS data from source file
$rss_content = new RSSContent();

//open source and destination files
$rss_source_file = fopen("$rss_from_file", "r") or die("can't open file [SOURCE]");
$rss_write_file = fopen("$rss_to_file", "a") or die("can't open file [DESTINATION]");

//read the entire contents of source file into buffer
while (!feof ($rss_source_file))
{
	$source_file_contents = fgets($rss_source_file);	
} 
$source_file_array_count = sizeof($source_file_contents);

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// RSS_CONTENT_ITEM
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
$match_pattern = ‘/[\r
]<rss_content_item>[\r
]
(.)[\r
]
<\\/rss_content_item>/U’;
$current_line = $source_file_contents;
for ($i = 0; $i < $source_file_array_count; $i++)
{
$content_items = get_all_content_between($current_line, $match_pattern);
}

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// ---&gt; RSS_CONTENT_TITLE
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////	
$array_count = sizeof($content_items);
$current_line = $content_items;
$match_pattern = '/[\\r\

]<rss_content_title>[\r
]
(.)[\r
]
<\\/rss_content_title>/U’;
for ($i = 0; $i < $array_count; $i++)
{
$content_title = get_all_content_between(implode(" ", $current_line), $match_pattern);
}
$array_count = sizeof($content_title);
for ($i = 0; $i < $array_count; $i++)
{
$rss_content['$rssTitle '][$i] = $content_title[$i];
}

etc…
[/B]

Seriously? You’re using RegExp to parse XML ? :confused:

Is there any particular reason you’re not using any of the myriad of tools designed to read XML natively?

I’m not parsing XML, I’m parsing a PHP (or any web page be it containing HTML) file and using the data within that to make an XML RSS 2.0 file.

The DOM method didn’t work so I just went and wrote my own method.

It all works fine if each rss item (however many there are in the web page) has the same required set of tags containing data in between them and this data for each tag only exists in one instance. If you have a rss item in the web page with 3 sets of the same tags, each containing some different data (i.e. prices or product codes, whatever this would be) it becomes more complicated.

How so? Given your described problem and my interpretation of what I think you’re trying to do, some more formal tools (like DOM as mentioned) would seem more appropriate at first glance.

If you’re stuck (for whatever reason) on using regular expressions then we could do with more details of precisely what you want to achieve. Your code is of an odd structure and not particularly conducive to easily being understood (at 11pm on a Saturday), as well as hiding key parts (like a function definition), which does not help.

Yes. I didn’t want to paste the code as its about 250-300 lines long so would probably distract you guys from even wanting to reply.

Not sure why DOM didn’t work but it was frustrating and it seemed to equally puzzle some others in another thread here so I dropped it. Anyhow, looking back I believe I would run into this problem anyhow (DOM would just mean less lines of code is all, maybe faster execution…).

Anyway, I believe to have cracked it finally, which is surprising consider how tired I am.

I’ve stripped each rss item from $content_items into its own buffer. After that I’ve iterated through each of these buffers watching to see if there’s a tag in there saying this rss item will have more data. If I find it I mark the $rssHasExtra for that rss item as 1, otherwise it gets a value of 0. This now works because each buffer I search represents exactly one rss item.
Having the above I can now go through only the rss items that have the value of $rssHasExtra == 1. Now I simply look for any extra type tags (always ending with _extra> and add them to position [y], , [y+1], ,[y+2] etc.

Once that’s all in place I just check to see what rss item has additional data and then choose how to write it to the RSS 2.0 XML file (i.e. append info to the title or description tags).
[B]
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// —> RSS_CONTENT_EXTRA
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
$array_count = sizeof($content_items);
$match_pattern = ‘/[\r
]<rss_content_extra>[\r
]
(.)[\r
]
<\\/rss_content_extra>/U’;
$c = 0;
for ($i = 0; $i < $array_count; $i++)
{
$current_line = $content_items[$i];
$content_extras = get_all_content_between($current_line, $match_pattern);

			//if $content_extras returns anything we know this content item has extra data
			if (sizeof($content_extras) == 0)
			{
				$rss_content['$rssHasExtra '][$c] = 0;		
			}
			else
			{
				$rss_content['$rssHasExtra '][$c] = 1;	
			}
							
			$c = $c + 1;				
		}

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// —> RSS_CONTENT_DESCRIPTION_EXTRA
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
$array_count = sizeof($content_items);
$match_pattern = ‘/[\r
]<rss_content_description_extra>[\r
]
(.)[\r
]
<\\/rss_content_description_extra>/U’;
for ($i = 0; $i < $array_count; $i++)
{
//only process the following if this content item has extra data
if ($rss_content['$rssHasExtra '][$i ] == 1)
{

					$current_line = $content_items[$i]; 
					$content_extras_description = get_all_content_between($current_line, $match_pattern); 		
					
					//go through all found &lt;rss_content_description_extra&gt; for this content item and add them to $rss_content
					$array_count2 = sizeof($content_extras_description);
					for ($c = 0; $c &lt; $array_count2; $c++) 
					{						
						$rss_content['$rssDescriptionExtra '][$i][$c] = $content_extras_description[$c];		
					}	
				}
			}

[/B]