I have a series of regex patterns (Entries into a log for varying processes and function calls etc.) All of them carry timing data (some may say “executed in XXX ms”, some may say “took XXX milliseconds”, etc.)
So I have an associative array of patterns. (for example: $regex = array(‘PATTERNA’ => “~This is a (\d+) pattern~”, ‘PATTERNB’ => “~This is another (\d+) pattern~”); )
For each line of the log; (must be done this way? It’s a very large log, and i wouldnt trust fread not to slice the lines into pieces if i dont do it line-by-line)
If the line matches any of the patterns, record it’s timing data (\d+) in the -APPROPRIATE- category.
So i’m doing a preg_replace_callback on each line. But how would I determine which pattern the item matched?
I don’t really understand the problem - isn’t fgets() enough to read a file line by line?
As to your original question - if you want to extract the timing data and identify which pattern matched then why don’t you combine all your patterns into one and use preg_match(). For example:
$matches_count = preg_match('~This is a (\\d+) pattern|This is another (\\d+) pattern~', $str, $matches);
And then depending on which pattern was matched you will have your number either in $matches[1] or $matches[2] and so on. If you have many patterns you can use the x modifier to make your pattern look readable by splitting it into separate lines and possibly adding some comments.
I dont have control of the software on the system.
Yes, I was just saying I dont feel comfortable pulling more than a single line at a time (to potentially save time on IO. This is a massive, massive file)
As to your original question - if you want to extract the timing data and identify which pattern matched then why don’t you combine all your patterns into one and use preg_match(). For example:
$matches_count = preg_match('~This is a (\\d+) pattern|This is another (\\d+) pattern~', $str, $matches);
And then depending on which pattern was matched you will have your number either in $matches[1] or $matches[2] and so on. If you have many patterns you can use the x modifier to make your pattern look readable by splitting it into separate lines and possibly adding some comments.
That makes sense… if i enforce X subpatterns per pattern (artificially if need be), then… I can do some math to find the pattern as (floor(maxkeyinmatches - 1 / X) = key_in_pattern_array)…
I originally started going down the path described in this thread, but decided to deviate a bit.
A bit more explanation of my scenario:
Currently running on my desktop; this script will eventually be moved to a server that can process faster, however running it on the desktop allowed me to see more clearly how various strategies worked.
Parsing a 2 GB file.
Searching for 22 independant regex strings.
Combining all the strings as an OR-regex pattern (Imploding on “|”): Script execution time 969 seconds.
Singular regex pattern scan (Looking for a single message): Script execution time ~20-30 seconds, depending on pattern.
Hmmm… 30*22 <<< 969…
Sum total of running entire script as a foreach of the patterns array: 470 seconds. Much faster…
Recalling that foreach is a looping construct, and is therefore subject to break;
while(!feof($file)) {
$logline = fgets($file);
foreach($patterns AS $patternname => $pattern) {
$pattern = "~".$pattern."~"; //For ease of readability my array does not include delimiters.
if(preg_match($pattern,$logline,$matches)) {
//Output handling here. $patternname now identifies the message that was matched.
break;
}
}
}
Total execution time for all 22 messages: 360 seconds. Almost 1/3 the time of the original attempt!
Could probably be further optimized by sorting the pattern array by frequency of occurrence.
That is an excellent evaluation, @StarLion;
Thanks for providing the detailed analysis for the benefit of others who may encounter a similar dilemma.
Intuitively, I am not surprised the lengthy OR process was very slow. It seems like a lot of processing effort to evaluate each-and-every expression in the list that way.
As you indicated, utilizing the loop’s capability to break provided a means of “shortcut logic” which optimized the process.
I was fairly sure that the massive OR would be slow… I was not expecting it to be doubly as slow as parsing the entire file 22 times looking for a single message each time, however.
Was just browsing around for a bit and found the S modifier for preg_match. Sounds like it could help a lot in your case.
“When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. If this modifier is set, then this extra analysis is performed. At present, studying a pattern is useful only for non-anchored patterns that do not have a single fixed starting character.”
Thanks for posting your results, it was interesting to learn how these various methods perform. I’m still curious if the S modifier mentioned by ScallioXTX would change anything.