How to scan a text file line by line for patterns?

How can i to delete all paragraphs that don t contain string: [msgstr “”];

i have 4000 paragraphs as shown below, how do i erase all the paragraphs same as the first one and keep only those like the 2nd and 3rd one?


...

#: uc_store/uc_store.module:1059,1059,1059; uc_store/uc_store.admin.inc:399,410,410,410,410,410,410,410
msgid "-Type address-"
msgstr "-mettez addresse"

#: uc_store/uc_store.module:1059,1059,1059; uc_store/uc_store.admin.inc:399,410,410,410,410,410,410,410
msgid "-Show options-"
msgstr ""

#: uc_store/uc_store.module:1059,1059,1059; uc_store/uc_store.admin.inc:399,410,410,410,410,410,410,410
msgid "-Type address-"
msgid_plural "-Type adresses-"
msgstr [0]"-mettez addresse"
msgstr [1]"-mettez addresses"
...

You could use grep with the -v option to show lines that do not match what you’re looking for.

If some regex guru can write a single line then it would be easier i guess. but for the moment you can go for file() function to split the lines to array and loop through all the lines and do the needful.


$lines = file('myfile.txt);
$newstring = '';
foreach($lines as $line){
    $newstring .= str_replace('find_msg', 'replace_text', $line) . "\
";
}
// now write the $newstring to the file back.

This needs to be asked. Is the following a single paragraph no line breaks, or is there a newline between each line?

#: uc_store/uc_store.module:1059,1059,1059; uc_store/uc_store.admin.inc:399,410,410,410,410,410,410,410
msgid "-Type address-"
msgid_plural "-Type adresses-"
msgstr [0]"-mettez addresse"
msgstr [1]"-mettez addresses"

If there is a line break between each line then parsing them will require caching lines until suitably understood break is attained.

dunno about grep PMW dude, will have to install linux for that. thanks! i already have some php to process the text, and it will be distributable to a community with 183mn text strings to translate, so it would be cool to process in php.

Thanks Rajug! i 'm working on it right this moment!

yes, each part to scan and delete is definately parsed in a paragraph.

I am using a trick to process the text… i’m replacing all the linebreaks with $adata = str_replace("\r
",“xNEWLINEx”,$data);
, so i dont have to look for a lineabreak token, i can ask for the php to search for all entries contained within the strings: xNEWLINExNEWLINEx…

and after the processing, i replace LBRKTOKEN with a line break again. str_replace(“xNEWLINEx”,"\r
,$data);

i would love also some PHP that sais:
If text entry between msgstr " and xNEWLINExxNEWLINEx is less than 20 characters, delete it. that would help me order the entries by length of text too.

I suggest then that you can use array_filter to easily filter the array (as read in with file)

Rajug, i tried your solution, but i have an error in the foreach loop for the moment. i think i have to sleep for abit and try again.

How about i replace all the linebreaks with XNEWLX

and then i say

str_replace strings starting and ending with XNEWLXXNEWLX that contain the value msgstr “”

:slight_smile:

Does the following work as per your requirement?


$lines = file('test.txt');
$new_string = "";
foreach($lines as $line){
	$new_string .= preg_replace('/^msgstr "(.*)"/i', 'msgstr ""', $line) . "\
";
}
$fp = fopen('test1.txt', 'w+');
fwrite($fp, $new_string);
fclose($fp);

I don’t really understand the op - don’t you wanna delete the second para and keep the 1st and 3rd? But at any rate, why don’t you just explode by #: and check each element, unsetting where necessary?

Rajug! that works really well! thanks so much! :slight_smile:

Hash, the reason is that i have a code that translates all the empty strings by “googla ajax translator API”, so i have to seperate a list of all the phrases that have yet to be translated, like that, when people read through the completed list of translations, all the google ones are not mixed together with the human translations.

Thanks!

Hey sorry, the script didn’t work… it just replaces all the strings msgstr “.*” with msgstr “” , so it just modifies a line rather than taking away all the paragraphs with msgstr “.*”

Sorry, was bored so `jQueried’ a solution, notice the overuse of chaining. :blush:


<?php
$string = '
#: uc_store/uc_store.module:1059,1059,1059; uc_store/uc_store.admin.inc:399,410,410,410,410,410,410,410
msgid "-Type address-"
msgstr "-mettez addresse"

#: uc_store/uc_store.module:1059,1059,1059; uc_store/uc_store.admin.inc:399,410,410,410,410,410,410,410
msgid "-Show options-"
msgstr ""

#: uc_store/uc_store.module:1059,1059,1059; uc_store/uc_store.admin.inc:399,410,410,410,410,410,410,410
msgid "-Type address-"
msgid_plural "-Type adresses-"
msgstr [0]"-mettez addresse"
msgstr [1]"-mettez addresses"
';

echo implode(
    "\
#",
    array_map(
        'trim',
        array_filter(
            explode(
                '#',
                $string
            ),
            create_function(
                '$element',
                'return 1 !== preg_match("~msgstr \\".+?\\"~", $element);'
            )
        )
    )
);

/*
    #: uc_store/uc_store.module:1059,1059,1059; uc_store/uc_store.admin.inc:399,410,410,410,410,410,410,410
    msgid "-Show options-"
    msgstr ""
    #: uc_store/uc_store.module:1059,1059,1059; uc_store/uc_store.admin.inc:399,410,410,410,410,410,410,410
    msgid "-Type address-"
    msgid_plural "-Type adresses-"
    msgstr [0]"-mettez addresse"
    msgstr [1]"-mettez addresses"
*/

Hi, that seems fantastic, thankyou so much. I am having an information overload, learning regular expressions et…

I’m trying to test regular expressions in this way:

echo substr_count($lines,'msgstr ".*"';
echo substr_count($lines,"~msgstr \\".+?\\"~");
echo substr_count($lines,'msgstr ""');
echo substr_count($lines,'.*msgstr.*');

but the first 2 echo “0” i guess i will keep on learning.

Okay, how about like this?


$string = file_get_contents('test.txt');
$ps = explode("#:", $string);
$new_string = '';
foreach($ps as $p){
	if(!empty($p) && !preg_match_all('/msgstr "(.+?)"/is', $p, $matches)){
    	$new_string .= "#:" . $p . "\
";
    }
}
$fp = fopen('test1.txt', 'w+');
fwrite($fp, $new_string);
fclose($fp);

I am also learning regular expressions :stuck_out_tongue: