Removing duplicate filenames

Have been able to produce a file containing all the filenames in a path, data such as

Folder Compare
Produced: 11/01/10 05:07:56 PM

Mode: All
Base folder: ~/.kde/share/apps/kmail/mail
Name Size CRC Name Size CRC

1232364633.13472.s7ZYS:2,S 19,370 0003C905 >>
1233650469.8190.4bQ5o:2,S 974 0005586C >>
1233650255.8190.qAdNh:2,S 4,104 000835C0 >>
1233650291.8190.5f5ud:2,S 1,275 000A3AD5 >>
1233650301.8190.TCuxJ 2,308 000B6FA3 >>

Because the report that produced the file was sorted to CRC, I’d like to simply open the file, read the contents, line by line, and where the CRC is the same as the previous line, then add details to an array. Then after completing to read the file/report, either display the array, or make code from it, and write to a bash file (like ‘rm filename’ commands).

Here is the php code so far


<?php
$handle = @fopen("~/Documents/temp/Report.txt", "r");
if ($handle) {
    while (!feof($handle)) {
        $buffer = fgets($handle, 4096);
        echo $buffer;
    $pieces = explode(" ", $buffer);
    print_r($pieces);
    }
    fclose($handle);
}
?>

The echo and print_r are just to see how the data looks to php. Now, the explode does this

Array
(
[0] => 1233650161.8190.VtijH
[1] =>
[2] =>
[3] =>
[4] =>
[5] =>
[6] =>
[7] =>
[8] =>
[9] =>
[10] =>
[11] =>
[12] =>
[13] =>
[14] =>
[15] =>
[16] =>
[17] =>
[18] =>
[19] =>
[20] =>
[21] =>
[22] =>
[23] =>
[24] =>
[25] =>
[26] =>
[27] =>
[28] =>
[29] =>
[30] =>
[31] =>
[32] =>
[33] =>
[34] =>
[35] =>
[36] =>
[37] =>
[38] =>
[39] =>
[40] => 404,280
[41] =>
[42] =>
[43] => 00F267FC
[44] => >>

The filename is always in element/key zero it seems, but the size and CRC are in different array keys, depends on the data.

How can I do the split (or similar) just to get the filename, size and CRC ? I also want to bypass the first 7 lines, and also bypass lines like this

.Templates.index

Once I am able to get the filename, size and CRC, it can be simply stored in variables, and then compare the values from the next line read in the file.

Thanks,

Jehoshua

Results are starting to look better; here is the code now


<?php

/*
exclude the following lines

&#65279;Folder Compare
Produced: 11/01/10 05:07:56 PM

Mode:  All
Base folder: ~/.kde/share/apps/kmail/mail
Name                                                         Size      CRC         Name Size CRC
------------------------------------------------------------------------------------------------
.Templates.index		//and other similar lines starting with a period

*/

/*
So, just check in first position, setup array
*/
$excluded_lines = array("F","P","M","B","N","-",".");	//all filenames begin with numeric, so should be safe

$handle = @fopen("~/Documents/temp/Report.txt", "r");

if ($handle) {
    while (!feof($handle)) {
        $buffer = fgets($handle, 4096);
	
	//check for lines we want to bypass
	$first_char = substr($buffer, 0,1);
	if (in_array($first_char, $excluded_lines)) {
		echo "Excluded line found - " . $buffer;
	}
	
	else {

	        echo $buffer;
		$words = explode(" ", $buffer);

		foreach ($words as $word) {
			if ($word > $NULL) {
				echo "$word\
";
			}
		}
	}

	print_r($pieces);
    }
    fclose($handle);
}
?>

and the output

Folder Compare
Folder
Compare

Excluded line found - Produced: 11/01/10 05:07:56 PM

Excluded line found - Mode: All
Excluded line found - Base folder: ~/.kde/share/apps/kmail/mail
Excluded line found - Name Size CRC Name Size CRC
Excluded line found - ------------------------------------------------------------------------------------------------
1232364633.13472.s7ZYS:2,S 19,370 0003C905 >>
1232364633.13472.s7ZYS:2,S
19,370
0003C905
>>

1233650469.8190.4bQ5o:2,S 974 0005586C >>
1233650469.8190.4bQ5o:2,S
974
0005586C
>>

The in_array() is not working for the first record in the file, which is

Folder Compare

the first character is “F” in that case. Any clues please ?

J

Hex editor showed some ‘dirty’ characters before the word ‘Folder Compare’

EF BB BF

removed them, now okay.