Detect and remove BOM?

I have an application where users can write their own configuration files. These are text files where on each line some text is inputted. I load the config file using file(), which puts it into an array, with each array element being a line in the file. This is nice to work with.

I need to compare the text in each line with a string and the problem is that the first line might start with a BOM, e.g. if the user saved the config file as UTF-8 in Notepad. Thus for the first string comparison they might be the same to the naked eye but to PHP they aren’t - string + BOM is three characters longer than string (even though these three are invisible).

Apparently this will be fixed for PHP 6, but for now how can I deal with this problem? I have no idea what encoding the text file will be saved as. Starting each file with a carriage return works, but it’s too much to ask of the users.

Ah, you got me going! I think I figured it out though:

<?php
// Removes BOM (Byte order mark) from file (if necessary)
function bomStrip( path, output )
{
	$bufsize = 65536;
	$utf8bom = "\\xef\\xbb\\xbf";

	$inf = fopen(path, r);
	$outf = fopen(output, w);

	$buf = fread($inf, strlen($utf8bom));
	if ($buf != $utf8bom)
	{
		fwrite($outf, $buf);
	}
	if ($buf == "")
	{
		exit();
	}
	while (true)
	{
		$buf = fread($inf, $bufsize);
		if ($buf == "")
		{
			exit();
		}
		fwrite($outf, $buf);
	}
}
?>

I did a quick test and it seemed to work. Let me know how it goes for you.

Thanks Hamish, that gave me the idea to simply use trim():

trim($lines[0], "\\xef\\xbb\\xbf")

$lines is what file(‘config.txt’) returns. Very simple and clean with trim() and it works well.

I’m curious about your function, though. I don’t really understand it. The first ‘if’ will output the first 3 characters of $inf (because there’s no BOM) to $outf. In the next ‘if’, if the file is empty, exit the script (makes sense but why not make it the first ‘if’?). This means $outf will only be three characters long. Then an infinite loop where $inf is read in chunks of 64kb. Again, if empty, exit script (why again?). Then write the current 64kb chunk of $inf to $outf (overwriting what was currently there). Then $outf will only contain the last 64kb of $inf.

I’m sure I’m wrong, but I’d like to know why as I’m pretty confused by it.

Hey, that’s an even better solution!

The function is a rough translation of some python code I found (for the same purpose). So to be honest, I don’t fully understand some of the bits myself; but I was in a rush to finish it (just leaving work, had to catch a train ;p) and since the first test worked, I just posted it as is.

I ran over it again, and I believe it works as I expected. Consecutive fwrite calls don’t (at least in my tests) overwrite previous data, they append.

Well, anyhow, I’m glad it gave you the hint you needed. :slight_smile:

Oh yeah, of course, fwrite appends. I got confused - was thinking of fopen() in w mode which truncates the file to zero.

Anyway, thanks again. :slight_smile: