Charset declaration html5 / php

ulthane · February 23, 2012, 3:11pm

Hey everyone, i saw that html5 has a new way of declaring charset, instead of:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

i can put:

<meta charset="UTF-8">

My question is, do old browsers support that shorthand? should i expect any issues with it?

And regarding a php question, im trying to get some text from external links, the text is returned as ‘???’ even if i use

header('Content-Type: text/html; charset=utf-8);

I tried anpther encoding (my language - hebrew) and it worked:

header('Content-Type: text/html; charset=windows-1255');

But the question is if the two wont conflict with each other… (the utf-8 declared in the html and windows-1255 declared in php)

Thanks for the help,
ulthane.

Michael_Morris1 · February 23, 2012, 5:50pm

It is far more effective and far more efficient to declare http headers in the actual header of the document rather than use the http-equiv tags. When applied the content type in particular those tags are a joke: by the time the browser reaches the tag it has already chosen a charset and language. At best you waste the client’s time restarting the page render. At worst the client happily ignores your tag (and most browsers do).

You already know how to set the headers in PHP. The http-equiv tags are redundant and unnecessary.

ulthane · February 23, 2012, 8:59pm

So i didn’t understand what you say is i should delete the meta tag completely and only put php header declaration? it then wont be visible to the client (dunno if it has any downsides or…)

Michael_Morris1 · February 24, 2012, 2:11pm

… HTTP 101 …

A document transmitted via the HTTP protocol will have two sections - a header and a body. The way modern browsers work, you never see the header, but they are there. These are the response headers for Google.


Date: Fri, 24 Feb 2012 13:59:47 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=UTF-8
Content-Encoding: gzip
Server: gws
Content-Length: 22147
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN

200 OK

After this information comes the MIME encoded body of the document. If it’s text it will be relatively readable.

PHP can control, through the [fphp]header[/fphp] function, the contents of any of these lines. This allows you to modify the response code, caching and so on. Whatever you don’t populate your webserver program populates for you according to its own settings.

meta http-equiv tags will, in theory, override these properties. But it’s more efficient to pass the correct desired value in the header in the first place. Also the content-type header cannot be changed after rendering of the document has started, so http-equiv=“Content-Type” is useless and will be ignored. The same applies to the Content-Encoding and Content-Length properties. Meta http-equiv tags are primarily used for setting specific caching rules in otherwise static html documents, and they are quite effective in that role.

ulthane · February 24, 2012, 5:36pm

hey , thanks for the information, i understood now, i got another question thought, is there any way to change encoding for only a part of the php script? (like in a function only)
Im using UTF-8 for my website, but i must use windows-1255 to get page titles from external links, cuz UTF-8 always return ‘???’

Any clue? or a workaround?

Michael_Morris1 · February 24, 2012, 5:41pm

Character coding must be uniform for the file.

ulthane · February 24, 2012, 9:54pm

isnt there any way to get external page titles without being dependent on encoding …?

Michael_Morris1 · February 25, 2012, 2:23am

$_SERVER[‘REQUEST_URI’] holds the file name the outside world is asking for.

Jeff_Mott · February 25, 2012, 4:57am

ulthane, if I understood your latest request correctly, after your script downloads content from some external source, you’ll then need to detect and [url=http://www.php.net/manual/en/function.mb-convert-encoding.php]convert its encoding.

ulthane · February 25, 2012, 9:18am

hey Jeff thanks for the answer however i noticed that from 2 different pages with same encoding i get different results (one as ‘???’ and the other as normal…) so i guess it was not an encoding issue, or at least it will be hard to detect and fix
So I just checked the returned title with a preg_match and if it doesnt contain the right characters im looking for it will be named “link” if anyone is interested in the solution here it is, it works fine but its a little bit slow …

function get_page_title($url)
{
	ini_set('user_agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11');
	$doc = new DOMDocument();
	@$doc->loadHTMLFile($url);
	if (!$doc)
		return 'link';
	$xpath = new DOMXPath($doc);
	$title = trim($xpath->query('//title')->item(0)->nodeValue);
	if (preg_match('/[^a-z0-9 ]/i', $title) || $title=='')
		return 'link';
	return $title;
}