Those funky europeans and their crazy accents

Hi Guys and Girls,

I am having some fun with some Spanish text. I have tried converting it to html entities and replacing the funky letters.

No joy.

Ayuda por favor!



$feed_url = 'http://www.infojobs.net/trabajos.rss/kw_english/';
$xml = simplexml_load_file($feed_url);
	foreach($xml->channel->item as $item)
	{
	
	$job_title = 		htmlentities($item->title);
	$job_link = 		htmlentities($item->link);
	$job_description = 	htmlentities($item->description);
	$job_pubDate = 		htmlentities($item->pubDate);
	
	
$search = array('á','ñ','ó');
$replace = array('a','n','o');

$job_title = str_replace($search, $replace, $job_title);
$job_description = str_replace($search, $replace, $job_description);


	echo "<p>$job_title <br/> $job_link <br/>$job_description <br/>$job_pubDate <br/></p>";
	// check to see if it is already in the DB
	$query = "SELECT FROM job where job_link = '$job_link' AND job_pubDate = '$job_pubDate'";
	echo "$query <br/>";
	$result = mysql_query($query);
	$rows = mysql_num_rows($result);
		if($rows >1) {
		echo "<h1>Job is already in the Database</h1>";
		}
		else {
		echo "Adding job $job_title<br>";
		$insert = "INSERT into job (job_title, job_link, job_description, job_pubDate) values('$job_title','$job_link','$job_description','$job_pubDate')";
		echo "$insert <br/>";
		mysql_query($insert);
		}
	
	}

Why would you want to replace those characters?
Can’t you just use UTF8 and save them in the DB as is?

The reason is because after you do htmlentities the accented letters no longer exist in their original form. You have to str_replace before htmlentities. Here is a function I use to convert almost all European accents to their non-accented counterparts. Just for those 3 Spanish letters you can skip the long strtr altogether, the first three lines of code should cope with them just fine. The function expects a string in utf-8:


function noAccents($str) {
		$str    = htmlentities($str, ENT_NOQUOTES, 'UTF-8');
		$str    = preg_replace("/&(.)(acute|caron|cedil|circ|ring|tilde|uml);/", "$1", $str);
		$str    = html_entity_decode($str, ENT_NOQUOTES, 'UTF-8');
	
		return strtr($str,
			array(
				'&#261;' => 'a',   // Polish
				'&#263;' => 'c',
				'&#281;' => 'e',
				'&#322;' => 'l',
				'&#324;' => 'n',
				'&#347;' => 's',
				'&#378;' => 'z',
				'&#380;' => 'z',
				'&#260;' => 'A',
				'&#262;' => 'C',
				'&#280;' => 'E',
				'&#321;' => 'L',
				'&#323;' => 'N',
				'&#346;' => 'S',
				'&#377;' => 'Z',
				'&#379;' => 'Z',
				'ß' => 'ss',  // German
				'&#269;' => 'c',   // Chech
				'&#271;' => 'd',
				'&#283;' => 'e',
				'&#328;' => 'n',
				'&#345;' => 'r',
				'&#357;' => 't',
				'&#367;' => 'u',
				'ž' => 'z',
				'&#268;' => 'C',
				'&#270;' => 'D',
				'&#282;' => 'E',
				'&#327;' => 'N',
				'&#344;' => 'R',
				'&#356;' => 'T',
				'&#366;' => 'U',
				'Ž' => 'Z',
				'&#314;' => 'l',   // Slovak
				'&#318;' => 'l',
				'&#341;' => 'r',
				'&#313;' => 'L',
				'&#317;' => 'L',
				'&#340;' => 'R',
				'&#273;' => 'd',   // Croatian
				'&#272;' => 'D',
				'&#337;' => 'o',   // Hungarian
				'&#369;' => 'u',
				'&#336;' => 'O',
				'&#368;' => 'U',
				'&#259;' => 'a',   // Romanian
				'&#537;' => 's',
				'&#351;' => 's',
				'&#539;' => 't',
				'&#355;' => 't',
				'&#258;' => 'A',
				'&#536;' => 'S',
				'&#350;' => 'S',
				'&#538;' => 'T',
				'&#354;' => 'T',
				'&#286;' => 'G',   // Turkish
				'&#304;' => 'I',
				'&#287;' => 'g',
				'&#305;' => 'i',
				'à' => 'a',   // French
				'è' => 'e',
				'ù' => 'u',
				'À' => 'A',
				'È' => 'E',
				'Ù' => 'U',
			)
		);
	}

@tangledman: That RSS file has wrongfully declared it’s charset to ISO-8859-15 so when I tried to convert it to ISO-8859-1. I just looked at the browser output and noticed the two-byte char representation I knew the RSS is actually in UTF-8. Then I used the code below to convert the descriptions to ISO-8859-1.

Note that you don’t have to convert any ISO-8859-1 character if you plan adding the texts in your database because the following conversion outputs everything in plain ASCII.

<?php
$url = "http://www.infojobs.net/trabajos.rss/kw_english/";
$xml = simplexml_load_file($url);

foreach ($xml->channel->item as $job)
	foreach ($job as $key => $val)
		$job->$key = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $val);

print_r($xml);
?>

Thanks for the help…I 've learnt a bit trying to crack this one.