How to remove non UTF-8 characters?

Hi all :),
I am having a problem with non UTF-8 characters being stored and read from a database for example as � .
For example
I get � for spaces
when I check the database it’s just a space but when displayed in html it’s � !
especially when � is at the end it does not go away when I trim()

it would be great to

  1. detect all non utf-8 characters
  2. convert/replace them
    Thanks for any advice.

Haha, I just had this problem. Here’s a nice function,

    function special_chars($str)
        $str = htmlentities($str, ENT_COMPAT, 'iso-8859-1');
        $str = preg_replace('/&(.)(acute|cedil|circ|lig|grave|ring|tilde|uml);/', "$1", $str);
        return $str;

Thanks :slight_smile: I will give this a try…

Your problem is that you are not storing as Unicode, or manipulating the string with PHP that is not Unicode aware. Or sending to HTML without sending a proper encoding. I assume it is the later, missing encoding.

Your problem is that you are not storing as Unicode,
that is correct :),
is there a way to detect current encoding and if it’s not utf-8 then to convert ?

It’s difficult, as you don’t really know what encoding the current string is in.

I.E if utf-8 is stored in a latin1 table in a db, when it comes out it’ll often be reported as latin1, even though you know it’s utf-8.

You can try converting using mb_convert_encoding, but I’ve had bad experiences.