jQuery Removing Bad Characters in HTML

Share this article

I previously wrote about using jQuery to Strip All HTML Tags From a Div. Now if you want to remove all bad character from a HTML string (which may have been provided by a $.getScript() call or such). This is how you can easily clean up your html and remove bad characters, it could be useful when you get the html from somewhere and you want to .match() for strings but the .match() throws an error because of bad characters. We can do this using regex and still retain our HTML tags like so:

//clean up string/HTML (remove bad chars but keep html tags)
rawData =  rawData.replace(/[<>^a-zA-Z 0-9]+/g,'');
If we wanted to be extra specific we could also remove other common characters which are not needed:
///clean up HTML ready to be used with match() statement
rawData =  rawData.replace(/[^/\"_+-<>=a-zA-Z 0-9]+/g,'');

cleanHTML() Function

I wrote this little function to help with the process of cleaning up the HMTL ready for using regex on it.
/* clean up HTML for use with .match() statement or regex */
var JQUERY4U = {};
JQUERY4U.UTIL = 
{
	cleanUpHTML: function(html) {
		html = html.replace("'",'"');
		html = html.replace(/[^/\"_+-?!<>[]{}()=*.|a-zA-Z 0-9]+/g,'');
		return html;
	}
}
//usage: 
var cleanedHTML = JQUERY4U.UTIL.cleanUpHTML(htmlString);
More Copy and Paste Regex Examples

Frequently Asked Questions (FAQs) about Removing Bad Characters in HTML

What are the common bad characters in HTML and how do they affect my code?

Bad characters in HTML are usually non-printable characters that can cause issues in your code. They can lead to unexpected results, such as breaking the layout, causing encoding errors, or even making your webpage unresponsive. Some common bad characters include zero-width spaces, non-breaking spaces, and other invisible characters. These characters can be accidentally inserted into your code when copying and pasting from different sources, especially from word processors.

How can I identify bad characters in my HTML code?

Identifying bad characters in your HTML code can be challenging due to their invisible nature. However, you can use various tools and techniques to spot them. For instance, you can use a text editor with a ‘show invisible characters’ feature. Alternatively, you can use online tools or scripts that highlight or remove these characters.

How can I remove bad characters using jQuery?

jQuery provides several methods to manipulate strings, which you can use to remove bad characters from your HTML. For instance, you can use the replace() method combined with a regular expression to target and remove specific characters. Here’s a basic example:

var str = "your HTML string";
str = str.replace(/bad character/g, "");

In this code, “bad character” should be replaced with the actual character you want to remove.

Why is the character ‘65279’ appearing in my HTML?

The character ‘65279’ is the Unicode character for the zero-width no-break space. It’s often inserted into files by certain text editors or when copying and pasting from word processors. This character can cause issues in your HTML, such as breaking the layout or causing encoding errors. You can remove it using the methods described in the previous question.

How can I prevent bad characters from being inserted into my HTML in the first place?

The best way to prevent bad characters from being inserted into your HTML is to always write your code directly in a text editor designed for coding, such as Sublime Text or Atom. These editors won’t insert invisible characters like word processors do. Also, be careful when copying and pasting code from other sources, as this can also introduce bad characters.

Can bad characters affect the SEO of my website?

Yes, bad characters can potentially affect the SEO of your website. They can cause encoding errors that make your webpage unresponsive or difficult to crawl for search engine bots. This can negatively impact your site’s ranking in search engine results.

Are there any other ways to remove bad characters besides using jQuery?

Yes, there are several other ways to remove bad characters from your HTML. For instance, you can use PHP’s preg_replace() function, or Python’s re.sub() function. These functions work similarly to jQuery’s replace() method, using regular expressions to target and remove specific characters.

How can I remove all non-printable characters in a string?

You can remove all non-printable characters in a string using a regular expression that targets these characters. Here’s an example using JavaScript:

var str = "your string";
str = str.replace(/[^ -~]+/g, "");

This code will remove all characters that are not in the range of printable ASCII characters (from space to tilde).

What is a zero-width no-break space and how can I remove it?

A zero-width no-break space is a non-printable Unicode character that takes up no space but prevents line breaks. It can cause issues in your HTML, such as breaking the layout or causing encoding errors. You can remove it using the methods described in the previous questions.

Can bad characters cause issues in other programming languages besides HTML?

Yes, bad characters can cause issues in any programming language. They can lead to unexpected results, such as breaking your code or causing encoding errors. The methods to remove them vary depending on the language, but usually involve using some form of string manipulation or regular expression.

Sam DeeringSam Deering
View Author

Sam Deering has 15+ years of programming and website development experience. He was a website consultant at Console, ABC News, Flight Centre, Sapient Nitro, and the QLD Government and runs a tech blog with over 1 million views per month. Currently, Sam is the Founder of Crypto News, Australia.

jQuery
Share this article
Read Next
Get the freshest news and resources for developers, designers and digital creators in your inbox each week
Loading form