Question about htmlentities

Hi Guys!

I am using phpwordlib to extract text from word documents (.doc) files. Basically phpwordlib converts the word document to plain text format. I am then trying to convert the plain text into html. Now, i’ve run into a slight problem in that some weird characters are being shown like below.

Weird character:

�

The PHP function I am using is:


$plaintext = $phpwordlib->GetPlainText($contents);
$plaintext = htmlentities($plaintext);

It seems as though the hyphen from the plain text is being turned into �

Any idea how to convert the plain text into HTML (the proper way :-))

Word document (plain text):


Being an experienced & focused Senior Account Manager, with the diligence to thrive under pressure, has excellent communication and negotiation skills.  Results orientated and target driven, will use initiative, creativity and show positive organisational skills.  With sound commercial acumen, definite motivational ability, hunger for success and a determined approach, will add value to the operation of an organisation.





BT Plc							(Apr 08 – date)
Recruited as New Media Senior Sales Consultant
Achieved 134% against 1st month’s target
‘Pioneering role’ for a new on-line search marketing channel driving new sales within the SME market 

Thomson Directories Ltd				(May 07 – Apr 08)
Re-recruited as a Senior Account Manager
Achieved 105% to target in 1st Campaign
Account management for clients with a combined annual spend of £350k+ - primary focus on retention, growth & new customer acquisition.  Multi-product sales whilst being responsible for developing Account Representative Channel with planning & field accompaniment

Yellow Pages Sales Ltd				(Apr 05 – May 07)
2007 Cycle (Q1 – 4)	 New Media Portfolio Achiever
2006 Cycle (Q1 – 4))	 New Media Portfolio Achiever
Achieved personal sales objectives, client growth & secured new clients.  Created & led the regional activity regarding a key & specific business objective.  Enhanced performance by developing abilities to sell multi product solutions

Thomson Directories Ltd			(Feb 04 – Apr 05)
£4,644 over target in first Campaign
118% to Q1 target in 2005 cycle 
Lowest cancellation rate (0.83%) in Regional Office
Passionate for securing new business sales, also, effectively managed existing accounts.  Development of solution based selling techniques

Oldroyd Publishing Ltd			(Aug 03 – Feb 04)
1st from initial training course to ‘sign a deal’
Qualified for ‘super-bonus’ in 1st week
Management responsibility of local community projects; developed relationships with clients & end-users alike.  Platform for re-entry into advertising market





Electricity Direct (UK) Ltd			(Feb 02 – May 03)
2nd National sales position for period 2002 – 2003
Regional Account Manager of the week on 10 occasions
Top Regional Account Manager over 4 continuous quarters
5 out of 5 quarterly targets surpassed – achieving 150% of target

Viterra / Energy Management Services	(Mar 97 – Jan 02)
National account development & direct sales successes
Introduced customised data & client monitoring and targeting solutions
Effective team management




James E James (Liverpool) Ltd		(Oct 96 – Mar 97)
Recruited, trained & led a telesales operation bordering on closure
Improved regional sales four-fold within the first quarter

Scope	(formerly the Spastics Society)	(Apr 96 – Oct 96)
Effective management of income & profitability targets
Planned, proposed & introduced initiatives to increase profitability
Ensured a team of 14 worked effectively & productively





Yorkshire Post Magazines Ltd		(1995 – 1996)
Commercial Manager of a regional business publication
Recruited, managed & motivated a sales team of twelve
Improved sales revenues (new & existing business)
Effective budgetary control
Positively transformed the image & content of the title

Business Magazine Group			(1991 – 1995 Field Sales)
						(1989 – 1990 Telesales)
Successes in both classified & display advertisement sales
Effective advertisement design, layout & presentation




	
Nottinghamshire Constabulary		(1981 – 1989)
Developed strong interview techniques
Confidently dealt with all sectors of society
Built complex and long-term investigation & research skills
Ensured honesty, integrity & a professional attitude and outlook was conveyed at all times







 				Training with a Difference
NVQ Level 4 Management
TDLB Assessor Unit Award (D32/D33)

1990 – 1991			Derby Tertiary College
				Cambridge Information Technology Certificates
				‘Access’ to Higher Education Certificate

1981 – 1982			Ashfield Education Centre
				2 GCE ‘O’ Level passes
Sociology
Psychology

1976- 1981			Quarrydale Comprehensive School
				5 GCE ‘O’ Level passes
Mathematics
History
English Language
English Literature
French





Date of Birth			15th March 1965

Marital Status		Married

Children			Three
Eireann, 15yrs
Tiarnan, 9yrs
Lochlan, 4yrs

Nationality			British

Driving Licence		Full – no points or convictions

NI Number			NE603874B

Availabilty			1 month’s notice
	MEDIA (Advertising Sales)			Aug 03 – date

UTILITIES (B2B)				1997 - 2003

CHARITIES					1996 - 1997

MEDIA / PUBLISHING			1989 - 1996

POLICE FORCE				1981 - 1989

EDUCATION & QUALIFICATIONS

	PERSONAL INFORMATION


Thanks in advance.

If it were a hyphen then it wouldn’t be having the difficulty so it is probably an en-dash.

The character should display correctly in the HTML provided that you use a characterset for the display of the web page that includes that character.

Try changing the charset for your page to UTF-8

Hi - thanks for the suggestion. So I changed the charset to utf-8, but still the same error.

<meta http-equiv="content-type" content="text/html; charset=utf-8">

Have you tried setting the appropriate Content-Type header too?

microsoft created their own characters for certain things, like the double hyphen, “smart quotes”, etc.

When I’m simply trying to get something copied from a word document into plain text I run it through something like this:


    $find = array('#“#', '#”#', '#…#', '#–#', '#’#', '#‘#', '#\\xBD#', '#\\200#', '#(\\242|\\342)#');
    $replace = array('"', '"', '...', '--', "'", "'", '1/2', '*', ' ');
    $text = preg_replace($find, $replace, $text);

It catches the most common weird characters, but not all of them.

Hi aamonkey,

I tried your code but it doesn’t seem to be finding the strange characters. I think we are on the right lines though. It’s missing this character ‘#–#’ for some reason…

Oh yeah, I didn’t even think about the filters this forum has in place to do the exact same thing, essentially stripping out the weird character I pasted. I will try to get you something bulletproof tomorrow morning

Ok thanks dude! :slight_smile:

give this a shot:

$html = mb_convert_encoding($html, 'HTML-ENTITIES', mb_detect_encoding($html));

I get this error:

Warning: mb_convert_encoding() [function.mb-convert-encoding]: Illegal character encoding specified in /var/www/vhosts/demo.mydomain.com/httpdocs/classes/main.class.php on line 886

Eeesh…it sounds like phpwordlib is not even returning a string with valid character encoding…

Might be time to look for something else. Sorry I don’t have any better ideas at the moment.

edit:

I guess you could try taking some stabs in the dark with that same code…i.e.:


$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');


$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'ISO-8859-1');

etc. - maybe you’ll get lucky and one will convert what you need. Here’s a list of all the valid character encodings you could try.

That worked a treat, thanks mate :slight_smile: