Go Back   SitePoint Forums > Forum Index > Program Your Site > Perl, Python and Other Languages
Newsletter FAQ Members List Calendar Mark Forums Read

New to SitePoint Forums? Register here for free!

SitePoint Sponsor
 
Reply
 
Thread Tools Display Modes
Old Sep 9, 2004, 20:41   #1
Ted S
SitePoint Mentor
 
Join Date: Aug 2003
Location: Southern California
Posts: 2,730
ms word curly quotes

I seem to be having a problem when someone pastes from ms word into a script and the curly quotes are saved to the database. Instead of showing up properly I end up with random, unknown characters and thus far I have been unable to parse out the quotes. Does anyone know how to do this?
Ted S is offline   Reply With Quote
Old Sep 9, 2004, 23:34   #2
mmj
Test cases complete. 0 fails.
 
mmj's Avatar
 
Join Date: Feb 2001
Location: Melbourne Australia
Posts: 6,721
This will be due to a problem with the character encoding. Almost all character encodings share the same 127 characters, but angled quotes are not within the first 127 characters and are thus different according to the character encoding used.

If you're seeing two nonsense characters for each occurence of one curly quote character then it could be that you're storing it as UTF-8 but viewing it as ISO-8859-1. What's the character encoding of the page that you're viewing it on?
mmj is offline   Reply With Quote
Old Sep 10, 2004, 00:23   #3
Ted S
SitePoint Mentor
 
Join Date: Aug 2003
Location: Southern California
Posts: 2,730
Standard pages using the iso set but I imagine mysql (which is storing the data) is not using the same set. Any way to remove the characters or force a text set to use before saving the data?
Ted S is offline   Reply With Quote
Old Sep 10, 2004, 01:12   #4
mmj
Test cases complete. 0 fails.
 
mmj's Avatar
 
Join Date: Feb 2001
Location: Melbourne Australia
Posts: 6,721
Unless you have a funky version of MySQL it will store your characters in the same way as they are input and output.

Are you viewing the characters in a browser or in phpMyAdmin or in a shell? What matters is how they appear in the browser. If they are incorrect, then in your browser go to "view" -> "encoding" and change the encoding until you find one where it looks right. Once you have found this you will know what encoding the characters were entered in.

It's at this time that you will realise that PHP has almost nonexistant support for converting between character encodings and you may be tempted to give up on non-ascii characters.
mmj is offline   Reply With Quote
Old Sep 10, 2004, 11:00   #5
madeonmoon
SitePoint Evangelist
 
Join Date: May 2003
Location: nyc
Posts: 463
you might need to connect to mysql appending 'urf-8' to the db url:

//localhost/<yourdbname>?useUnicode=true&characterEncoding=UTF-8

besides that you have to make sure that your forms are set to accept data input as utf-8 and your output pages are set to display data as utf-8 as well

good luck
james
madeonmoon is offline   Reply With Quote
Old Sep 10, 2004, 14:37   #6
TechInterviews
SitePoint Member
 
Join Date: Aug 2004
Location: US NorthWest
Posts: 9
Check out David Wheeler's Quoter. Basically, you need to look for certain values in the submitted text and replace them with HTML &quot;
TechInterviews is offline   Reply With Quote
Old Sep 14, 2004, 05:25   #7
StefanH
SitePoint Enthusiast
 
Join Date: Sep 2004
Location: UK
Posts: 78
My guess would be that he's checking the database contents using the shell. The effect on a Windows shell versus a UNIX shell is different, but stem from the same problem.

I think it is the shell's handling of the characters that is at fault, and not MySQL. MySQL defaults to Latin 1 which should be perfectly fine for your average db application in Western Europe.

As people have already said, try it out in PHPMyAdmin. If that gives the same (or still an incorrect) result then you really will have to start messing about with character sets.
StefanH is offline   Reply With Quote
Old Sep 14, 2004, 18:30   #8
Ted S
SitePoint Mentor
 
Join Date: Aug 2003
Location: Southern California
Posts: 2,730
Actually the data is all being passed via web script from an html textarea to perl to mysql back to perl and output as html. I will try the quote transforming and also take a look at the actual data in mysql to see where it's going south.
Ted S is offline   Reply With Quote
Old Sep 14, 2004, 23:46   #9
Ted S
SitePoint Mentor
 
Join Date: Aug 2003
Location: Southern California
Posts: 2,730
Ok, looking at the mysql data via ssh I am seeing some wierd characters directly in the sql table like:

Code:
joint heritage.Â
it seems that the data is being saved or inserted with a different character set than html shows... when I pasted the content for that entry it did not show the A character which comes from MS Word (copy & pasted).
Ted S is offline   Reply With Quote
Old Sep 15, 2004, 02:04   #10
StefanH
SitePoint Enthusiast
 
Join Date: Sep 2004
Location: UK
Posts: 78
Ted,

Do you have something like PHPMyAdmin or equivalent? I ask only because I'm fairly certain that the shell will display certain characters incorrectly even though the data may be ok.

I'm now out of my depth on this topic unfortunately. Good luck!
StefanH is offline   Reply With Quote
Old Sep 15, 2004, 02:25   #11
Ted S
SitePoint Mentor
 
Join Date: Aug 2003
Location: Southern California
Posts: 2,730
Looking vvia phpMyAdmin I can still see the misformated characters. To be clear this data is coming from Word (with it's odd formating) into an html form, saved to mysql via perl and then viewed again (with php, ssh, perl which all show the bad characters).
Ted S is offline   Reply With Quote
Old Sep 16, 2004, 01:24   #12
Ted S
SitePoint Mentor
 
Join Date: Aug 2003
Location: Southern California
Posts: 2,730
I've tried a few regexp lines to no avail... any other ideas on the character encoding?
Ted S is offline   Reply With Quote
Old Sep 16, 2004, 06:51   #13
rvanderh
SitePoint Member
 
Join Date: Sep 2004
Location: Massachusetts
Posts: 11
I've had to deal with the same problem recently. The curly quotes from MS Word, when pasted into a HTML textarea, appear as straight slanted quotes. Then when you submit the form and display on a web page they should appear as the straight up and down quotes. Instead I was getting little square boxes.

My solution (we use CFMX/MySQL) was to use a ColdFusion Replace() function. The curly quotes in MS Word are ASCII characters 8220 (left quote) and 8221 (right quote). The quotes you want are ASCII character 34. So, in the same template where I have the textarea boxes, I also use the Replace() function.

You mentioned you are using Perl. I'm not familiar enough with Perl but I would think there is a replace function you could write up that would do the same thing I'm doing in CFMX.

Here's the syntax I'm using in CFMX (without the starting and ending brackets):

cfset form.abstract=#replace(form.abstract, chr(8220), chr(34), "all")#

where "abstract" is the name of the textarea. I have a similar line for ASCII character 8221.

Hope this is helpful.
rvanderh is offline   Reply With Quote
Old Oct 1, 2004, 05:05   #14
StefanH
SitePoint Enthusiast
 
Join Date: Sep 2004
Location: UK
Posts: 78
rvanderth,

Your solution is similar to the solution I used myself (eventually) however I used the SQL function REPLACE. The logic is the same however.

Also another tip, opening the pasted MS Word text in xEmacs will show you the codes that Word has used, i.e. 8220 as rvanderth writes would be shown as /220.
StefanH is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread | Next Thread »

Thread Tools
Display Modes

 
Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Sponsored Links
 
Forum Jump


All times are GMT -7. The time now is 06:56.


Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Copyright 1998-2009, SitePoint Pty Ltd. All Rights Reserved