SitePoint Sponsor

User Tag List

Results 1 to 2 of 2
  1. #1
    SitePoint Member
    Join Date
    Nov 2006
    Posts
    19
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    regex for apostrophe character and encoding prob

    Hello forum,
    I've written a regex that captures words in an HTML document. It works almost perfectly with the exception of "typographer's" quotes, conjunctions, and possessives which I can't seem to capture.

    I looked at the DB containing the dowloaded text and it seems to be a character encoding problem.

    My app downloads a webpage to a file using curl then processes the text from that file and puts the processed text into a sqlite3 DB. The user can then view the processed text located in the DB. I'm serving the pages with iso-8859-1 encoding and the quotes and apostrophes look fine when viewed with a browser. The original html downloaded is also served as iso-8859-1.

    Taking a look at the html source with firefox the characters I'm trying to capture are . Should be easy...

    However, looking at the text directly in the DB or in the original files the characters are replaced with question marks. I checked and the files created using curl are encoded as utf-8 I believe the sqlite3 db is also utf-8.

    Is there a way to set curl to encode using iso-8859-1 or is there some other fix anyone can suggest? I've been attempting to figure this out for two days and haven't gotten anywhere.

    Thanks.

  2. #2
    SitePoint Enthusiast
    Join Date
    Jun 2006
    Location
    Italy
    Posts
    44
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I have succesfully used this utf8ToUnicodeEntities() function to convert UTF-8 content into unicode entities. I think it fits your needs to.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •