I've written a regex that captures words in an HTML document. It works almost perfectly with the exception of "typographer's" quotes, conjunctions, and possessives which I can't seem to capture.
I looked at the DB containing the dowloaded text and it seems to be a character encoding problem.
My app downloads a webpage to a file using curl then processes the text from that file and puts the processed text into a sqlite3 DB. The user can then view the processed text located in the DB. I'm serving the pages with iso-8859-1 encoding and the quotes and apostrophes look fine when viewed with a browser. The original html downloaded is also served as iso-8859-1.
Taking a look at the html source with firefox the characters I'm trying to capture are ’ ‘ “ ” . Should be easy...
However, looking at the text directly in the DB or in the original files the characters are replaced with question marks. I checked and the files created using curl are encoded as utf-8 I believe the sqlite3 db is also utf-8.
Is there a way to set curl to encode using iso-8859-1 or is there some other fix anyone can suggest? I've been attempting to figure this out for two days and haven't gotten anywhere.