Copying Chinese from PDF into HMTL


#1

Hi there,

I have a PDF which is in Chinese.

I am trying to copy it into Notepad to then paste it into HTML, however, it is appearing as little squares when I paste.

Can anyone recommend a way of transferring this from the Chinese in the document into Notepad or another program which I can then copy out again?

Thanks


#2

Is your character set UTF-8 and are you saving the file as UTF-8?

<meta charset="UTF-8">

utf-8


#3

I’m trying to add it into WordPress and also Dreamweaver, but it’s still just the strange little squares.

I wondered if there is a way of exporting or copying from a PDF keeping the Chinese symbols?


#5

Select and copy the Chinese text in the pdf, paste it in a new, not yet saved, Notepad document.

Yes, it will display as bars or rectangles, depending on the language in Notebook. But it actually is the Chinese letters, saving the file in UTF-8 (any font) will keep the copy-pasted Chinese letters. Mind though that saving in ANSI or ISO can ruin the pasted utf-8 characters.

But also try the copy-paste direct into the html and save the file as UTF-8, that should work too as far as I can tell.

There are tools that can extract and save text, unformatted, from pdf documents.

The browser should show the Chinese letters, even if the editor maybe can’t like Notepad.

If not, check, as @SamA74 says, that the UTF-8 is used in all steps and also the html meta charset is UTF-8.

Please tell how you succeed, or fail. :slight_smile:


#6

Hi,

Thanks for the reply.

I have tried saving using UTF-8, but no luck :frowning:


#7

Pity you didn’t include the select character set box below showing UTF-8. :wink:

If you open the originally saved as UTF-8 text file in the browser, what does it show there?


#8

Sorry, which character box? This is what I have when I save it:

Capture2

This is what it outputs in Chrome (both HTML and TXT)

Capture


#9

When I paste and save as UTF-8:

txt

Cromium-test.txt (65 Bytes)

Above is pasted from Google translate, but I often copy snippets from Chinese pdf documents.

Could you maybe link to a pdf that you fail copy-paste Chinese from, or tell what pdf-reader you use and also post the content in the pdf file properties?


#10

Hi,

I’m using Adobe Acrobat Pro.

This is a screenshot of the PDF properties:


#11

The fonts tab is interesting too getting a clue why copy-paste fails. I assume fonts are not embedded, but does it say what the character table is, ISO/UTF?

So, you can actually edit the pdf document’s content, fonts and securitty etc. I assumed the copy-paste was done in a pdf reader that can only display the content, maybe you could try that if everything else fails.

I can’t give any advise for your Acrobat Pro, the last version I had was Acrobat Pro 1.x, sorry I’m not on Windows since very long.

Now I’m curious, what happens if you insert the Chinese snippet I posted above to the pdf and save as UTF-8. Will the copy-paste of that snippet too result in squares?


#12

Hi there toolman,

why don’t you upload your PDF file to this thread? :wink:

Members will then have a golden opportunity to test
your problem and possibly resolve it for you. :biggrin:

coothead


#13

Hi,

Thanks for the reply.

This is one of the pages:

http://elop.co/page.pdf


#14

Hi,

The fonts are embedded and I extracted them to see what syllables was used in the document. I couldn’t find the Chinese letters displayed in the pdf in the 40kB embedded subset of the 13.6MB PingFangSC-Regular.ttf. What the text encoding the text originally had isn’t clear either.

I could read the leaflet was about the qualities of Botox. But the Chinese letters I think Acrobat might have replaced with images, that could explain the file size. The file is “optimized” so it’s not easy to debug it.

Now this is not the pdf you posted the file properties from. :thinking:

This pdf is v.1.6 and it’s created a few hours after @coothead suggested to upload the pdf.

If you or a college of yours created it, it would be reasonable to think you have other ways to get the Chinese text than copy-pasting it to Notepad.

Anyway, what I think could also be totally wrong. :slight_smile:


#15

Hi,

Many thanks for the reply.

Yes, this is a different page - a more simpler one which is not as big in size as the other pages.

I have tried to download the PingFang font and also SFNSDisplay, but still no luck :frowning:


#16

Hi there toolman,

check out the attachment which contains the page basics:
an HTML file, a CSS file and the three woff fonts used.

pdf-to-html.zip (43.6 KB)

Unfortunately, the Chinese characters stubbornly refused
to be Copy & Pasted. :unhappy:

coothead


#17

Hi,

Many thanks, that worked great :smiley:

How did you export the PDF?


#18

You are allowed to download one font free of charge at the Chinese site https://en.fontke.com/. You can find the font there and download it using the browser. Note: The site is in Chinese so the download names are in Chinese too, and you might need to rename the file to open it.

About the copy-paste failure, I stumbled over an interesting Github thread that explains and also has a tip how to get around the failure by exporting the selected part as text.

I think you could find the answer or links to here:


#19

I used this site…

https://www.zamzar.com/convert/pdf-to-html/

…which worked amazingly well considering the size
of the PDF file - 34.3MB :eek:

coothead


#20

Hi,

Thank you everyone for your help on this, it has been a real help and very much appreciated :smiley: