Results 1 to 2 of 2
May 13, 2010, 12:15 #1
- Join Date
- May 2010
- 0 Post(s)
- 0 Thread(s)
May 17, 2010, 01:42 #2
PDF documents do not have an attribute that prevents them from being searched. However the content is often encoded or encrypted in such a way that would make plain text searches impossible.
You will find the easiest way to decode a document is to use a third-party PDF component. I use ABCpdf from webSupergoo, which supports CSharp, VB.NET, VBScript, ASP, ASP.NET. It also provides a COM interface for interoperation with other languages.
The following VBScript example shows how to decompress content streams in a PDF using ABCpdf. Copy the code into a text file and change the file extension from '.txt' to '.vbs'
theFile = WScript.Arguments.Item(0) Set theDoc = CreateObject("ABCpdf7.Doc") theDoc.Read theFile theCount = theDoc.GetInfo(0, "Count") For i = 1 to theCount theDoc.GetInfo i, "Decompress" Next theDoc.SaveOptions.Linearize = false theDoc.Save theFile & "_dec.pdf" MsgBox "Done"
Be aware that text in PDF files is typically broken into short arbitary fragments and might require additional work to reconstruct. ABCpdf has a GetText function that simplifies this task.
The contents of a PDF document can also be protected by encryption. ABCpdf supports a number of encryption standards and can decrypt these files for you, if you know the password. See the documentation on the Encryption object for further details.