El grupo al cual envías entradas es un grupo Usenet. Si envías mensajes a este grupo, cualquier usuario de Internet podrá ver tu dirección de correo electrónico
I am involved in a project which requires to store some text (programmatically) in PDF documents. I guess my first step would be to look at how Adobe does it. I was surprised to see that the text being discovered by the Adobe OCR phase is stored in a fashion in the PDF file, while the text discovered by another OCR company is stored differently. Perhaps they are trying to stay out of each other's way?
In any event, some of my questions are: Is the mechanism to store text in the PDF file documented? Is there some sort of standard?
Tools that extract such words from PDF files could be useful in my research.
In article <791c07c4-e66d-4f99-bff1- d475a764b...@m13g2000vbf.googlegroups.com>, ra...@conexus.net says...
> In any event, some of my questions are: Is the mechanism to store text > in the PDF file documented? Is there some sort of standard?
You need to read teh PDF Rederence Manual, which is available from the Adobe web site. Warning; text is stored in an encoded fashion, while it *may* be ASCII or similar it equally well may not be, and is dependent (amongst other things) on the font being used.
This is a complex subject, and in the general case there is no guarantee of being able to recover text from a PDF file in any way other than printing and OCR'ing it.
That being said, since you are generating the text, its perfectly possible to ensure that you can get it back out again, just don't assume that you can do this with any random PDF file.
> Tools that extract such words from PDF files could be useful in my > research.
Ghostscript has a simple tool, ps2ascii, which can extract text, but is not well supported.