In article <791c07c4-e66d-4f99-bff1-
d475a764b
...@m13g2000vbf.googlegroups.com>, ra
...@conexus.net says...
> In any event, some of my questions are: Is the mechanism to store text
> in the PDF file documented? Is there some sort of standard?
You need to read teh PDF Rederence Manual, which is available from the
Adobe web site. Warning; text is stored in an encoded fashion, while it
*may* be ASCII or similar it equally well may not be, and is dependent
(amongst other things) on the font being used.
This is a complex subject, and in the general case there is no guarantee
of being able to recover text from a PDF file in any way other than
printing and OCR'ing it.
That being said, since you are generating the text, its perfectly
possible to ensure that you can get it back out again, just don't assume
that you can do this with any random PDF file.
> Tools that extract such words from PDF files could be useful in my
> research.
Ghostscript has a simple tool, ps2ascii, which can extract text, but is
not well supported.
Ken