"DoctorC" <enco@[EMAIL PROTECTED]
> wrote in message
news:456ae3ac$0$17950$f69f905@[EMAIL PROTECTED]
> Hi,
> I need some suggestion about do***ent processing techniques.
> I need to im****t do***ents in html, DOC and PDF formats and would like
to
> parse them and automatically create fields to fill the do***ents.
> Any idea how to do it?
"im****t do***ents..." "automaticallycreate fields to fill the
do***ents..."
html, DOC and PDF are 3 different animals.
The easiest would probably be HTML, since it'll probably have tags specify
what are actually fields (if my HTML memory servers me, it might be
something like <field=...> but don't quote me on that).
The problem with DOC and PDF is there is nothing really stating what a
field
is. Lets take a PDF which are (usually) graphic images. If they are
graphic you'll need some type of OCR (Optical Character Recognition) to
read
the text. At least with DOC you already have that. But then what? How
do
you know what a field is?
We, as humans, see:
Name
and we know we're supposed to put our name their. How is you software
supposed to distinguish that as a field though? How does it know:
Enter your name:
is a field and
Do not write below this line:
isn't?


|