Read text from pdf - any idea?

Other General Resources like icon sets, sound files etc.,

Moderator: Rathinagiri

User avatar
mol
Posts: 3723
Joined: Thu Sep 11, 2008 5:31 am
Location: Myszków, Poland
Contact:

Read text from pdf - any idea?

Post by mol »

Do you have any idea, maybe sample, how to read data from tables placed in PDF file?
User avatar
Rathinagiri
Posts: 5471
Joined: Tue Jul 29, 2008 6:30 pm
DBs Used: MariaDB, SQLite, SQLCipher and MySQL
Location: Sivakasi, India
Contact:

Re: Read text from pdf - any idea?

Post by Rathinagiri »

https://github.com/michaelrsweet/pdfio

This looks very promising. Any one who can create a Library or DLL?
East or West HMG is the Best.
South or North HMG is worth.
...the possibilities are endless.
User avatar
Rathinagiri
Posts: 5471
Joined: Tue Jul 29, 2008 6:30 pm
DBs Used: MariaDB, SQLite, SQLCipher and MySQL
Location: Sivakasi, India
Contact:

Re: Read text from pdf - any idea?

Post by Rathinagiri »

East or West HMG is the Best.
South or North HMG is worth.
...the possibilities are endless.
User avatar
mol
Posts: 3723
Joined: Thu Sep 11, 2008 5:31 am
Location: Myszków, Poland
Contact:

Re: Read text from pdf - any idea?

Post by mol »

C is not my language :lol:
I really even don't know how to start to compile zlib library from this text.

Update:
I'm trying to compile pdf.cpp from these sites.
First, I've downloaded zlib library.
I successfully compiled it with MinGW
But, when I try do compile pdf.cpp I get an error:

Code: Select all

pdf.cpp:205:22: error: '_TCHAR' has not been declared
  205 | int _tmain(int argc, _TCHAR* argv[])
      |                      ^~~~~~
pdf.cpp: In function 'int _tmain(int, int**)':
pdf.cpp:233:74: warning: ISO C++ forbids converting a string constant to 'char*' [-Wwrite-strings]
  233 |                         size_t streamstart = FindStringInBuffer (buffer, "stream", filelen);
      |                                                                          ^~~~~~~~
pdf.cpp:234:74: warning: ISO C++ forbids converting a string constant to 'char*' [-Wwrite-strings]
  234 |                         size_t streamend   = FindStringInBuffer (buffer, "endstream", filelen);
      |                                                                          ^~~~~~~~~~~
I think it's the end of my knowledge :D
Last edited by mol on Tue Dec 12, 2023 8:29 am, edited 1 time in total.
User avatar
serge_girard
Posts: 3167
Joined: Sun Nov 25, 2012 2:44 pm
DBs Used: 1 MySQL - MariaDB
2 DBF
Location: Belgium
Contact:

Re: Read text from pdf - any idea?

Post by serge_girard »

Very interesting!

I would love to have this...!
There's nothing you can do that can't be done...
User avatar
mol
Posts: 3723
Joined: Thu Sep 11, 2008 5:31 am
Location: Myszków, Poland
Contact:

Re: Read text from pdf - any idea?

Post by mol »

Maybe someone can help to move this piece of code to harbour?

Code: Select all

//Now use zlib to inflate:
				z_stream zstrm; ZeroMemory(&zstrm, sizeof(zstrm));

				zstrm.avail_in = streamend - streamstart + 1;
				zstrm.avail_out = outsize;
				zstrm.next_in = (Bytef*)(buffer + streamstart);
				zstrm.next_out = (Bytef*)output;

				int rsti = inflateInit(&zstrm);
				if (rsti == Z_OK)
				{
					int rst2 = inflate (&zstrm, Z_FINISH);
					if (rst2 >= 0)
					{
						//Ok, got something, extract the text:
						size_t totout = zstrm.total_out;
						ProcessOutput(fileo, output, totout);
					}
				}
				delete[] output; output=0;
				buffer+= streamend + 7;
				filelen = filelen - (streamend+7);
User avatar
Rathinagiri
Posts: 5471
Joined: Tue Jul 29, 2008 6:30 pm
DBs Used: MariaDB, SQLite, SQLCipher and MySQL
Location: Sivakasi, India
Contact:

Re: Read text from pdf - any idea?

Post by Rathinagiri »

Wow! At least you can move to this level!

I think there are C Gurus like Grigory and edk are available. Let us ask them.
East or West HMG is the Best.
South or North HMG is worth.
...the possibilities are endless.
User avatar
mol
Posts: 3723
Joined: Thu Sep 11, 2008 5:31 am
Location: Myszków, Poland
Contact:

Re: Read text from pdf - any idea?

Post by mol »

I compiled sample, but I'm getting some trashes instead of text from pdf:

Code: Select all

!"#$"%&'$%()*&+,-./01&23"456$)3$75"&89:;<&:8=88:&>5?4(@7A@3"0)BCD&E:;&FGG&HIJ&H8;<&/2KD&J:J8I8L::H2/'&M"%6&NBO46$<;F&GIFI&GG:8&GIII&IIJI&J8GF&:L;F

>5?4(@7A@3"

P$)Q47)&3R4("3$)%$"DI8CI8C8I8L

S"("&4T*5)U"VRDI8CI8C8I8L

S"("&3R4("3$)%$"D

WT*5)U"37"D

/"#R37"D!"#$"%&'$%()*&+,-./01>B)"%&A@X4)&S3@*"6&Y%%"/2KD&J:J8I8L::H

23"456$)3$75"&89:;:8=88:&>5?4(@7A@3"

W$6@*46$)Z@&G[&:8=LII&PR456\3

/2KD&FHH&GJL&IL&8I!"6(X*"&]Y0&&8;:98I8L&@*RZ$%"^%"&T@U4("3$)&5"_\3$)%$"&1`&8L898I8L&5&U%$"&GJCIGC8I8L

I have no idea how to continue this work...

I think it's possible to write it in pure harbour, but I don't know how to decompress text variable in memory, what is compression method etc...
hansmarc
Posts: 40
Joined: Thu Jun 23, 2016 5:38 am
Location: Belgium

Re: Read text from pdf - any idea?

Post by hansmarc »

Hi Mol,

This tool, see link, can maybe help you.
It is a command line tool. i know, not the best solution.
I did some tests in 2022 with a lot of supplier invoice pdf files for a project to import
automatically invoices in our management software.
Results where not bad at all but saddly our project has not the highest priority and thus not finished.

https://www.xpdfreader.com/pdftotext-man.html

Regards
Hans
User avatar
Rathinagiri
Posts: 5471
Joined: Tue Jul 29, 2008 6:30 pm
DBs Used: MariaDB, SQLite, SQLCipher and MySQL
Location: Sivakasi, India
Contact:

Re: Read text from pdf - any idea?

Post by Rathinagiri »

Please share your code Mol. Let others try.
mol wrote: Tue Dec 12, 2023 9:29 pm I compiled sample, but I'm getting some trashes instead of text from pdf:

Code: Select all

!"#$"%&'$%()*&+,-./01&23"456$)3$75"&89:;<&:8=88:&>5?4(@7A@3"0)BCD&E:;&FGG&HIJ&H8;<&/2KD&J:J8I8L::H2/'&M"%6&NBO46$<;F&GIFI&GG:8&GIII&IIJI&J8GF&:L;F

>5?4(@7A@3"

P$)Q47)&3R4("3$)%$"DI8CI8C8I8L

S"("&4T*5)U"VRDI8CI8C8I8L

S"("&3R4("3$)%$"D

WT*5)U"37"D

/"#R37"D!"#$"%&'$%()*&+,-./01>B)"%&A@X4)&S3@*"6&Y%%"/2KD&J:J8I8L::H

23"456$)3$75"&89:;:8=88:&>5?4(@7A@3"

W$6@*46$)Z@&G[&:8=LII&PR456\3

/2KD&FHH&GJL&IL&8I!"6(X*"&]Y0&&8;:98I8L&@*RZ$%"^%"&T@U4("3$)&5"_\3$)%$"&1`&8L898I8L&5&U%$"&GJCIGC8I8L

I have no idea how to continue this work...

I think it's possible to write it in pure harbour, but I don't know how to decompress text variable in memory, what is compression method etc...
East or West HMG is the Best.
South or North HMG is worth.
...the possibilities are endless.
Post Reply