Bash: search for keywords PDF files and return pages [duplicate]

up vote
1
down vote

favorite

This question already has an answer here:

How can I grep in PDF files?

13 answers

Hopefully somebody can help me out with this,

I'm looking for a small script which does a keyword search in a PDF file (actually a folder of PDF files), and it needs to return all pages including the name of the file, where the keyword is found.

I have found the following script (over here https://ubuntuforums.org/showthread.php?t=1368062):

#!/bin/bash

[ "$*" ] || echo "You forgot a search string!" ; exit 1 ; 

found=1

for file in ./src/*.pdf ; do
 [ "$file" = '*.pdf' ] && echo "No PDF files found!" && exit 1
 pages=$(pdfinfo "$file" | awk '/Pages:/ print $NF ')
 for ((i=1 ; i<=$pages ; i++)) ; do
 match=$(pdftotext -q -f $i -l $i "$file" - | grep -m 1 "$*")
 [ "$match" ] && echo "Page $i in $file" && found=0
 done
done

[ "$found" -ne 0 ] && echo "No search string matches found"

It does return most of the hits, but still, the search functionality inside Adobe Acrobat Reader and Mac Preview does return way more matches. Anyone who recognises what might be the problem?

My guess is it fails on a character before and/or after the search keyword, but that's just a guess .

If it includes the number of matches per page, it's truly perfect!

asked Sep 25 at 9:05

Erik van de Ven

2612412

marked as duplicate by Goro, countermode, RalfFriedl, Caleb, meuh Sep 26 at 17:50

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

add a commentÂ |Â

up vote
1
down vote

favorite

This question already has an answer here:

How can I grep in PDF files?

13 answers

Hopefully somebody can help me out with this,

I'm looking for a small script which does a keyword search in a PDF file (actually a folder of PDF files), and it needs to return all pages including the name of the file, where the keyword is found.

I have found the following script (over here https://ubuntuforums.org/showthread.php?t=1368062):

#!/bin/bash

[ "$*" ] || echo "You forgot a search string!" ; exit 1 ; 

found=1

for file in ./src/*.pdf ; do
 [ "$file" = '*.pdf' ] && echo "No PDF files found!" && exit 1
 pages=$(pdfinfo "$file" | awk '/Pages:/ print $NF ')
 for ((i=1 ; i<=$pages ; i++)) ; do
 match=$(pdftotext -q -f $i -l $i "$file" - | grep -m 1 "$*")
 [ "$match" ] && echo "Page $i in $file" && found=0
 done
done

[ "$found" -ne 0 ] && echo "No search string matches found"

It does return most of the hits, but still, the search functionality inside Adobe Acrobat Reader and Mac Preview does return way more matches. Anyone who recognises what might be the problem?

My guess is it fails on a character before and/or after the search keyword, but that's just a guess .

If it includes the number of matches per page, it's truly perfect!

asked Sep 25 at 9:05

Erik van de Ven

2612412

marked as duplicate by Goro, countermode, RalfFriedl, Caleb, meuh Sep 26 at 17:50

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

add a commentÂ |Â

up vote
1
down vote

favorite

This question already has an answer here:

How can I grep in PDF files?

13 answers

Hopefully somebody can help me out with this,

I'm looking for a small script which does a keyword search in a PDF file (actually a folder of PDF files), and it needs to return all pages including the name of the file, where the keyword is found.

I have found the following script (over here https://ubuntuforums.org/showthread.php?t=1368062):

#!/bin/bash

[ "$*" ] || echo "You forgot a search string!" ; exit 1 ; 

found=1

for file in ./src/*.pdf ; do
 [ "$file" = '*.pdf' ] && echo "No PDF files found!" && exit 1
 pages=$(pdfinfo "$file" | awk '/Pages:/ print $NF ')
 for ((i=1 ; i<=$pages ; i++)) ; do
 match=$(pdftotext -q -f $i -l $i "$file" - | grep -m 1 "$*")
 [ "$match" ] && echo "Page $i in $file" && found=0
 done
done

[ "$found" -ne 0 ] && echo "No search string matches found"

It does return most of the hits, but still, the search functionality inside Adobe Acrobat Reader and Mac Preview does return way more matches. Anyone who recognises what might be the problem?

My guess is it fails on a character before and/or after the search keyword, but that's just a guess .

If it includes the number of matches per page, it's truly perfect!

asked Sep 25 at 9:05

Erik van de Ven

2612412

This question already has an answer here:

How can I grep in PDF files?

13 answers

Hopefully somebody can help me out with this,

I'm looking for a small script which does a keyword search in a PDF file (actually a folder of PDF files), and it needs to return all pages including the name of the file, where the keyword is found.

I have found the following script (over here https://ubuntuforums.org/showthread.php?t=1368062):

#!/bin/bash

[ "$*" ] || echo "You forgot a search string!" ; exit 1 ; 

found=1

for file in ./src/*.pdf ; do
 [ "$file" = '*.pdf' ] && echo "No PDF files found!" && exit 1
 pages=$(pdfinfo "$file" | awk '/Pages:/ print $NF ')
 for ((i=1 ; i<=$pages ; i++)) ; do
 match=$(pdftotext -q -f $i -l $i "$file" - | grep -m 1 "$*")
 [ "$match" ] && echo "Page $i in $file" && found=0
 done
done

[ "$found" -ne 0 ] && echo "No search string matches found"

It does return most of the hits, but still, the search functionality inside Adobe Acrobat Reader and Mac Preview does return way more matches. Anyone who recognises what might be the problem?

My guess is it fails on a character before and/or after the search keyword, but that's just a guess .

If it includes the number of matches per page, it's truly perfect!

This question already has an answer here:

How can I grep in PDF files?

13 answers

bash pdf file-search

asked Sep 25 at 9:05

Erik van de Ven

2612412

asked Sep 25 at 9:05

Erik van de Ven

2612412

asked Sep 25 at 9:05

Erik van de Ven

2612412

asked Sep 25 at 9:05

Erik van de Ven

2612412

asked Sep 25 at 9:05

Erik van de Ven

2612412

marked as duplicate by Goro, countermode, RalfFriedl, Caleb, meuh Sep 26 at 17:50

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

marked as duplicate by Goro, countermode, RalfFriedl, Caleb, meuh Sep 26 at 17:50

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

I would use pdfgrep:

pdfgrep -p "your search string" src/*.pdf

will output the matching page numbers, with a count per page.

This might not deal with the missing matches; the reasons for those depend on the way the PDFs are constructed (in particular, how the text is assembled).

answered Sep 25 at 9:12

Stephen Kitt

148k23328395

Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
â€“Â Erik van de Ven
Sep 25 at 9:17

add a commentÂ |Â

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

I would use pdfgrep:

pdfgrep -p "your search string" src/*.pdf

will output the matching page numbers, with a count per page.

This might not deal with the missing matches; the reasons for those depend on the way the PDFs are constructed (in particular, how the text is assembled).

answered Sep 25 at 9:12

Stephen Kitt

148k23328395

Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
â€“Â Erik van de Ven
Sep 25 at 9:17

add a commentÂ |Â

up vote
2
down vote

accepted

I would use pdfgrep:

pdfgrep -p "your search string" src/*.pdf

will output the matching page numbers, with a count per page.

This might not deal with the missing matches; the reasons for those depend on the way the PDFs are constructed (in particular, how the text is assembled).

answered Sep 25 at 9:12

Stephen Kitt

148k23328395

Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
â€“Â Erik van de Ven
Sep 25 at 9:17

add a commentÂ |Â

up vote
2
down vote

accepted

I would use pdfgrep:

pdfgrep -p "your search string" src/*.pdf

will output the matching page numbers, with a count per page.

This might not deal with the missing matches; the reasons for those depend on the way the PDFs are constructed (in particular, how the text is assembled).

answered Sep 25 at 9:12

Stephen Kitt

148k23328395

I would use pdfgrep:

pdfgrep -p "your search string" src/*.pdf

will output the matching page numbers, with a count per page.

This might not deal with the missing matches; the reasons for those depend on the way the PDFs are constructed (in particular, how the text is assembled).

answered Sep 25 at 9:12

Stephen Kitt

148k23328395

answered Sep 25 at 9:12

Stephen Kitt

148k23328395

answered Sep 25 at 9:12

Stephen Kitt

148k23328395

answered Sep 25 at 9:12

Stephen Kitt

148k23328395

Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
â€“Â Erik van de Ven
Sep 25 at 9:17

add a commentÂ |Â

Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
â€“Â Erik van de Ven
Sep 25 at 9:17

Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
â€“Â Erik van de Ven
Sep 25 at 9:17

add a commentÂ |Â

搜尋此網誌

mjhjmtu