Bash: search for keywords PDF files and return pages [duplicate]
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
This question already has an answer here:
How can I grep in PDF files?
13 answers
Hopefully somebody can help me out with this,
I'm looking for a small script which does a keyword search in a PDF file (actually a folder of PDF files), and it needs to return all pages including the name of the file, where the keyword is found.
I have found the following script (over here https://ubuntuforums.org/showthread.php?t=1368062):
#!/bin/bash
[ "$*" ] || echo "You forgot a search string!" ; exit 1 ;
found=1
for file in ./src/*.pdf ; do
[ "$file" = '*.pdf' ] && echo "No PDF files found!" && exit 1
pages=$(pdfinfo "$file" | awk '/Pages:/ print $NF ')
for ((i=1 ; i<=$pages ; i++)) ; do
match=$(pdftotext -q -f $i -l $i "$file" - | grep -m 1 "$*")
[ "$match" ] && echo "Page $i in $file" && found=0
done
done
[ "$found" -ne 0 ] && echo "No search string matches found"
It does return most of the hits, but still, the search functionality inside Adobe Acrobat Reader and Mac Preview does return way more matches. Anyone who recognises what might be the problem?
My guess is it fails on a character before and/or after the search keyword, but that's just a guess .
If it includes the number of matches per page, it's truly perfect!
bash pdf file-search
marked as duplicate by Goro, countermode, RalfFriedl, Caleb, meuh Sep 26 at 17:50
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |Â
up vote
1
down vote
favorite
This question already has an answer here:
How can I grep in PDF files?
13 answers
Hopefully somebody can help me out with this,
I'm looking for a small script which does a keyword search in a PDF file (actually a folder of PDF files), and it needs to return all pages including the name of the file, where the keyword is found.
I have found the following script (over here https://ubuntuforums.org/showthread.php?t=1368062):
#!/bin/bash
[ "$*" ] || echo "You forgot a search string!" ; exit 1 ;
found=1
for file in ./src/*.pdf ; do
[ "$file" = '*.pdf' ] && echo "No PDF files found!" && exit 1
pages=$(pdfinfo "$file" | awk '/Pages:/ print $NF ')
for ((i=1 ; i<=$pages ; i++)) ; do
match=$(pdftotext -q -f $i -l $i "$file" - | grep -m 1 "$*")
[ "$match" ] && echo "Page $i in $file" && found=0
done
done
[ "$found" -ne 0 ] && echo "No search string matches found"
It does return most of the hits, but still, the search functionality inside Adobe Acrobat Reader and Mac Preview does return way more matches. Anyone who recognises what might be the problem?
My guess is it fails on a character before and/or after the search keyword, but that's just a guess .
If it includes the number of matches per page, it's truly perfect!
bash pdf file-search
marked as duplicate by Goro, countermode, RalfFriedl, Caleb, meuh Sep 26 at 17:50
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
This question already has an answer here:
How can I grep in PDF files?
13 answers
Hopefully somebody can help me out with this,
I'm looking for a small script which does a keyword search in a PDF file (actually a folder of PDF files), and it needs to return all pages including the name of the file, where the keyword is found.
I have found the following script (over here https://ubuntuforums.org/showthread.php?t=1368062):
#!/bin/bash
[ "$*" ] || echo "You forgot a search string!" ; exit 1 ;
found=1
for file in ./src/*.pdf ; do
[ "$file" = '*.pdf' ] && echo "No PDF files found!" && exit 1
pages=$(pdfinfo "$file" | awk '/Pages:/ print $NF ')
for ((i=1 ; i<=$pages ; i++)) ; do
match=$(pdftotext -q -f $i -l $i "$file" - | grep -m 1 "$*")
[ "$match" ] && echo "Page $i in $file" && found=0
done
done
[ "$found" -ne 0 ] && echo "No search string matches found"
It does return most of the hits, but still, the search functionality inside Adobe Acrobat Reader and Mac Preview does return way more matches. Anyone who recognises what might be the problem?
My guess is it fails on a character before and/or after the search keyword, but that's just a guess .
If it includes the number of matches per page, it's truly perfect!
bash pdf file-search
This question already has an answer here:
How can I grep in PDF files?
13 answers
Hopefully somebody can help me out with this,
I'm looking for a small script which does a keyword search in a PDF file (actually a folder of PDF files), and it needs to return all pages including the name of the file, where the keyword is found.
I have found the following script (over here https://ubuntuforums.org/showthread.php?t=1368062):
#!/bin/bash
[ "$*" ] || echo "You forgot a search string!" ; exit 1 ;
found=1
for file in ./src/*.pdf ; do
[ "$file" = '*.pdf' ] && echo "No PDF files found!" && exit 1
pages=$(pdfinfo "$file" | awk '/Pages:/ print $NF ')
for ((i=1 ; i<=$pages ; i++)) ; do
match=$(pdftotext -q -f $i -l $i "$file" - | grep -m 1 "$*")
[ "$match" ] && echo "Page $i in $file" && found=0
done
done
[ "$found" -ne 0 ] && echo "No search string matches found"
It does return most of the hits, but still, the search functionality inside Adobe Acrobat Reader and Mac Preview does return way more matches. Anyone who recognises what might be the problem?
My guess is it fails on a character before and/or after the search keyword, but that's just a guess .
If it includes the number of matches per page, it's truly perfect!
This question already has an answer here:
How can I grep in PDF files?
13 answers
bash pdf file-search
bash pdf file-search
asked Sep 25 at 9:05
Erik van de Ven
2612412
2612412
marked as duplicate by Goro, countermode, RalfFriedl, Caleb, meuh Sep 26 at 17:50
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
marked as duplicate by Goro, countermode, RalfFriedl, Caleb, meuh Sep 26 at 17:50
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
2
down vote
accepted
I would use pdfgrep
:
pdfgrep -p "your search string" src/*.pdf
will output the matching page numbers, with a count per page.
This might not deal with the missing matches; the reasons for those depend on the way the PDFs are constructed (in particular, how the text is assembled).
Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
â Erik van de Ven
Sep 25 at 9:17
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
I would use pdfgrep
:
pdfgrep -p "your search string" src/*.pdf
will output the matching page numbers, with a count per page.
This might not deal with the missing matches; the reasons for those depend on the way the PDFs are constructed (in particular, how the text is assembled).
Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
â Erik van de Ven
Sep 25 at 9:17
add a comment |Â
up vote
2
down vote
accepted
I would use pdfgrep
:
pdfgrep -p "your search string" src/*.pdf
will output the matching page numbers, with a count per page.
This might not deal with the missing matches; the reasons for those depend on the way the PDFs are constructed (in particular, how the text is assembled).
Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
â Erik van de Ven
Sep 25 at 9:17
add a comment |Â
up vote
2
down vote
accepted
up vote
2
down vote
accepted
I would use pdfgrep
:
pdfgrep -p "your search string" src/*.pdf
will output the matching page numbers, with a count per page.
This might not deal with the missing matches; the reasons for those depend on the way the PDFs are constructed (in particular, how the text is assembled).
I would use pdfgrep
:
pdfgrep -p "your search string" src/*.pdf
will output the matching page numbers, with a count per page.
This might not deal with the missing matches; the reasons for those depend on the way the PDFs are constructed (in particular, how the text is assembled).
answered Sep 25 at 9:12
Stephen Kitt
148k23328395
148k23328395
Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
â Erik van de Ven
Sep 25 at 9:17
add a comment |Â
Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
â Erik van de Ven
Sep 25 at 9:17
Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
â Erik van de Ven
Sep 25 at 9:17
Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
â Erik van de Ven
Sep 25 at 9:17
add a comment |Â