Bash: search for keywords PDF files and return pages [duplicate]

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite













This question already has an answer here:



  • How can I grep in PDF files?

    13 answers



Hopefully somebody can help me out with this,



I'm looking for a small script which does a keyword search in a PDF file (actually a folder of PDF files), and it needs to return all pages including the name of the file, where the keyword is found.



I have found the following script (over here https://ubuntuforums.org/showthread.php?t=1368062):



#!/bin/bash

[ "$*" ] || echo "You forgot a search string!" ; exit 1 ;

found=1

for file in ./src/*.pdf ; do
[ "$file" = '*.pdf' ] && echo "No PDF files found!" && exit 1
pages=$(pdfinfo "$file" | awk '/Pages:/ print $NF ')
for ((i=1 ; i<=$pages ; i++)) ; do
match=$(pdftotext -q -f $i -l $i "$file" - | grep -m 1 "$*")
[ "$match" ] && echo "Page $i in $file" && found=0
done
done

[ "$found" -ne 0 ] && echo "No search string matches found"


It does return most of the hits, but still, the search functionality inside Adobe Acrobat Reader and Mac Preview does return way more matches. Anyone who recognises what might be the problem?



My guess is it fails on a character before and/or after the search keyword, but that's just a guess .



If it includes the number of matches per page, it's truly perfect!










share|improve this question













marked as duplicate by Goro, countermode, RalfFriedl, Caleb, meuh Sep 26 at 17:50


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.


















    up vote
    1
    down vote

    favorite













    This question already has an answer here:



    • How can I grep in PDF files?

      13 answers



    Hopefully somebody can help me out with this,



    I'm looking for a small script which does a keyword search in a PDF file (actually a folder of PDF files), and it needs to return all pages including the name of the file, where the keyword is found.



    I have found the following script (over here https://ubuntuforums.org/showthread.php?t=1368062):



    #!/bin/bash

    [ "$*" ] || echo "You forgot a search string!" ; exit 1 ;

    found=1

    for file in ./src/*.pdf ; do
    [ "$file" = '*.pdf' ] && echo "No PDF files found!" && exit 1
    pages=$(pdfinfo "$file" | awk '/Pages:/ print $NF ')
    for ((i=1 ; i<=$pages ; i++)) ; do
    match=$(pdftotext -q -f $i -l $i "$file" - | grep -m 1 "$*")
    [ "$match" ] && echo "Page $i in $file" && found=0
    done
    done

    [ "$found" -ne 0 ] && echo "No search string matches found"


    It does return most of the hits, but still, the search functionality inside Adobe Acrobat Reader and Mac Preview does return way more matches. Anyone who recognises what might be the problem?



    My guess is it fails on a character before and/or after the search keyword, but that's just a guess .



    If it includes the number of matches per page, it's truly perfect!










    share|improve this question













    marked as duplicate by Goro, countermode, RalfFriedl, Caleb, meuh Sep 26 at 17:50


    This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
















      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite












      This question already has an answer here:



      • How can I grep in PDF files?

        13 answers



      Hopefully somebody can help me out with this,



      I'm looking for a small script which does a keyword search in a PDF file (actually a folder of PDF files), and it needs to return all pages including the name of the file, where the keyword is found.



      I have found the following script (over here https://ubuntuforums.org/showthread.php?t=1368062):



      #!/bin/bash

      [ "$*" ] || echo "You forgot a search string!" ; exit 1 ;

      found=1

      for file in ./src/*.pdf ; do
      [ "$file" = '*.pdf' ] && echo "No PDF files found!" && exit 1
      pages=$(pdfinfo "$file" | awk '/Pages:/ print $NF ')
      for ((i=1 ; i<=$pages ; i++)) ; do
      match=$(pdftotext -q -f $i -l $i "$file" - | grep -m 1 "$*")
      [ "$match" ] && echo "Page $i in $file" && found=0
      done
      done

      [ "$found" -ne 0 ] && echo "No search string matches found"


      It does return most of the hits, but still, the search functionality inside Adobe Acrobat Reader and Mac Preview does return way more matches. Anyone who recognises what might be the problem?



      My guess is it fails on a character before and/or after the search keyword, but that's just a guess .



      If it includes the number of matches per page, it's truly perfect!










      share|improve this question














      This question already has an answer here:



      • How can I grep in PDF files?

        13 answers



      Hopefully somebody can help me out with this,



      I'm looking for a small script which does a keyword search in a PDF file (actually a folder of PDF files), and it needs to return all pages including the name of the file, where the keyword is found.



      I have found the following script (over here https://ubuntuforums.org/showthread.php?t=1368062):



      #!/bin/bash

      [ "$*" ] || echo "You forgot a search string!" ; exit 1 ;

      found=1

      for file in ./src/*.pdf ; do
      [ "$file" = '*.pdf' ] && echo "No PDF files found!" && exit 1
      pages=$(pdfinfo "$file" | awk '/Pages:/ print $NF ')
      for ((i=1 ; i<=$pages ; i++)) ; do
      match=$(pdftotext -q -f $i -l $i "$file" - | grep -m 1 "$*")
      [ "$match" ] && echo "Page $i in $file" && found=0
      done
      done

      [ "$found" -ne 0 ] && echo "No search string matches found"


      It does return most of the hits, but still, the search functionality inside Adobe Acrobat Reader and Mac Preview does return way more matches. Anyone who recognises what might be the problem?



      My guess is it fails on a character before and/or after the search keyword, but that's just a guess .



      If it includes the number of matches per page, it's truly perfect!





      This question already has an answer here:



      • How can I grep in PDF files?

        13 answers







      bash pdf file-search






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Sep 25 at 9:05









      Erik van de Ven

      2612412




      2612412




      marked as duplicate by Goro, countermode, RalfFriedl, Caleb, meuh Sep 26 at 17:50


      This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.






      marked as duplicate by Goro, countermode, RalfFriedl, Caleb, meuh Sep 26 at 17:50


      This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.






















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted










          I would use pdfgrep:



          pdfgrep -p "your search string" src/*.pdf


          will output the matching page numbers, with a count per page.



          This might not deal with the missing matches; the reasons for those depend on the way the PDFs are constructed (in particular, how the text is assembled).






          share|improve this answer




















          • Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
            – Erik van de Ven
            Sep 25 at 9:17

















          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          2
          down vote



          accepted










          I would use pdfgrep:



          pdfgrep -p "your search string" src/*.pdf


          will output the matching page numbers, with a count per page.



          This might not deal with the missing matches; the reasons for those depend on the way the PDFs are constructed (in particular, how the text is assembled).






          share|improve this answer




















          • Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
            – Erik van de Ven
            Sep 25 at 9:17














          up vote
          2
          down vote



          accepted










          I would use pdfgrep:



          pdfgrep -p "your search string" src/*.pdf


          will output the matching page numbers, with a count per page.



          This might not deal with the missing matches; the reasons for those depend on the way the PDFs are constructed (in particular, how the text is assembled).






          share|improve this answer




















          • Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
            – Erik van de Ven
            Sep 25 at 9:17












          up vote
          2
          down vote



          accepted







          up vote
          2
          down vote



          accepted






          I would use pdfgrep:



          pdfgrep -p "your search string" src/*.pdf


          will output the matching page numbers, with a count per page.



          This might not deal with the missing matches; the reasons for those depend on the way the PDFs are constructed (in particular, how the text is assembled).






          share|improve this answer












          I would use pdfgrep:



          pdfgrep -p "your search string" src/*.pdf


          will output the matching page numbers, with a count per page.



          This might not deal with the missing matches; the reasons for those depend on the way the PDFs are constructed (in particular, how the text is assembled).







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Sep 25 at 9:12









          Stephen Kitt

          148k23328395




          148k23328395











          • Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
            – Erik van de Ven
            Sep 25 at 9:17
















          • Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
            – Erik van de Ven
            Sep 25 at 9:17















          Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
          – Erik van de Ven
          Sep 25 at 9:17




          Thanks so much! Pdfgrep is far more accurate! It seems to work perfectly
          – Erik van de Ven
          Sep 25 at 9:17


          Popular posts from this blog

          How to check contact read email or not when send email to Individual?

          Displaying single band from multi-band raster using QGIS

          How many registers does an x86_64 CPU actually have?