Splitting a single large PDF file into n PDF files based on content and rename each splitted file (in Bash)

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












I'm working in a method of splitting a single large PDF file (which represents monthly settlements of a credit card). It is builded for printing but we'd like to split that file into single ones, for posterior use. Each settlement has a variable lenght: 2 pages, 3 pages, 4 pages... So we need to "read" each page, find the "Page 1 of X" and split the chunk 'till the next "Page 1 of X" appears. Also, each resulting splitted file has to have an unique Id (contained also in the "Page 1 of X" page).



While I was R&D-ing I found a tool named "PDF Content Split SA" that would do the exact task we needed. But I'm sure there's a way to do this in Linux (we're moving towards OpenSource+Libre).



Thank you for reading. Any help will be extremely useful.



EDIT



So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.



#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.

# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "$filelist[@]"; do
pagecount=`pdfinfo $file | grep "Pages" | awk ' print $2 '`
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
pattern=''
pagetitle=''
datestamp=''

for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do

header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]9'`
let "datestamp =`date +%s%N`" # to avoid overwriting with same new name

# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"

if [[ $pageindex == $pagecount ]]; then #process last output of the file
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=0
pattern=''
pagetitle=''
fi
else
#process previous set of pages to output
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=$pageid
pattern="$pageindex "
pagetitle="$header+"
fi
done
done


I've edit the Search Criteria, and the Script is well placed in the Nautilus Script folder, but it doesn't work. I've try debugging using the activity log from the console, and adding marks on the code; apparently there's a conflict with the resulting value of pdfinfo, but I've no idea how to solve it.







share|improve this question





















  • So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
    – RBaravalle
    Jun 6 at 19:21










  • If c#.net is an option, PDFSharp with MigraDoc can do that easily
    – ajeh
    Jun 6 at 19:48














up vote
0
down vote

favorite












I'm working in a method of splitting a single large PDF file (which represents monthly settlements of a credit card). It is builded for printing but we'd like to split that file into single ones, for posterior use. Each settlement has a variable lenght: 2 pages, 3 pages, 4 pages... So we need to "read" each page, find the "Page 1 of X" and split the chunk 'till the next "Page 1 of X" appears. Also, each resulting splitted file has to have an unique Id (contained also in the "Page 1 of X" page).



While I was R&D-ing I found a tool named "PDF Content Split SA" that would do the exact task we needed. But I'm sure there's a way to do this in Linux (we're moving towards OpenSource+Libre).



Thank you for reading. Any help will be extremely useful.



EDIT



So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.



#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.

# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "$filelist[@]"; do
pagecount=`pdfinfo $file | grep "Pages" | awk ' print $2 '`
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
pattern=''
pagetitle=''
datestamp=''

for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do

header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]9'`
let "datestamp =`date +%s%N`" # to avoid overwriting with same new name

# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"

if [[ $pageindex == $pagecount ]]; then #process last output of the file
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=0
pattern=''
pagetitle=''
fi
else
#process previous set of pages to output
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=$pageid
pattern="$pageindex "
pagetitle="$header+"
fi
done
done


I've edit the Search Criteria, and the Script is well placed in the Nautilus Script folder, but it doesn't work. I've try debugging using the activity log from the console, and adding marks on the code; apparently there's a conflict with the resulting value of pdfinfo, but I've no idea how to solve it.







share|improve this question





















  • So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
    – RBaravalle
    Jun 6 at 19:21










  • If c#.net is an option, PDFSharp with MigraDoc can do that easily
    – ajeh
    Jun 6 at 19:48












up vote
0
down vote

favorite









up vote
0
down vote

favorite











I'm working in a method of splitting a single large PDF file (which represents monthly settlements of a credit card). It is builded for printing but we'd like to split that file into single ones, for posterior use. Each settlement has a variable lenght: 2 pages, 3 pages, 4 pages... So we need to "read" each page, find the "Page 1 of X" and split the chunk 'till the next "Page 1 of X" appears. Also, each resulting splitted file has to have an unique Id (contained also in the "Page 1 of X" page).



While I was R&D-ing I found a tool named "PDF Content Split SA" that would do the exact task we needed. But I'm sure there's a way to do this in Linux (we're moving towards OpenSource+Libre).



Thank you for reading. Any help will be extremely useful.



EDIT



So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.



#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.

# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "$filelist[@]"; do
pagecount=`pdfinfo $file | grep "Pages" | awk ' print $2 '`
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
pattern=''
pagetitle=''
datestamp=''

for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do

header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]9'`
let "datestamp =`date +%s%N`" # to avoid overwriting with same new name

# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"

if [[ $pageindex == $pagecount ]]; then #process last output of the file
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=0
pattern=''
pagetitle=''
fi
else
#process previous set of pages to output
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=$pageid
pattern="$pageindex "
pagetitle="$header+"
fi
done
done


I've edit the Search Criteria, and the Script is well placed in the Nautilus Script folder, but it doesn't work. I've try debugging using the activity log from the console, and adding marks on the code; apparently there's a conflict with the resulting value of pdfinfo, but I've no idea how to solve it.







share|improve this question













I'm working in a method of splitting a single large PDF file (which represents monthly settlements of a credit card). It is builded for printing but we'd like to split that file into single ones, for posterior use. Each settlement has a variable lenght: 2 pages, 3 pages, 4 pages... So we need to "read" each page, find the "Page 1 of X" and split the chunk 'till the next "Page 1 of X" appears. Also, each resulting splitted file has to have an unique Id (contained also in the "Page 1 of X" page).



While I was R&D-ing I found a tool named "PDF Content Split SA" that would do the exact task we needed. But I'm sure there's a way to do this in Linux (we're moving towards OpenSource+Libre).



Thank you for reading. Any help will be extremely useful.



EDIT



So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.



#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.

# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "$filelist[@]"; do
pagecount=`pdfinfo $file | grep "Pages" | awk ' print $2 '`
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
pattern=''
pagetitle=''
datestamp=''

for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do

header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]9'`
let "datestamp =`date +%s%N`" # to avoid overwriting with same new name

# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"

if [[ $pageindex == $pagecount ]]; then #process last output of the file
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=0
pattern=''
pagetitle=''
fi
else
#process previous set of pages to output
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=$pageid
pattern="$pageindex "
pagetitle="$header+"
fi
done
done


I've edit the Search Criteria, and the Script is well placed in the Nautilus Script folder, but it doesn't work. I've try debugging using the activity log from the console, and adding marks on the code; apparently there's a conflict with the resulting value of pdfinfo, but I've no idea how to solve it.









share|improve this question












share|improve this question




share|improve this question








edited Jun 6 at 19:26
























asked Jun 6 at 13:55









RBaravalle

12




12











  • So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
    – RBaravalle
    Jun 6 at 19:21










  • If c#.net is an option, PDFSharp with MigraDoc can do that easily
    – ajeh
    Jun 6 at 19:48
















  • So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
    – RBaravalle
    Jun 6 at 19:21










  • If c#.net is an option, PDFSharp with MigraDoc can do that easily
    – ajeh
    Jun 6 at 19:48















So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
– RBaravalle
Jun 6 at 19:21




So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
– RBaravalle
Jun 6 at 19:21












If c#.net is an option, PDFSharp with MigraDoc can do that easily
– ajeh
Jun 6 at 19:48




If c#.net is an option, PDFSharp with MigraDoc can do that easily
– ajeh
Jun 6 at 19:48










2 Answers
2






active

oldest

votes

















up vote
1
down vote













Is some quick python an option? The package PyPDF2 would let you do exactly what you are asking.






share|improve this answer





















  • I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
    – RBaravalle
    Jun 6 at 15:33










  • This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
    – Jesse_b
    Jun 6 at 16:02










  • @Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
    – RBaravalle
    Jun 6 at 16:14

















up vote
0
down vote



accepted










I've made it. At least, it worked. But now I'd like to optimize the process. It takes up to 40 minutes to process 1000 items in a single massive pdf.



#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.



# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS



# process files
for file in "$filelist[@]"; do
pagecount=$(pdfinfo $file | grep "Pages" | awk ' print $2 ')
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
#storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]8')
pattern=''
pagetitle=''
datestamp=''

#for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do

header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)


pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]8')


echo $pageid
let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name

# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"


if [[ $pageindex == $pagecount ]]; then #process last output of the file
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
pdftk $file cat $pattern output "$storedid.pdf"
storedid=0
pattern=''
pagetitle=''

fi
else
#process previous set of pages to output
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
pdftk $file cat $pattern output "$storedid.pdf"
storedid=$pageid
pattern="$pageindex "
pagetitle="$header+"

fi
done
done





share|improve this answer





















    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f448212%2fsplitting-a-single-large-pdf-file-into-n-pdf-files-based-on-content-and-rename-e%23new-answer', 'question_page');

    );

    Post as a guest






























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote













    Is some quick python an option? The package PyPDF2 would let you do exactly what you are asking.






    share|improve this answer





















    • I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
      – RBaravalle
      Jun 6 at 15:33










    • This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
      – Jesse_b
      Jun 6 at 16:02










    • @Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
      – RBaravalle
      Jun 6 at 16:14














    up vote
    1
    down vote













    Is some quick python an option? The package PyPDF2 would let you do exactly what you are asking.






    share|improve this answer





















    • I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
      – RBaravalle
      Jun 6 at 15:33










    • This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
      – Jesse_b
      Jun 6 at 16:02










    • @Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
      – RBaravalle
      Jun 6 at 16:14












    up vote
    1
    down vote










    up vote
    1
    down vote









    Is some quick python an option? The package PyPDF2 would let you do exactly what you are asking.






    share|improve this answer













    Is some quick python an option? The package PyPDF2 would let you do exactly what you are asking.







    share|improve this answer













    share|improve this answer



    share|improve this answer











    answered Jun 6 at 14:25









    Joe M

    5964




    5964











    • I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
      – RBaravalle
      Jun 6 at 15:33










    • This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
      – Jesse_b
      Jun 6 at 16:02










    • @Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
      – RBaravalle
      Jun 6 at 16:14
















    • I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
      – RBaravalle
      Jun 6 at 15:33










    • This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
      – Jesse_b
      Jun 6 at 16:02










    • @Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
      – RBaravalle
      Jun 6 at 16:14















    I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
    – RBaravalle
    Jun 6 at 15:33




    I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
    – RBaravalle
    Jun 6 at 15:33












    This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
    – Jesse_b
    Jun 6 at 16:02




    This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
    – Jesse_b
    Jun 6 at 16:02












    @Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
    – RBaravalle
    Jun 6 at 16:14




    @Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
    – RBaravalle
    Jun 6 at 16:14












    up vote
    0
    down vote



    accepted










    I've made it. At least, it worked. But now I'd like to optimize the process. It takes up to 40 minutes to process 1000 items in a single massive pdf.



    #!/bin/bash
    # NAUTILUS SCRIPT
    # automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.



    # read files
    IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS



    # process files
    for file in "$filelist[@]"; do
    pagecount=$(pdfinfo $file | grep "Pages" | awk ' print $2 ')
    # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
    #storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
    storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]8')
    pattern=''
    pagetitle=''
    datestamp=''

    #for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
    for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do

    header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)


    pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]8')


    echo $pageid
    let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name

    # match ID found on the page to the stored ID
    if [[ $pageid == $storedid ]]; then
    pattern+="$pageindex " # adds number as text to variable separated by spaces
    pagetitle+="$header+"


    if [[ $pageindex == $pagecount ]]; then #process last output of the file
    # pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
    pdftk $file cat $pattern output "$storedid.pdf"
    storedid=0
    pattern=''
    pagetitle=''

    fi
    else
    #process previous set of pages to output
    # pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
    pdftk $file cat $pattern output "$storedid.pdf"
    storedid=$pageid
    pattern="$pageindex "
    pagetitle="$header+"

    fi
    done
    done





    share|improve this answer

























      up vote
      0
      down vote



      accepted










      I've made it. At least, it worked. But now I'd like to optimize the process. It takes up to 40 minutes to process 1000 items in a single massive pdf.



      #!/bin/bash
      # NAUTILUS SCRIPT
      # automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.



      # read files
      IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS



      # process files
      for file in "$filelist[@]"; do
      pagecount=$(pdfinfo $file | grep "Pages" | awk ' print $2 ')
      # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
      #storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
      storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]8')
      pattern=''
      pagetitle=''
      datestamp=''

      #for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
      for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do

      header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)


      pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]8')


      echo $pageid
      let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name

      # match ID found on the page to the stored ID
      if [[ $pageid == $storedid ]]; then
      pattern+="$pageindex " # adds number as text to variable separated by spaces
      pagetitle+="$header+"


      if [[ $pageindex == $pagecount ]]; then #process last output of the file
      # pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
      pdftk $file cat $pattern output "$storedid.pdf"
      storedid=0
      pattern=''
      pagetitle=''

      fi
      else
      #process previous set of pages to output
      # pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
      pdftk $file cat $pattern output "$storedid.pdf"
      storedid=$pageid
      pattern="$pageindex "
      pagetitle="$header+"

      fi
      done
      done





      share|improve this answer























        up vote
        0
        down vote



        accepted







        up vote
        0
        down vote



        accepted






        I've made it. At least, it worked. But now I'd like to optimize the process. It takes up to 40 minutes to process 1000 items in a single massive pdf.



        #!/bin/bash
        # NAUTILUS SCRIPT
        # automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.



        # read files
        IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS



        # process files
        for file in "$filelist[@]"; do
        pagecount=$(pdfinfo $file | grep "Pages" | awk ' print $2 ')
        # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
        #storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
        storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]8')
        pattern=''
        pagetitle=''
        datestamp=''

        #for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
        for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do

        header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)


        pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]8')


        echo $pageid
        let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name

        # match ID found on the page to the stored ID
        if [[ $pageid == $storedid ]]; then
        pattern+="$pageindex " # adds number as text to variable separated by spaces
        pagetitle+="$header+"


        if [[ $pageindex == $pagecount ]]; then #process last output of the file
        # pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
        pdftk $file cat $pattern output "$storedid.pdf"
        storedid=0
        pattern=''
        pagetitle=''

        fi
        else
        #process previous set of pages to output
        # pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
        pdftk $file cat $pattern output "$storedid.pdf"
        storedid=$pageid
        pattern="$pageindex "
        pagetitle="$header+"

        fi
        done
        done





        share|improve this answer













        I've made it. At least, it worked. But now I'd like to optimize the process. It takes up to 40 minutes to process 1000 items in a single massive pdf.



        #!/bin/bash
        # NAUTILUS SCRIPT
        # automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.



        # read files
        IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS



        # process files
        for file in "$filelist[@]"; do
        pagecount=$(pdfinfo $file | grep "Pages" | awk ' print $2 ')
        # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
        #storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
        storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]8')
        pattern=''
        pagetitle=''
        datestamp=''

        #for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
        for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do

        header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)


        pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]8')


        echo $pageid
        let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name

        # match ID found on the page to the stored ID
        if [[ $pageid == $storedid ]]; then
        pattern+="$pageindex " # adds number as text to variable separated by spaces
        pagetitle+="$header+"


        if [[ $pageindex == $pagecount ]]; then #process last output of the file
        # pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
        pdftk $file cat $pattern output "$storedid.pdf"
        storedid=0
        pattern=''
        pagetitle=''

        fi
        else
        #process previous set of pages to output
        # pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
        pdftk $file cat $pattern output "$storedid.pdf"
        storedid=$pageid
        pattern="$pageindex "
        pagetitle="$header+"

        fi
        done
        done






        share|improve this answer













        share|improve this answer



        share|improve this answer











        answered Jun 7 at 18:11









        RBaravalle

        12




        12






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f448212%2fsplitting-a-single-large-pdf-file-into-n-pdf-files-based-on-content-and-rename-e%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Displaying single band from multi-band raster using QGIS

            How many registers does an x86_64 CPU actually have?