Splitting a single large PDF file into n PDF files based on content and rename each splitted file (in Bash)

up vote
0
down vote

favorite

I'm working in a method of splitting a single large PDF file (which represents monthly settlements of a credit card). It is builded for printing but we'd like to split that file into single ones, for posterior use. Each settlement has a variable lenght: 2 pages, 3 pages, 4 pages... So we need to "read" each page, find the "Page 1 of X" and split the chunk 'till the next "Page 1 of X" appears. Also, each resulting splitted file has to have an unique Id (contained also in the "Page 1 of X" page).

While I was R&D-ing I found a tool named "PDF Content Split SA" that would do the exact task we needed. But I'm sure there's a way to do this in Linux (we're moving towards OpenSource+Libre).

Thank you for reading. Any help will be extremely useful.

EDIT

So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.

# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "$filelist[@]"; do
 pagecount=`pdfinfo $file | grep "Pages" | awk ' print $2 '`
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
 storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
 pattern=''
 pagetitle=''
 datestamp=''

 for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do

 header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
 pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]9'`
 let "datestamp =`date +%s%N`" # to avoid overwriting with same new name

 # match ID found on the page to the stored ID
 if [[ $pageid == $storedid ]]; then
 pattern+="$pageindex " # adds number as text to variable separated by spaces
 pagetitle+="$header+"

 if [[ $pageindex == $pagecount ]]; then #process last output of the file 
 pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 storedid=0
 pattern=''
 pagetitle=''
 fi
 else 
 #process previous set of pages to output
 pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 storedid=$pageid
 pattern="$pageindex "
 pagetitle="$header+"
 fi
 done
done

I've edit the Search Criteria, and the Script is well placed in the Nautilus Script folder, but it doesn't work. I've try debugging using the activity log from the console, and adding marks on the code; apparently there's a conflict with the resulting value of pdfinfo, but I've no idea how to solve it.

edited Jun 6 at 19:26

asked Jun 6 at 13:55

RBaravalle

So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
â€“Â RBaravalle
Jun 6 at 19:21

If c#.net is an option, PDFSharp with MigraDoc can do that easily
â€“Â ajeh
Jun 6 at 19:48

add a commentÂ |Â

up vote
0
down vote

favorite

While I was R&D-ing I found a tool named "PDF Content Split SA" that would do the exact task we needed. But I'm sure there's a way to do this in Linux (we're moving towards OpenSource+Libre).

Thank you for reading. Any help will be extremely useful.

EDIT

So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.

# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "$filelist[@]"; do
 pagecount=`pdfinfo $file | grep "Pages" | awk ' print $2 '`
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
 storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
 pattern=''
 pagetitle=''
 datestamp=''

 for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do

 header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
 pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]9'`
 let "datestamp =`date +%s%N`" # to avoid overwriting with same new name

 # match ID found on the page to the stored ID
 if [[ $pageid == $storedid ]]; then
 pattern+="$pageindex " # adds number as text to variable separated by spaces
 pagetitle+="$header+"

 if [[ $pageindex == $pagecount ]]; then #process last output of the file 
 pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 storedid=0
 pattern=''
 pagetitle=''
 fi
 else 
 #process previous set of pages to output
 pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 storedid=$pageid
 pattern="$pageindex "
 pagetitle="$header+"
 fi
 done
done

edited Jun 6 at 19:26

asked Jun 6 at 13:55

RBaravalle

So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
â€“Â RBaravalle
Jun 6 at 19:21

If c#.net is an option, PDFSharp with MigraDoc can do that easily
â€“Â ajeh
Jun 6 at 19:48

add a commentÂ |Â

up vote
0
down vote

favorite

While I was R&D-ing I found a tool named "PDF Content Split SA" that would do the exact task we needed. But I'm sure there's a way to do this in Linux (we're moving towards OpenSource+Libre).

Thank you for reading. Any help will be extremely useful.

EDIT

So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.

# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "$filelist[@]"; do
 pagecount=`pdfinfo $file | grep "Pages" | awk ' print $2 '`
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
 storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
 pattern=''
 pagetitle=''
 datestamp=''

 for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do

 header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
 pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]9'`
 let "datestamp =`date +%s%N`" # to avoid overwriting with same new name

 # match ID found on the page to the stored ID
 if [[ $pageid == $storedid ]]; then
 pattern+="$pageindex " # adds number as text to variable separated by spaces
 pagetitle+="$header+"

 if [[ $pageindex == $pagecount ]]; then #process last output of the file 
 pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 storedid=0
 pattern=''
 pagetitle=''
 fi
 else 
 #process previous set of pages to output
 pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 storedid=$pageid
 pattern="$pageindex "
 pagetitle="$header+"
 fi
 done
done

edited Jun 6 at 19:26

asked Jun 6 at 13:55

RBaravalle

While I was R&D-ing I found a tool named "PDF Content Split SA" that would do the exact task we needed. But I'm sure there's a way to do this in Linux (we're moving towards OpenSource+Libre).

Thank you for reading. Any help will be extremely useful.

EDIT

So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.

# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "$filelist[@]"; do
 pagecount=`pdfinfo $file | grep "Pages" | awk ' print $2 '`
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
 storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
 pattern=''
 pagetitle=''
 datestamp=''

 for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do

 header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
 pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]9'`
 let "datestamp =`date +%s%N`" # to avoid overwriting with same new name

 # match ID found on the page to the stored ID
 if [[ $pageid == $storedid ]]; then
 pattern+="$pageindex " # adds number as text to variable separated by spaces
 pagetitle+="$header+"

 if [[ $pageindex == $pagecount ]]; then #process last output of the file 
 pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 storedid=0
 pattern=''
 pagetitle=''
 fi
 else 
 #process previous set of pages to output
 pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 storedid=$pageid
 pattern="$pageindex "
 pagetitle="$header+"
 fi
 done
done

edited Jun 6 at 19:26

asked Jun 6 at 13:55

RBaravalle

edited Jun 6 at 19:26

asked Jun 6 at 13:55

RBaravalle

asked Jun 6 at 13:55

RBaravalle

asked Jun 6 at 13:55

RBaravalle

So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
â€“Â RBaravalle
Jun 6 at 19:21

If c#.net is an option, PDFSharp with MigraDoc can do that easily
â€“Â ajeh
Jun 6 at 19:48

add a commentÂ |Â

So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
â€“Â RBaravalle
Jun 6 at 19:21

If c#.net is an option, PDFSharp with MigraDoc can do that easily
â€“Â ajeh
Jun 6 at 19:48

So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
â€“Â RBaravalle
Jun 6 at 19:21

If c#.net is an option, PDFSharp with MigraDoc can do that easily
â€“Â ajeh
Jun 6 at 19:48

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
1
down vote

Is some quick python an option? The package PyPDF2 would let you do exactly what you are asking.

answered Jun 6 at 14:25

Joe M

5964

I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
â€“Â RBaravalle
Jun 6 at 15:33

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
â€“Â Jesse_b
Jun 6 at 16:02

@Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
â€“Â RBaravalle
Jun 6 at 16:14

add a commentÂ |Â

up vote
0
down vote

accepted

I've made it. At least, it worked. But now I'd like to optimize the process. It takes up to 40 minutes to process 1000 items in a single massive pdf.

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.



# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS



# process files
for file in "$filelist[@]"; do
 pagecount=$(pdfinfo $file | grep "Pages" | awk ' print $2 ')
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
#storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
 storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA NÃ‚Âº ?[0-9]8')
 pattern=''
 pagetitle=''
 datestamp=''

 #for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
 for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do

 header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)


 pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA NÃ‚Âº ?[0-9]8')


 echo $pageid
 let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name

 # match ID found on the page to the stored ID
 if [[ $pageid == $storedid ]]; then
 pattern+="$pageindex " # adds number as text to variable separated by spaces
 pagetitle+="$header+"


 if [[ $pageindex == $pagecount ]]; then #process last output of the file 
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 pdftk $file cat $pattern output "$storedid.pdf"
 storedid=0
 pattern=''
 pagetitle=''

 fi
 else 
 #process previous set of pages to output
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 pdftk $file cat $pattern output "$storedid.pdf"
 storedid=$pageid
 pattern="$pageindex "
 pagetitle="$header+"

 fi
 done
done

answered Jun 7 at 18:11

RBaravalle

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f448212%2fsplitting-a-single-large-pdf-file-into-n-pdf-files-based-on-content-and-rename-e%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

Is some quick python an option? The package PyPDF2 would let you do exactly what you are asking.

answered Jun 6 at 14:25

Joe M

5964

I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
â€“Â RBaravalle
Jun 6 at 15:33

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
â€“Â Jesse_b
Jun 6 at 16:02

@Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
â€“Â RBaravalle
Jun 6 at 16:14

add a commentÂ |Â

up vote
1
down vote

Is some quick python an option? The package PyPDF2 would let you do exactly what you are asking.

answered Jun 6 at 14:25

Joe M

5964

I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
â€“Â RBaravalle
Jun 6 at 15:33

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
â€“Â Jesse_b
Jun 6 at 16:02

@Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
â€“Â RBaravalle
Jun 6 at 16:14

add a commentÂ |Â

up vote
1
down vote

Is some quick python an option? The package PyPDF2 would let you do exactly what you are asking.

answered Jun 6 at 14:25

Joe M

5964

Is some quick python an option? The package PyPDF2 would let you do exactly what you are asking.

answered Jun 6 at 14:25

Joe M

5964

answered Jun 6 at 14:25

Joe M

5964

answered Jun 6 at 14:25

Joe M

5964

answered Jun 6 at 14:25

Joe M

5964

I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
â€“Â RBaravalle
Jun 6 at 15:33

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
â€“Â Jesse_b
Jun 6 at 16:02

@Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
â€“Â RBaravalle
Jun 6 at 16:14

add a commentÂ |Â

I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
â€“Â RBaravalle
Jun 6 at 15:33

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
â€“Â Jesse_b
Jun 6 at 16:02

@Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
â€“Â RBaravalle
Jun 6 at 16:14

I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
â€“Â RBaravalle
Jun 6 at 15:33

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
â€“Â Jesse_b
Jun 6 at 16:02

@Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
â€“Â RBaravalle
Jun 6 at 16:14

add a commentÂ |Â

up vote
0
down vote

accepted

I've made it. At least, it worked. But now I'd like to optimize the process. It takes up to 40 minutes to process 1000 items in a single massive pdf.

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.



# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS



# process files
for file in "$filelist[@]"; do
 pagecount=$(pdfinfo $file | grep "Pages" | awk ' print $2 ')
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
#storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
 storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA NÃ‚Âº ?[0-9]8')
 pattern=''
 pagetitle=''
 datestamp=''

 #for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
 for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do

 header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)


 pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA NÃ‚Âº ?[0-9]8')


 echo $pageid
 let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name

 # match ID found on the page to the stored ID
 if [[ $pageid == $storedid ]]; then
 pattern+="$pageindex " # adds number as text to variable separated by spaces
 pagetitle+="$header+"


 if [[ $pageindex == $pagecount ]]; then #process last output of the file 
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 pdftk $file cat $pattern output "$storedid.pdf"
 storedid=0
 pattern=''
 pagetitle=''

 fi
 else 
 #process previous set of pages to output
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 pdftk $file cat $pattern output "$storedid.pdf"
 storedid=$pageid
 pattern="$pageindex "
 pagetitle="$header+"

 fi
 done
done

answered Jun 7 at 18:11

RBaravalle

add a commentÂ |Â

up vote
0
down vote

accepted

I've made it. At least, it worked. But now I'd like to optimize the process. It takes up to 40 minutes to process 1000 items in a single massive pdf.

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.



# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS



# process files
for file in "$filelist[@]"; do
 pagecount=$(pdfinfo $file | grep "Pages" | awk ' print $2 ')
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
#storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
 storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA NÃ‚Âº ?[0-9]8')
 pattern=''
 pagetitle=''
 datestamp=''

 #for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
 for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do

 header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)


 pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA NÃ‚Âº ?[0-9]8')


 echo $pageid
 let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name

 # match ID found on the page to the stored ID
 if [[ $pageid == $storedid ]]; then
 pattern+="$pageindex " # adds number as text to variable separated by spaces
 pagetitle+="$header+"


 if [[ $pageindex == $pagecount ]]; then #process last output of the file 
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 pdftk $file cat $pattern output "$storedid.pdf"
 storedid=0
 pattern=''
 pagetitle=''

 fi
 else 
 #process previous set of pages to output
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 pdftk $file cat $pattern output "$storedid.pdf"
 storedid=$pageid
 pattern="$pageindex "
 pagetitle="$header+"

 fi
 done
done

answered Jun 7 at 18:11

RBaravalle

add a commentÂ |Â

up vote
0
down vote

accepted

I've made it. At least, it worked. But now I'd like to optimize the process. It takes up to 40 minutes to process 1000 items in a single massive pdf.

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.



# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS



# process files
for file in "$filelist[@]"; do
 pagecount=$(pdfinfo $file | grep "Pages" | awk ' print $2 ')
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
#storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
 storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA NÃ‚Âº ?[0-9]8')
 pattern=''
 pagetitle=''
 datestamp=''

 #for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
 for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do

 header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)


 pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA NÃ‚Âº ?[0-9]8')


 echo $pageid
 let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name

 # match ID found on the page to the stored ID
 if [[ $pageid == $storedid ]]; then
 pattern+="$pageindex " # adds number as text to variable separated by spaces
 pagetitle+="$header+"


 if [[ $pageindex == $pagecount ]]; then #process last output of the file 
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 pdftk $file cat $pattern output "$storedid.pdf"
 storedid=0
 pattern=''
 pagetitle=''

 fi
 else 
 #process previous set of pages to output
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 pdftk $file cat $pattern output "$storedid.pdf"
 storedid=$pageid
 pattern="$pageindex "
 pagetitle="$header+"

 fi
 done
done

answered Jun 7 at 18:11

RBaravalle

I've made it. At least, it worked. But now I'd like to optimize the process. It takes up to 40 minutes to process 1000 items in a single massive pdf.

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.



# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS



# process files
for file in "$filelist[@]"; do
 pagecount=$(pdfinfo $file | grep "Pages" | awk ' print $2 ')
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
#storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
 storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA NÃ‚Âº ?[0-9]8')
 pattern=''
 pagetitle=''
 datestamp=''

 #for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
 for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do

 header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)


 pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA NÃ‚Âº ?[0-9]8')


 echo $pageid
 let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name

 # match ID found on the page to the stored ID
 if [[ $pageid == $storedid ]]; then
 pattern+="$pageindex " # adds number as text to variable separated by spaces
 pagetitle+="$header+"


 if [[ $pageindex == $pagecount ]]; then #process last output of the file 
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 pdftk $file cat $pattern output "$storedid.pdf"
 storedid=0
 pattern=''
 pagetitle=''

 fi
 else 
 #process previous set of pages to output
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
 pdftk $file cat $pattern output "$storedid.pdf"
 storedid=$pageid
 pattern="$pageindex "
 pagetitle="$header+"

 fi
 done
done

answered Jun 7 at 18:11

RBaravalle

answered Jun 7 at 18:11

RBaravalle

answered Jun 7 at 18:11

RBaravalle

answered Jun 7 at 18:11

RBaravalle

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu