Splitting a single large PDF file into n PDF files based on content and rename each splitted file (in Bash)
Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
I'm working in a method of splitting a single large PDF file (which represents monthly settlements of a credit card). It is builded for printing but we'd like to split that file into single ones, for posterior use. Each settlement has a variable lenght: 2 pages, 3 pages, 4 pages... So we need to "read" each page, find the "Page 1 of X" and split the chunk 'till the next "Page 1 of X" appears. Also, each resulting splitted file has to have an unique Id (contained also in the "Page 1 of X" page).
While I was R&D-ing I found a tool named "PDF Content Split SA" that would do the exact task we needed. But I'm sure there's a way to do this in Linux (we're moving towards OpenSource+Libre).
Thank you for reading. Any help will be extremely useful.
EDIT
So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.
# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS
# process files
for file in "$filelist[@]"; do
pagecount=`pdfinfo $file | grep "Pages" | awk ' print $2 '`
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
pattern=''
pagetitle=''
datestamp=''
for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do
header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]9'`
let "datestamp =`date +%s%N`" # to avoid overwriting with same new name
# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"
if [[ $pageindex == $pagecount ]]; then #process last output of the file
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=0
pattern=''
pagetitle=''
fi
else
#process previous set of pages to output
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=$pageid
pattern="$pageindex "
pagetitle="$header+"
fi
done
done
I've edit the Search Criteria, and the Script is well placed in the Nautilus Script folder, but it doesn't work. I've try debugging using the activity log from the console, and adding marks on the code; apparently there's a conflict with the resulting value of pdfinfo, but I've no idea how to solve it.
linux command-line pdf split
add a comment |Â
up vote
0
down vote
favorite
I'm working in a method of splitting a single large PDF file (which represents monthly settlements of a credit card). It is builded for printing but we'd like to split that file into single ones, for posterior use. Each settlement has a variable lenght: 2 pages, 3 pages, 4 pages... So we need to "read" each page, find the "Page 1 of X" and split the chunk 'till the next "Page 1 of X" appears. Also, each resulting splitted file has to have an unique Id (contained also in the "Page 1 of X" page).
While I was R&D-ing I found a tool named "PDF Content Split SA" that would do the exact task we needed. But I'm sure there's a way to do this in Linux (we're moving towards OpenSource+Libre).
Thank you for reading. Any help will be extremely useful.
EDIT
So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.
# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS
# process files
for file in "$filelist[@]"; do
pagecount=`pdfinfo $file | grep "Pages" | awk ' print $2 '`
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
pattern=''
pagetitle=''
datestamp=''
for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do
header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]9'`
let "datestamp =`date +%s%N`" # to avoid overwriting with same new name
# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"
if [[ $pageindex == $pagecount ]]; then #process last output of the file
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=0
pattern=''
pagetitle=''
fi
else
#process previous set of pages to output
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=$pageid
pattern="$pageindex "
pagetitle="$header+"
fi
done
done
I've edit the Search Criteria, and the Script is well placed in the Nautilus Script folder, but it doesn't work. I've try debugging using the activity log from the console, and adding marks on the code; apparently there's a conflict with the resulting value of pdfinfo, but I've no idea how to solve it.
linux command-line pdf split
So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
â RBaravalle
Jun 6 at 19:21
If c#.net is an option, PDFSharp with MigraDoc can do that easily
â ajeh
Jun 6 at 19:48
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I'm working in a method of splitting a single large PDF file (which represents monthly settlements of a credit card). It is builded for printing but we'd like to split that file into single ones, for posterior use. Each settlement has a variable lenght: 2 pages, 3 pages, 4 pages... So we need to "read" each page, find the "Page 1 of X" and split the chunk 'till the next "Page 1 of X" appears. Also, each resulting splitted file has to have an unique Id (contained also in the "Page 1 of X" page).
While I was R&D-ing I found a tool named "PDF Content Split SA" that would do the exact task we needed. But I'm sure there's a way to do this in Linux (we're moving towards OpenSource+Libre).
Thank you for reading. Any help will be extremely useful.
EDIT
So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.
# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS
# process files
for file in "$filelist[@]"; do
pagecount=`pdfinfo $file | grep "Pages" | awk ' print $2 '`
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
pattern=''
pagetitle=''
datestamp=''
for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do
header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]9'`
let "datestamp =`date +%s%N`" # to avoid overwriting with same new name
# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"
if [[ $pageindex == $pagecount ]]; then #process last output of the file
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=0
pattern=''
pagetitle=''
fi
else
#process previous set of pages to output
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=$pageid
pattern="$pageindex "
pagetitle="$header+"
fi
done
done
I've edit the Search Criteria, and the Script is well placed in the Nautilus Script folder, but it doesn't work. I've try debugging using the activity log from the console, and adding marks on the code; apparently there's a conflict with the resulting value of pdfinfo, but I've no idea how to solve it.
linux command-line pdf split
I'm working in a method of splitting a single large PDF file (which represents monthly settlements of a credit card). It is builded for printing but we'd like to split that file into single ones, for posterior use. Each settlement has a variable lenght: 2 pages, 3 pages, 4 pages... So we need to "read" each page, find the "Page 1 of X" and split the chunk 'till the next "Page 1 of X" appears. Also, each resulting splitted file has to have an unique Id (contained also in the "Page 1 of X" page).
While I was R&D-ing I found a tool named "PDF Content Split SA" that would do the exact task we needed. But I'm sure there's a way to do this in Linux (we're moving towards OpenSource+Libre).
Thank you for reading. Any help will be extremely useful.
EDIT
So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.
# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS
# process files
for file in "$filelist[@]"; do
pagecount=`pdfinfo $file | grep "Pages" | awk ' print $2 '`
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
pattern=''
pagetitle=''
datestamp=''
for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do
header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]9'`
let "datestamp =`date +%s%N`" # to avoid overwriting with same new name
# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"
if [[ $pageindex == $pagecount ]]; then #process last output of the file
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=0
pattern=''
pagetitle=''
fi
else
#process previous set of pages to output
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=$pageid
pattern="$pageindex "
pagetitle="$header+"
fi
done
done
I've edit the Search Criteria, and the Script is well placed in the Nautilus Script folder, but it doesn't work. I've try debugging using the activity log from the console, and adding marks on the code; apparently there's a conflict with the resulting value of pdfinfo, but I've no idea how to solve it.
linux command-line pdf split
edited Jun 6 at 19:26
asked Jun 6 at 13:55
RBaravalle
12
12
So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
â RBaravalle
Jun 6 at 19:21
If c#.net is an option, PDFSharp with MigraDoc can do that easily
â ajeh
Jun 6 at 19:48
add a comment |Â
So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
â RBaravalle
Jun 6 at 19:21
If c#.net is an option, PDFSharp with MigraDoc can do that easily
â ajeh
Jun 6 at 19:48
So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
â RBaravalle
Jun 6 at 19:21
So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
â RBaravalle
Jun 6 at 19:21
If c#.net is an option, PDFSharp with MigraDoc can do that easily
â ajeh
Jun 6 at 19:48
If c#.net is an option, PDFSharp with MigraDoc can do that easily
â ajeh
Jun 6 at 19:48
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
1
down vote
Is some quick python an option? The package PyPDF2 would let you do exactly what you are asking.
I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
â RBaravalle
Jun 6 at 15:33
This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
â Jesse_b
Jun 6 at 16:02
@Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
â RBaravalle
Jun 6 at 16:14
add a comment |Â
up vote
0
down vote
accepted
I've made it. At least, it worked. But now I'd like to optimize the process. It takes up to 40 minutes to process 1000 items in a single massive pdf.
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.
# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS
# process files
for file in "$filelist[@]"; do
pagecount=$(pdfinfo $file | grep "Pages" | awk ' print $2 ')
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
#storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA Nú ?[0-9]8')
pattern=''
pagetitle=''
datestamp=''
#for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do
header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)
pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA Nú ?[0-9]8')
echo $pageid
let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name
# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"
if [[ $pageindex == $pagecount ]]; then #process last output of the file
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
pdftk $file cat $pattern output "$storedid.pdf"
storedid=0
pattern=''
pagetitle=''
fi
else
#process previous set of pages to output
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
pdftk $file cat $pattern output "$storedid.pdf"
storedid=$pageid
pattern="$pageindex "
pagetitle="$header+"
fi
done
done
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
Is some quick python an option? The package PyPDF2 would let you do exactly what you are asking.
I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
â RBaravalle
Jun 6 at 15:33
This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
â Jesse_b
Jun 6 at 16:02
@Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
â RBaravalle
Jun 6 at 16:14
add a comment |Â
up vote
1
down vote
Is some quick python an option? The package PyPDF2 would let you do exactly what you are asking.
I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
â RBaravalle
Jun 6 at 15:33
This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
â Jesse_b
Jun 6 at 16:02
@Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
â RBaravalle
Jun 6 at 16:14
add a comment |Â
up vote
1
down vote
up vote
1
down vote
Is some quick python an option? The package PyPDF2 would let you do exactly what you are asking.
Is some quick python an option? The package PyPDF2 would let you do exactly what you are asking.
answered Jun 6 at 14:25
Joe M
5964
5964
I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
â RBaravalle
Jun 6 at 15:33
This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
â Jesse_b
Jun 6 at 16:02
@Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
â RBaravalle
Jun 6 at 16:14
add a comment |Â
I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
â RBaravalle
Jun 6 at 15:33
This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
â Jesse_b
Jun 6 at 16:02
@Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
â RBaravalle
Jun 6 at 16:14
I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
â RBaravalle
Jun 6 at 15:33
I'm open to any solution as much as it runs locally as a script. I'll look that pkg. Thank you!
â RBaravalle
Jun 6 at 15:33
This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
â Jesse_b
Jun 6 at 16:02
This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
â Jesse_b
Jun 6 at 16:02
@Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
â RBaravalle
Jun 6 at 16:14
@Jesse_b my knowledge in Python is far below the knowledge I could have in Linux scripting (which is also pretty low).
â RBaravalle
Jun 6 at 16:14
add a comment |Â
up vote
0
down vote
accepted
I've made it. At least, it worked. But now I'd like to optimize the process. It takes up to 40 minutes to process 1000 items in a single massive pdf.
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.
# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS
# process files
for file in "$filelist[@]"; do
pagecount=$(pdfinfo $file | grep "Pages" | awk ' print $2 ')
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
#storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA Nú ?[0-9]8')
pattern=''
pagetitle=''
datestamp=''
#for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do
header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)
pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA Nú ?[0-9]8')
echo $pageid
let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name
# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"
if [[ $pageindex == $pagecount ]]; then #process last output of the file
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
pdftk $file cat $pattern output "$storedid.pdf"
storedid=0
pattern=''
pagetitle=''
fi
else
#process previous set of pages to output
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
pdftk $file cat $pattern output "$storedid.pdf"
storedid=$pageid
pattern="$pageindex "
pagetitle="$header+"
fi
done
done
add a comment |Â
up vote
0
down vote
accepted
I've made it. At least, it worked. But now I'd like to optimize the process. It takes up to 40 minutes to process 1000 items in a single massive pdf.
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.
# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS
# process files
for file in "$filelist[@]"; do
pagecount=$(pdfinfo $file | grep "Pages" | awk ' print $2 ')
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
#storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA Nú ?[0-9]8')
pattern=''
pagetitle=''
datestamp=''
#for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do
header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)
pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA Nú ?[0-9]8')
echo $pageid
let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name
# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"
if [[ $pageindex == $pagecount ]]; then #process last output of the file
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
pdftk $file cat $pattern output "$storedid.pdf"
storedid=0
pattern=''
pagetitle=''
fi
else
#process previous set of pages to output
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
pdftk $file cat $pattern output "$storedid.pdf"
storedid=$pageid
pattern="$pageindex "
pagetitle="$header+"
fi
done
done
add a comment |Â
up vote
0
down vote
accepted
up vote
0
down vote
accepted
I've made it. At least, it worked. But now I'd like to optimize the process. It takes up to 40 minutes to process 1000 items in a single massive pdf.
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.
# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS
# process files
for file in "$filelist[@]"; do
pagecount=$(pdfinfo $file | grep "Pages" | awk ' print $2 ')
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
#storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA Nú ?[0-9]8')
pattern=''
pagetitle=''
datestamp=''
#for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do
header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)
pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA Nú ?[0-9]8')
echo $pageid
let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name
# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"
if [[ $pageindex == $pagecount ]]; then #process last output of the file
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
pdftk $file cat $pattern output "$storedid.pdf"
storedid=0
pattern=''
pagetitle=''
fi
else
#process previous set of pages to output
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
pdftk $file cat $pattern output "$storedid.pdf"
storedid=$pageid
pattern="$pageindex "
pagetitle="$header+"
fi
done
done
I've made it. At least, it worked. But now I'd like to optimize the process. It takes up to 40 minutes to process 1000 items in a single massive pdf.
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.
# read files
IFS=$'n' read -d '' -r -a filelist < <(printf '%sn' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS
# process files
for file in "$filelist[@]"; do
pagecount=$(pdfinfo $file | grep "Pages" | awk ' print $2 ')
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
#storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]9'`
storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA Nú ?[0-9]8')
pattern=''
pagetitle=''
datestamp=''
#for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do
header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)
pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA Nú ?[0-9]8')
echo $pageid
let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name
# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"
if [[ $pageindex == $pagecount ]]; then #process last output of the file
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
pdftk $file cat $pattern output "$storedid.pdf"
storedid=0
pattern=''
pagetitle=''
fi
else
#process previous set of pages to output
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
pdftk $file cat $pattern output "$storedid.pdf"
storedid=$pageid
pattern="$pageindex "
pagetitle="$header+"
fi
done
done
answered Jun 7 at 18:11
RBaravalle
12
12
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f448212%2fsplitting-a-single-large-pdf-file-into-n-pdf-files-based-on-content-and-rename-e%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
So far, I've found this Nautilus script that could do exactly what we need, but I can't make it work.
â RBaravalle
Jun 6 at 19:21
If c#.net is an option, PDFSharp with MigraDoc can do that easily
â ajeh
Jun 6 at 19:48