Command line tool to search and replace text on a PDF

I have a PDF that has my name as an obnoxious watermark through out a rather long PDF file. I tried replacing the text in LibreOffice Draw with blanks, but while my name does appear as text, the find and replace function seems to tank my computer taking significant RAM and CPU time to do.

Is there a command line way to remove strings from PDF? Hmm... can sed do that?

edited 4 hours ago

Pablo Bianchi

2,3131528

asked Dec 14 at 21:45

j0h

6,2451351111

add a comment |

Is there a command line way to remove strings from PDF? Hmm... can sed do that?

edited 4 hours ago

Pablo Bianchi

2,3131528

asked Dec 14 at 21:45

j0h

6,2451351111

add a comment |

Is there a command line way to remove strings from PDF? Hmm... can sed do that?

edited 4 hours ago

Pablo Bianchi

2,3131528

asked Dec 14 at 21:45

j0h

6,2451351111

Is there a command line way to remove strings from PDF? Hmm... can sed do that?

command-line libreoffice pdf

edited 4 hours ago

Pablo Bianchi

2,3131528

asked Dec 14 at 21:45

j0h

6,2451351111

edited 4 hours ago

Pablo Bianchi

2,3131528

asked Dec 14 at 21:45

j0h

6,2451351111

edited 4 hours ago

Pablo Bianchi

2,3131528

edited 4 hours ago

Pablo Bianchi

2,3131528

edited 4 hours ago

Pablo Bianchi

2,3131528

asked Dec 14 at 21:45

j0h

6,2451351111

asked Dec 14 at 21:45

j0h

6,2451351111

asked Dec 14 at 21:45

j0h

6,2451351111

add a comment |

2 Answers
2

active

oldest

votes

As in many cases it’s just text, you can often remove it simply with sed or in fact any text editor – let’s say it says “watermark”:

sed 's/watermark//g' in.pdf >out.pdf

If your PDF file is compressed this doesn’t work, you need to uncompress it first, e.g. with pdftk (How can I install pdftk in Ubuntu 18.04 and later?):

pdftk in.pdf output out.pdf uncompress

If sed’s output is not readable with your preferred PDF reader, try repairing it with pdftk:

pdftk out.pdf output out_pdftk.pdf

Further reading: How to Edit PDFs?

^{Source: How to remove watermark from pdf using pdftk • Super User}

edited 7 hours ago

answered Dec 14 at 21:58

dessert

22k55997

1

Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence your sed command will not succeed -- not even after uncompressing the PDF.
– Kurt Pfeifle
8 hours ago

@KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases where sed fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
– dessert
7 hours ago

PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases where sed method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
– Kurt Pfeifle
7 hours ago

add a comment |

Accepted answer will work only in rare cases

Sorry, the answer given by @dessert is as wrong as it could be as a general advice. It will not work for the general case of text replacement in PDFs (watermarks or not), and you'll have to be very lucky for very rare cases of PDFs you encounter were it would work. (Moreover, watermarks inserted by LibreOffice frequently are converted into vector or pixel graphics, even if they appear like text when printed or viewed on screen.... but this case I'll not discuss any further -- below I deal only with real text contents in a PDF.)

Reasons

The reasons for this are these:

What appears to be ASCII text in the visual representation of its content in a PDF viewer, very likely will not be ASCII text inside the PDF source code. Instead it may be hex encoded.

Additionally, an ASCII string's individual characters might be placed on the page in a consecutive order, but they may easily be placed individually, with each having its own coordinate information sprinkled in between the individual characters...

Also, the hex encoding of the ASCII (and non-ASCII) character table (the "mapping") will not be predictable, and it may change from font to font.

Hence in all these cases your sed command will not succeed -- not even after uncompressing the PDF.

Example

Here is an example for the "string" Watermark, how it can appear inside a PDF created with LibreOffice:

56.8 726.989 Td /F2 16 Tf[<01>29<0203>-2<0405>6<06>-1<020507>]TJ

I'll dissect for you what that means:

56.8 726.989 Td: Td is an operator to move the text positioning on the page; 56.8 726.989 are the x-/y-coordinates to describe that exact position.

/F2 16 Tf: Tf is an operator to set a certain font as well as its size as the currently active one; in this case it is the font tagged elsewhere with the name /F2 and its size should be 16 pt.

[<01>29<0203>-2<0405>6<06>-1<020507>]TJ: TJ is an operator to show text while at the same time allowing for individual glyph positioning. The meaning of the hex snippets enclosed by angle brackets are the following, according to the 'charmap' table specific for that PDF and the used font:
- <01>: this is the 'W'.
- <0203>: this is the 'at'.
- <0405>: this is the 'er'.
- <06>: this is the 'm'.
- <020507>: this is the 'ark'.
The numbers in between these hex snippets (29, -2, 6 and -1) are correction values which determine the individual spacings of the different characters.

Now you show me how you'd replace that "string" by something else by using sed... Remember, you do not know the encoding in advance, nor the placement correction numbers, when you deal with an arbitrary PDF. You can only find out by opening its source code in an editor and analysing its content.

Executive Summary

No, there is no command line way to reliably remove unwanted strings from a PDF!

You can only do this if...

(a) ...you are a PDF expert who is skilled to read the PDF source code;

(b) ...you are prepared to analyse the PDF file in question individually;

WARNING: The answer currently marked as 'accepted' might have worked for the specific PDF of the OP. However, it will not work in the general case. Don't take the "recipe" it advertises for granted!

edited 7 hours ago

answered 7 hours ago

Kurt Pfeifle

954710

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "89"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1100970%2fcommand-line-tool-to-search-and-replace-text-on-a-pdf%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

As in many cases it’s just text, you can often remove it simply with sed or in fact any text editor – let’s say it says “watermark”:

sed 's/watermark//g' in.pdf >out.pdf

If your PDF file is compressed this doesn’t work, you need to uncompress it first, e.g. with pdftk (How can I install pdftk in Ubuntu 18.04 and later?):

pdftk in.pdf output out.pdf uncompress

If sed’s output is not readable with your preferred PDF reader, try repairing it with pdftk:

pdftk out.pdf output out_pdftk.pdf

Further reading: How to Edit PDFs?

^{Source: How to remove watermark from pdf using pdftk • Super User}

edited 7 hours ago

answered Dec 14 at 21:58

dessert

22k55997

1

Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence your sed command will not succeed -- not even after uncompressing the PDF.
– Kurt Pfeifle
8 hours ago

@KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases where sed fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
– dessert
7 hours ago

PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases where sed method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
– Kurt Pfeifle
7 hours ago

add a comment |

As in many cases it’s just text, you can often remove it simply with sed or in fact any text editor – let’s say it says “watermark”:

sed 's/watermark//g' in.pdf >out.pdf

If your PDF file is compressed this doesn’t work, you need to uncompress it first, e.g. with pdftk (How can I install pdftk in Ubuntu 18.04 and later?):

pdftk in.pdf output out.pdf uncompress

If sed’s output is not readable with your preferred PDF reader, try repairing it with pdftk:

pdftk out.pdf output out_pdftk.pdf

Further reading: How to Edit PDFs?

^{Source: How to remove watermark from pdf using pdftk • Super User}

edited 7 hours ago

answered Dec 14 at 21:58

dessert

22k55997

1

Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence your sed command will not succeed -- not even after uncompressing the PDF.
– Kurt Pfeifle
8 hours ago

@KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases where sed fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
– dessert
7 hours ago

PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases where sed method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
– Kurt Pfeifle
7 hours ago

add a comment |

As in many cases it’s just text, you can often remove it simply with sed or in fact any text editor – let’s say it says “watermark”:

sed 's/watermark//g' in.pdf >out.pdf

If your PDF file is compressed this doesn’t work, you need to uncompress it first, e.g. with pdftk (How can I install pdftk in Ubuntu 18.04 and later?):

pdftk in.pdf output out.pdf uncompress

If sed’s output is not readable with your preferred PDF reader, try repairing it with pdftk:

pdftk out.pdf output out_pdftk.pdf

Further reading: How to Edit PDFs?

^{Source: How to remove watermark from pdf using pdftk • Super User}

edited 7 hours ago

answered Dec 14 at 21:58

dessert

22k55997

As in many cases it’s just text, you can often remove it simply with sed or in fact any text editor – let’s say it says “watermark”:

sed 's/watermark//g' in.pdf >out.pdf

If your PDF file is compressed this doesn’t work, you need to uncompress it first, e.g. with pdftk (How can I install pdftk in Ubuntu 18.04 and later?):

pdftk in.pdf output out.pdf uncompress

If sed’s output is not readable with your preferred PDF reader, try repairing it with pdftk:

pdftk out.pdf output out_pdftk.pdf

Further reading: How to Edit PDFs?

^{Source: How to remove watermark from pdf using pdftk • Super User}

edited 7 hours ago

answered Dec 14 at 21:58

dessert

22k55997

edited 7 hours ago

answered Dec 14 at 21:58

dessert

22k55997

answered Dec 14 at 21:58

dessert

22k55997

answered Dec 14 at 21:58

dessert

22k55997

1

Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence your sed command will not succeed -- not even after uncompressing the PDF.
– Kurt Pfeifle
8 hours ago

@KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases where sed fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
– dessert
7 hours ago

PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases where sed method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
– Kurt Pfeifle
7 hours ago

add a comment |

1

Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence your sed command will not succeed -- not even after uncompressing the PDF.
– Kurt Pfeifle
8 hours ago

@KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases where sed fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
– dessert
7 hours ago

PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases where sed method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
– Kurt Pfeifle
7 hours ago

Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence your sed command will not succeed -- not even after uncompressing the PDF.
– Kurt Pfeifle
8 hours ago

@KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases where sed fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
– dessert
7 hours ago

PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases where sed method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
– Kurt Pfeifle
7 hours ago

add a comment |

Accepted answer will work only in rare cases

Reasons

The reasons for this are these:

What appears to be ASCII text in the visual representation of its content in a PDF viewer, very likely will not be ASCII text inside the PDF source code. Instead it may be hex encoded.

Additionally, an ASCII string's individual characters might be placed on the page in a consecutive order, but they may easily be placed individually, with each having its own coordinate information sprinkled in between the individual characters...

Also, the hex encoding of the ASCII (and non-ASCII) character table (the "mapping") will not be predictable, and it may change from font to font.

Hence in all these cases your sed command will not succeed -- not even after uncompressing the PDF.

Example

Here is an example for the "string" Watermark, how it can appear inside a PDF created with LibreOffice:

56.8 726.989 Td /F2 16 Tf[<01>29<0203>-2<0405>6<06>-1<020507>]TJ

I'll dissect for you what that means:

56.8 726.989 Td: Td is an operator to move the text positioning on the page; 56.8 726.989 are the x-/y-coordinates to describe that exact position.

/F2 16 Tf: Tf is an operator to set a certain font as well as its size as the currently active one; in this case it is the font tagged elsewhere with the name /F2 and its size should be 16 pt.

[<01>29<0203>-2<0405>6<06>-1<020507>]TJ: TJ is an operator to show text while at the same time allowing for individual glyph positioning. The meaning of the hex snippets enclosed by angle brackets are the following, according to the 'charmap' table specific for that PDF and the used font:
- <01>: this is the 'W'.
- <0203>: this is the 'at'.
- <0405>: this is the 'er'.
- <06>: this is the 'm'.
- <020507>: this is the 'ark'.
The numbers in between these hex snippets (29, -2, 6 and -1) are correction values which determine the individual spacings of the different characters.

Executive Summary

No, there is no command line way to reliably remove unwanted strings from a PDF!

You can only do this if...

(a) ...you are a PDF expert who is skilled to read the PDF source code;

(b) ...you are prepared to analyse the PDF file in question individually;

edited 7 hours ago

answered 7 hours ago

Kurt Pfeifle

954710

add a comment |

Accepted answer will work only in rare cases

Reasons

The reasons for this are these:

What appears to be ASCII text in the visual representation of its content in a PDF viewer, very likely will not be ASCII text inside the PDF source code. Instead it may be hex encoded.

Additionally, an ASCII string's individual characters might be placed on the page in a consecutive order, but they may easily be placed individually, with each having its own coordinate information sprinkled in between the individual characters...

Also, the hex encoding of the ASCII (and non-ASCII) character table (the "mapping") will not be predictable, and it may change from font to font.

Hence in all these cases your sed command will not succeed -- not even after uncompressing the PDF.

Example

Here is an example for the "string" Watermark, how it can appear inside a PDF created with LibreOffice:

56.8 726.989 Td /F2 16 Tf[<01>29<0203>-2<0405>6<06>-1<020507>]TJ

I'll dissect for you what that means:

56.8 726.989 Td: Td is an operator to move the text positioning on the page; 56.8 726.989 are the x-/y-coordinates to describe that exact position.

/F2 16 Tf: Tf is an operator to set a certain font as well as its size as the currently active one; in this case it is the font tagged elsewhere with the name /F2 and its size should be 16 pt.

[<01>29<0203>-2<0405>6<06>-1<020507>]TJ: TJ is an operator to show text while at the same time allowing for individual glyph positioning. The meaning of the hex snippets enclosed by angle brackets are the following, according to the 'charmap' table specific for that PDF and the used font:
- <01>: this is the 'W'.
- <0203>: this is the 'at'.
- <0405>: this is the 'er'.
- <06>: this is the 'm'.
- <020507>: this is the 'ark'.
The numbers in between these hex snippets (29, -2, 6 and -1) are correction values which determine the individual spacings of the different characters.

Executive Summary

No, there is no command line way to reliably remove unwanted strings from a PDF!

You can only do this if...

(a) ...you are a PDF expert who is skilled to read the PDF source code;

(b) ...you are prepared to analyse the PDF file in question individually;

edited 7 hours ago

answered 7 hours ago

Kurt Pfeifle

954710

add a comment |

Accepted answer will work only in rare cases

Reasons

The reasons for this are these:

What appears to be ASCII text in the visual representation of its content in a PDF viewer, very likely will not be ASCII text inside the PDF source code. Instead it may be hex encoded.

Additionally, an ASCII string's individual characters might be placed on the page in a consecutive order, but they may easily be placed individually, with each having its own coordinate information sprinkled in between the individual characters...

Also, the hex encoding of the ASCII (and non-ASCII) character table (the "mapping") will not be predictable, and it may change from font to font.

Hence in all these cases your sed command will not succeed -- not even after uncompressing the PDF.

Example

Here is an example for the "string" Watermark, how it can appear inside a PDF created with LibreOffice:

56.8 726.989 Td /F2 16 Tf[<01>29<0203>-2<0405>6<06>-1<020507>]TJ

I'll dissect for you what that means:

56.8 726.989 Td: Td is an operator to move the text positioning on the page; 56.8 726.989 are the x-/y-coordinates to describe that exact position.

/F2 16 Tf: Tf is an operator to set a certain font as well as its size as the currently active one; in this case it is the font tagged elsewhere with the name /F2 and its size should be 16 pt.

[<01>29<0203>-2<0405>6<06>-1<020507>]TJ: TJ is an operator to show text while at the same time allowing for individual glyph positioning. The meaning of the hex snippets enclosed by angle brackets are the following, according to the 'charmap' table specific for that PDF and the used font:
- <01>: this is the 'W'.
- <0203>: this is the 'at'.
- <0405>: this is the 'er'.
- <06>: this is the 'm'.
- <020507>: this is the 'ark'.
The numbers in between these hex snippets (29, -2, 6 and -1) are correction values which determine the individual spacings of the different characters.

Executive Summary

No, there is no command line way to reliably remove unwanted strings from a PDF!

You can only do this if...

(a) ...you are a PDF expert who is skilled to read the PDF source code;

(b) ...you are prepared to analyse the PDF file in question individually;

edited 7 hours ago

answered 7 hours ago

Kurt Pfeifle

954710

Accepted answer will work only in rare cases

Reasons

The reasons for this are these:

What appears to be ASCII text in the visual representation of its content in a PDF viewer, very likely will not be ASCII text inside the PDF source code. Instead it may be hex encoded.

Additionally, an ASCII string's individual characters might be placed on the page in a consecutive order, but they may easily be placed individually, with each having its own coordinate information sprinkled in between the individual characters...

Also, the hex encoding of the ASCII (and non-ASCII) character table (the "mapping") will not be predictable, and it may change from font to font.

Hence in all these cases your sed command will not succeed -- not even after uncompressing the PDF.

Example

Here is an example for the "string" Watermark, how it can appear inside a PDF created with LibreOffice:

56.8 726.989 Td /F2 16 Tf[<01>29<0203>-2<0405>6<06>-1<020507>]TJ

I'll dissect for you what that means:

56.8 726.989 Td: Td is an operator to move the text positioning on the page; 56.8 726.989 are the x-/y-coordinates to describe that exact position.

/F2 16 Tf: Tf is an operator to set a certain font as well as its size as the currently active one; in this case it is the font tagged elsewhere with the name /F2 and its size should be 16 pt.

[<01>29<0203>-2<0405>6<06>-1<020507>]TJ: TJ is an operator to show text while at the same time allowing for individual glyph positioning. The meaning of the hex snippets enclosed by angle brackets are the following, according to the 'charmap' table specific for that PDF and the used font:
- <01>: this is the 'W'.
- <0203>: this is the 'at'.
- <0405>: this is the 'er'.
- <06>: this is the 'm'.
- <020507>: this is the 'ark'.
The numbers in between these hex snippets (29, -2, 6 and -1) are correction values which determine the individual spacings of the different characters.

Executive Summary

No, there is no command line way to reliably remove unwanted strings from a PDF!

You can only do this if...

(a) ...you are a PDF expert who is skilled to read the PDF source code;

(b) ...you are prepared to analyse the PDF file in question individually;

edited 7 hours ago

answered 7 hours ago

Kurt Pfeifle

954710

edited 7 hours ago

answered 7 hours ago

Kurt Pfeifle

954710

answered 7 hours ago

Kurt Pfeifle

954710

answered 7 hours ago

Kurt Pfeifle

954710

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Ask Ubuntu!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

搜尋此網誌

mjhjmtu