Mass convert thousands of downloaded (with wget) HTML documents to DOCX

I would like to process and convert all the files downloaded from wget in HTML format from a URL.

I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.

Could it be automatically done in some way?

edited Dec 15 at 21:48

Kurt Pfeifle

43038

asked Apr 24 at 0:38

user3127939

Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44

when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08

look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17

what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10

add a comment |

I would like to process and convert all the files downloaded from wget in HTML format from a URL.

I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.

Could it be automatically done in some way?

edited Dec 15 at 21:48

Kurt Pfeifle

43038

asked Apr 24 at 0:38

user3127939

Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44

when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08

look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17

what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10

add a comment |

I would like to process and convert all the files downloaded from wget in HTML format from a URL.

I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.

Could it be automatically done in some way?

edited Dec 15 at 21:48

Kurt Pfeifle

43038

asked Apr 24 at 0:38

user3127939

I would like to process and convert all the files downloaded from wget in HTML format from a URL.

I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.

Could it be automatically done in some way?

wget html pandoc

edited Dec 15 at 21:48

Kurt Pfeifle

43038

asked Apr 24 at 0:38

user3127939

edited Dec 15 at 21:48

Kurt Pfeifle

43038

asked Apr 24 at 0:38

user3127939

edited Dec 15 at 21:48

Kurt Pfeifle

43038

edited Dec 15 at 21:48

Kurt Pfeifle

43038

edited Dec 15 at 21:48

Kurt Pfeifle

43038

asked Apr 24 at 0:38

user3127939

asked Apr 24 at 0:38

user3127939

asked Apr 24 at 0:38

user3127939

Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44

when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08

look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17

what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10

add a comment |

Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44

when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08

look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17

what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10

Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44

when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08

look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17

what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10

add a comment |

1 Answer
1

active

oldest

votes

1. Convert after downloading

Whats the problem with using Pandoc on your saved HTML files?

Assuming your HTML are all in the a directory named wget-html, you could do the following:


 cd wget-html

 find . -name "*.docx" 
 | xargs -0 
 pandoc 
 --from=html 
 --to=docx 
 --toc 
 --standalone 
 --output=.pdf
 ;

This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".

2. Convert while downloading

If you want to achieve this, say so. But first please indicate which exact wget command you were using.

answered Dec 15 at 19:15

Kurt Pfeifle

43038

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f439612%2fmass-convert-thousands-of-downloaded-with-wget-html-documents-to-docx%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

1. Convert after downloading

Whats the problem with using Pandoc on your saved HTML files?

Assuming your HTML are all in the a directory named wget-html, you could do the following:


 cd wget-html

 find . -name "*.docx" 
 | xargs -0 
 pandoc 
 --from=html 
 --to=docx 
 --toc 
 --standalone 
 --output=.pdf
 ;

This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".

2. Convert while downloading

If you want to achieve this, say so. But first please indicate which exact wget command you were using.

answered Dec 15 at 19:15

Kurt Pfeifle

43038

add a comment |

1. Convert after downloading

Whats the problem with using Pandoc on your saved HTML files?

Assuming your HTML are all in the a directory named wget-html, you could do the following:


 cd wget-html

 find . -name "*.docx" 
 | xargs -0 
 pandoc 
 --from=html 
 --to=docx 
 --toc 
 --standalone 
 --output=.pdf
 ;

This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".

2. Convert while downloading

If you want to achieve this, say so. But first please indicate which exact wget command you were using.

answered Dec 15 at 19:15

Kurt Pfeifle

43038

add a comment |

1. Convert after downloading

Whats the problem with using Pandoc on your saved HTML files?

Assuming your HTML are all in the a directory named wget-html, you could do the following:


 cd wget-html

 find . -name "*.docx" 
 | xargs -0 
 pandoc 
 --from=html 
 --to=docx 
 --toc 
 --standalone 
 --output=.pdf
 ;

This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".

2. Convert while downloading

If you want to achieve this, say so. But first please indicate which exact wget command you were using.

answered Dec 15 at 19:15

Kurt Pfeifle

43038

1. Convert after downloading

Whats the problem with using Pandoc on your saved HTML files?

Assuming your HTML are all in the a directory named wget-html, you could do the following:


 cd wget-html

 find . -name "*.docx" 
 | xargs -0 
 pandoc 
 --from=html 
 --to=docx 
 --toc 
 --standalone 
 --output=.pdf
 ;

This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".

2. Convert while downloading

If you want to achieve this, say so. But first please indicate which exact wget command you were using.

answered Dec 15 at 19:15

Kurt Pfeifle

43038

answered Dec 15 at 19:15

Kurt Pfeifle

43038

answered Dec 15 at 19:15

Kurt Pfeifle

43038

answered Dec 15 at 19:15

Kurt Pfeifle

43038

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

搜尋此網誌

mjhjmtu