Mass convert thousands of downloaded (with wget) HTML documents to DOCX
Clash Royale CLAN TAG#URR8PPP
I would like to process and convert all the files downloaded from wget in HTML format from a URL.
I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.
Could it be automatically done in some way?
wget html pandoc
add a comment |
I would like to process and convert all the files downloaded from wget in HTML format from a URL.
I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.
Could it be automatically done in some way?
wget html pandoc
Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44
when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08
look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17
what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10
add a comment |
I would like to process and convert all the files downloaded from wget in HTML format from a URL.
I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.
Could it be automatically done in some way?
wget html pandoc
I would like to process and convert all the files downloaded from wget in HTML format from a URL.
I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.
Could it be automatically done in some way?
wget html pandoc
wget html pandoc
edited Dec 15 at 21:48
Kurt Pfeifle
43038
43038
asked Apr 24 at 0:38
user3127939
61
61
Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44
when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08
look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17
what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10
add a comment |
Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44
when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08
look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17
what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10
Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44
Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44
when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08
when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08
look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17
look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17
what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10
what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10
add a comment |
1 Answer
1
active
oldest
votes
1. Convert after downloading
Whats the problem with using Pandoc on your saved HTML files?
Assuming your HTML are all in the a directory named wget-html, you could do the following:
cd wget-html
find . -name "*.docx"
| xargs -0
pandoc
--from=html
--to=docx
--toc
--standalone
--output=.pdf
;
This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".
2. Convert while downloading
If you want to achieve this, say so. But first please indicate which exact wget command you were using.
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f439612%2fmass-convert-thousands-of-downloaded-with-wget-html-documents-to-docx%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
1. Convert after downloading
Whats the problem with using Pandoc on your saved HTML files?
Assuming your HTML are all in the a directory named wget-html, you could do the following:
cd wget-html
find . -name "*.docx"
| xargs -0
pandoc
--from=html
--to=docx
--toc
--standalone
--output=.pdf
;
This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".
2. Convert while downloading
If you want to achieve this, say so. But first please indicate which exact wget command you were using.
add a comment |
1. Convert after downloading
Whats the problem with using Pandoc on your saved HTML files?
Assuming your HTML are all in the a directory named wget-html, you could do the following:
cd wget-html
find . -name "*.docx"
| xargs -0
pandoc
--from=html
--to=docx
--toc
--standalone
--output=.pdf
;
This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".
2. Convert while downloading
If you want to achieve this, say so. But first please indicate which exact wget command you were using.
add a comment |
1. Convert after downloading
Whats the problem with using Pandoc on your saved HTML files?
Assuming your HTML are all in the a directory named wget-html, you could do the following:
cd wget-html
find . -name "*.docx"
| xargs -0
pandoc
--from=html
--to=docx
--toc
--standalone
--output=.pdf
;
This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".
2. Convert while downloading
If you want to achieve this, say so. But first please indicate which exact wget command you were using.
1. Convert after downloading
Whats the problem with using Pandoc on your saved HTML files?
Assuming your HTML are all in the a directory named wget-html, you could do the following:
cd wget-html
find . -name "*.docx"
| xargs -0
pandoc
--from=html
--to=docx
--toc
--standalone
--output=.pdf
;
This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".
2. Convert while downloading
If you want to achieve this, say so. But first please indicate which exact wget command you were using.
answered Dec 15 at 19:15
Kurt Pfeifle
43038
43038
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f439612%2fmass-convert-thousands-of-downloaded-with-wget-html-documents-to-docx%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44
when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08
look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17
what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10