Mass convert thousands of downloaded (with wget) HTML documents to DOCX

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












1














I would like to process and convert all the files downloaded from wget in HTML format from a URL.



I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.



Could it be automatically done in some way?










share|improve this question























  • Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
    – ivanivan
    Apr 24 at 0:44










  • when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
    – user3127939
    Apr 24 at 2:08











  • look at the headless doc conversion option for open|libre office
    – ivanivan
    Apr 24 at 2:17










  • what's the use if I can not agree with wget, you are not answering my question.
    – user3127939
    Apr 24 at 12:10















1














I would like to process and convert all the files downloaded from wget in HTML format from a URL.



I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.



Could it be automatically done in some way?










share|improve this question























  • Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
    – ivanivan
    Apr 24 at 0:44










  • when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
    – user3127939
    Apr 24 at 2:08











  • look at the headless doc conversion option for open|libre office
    – ivanivan
    Apr 24 at 2:17










  • what's the use if I can not agree with wget, you are not answering my question.
    – user3127939
    Apr 24 at 12:10













1












1








1


1





I would like to process and convert all the files downloaded from wget in HTML format from a URL.



I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.



Could it be automatically done in some way?










share|improve this question















I would like to process and convert all the files downloaded from wget in HTML format from a URL.



I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.



Could it be automatically done in some way?







wget html pandoc






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 15 at 21:48









Kurt Pfeifle

43038




43038










asked Apr 24 at 0:38









user3127939

61




61











  • Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
    – ivanivan
    Apr 24 at 0:44










  • when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
    – user3127939
    Apr 24 at 2:08











  • look at the headless doc conversion option for open|libre office
    – ivanivan
    Apr 24 at 2:17










  • what's the use if I can not agree with wget, you are not answering my question.
    – user3127939
    Apr 24 at 12:10
















  • Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
    – ivanivan
    Apr 24 at 0:44










  • when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
    – user3127939
    Apr 24 at 2:08











  • look at the headless doc conversion option for open|libre office
    – ivanivan
    Apr 24 at 2:17










  • what's the use if I can not agree with wget, you are not answering my question.
    – user3127939
    Apr 24 at 12:10















Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44




Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44












when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08





when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08













look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17




look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17












what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10




what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10










1 Answer
1






active

oldest

votes


















0














1. Convert after downloading



Whats the problem with using Pandoc on your saved HTML files?



Assuming your HTML are all in the a directory named wget-html, you could do the following:




cd wget-html

find . -name "*.docx"
| xargs -0
pandoc
--from=html
--to=docx
--toc
--standalone
--output=.pdf
;


This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".



2. Convert while downloading



If you want to achieve this, say so. But first please indicate which exact wget command you were using.






share|improve this answer




















    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f439612%2fmass-convert-thousands-of-downloaded-with-wget-html-documents-to-docx%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    1. Convert after downloading



    Whats the problem with using Pandoc on your saved HTML files?



    Assuming your HTML are all in the a directory named wget-html, you could do the following:




    cd wget-html

    find . -name "*.docx"
    | xargs -0
    pandoc
    --from=html
    --to=docx
    --toc
    --standalone
    --output=.pdf
    ;


    This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".



    2. Convert while downloading



    If you want to achieve this, say so. But first please indicate which exact wget command you were using.






    share|improve this answer

























      0














      1. Convert after downloading



      Whats the problem with using Pandoc on your saved HTML files?



      Assuming your HTML are all in the a directory named wget-html, you could do the following:




      cd wget-html

      find . -name "*.docx"
      | xargs -0
      pandoc
      --from=html
      --to=docx
      --toc
      --standalone
      --output=.pdf
      ;


      This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".



      2. Convert while downloading



      If you want to achieve this, say so. But first please indicate which exact wget command you were using.






      share|improve this answer























        0












        0








        0






        1. Convert after downloading



        Whats the problem with using Pandoc on your saved HTML files?



        Assuming your HTML are all in the a directory named wget-html, you could do the following:




        cd wget-html

        find . -name "*.docx"
        | xargs -0
        pandoc
        --from=html
        --to=docx
        --toc
        --standalone
        --output=.pdf
        ;


        This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".



        2. Convert while downloading



        If you want to achieve this, say so. But first please indicate which exact wget command you were using.






        share|improve this answer












        1. Convert after downloading



        Whats the problem with using Pandoc on your saved HTML files?



        Assuming your HTML are all in the a directory named wget-html, you could do the following:




        cd wget-html

        find . -name "*.docx"
        | xargs -0
        pandoc
        --from=html
        --to=docx
        --toc
        --standalone
        --output=.pdf
        ;


        This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".



        2. Convert while downloading



        If you want to achieve this, say so. But first please indicate which exact wget command you were using.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Dec 15 at 19:15









        Kurt Pfeifle

        43038




        43038



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Unix & Linux Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f439612%2fmass-convert-thousands-of-downloaded-with-wget-html-documents-to-docx%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown






            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Displaying single band from multi-band raster using QGIS

            How many registers does an x86_64 CPU actually have?