Remove blocks matching a string from a huge html file

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












I am on Mac and I want to remove multiple <div> blocks from a html file that match a certain string. I tried to use sed for it the following way, but failed:



  1. First I escaped all the symbols in my STRING that have special regex meaning and produced ESCAPEDSTRING


  2. But now I am struggling to find a tool that works on multiple lines and with regex to remove respective lines. I guess sed won't work.


In the following example, I want to remove any <div>block that contains the string GET /thestring//index.php, while everything else (i.e, the second to last block containing GET /thisisatotallydifferentstring) remains part of the html file. A sample foo.html looks like this:



<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=3214235353.32&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:18 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&amp;session_id=3214235353.32&amp;date= HTTP/1.1" 200 249 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:28 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&amp;session_id=3214123353.99 HTTP/1.1" 200 278 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:18 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=1523141718.15&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:19 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&amp;session_id=1523141718.15&amp;date= HTTP/1.1" 200 249 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:29 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&amp;session_id=1523141729.64 HTTP/1.1" 200 278 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=1523142027.44&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:28 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&amp;session_id=1523142027.44&amp;date= HTTP/1.1" 200 249 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:38 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&amp;session_id=1523142038.38 HTTP/1.1" 200 278 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects setter usage and property overloading</span><br>
<span class="line"><b>Log line: </b>222.333.444.555 - - [03/Jan/2013:01:03:42 +0200] "GET /thisisatotallydifferentstring.html HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:05:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=1523142327.08&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>


I would like to remove every <div></div> block that contains 'thestring'.



My regex looks like this:



<div class="block highlight">n Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>n <span class="line"><b>Log line: .* - - [08/Apr/2018:.*] "GET /pixi//index.php.* HTTP/1.1" 200 .* "-" "-"n</span><br>n </div>n


Any suggestions?







share|improve this question


























    up vote
    0
    down vote

    favorite












    I am on Mac and I want to remove multiple <div> blocks from a html file that match a certain string. I tried to use sed for it the following way, but failed:



    1. First I escaped all the symbols in my STRING that have special regex meaning and produced ESCAPEDSTRING


    2. But now I am struggling to find a tool that works on multiple lines and with regex to remove respective lines. I guess sed won't work.


    In the following example, I want to remove any <div>block that contains the string GET /thestring//index.php, while everything else (i.e, the second to last block containing GET /thisisatotallydifferentstring) remains part of the html file. A sample foo.html looks like this:



    <div class="block highlight">
    Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
    <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
    </span><br>
    </div>
    <div class="block highlight">
    Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
    <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
    </span><br>
    </div>
    <div class="block highlight">
    Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
    <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=3214235353.32&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
    </span><br>
    </div>
    <div class="block highlight">
    Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
    <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:18 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&amp;session_id=3214235353.32&amp;date= HTTP/1.1" 200 249 "-" "-"
    </span><br>
    </div>
    <div class="block highlight">
    Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
    <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:28 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&amp;session_id=3214123353.99 HTTP/1.1" 200 278 "-" "-"
    </span><br>
    </div>
    <div class="block highlight">
    Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
    <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:18 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=1523141718.15&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
    </span><br>
    </div>
    <div class="block highlight">
    Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
    <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:19 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&amp;session_id=1523141718.15&amp;date= HTTP/1.1" 200 249 "-" "-"
    </span><br>
    </div>
    <div class="block highlight">
    Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
    <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:29 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&amp;session_id=1523141729.64 HTTP/1.1" 200 278 "-" "-"
    </span><br>
    </div>
    <div class="block highlight">
    Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
    <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=1523142027.44&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
    </span><br>
    </div>
    <div class="block highlight">
    Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
    <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:28 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&amp;session_id=1523142027.44&amp;date= HTTP/1.1" 200 249 "-" "-"
    </span><br>
    </div>
    <div class="block highlight">
    Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
    <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:38 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&amp;session_id=1523142038.38 HTTP/1.1" 200 278 "-" "-"
    </span><br>
    </div>
    <div class="block highlight">
    Reason: <span class="reason">Detects setter usage and property overloading</span><br>
    <span class="line"><b>Log line: </b>222.333.444.555 - - [03/Jan/2013:01:03:42 +0200] "GET /thisisatotallydifferentstring.html HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
    </span><br>
    </div>
    <div class="block highlight">
    Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
    <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:05:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=1523142327.08&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
    </span><br>
    </div>


    I would like to remove every <div></div> block that contains 'thestring'.



    My regex looks like this:



    <div class="block highlight">n Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>n <span class="line"><b>Log line: .* - - [08/Apr/2018:.*] "GET /pixi//index.php.* HTTP/1.1" 200 .* "-" "-"n</span><br>n </div>n


    Any suggestions?







    share|improve this question
























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I am on Mac and I want to remove multiple <div> blocks from a html file that match a certain string. I tried to use sed for it the following way, but failed:



      1. First I escaped all the symbols in my STRING that have special regex meaning and produced ESCAPEDSTRING


      2. But now I am struggling to find a tool that works on multiple lines and with regex to remove respective lines. I guess sed won't work.


      In the following example, I want to remove any <div>block that contains the string GET /thestring//index.php, while everything else (i.e, the second to last block containing GET /thisisatotallydifferentstring) remains part of the html file. A sample foo.html looks like this:



      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=3214235353.32&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:18 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&amp;session_id=3214235353.32&amp;date= HTTP/1.1" 200 249 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:28 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&amp;session_id=3214123353.99 HTTP/1.1" 200 278 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:18 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=1523141718.15&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:19 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&amp;session_id=1523141718.15&amp;date= HTTP/1.1" 200 249 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:29 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&amp;session_id=1523141729.64 HTTP/1.1" 200 278 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=1523142027.44&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:28 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&amp;session_id=1523142027.44&amp;date= HTTP/1.1" 200 249 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:38 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&amp;session_id=1523142038.38 HTTP/1.1" 200 278 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects setter usage and property overloading</span><br>
      <span class="line"><b>Log line: </b>222.333.444.555 - - [03/Jan/2013:01:03:42 +0200] "GET /thisisatotallydifferentstring.html HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:05:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=1523142327.08&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
      </span><br>
      </div>


      I would like to remove every <div></div> block that contains 'thestring'.



      My regex looks like this:



      <div class="block highlight">n Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>n <span class="line"><b>Log line: .* - - [08/Apr/2018:.*] "GET /pixi//index.php.* HTTP/1.1" 200 .* "-" "-"n</span><br>n </div>n


      Any suggestions?







      share|improve this question














      I am on Mac and I want to remove multiple <div> blocks from a html file that match a certain string. I tried to use sed for it the following way, but failed:



      1. First I escaped all the symbols in my STRING that have special regex meaning and produced ESCAPEDSTRING


      2. But now I am struggling to find a tool that works on multiple lines and with regex to remove respective lines. I guess sed won't work.


      In the following example, I want to remove any <div>block that contains the string GET /thestring//index.php, while everything else (i.e, the second to last block containing GET /thisisatotallydifferentstring) remains part of the html file. A sample foo.html looks like this:



      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=3214235353.32&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:18 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&amp;session_id=3214235353.32&amp;date= HTTP/1.1" 200 249 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:28 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&amp;session_id=3214123353.99 HTTP/1.1" 200 278 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:18 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=1523141718.15&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:19 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&amp;session_id=1523141718.15&amp;date= HTTP/1.1" 200 249 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:29 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&amp;session_id=1523141729.64 HTTP/1.1" 200 278 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=1523142027.44&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:28 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&amp;session_id=1523142027.44&amp;date= HTTP/1.1" 200 249 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:38 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&amp;session_id=1523142038.38 HTTP/1.1" 200 278 "-" "-"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects setter usage and property overloading</span><br>
      <span class="line"><b>Log line: </b>222.333.444.555 - - [03/Jan/2013:01:03:42 +0200] "GET /thisisatotallydifferentstring.html HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
      </span><br>
      </div>
      <div class="block highlight">
      Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
      <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:05:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=1523142327.08&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
      </span><br>
      </div>


      I would like to remove every <div></div> block that contains 'thestring'.



      My regex looks like this:



      <div class="block highlight">n Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>n <span class="line"><b>Log line: .* - - [08/Apr/2018:.*] "GET /pixi//index.php.* HTTP/1.1" 200 .* "-" "-"n</span><br>n </div>n


      Any suggestions?









      share|improve this question













      share|improve this question




      share|improve this question








      edited Apr 12 at 3:36

























      asked Apr 11 at 9:14









      user1192748

      1031




      1031




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote













          Use the program xsltproc you will have that on OSX, man xsltproc



          For example:



          $ cat remdivs.xslt 
          <?xml version="1.0" encoding="utf-8"?>
          <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
          version="1.0">

          <xsl:output method="html" omit-xml-declaration="yes" indent="yes"/>
          <xsl:strip-space elements="*" />
          <xsl:preserve-space elements="html body div" />

          <xsl:template match="@* | node()">
          <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
          </xsl:copy>
          </xsl:template>

          <xsl:template match="div[@class='block highlight']"/>

          </xsl:stylesheet>



          $ cat input.xml
          <html>
          <div class="block highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
          </span><br>
          </div>
          <div class="no highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
          </span><br>
          </div>
          <div class="block highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=3214235353.32&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
          </span><br>
          </div>
          </html>

          $ xsltproc --html remdivs.xslt input.xml
          <html>
          <body>
          <div class="no highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
          </span><br>
          </div>

          </body>
          </html>


          edit after further problem explanation.



          Further explanation.



          xsltproc performs the transformation of the input document based on the template (remdivs.xslt), I use the --html option to relax the strict xml validation since your input document contains <br> empty elements (opposed to <br/>).



          The processor first takes your input doc and builds a document model in memory, then traverses the elements in the model applying the templates it finds in the .xslt



          Looking at the .xslt, it contains the preamble declaration and then some general processing rules that help to define the type of output required.



          There are only 2 templates, the first one



          <xsl:template match="@* | node()">
          <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
          </xsl:copy>
          </xsl:template>


          This template has a match attribute, so it is only applied to those elements in the input document that match the match expression, in this case the expression is "@* | node()", which will match any attribute or any node in your document, the whole thing! What it does to those elements is laid out inside, it copies the output of applying each template selectively, but the selection criteria templates will be the name of every attribute and element. The effect is, if only this template was present would be a copy of your original input document, with the ouput processing rules applied.



          The 2nd template does the rejection.



           <xsl:template match="div[@class='block highlight']"/>


          Here, it matches specifically <div> elements that have an attribute named class, with a value of 'block highlight'. so those <div>...</div> blocks where this matches, are replaced with the output this template produces, sine this one is empty (has a terminating /), no output is produced.



          This, on the other hand,



           <xsl:template match="div[@class='block highlight']">
          suppressed div output<br>
          </xsl:template>


          Will output some text in place of the suppressed div-block.



          Here's a different suppression template based on your revised question.



          <xsl:template match="div">
          <xsl:choose>
          <xsl:when test="not(contains(span[@class='line'],'GET /thestring'))">
          <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
          </xsl:copy>
          </xsl:when>
          <xsl:otherwise><!-- Just do nothing to supress output -->
          </xsl:otherwise>
          </xsl:choose>
          </xsl:template>


          This template is applied to all div-elements, it tests whether the text content of any of it's child span-elements that also contain the class attribute value of line, does not contain the string 'GET /thestring'.



          When it does not contain the string we do the same sort of copy as in the 1st template otherwise we do nothing to suppress the output of that div-block.



          Read further about XPath which defines how to address the elements and attributes of a document and XSLT to write processing templates, these examples should help to make it clearer for the beginner.






          share|improve this answer






















          • Dear X Tian, this tool looks just like what I have been searching for! However, I don't want to filter on the class id of the div block as they are all the same. I rather want to filter on the contents within the div block. Especially I want to match the GET xyz part. I updated my initial question to be more precise.
            – user1192748
            Apr 12 at 3:38










          Your Answer







          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "106"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f436960%2fremove-div-blocks-matching-a-string-from-a-huge-html-file%23new-answer', 'question_page');

          );

          Post as a guest






























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          0
          down vote













          Use the program xsltproc you will have that on OSX, man xsltproc



          For example:



          $ cat remdivs.xslt 
          <?xml version="1.0" encoding="utf-8"?>
          <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
          version="1.0">

          <xsl:output method="html" omit-xml-declaration="yes" indent="yes"/>
          <xsl:strip-space elements="*" />
          <xsl:preserve-space elements="html body div" />

          <xsl:template match="@* | node()">
          <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
          </xsl:copy>
          </xsl:template>

          <xsl:template match="div[@class='block highlight']"/>

          </xsl:stylesheet>



          $ cat input.xml
          <html>
          <div class="block highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
          </span><br>
          </div>
          <div class="no highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
          </span><br>
          </div>
          <div class="block highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=3214235353.32&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
          </span><br>
          </div>
          </html>

          $ xsltproc --html remdivs.xslt input.xml
          <html>
          <body>
          <div class="no highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
          </span><br>
          </div>

          </body>
          </html>


          edit after further problem explanation.



          Further explanation.



          xsltproc performs the transformation of the input document based on the template (remdivs.xslt), I use the --html option to relax the strict xml validation since your input document contains <br> empty elements (opposed to <br/>).



          The processor first takes your input doc and builds a document model in memory, then traverses the elements in the model applying the templates it finds in the .xslt



          Looking at the .xslt, it contains the preamble declaration and then some general processing rules that help to define the type of output required.



          There are only 2 templates, the first one



          <xsl:template match="@* | node()">
          <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
          </xsl:copy>
          </xsl:template>


          This template has a match attribute, so it is only applied to those elements in the input document that match the match expression, in this case the expression is "@* | node()", which will match any attribute or any node in your document, the whole thing! What it does to those elements is laid out inside, it copies the output of applying each template selectively, but the selection criteria templates will be the name of every attribute and element. The effect is, if only this template was present would be a copy of your original input document, with the ouput processing rules applied.



          The 2nd template does the rejection.



           <xsl:template match="div[@class='block highlight']"/>


          Here, it matches specifically <div> elements that have an attribute named class, with a value of 'block highlight'. so those <div>...</div> blocks where this matches, are replaced with the output this template produces, sine this one is empty (has a terminating /), no output is produced.



          This, on the other hand,



           <xsl:template match="div[@class='block highlight']">
          suppressed div output<br>
          </xsl:template>


          Will output some text in place of the suppressed div-block.



          Here's a different suppression template based on your revised question.



          <xsl:template match="div">
          <xsl:choose>
          <xsl:when test="not(contains(span[@class='line'],'GET /thestring'))">
          <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
          </xsl:copy>
          </xsl:when>
          <xsl:otherwise><!-- Just do nothing to supress output -->
          </xsl:otherwise>
          </xsl:choose>
          </xsl:template>


          This template is applied to all div-elements, it tests whether the text content of any of it's child span-elements that also contain the class attribute value of line, does not contain the string 'GET /thestring'.



          When it does not contain the string we do the same sort of copy as in the 1st template otherwise we do nothing to suppress the output of that div-block.



          Read further about XPath which defines how to address the elements and attributes of a document and XSLT to write processing templates, these examples should help to make it clearer for the beginner.






          share|improve this answer






















          • Dear X Tian, this tool looks just like what I have been searching for! However, I don't want to filter on the class id of the div block as they are all the same. I rather want to filter on the contents within the div block. Especially I want to match the GET xyz part. I updated my initial question to be more precise.
            – user1192748
            Apr 12 at 3:38














          up vote
          0
          down vote













          Use the program xsltproc you will have that on OSX, man xsltproc



          For example:



          $ cat remdivs.xslt 
          <?xml version="1.0" encoding="utf-8"?>
          <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
          version="1.0">

          <xsl:output method="html" omit-xml-declaration="yes" indent="yes"/>
          <xsl:strip-space elements="*" />
          <xsl:preserve-space elements="html body div" />

          <xsl:template match="@* | node()">
          <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
          </xsl:copy>
          </xsl:template>

          <xsl:template match="div[@class='block highlight']"/>

          </xsl:stylesheet>



          $ cat input.xml
          <html>
          <div class="block highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
          </span><br>
          </div>
          <div class="no highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
          </span><br>
          </div>
          <div class="block highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=3214235353.32&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
          </span><br>
          </div>
          </html>

          $ xsltproc --html remdivs.xslt input.xml
          <html>
          <body>
          <div class="no highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
          </span><br>
          </div>

          </body>
          </html>


          edit after further problem explanation.



          Further explanation.



          xsltproc performs the transformation of the input document based on the template (remdivs.xslt), I use the --html option to relax the strict xml validation since your input document contains <br> empty elements (opposed to <br/>).



          The processor first takes your input doc and builds a document model in memory, then traverses the elements in the model applying the templates it finds in the .xslt



          Looking at the .xslt, it contains the preamble declaration and then some general processing rules that help to define the type of output required.



          There are only 2 templates, the first one



          <xsl:template match="@* | node()">
          <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
          </xsl:copy>
          </xsl:template>


          This template has a match attribute, so it is only applied to those elements in the input document that match the match expression, in this case the expression is "@* | node()", which will match any attribute or any node in your document, the whole thing! What it does to those elements is laid out inside, it copies the output of applying each template selectively, but the selection criteria templates will be the name of every attribute and element. The effect is, if only this template was present would be a copy of your original input document, with the ouput processing rules applied.



          The 2nd template does the rejection.



           <xsl:template match="div[@class='block highlight']"/>


          Here, it matches specifically <div> elements that have an attribute named class, with a value of 'block highlight'. so those <div>...</div> blocks where this matches, are replaced with the output this template produces, sine this one is empty (has a terminating /), no output is produced.



          This, on the other hand,



           <xsl:template match="div[@class='block highlight']">
          suppressed div output<br>
          </xsl:template>


          Will output some text in place of the suppressed div-block.



          Here's a different suppression template based on your revised question.



          <xsl:template match="div">
          <xsl:choose>
          <xsl:when test="not(contains(span[@class='line'],'GET /thestring'))">
          <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
          </xsl:copy>
          </xsl:when>
          <xsl:otherwise><!-- Just do nothing to supress output -->
          </xsl:otherwise>
          </xsl:choose>
          </xsl:template>


          This template is applied to all div-elements, it tests whether the text content of any of it's child span-elements that also contain the class attribute value of line, does not contain the string 'GET /thestring'.



          When it does not contain the string we do the same sort of copy as in the 1st template otherwise we do nothing to suppress the output of that div-block.



          Read further about XPath which defines how to address the elements and attributes of a document and XSLT to write processing templates, these examples should help to make it clearer for the beginner.






          share|improve this answer






















          • Dear X Tian, this tool looks just like what I have been searching for! However, I don't want to filter on the class id of the div block as they are all the same. I rather want to filter on the contents within the div block. Especially I want to match the GET xyz part. I updated my initial question to be more precise.
            – user1192748
            Apr 12 at 3:38












          up vote
          0
          down vote










          up vote
          0
          down vote









          Use the program xsltproc you will have that on OSX, man xsltproc



          For example:



          $ cat remdivs.xslt 
          <?xml version="1.0" encoding="utf-8"?>
          <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
          version="1.0">

          <xsl:output method="html" omit-xml-declaration="yes" indent="yes"/>
          <xsl:strip-space elements="*" />
          <xsl:preserve-space elements="html body div" />

          <xsl:template match="@* | node()">
          <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
          </xsl:copy>
          </xsl:template>

          <xsl:template match="div[@class='block highlight']"/>

          </xsl:stylesheet>



          $ cat input.xml
          <html>
          <div class="block highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
          </span><br>
          </div>
          <div class="no highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
          </span><br>
          </div>
          <div class="block highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=3214235353.32&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
          </span><br>
          </div>
          </html>

          $ xsltproc --html remdivs.xslt input.xml
          <html>
          <body>
          <div class="no highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
          </span><br>
          </div>

          </body>
          </html>


          edit after further problem explanation.



          Further explanation.



          xsltproc performs the transformation of the input document based on the template (remdivs.xslt), I use the --html option to relax the strict xml validation since your input document contains <br> empty elements (opposed to <br/>).



          The processor first takes your input doc and builds a document model in memory, then traverses the elements in the model applying the templates it finds in the .xslt



          Looking at the .xslt, it contains the preamble declaration and then some general processing rules that help to define the type of output required.



          There are only 2 templates, the first one



          <xsl:template match="@* | node()">
          <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
          </xsl:copy>
          </xsl:template>


          This template has a match attribute, so it is only applied to those elements in the input document that match the match expression, in this case the expression is "@* | node()", which will match any attribute or any node in your document, the whole thing! What it does to those elements is laid out inside, it copies the output of applying each template selectively, but the selection criteria templates will be the name of every attribute and element. The effect is, if only this template was present would be a copy of your original input document, with the ouput processing rules applied.



          The 2nd template does the rejection.



           <xsl:template match="div[@class='block highlight']"/>


          Here, it matches specifically <div> elements that have an attribute named class, with a value of 'block highlight'. so those <div>...</div> blocks where this matches, are replaced with the output this template produces, sine this one is empty (has a terminating /), no output is produced.



          This, on the other hand,



           <xsl:template match="div[@class='block highlight']">
          suppressed div output<br>
          </xsl:template>


          Will output some text in place of the suppressed div-block.



          Here's a different suppression template based on your revised question.



          <xsl:template match="div">
          <xsl:choose>
          <xsl:when test="not(contains(span[@class='line'],'GET /thestring'))">
          <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
          </xsl:copy>
          </xsl:when>
          <xsl:otherwise><!-- Just do nothing to supress output -->
          </xsl:otherwise>
          </xsl:choose>
          </xsl:template>


          This template is applied to all div-elements, it tests whether the text content of any of it's child span-elements that also contain the class attribute value of line, does not contain the string 'GET /thestring'.



          When it does not contain the string we do the same sort of copy as in the 1st template otherwise we do nothing to suppress the output of that div-block.



          Read further about XPath which defines how to address the elements and attributes of a document and XSLT to write processing templates, these examples should help to make it clearer for the beginner.






          share|improve this answer














          Use the program xsltproc you will have that on OSX, man xsltproc



          For example:



          $ cat remdivs.xslt 
          <?xml version="1.0" encoding="utf-8"?>
          <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
          version="1.0">

          <xsl:output method="html" omit-xml-declaration="yes" indent="yes"/>
          <xsl:strip-space elements="*" />
          <xsl:preserve-space elements="html body div" />

          <xsl:template match="@* | node()">
          <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
          </xsl:copy>
          </xsl:template>

          <xsl:template match="div[@class='block highlight']"/>

          </xsl:stylesheet>



          $ cat input.xml
          <html>
          <div class="block highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
          </span><br>
          </div>
          <div class="no highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
          </span><br>
          </div>
          <div class="block highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=3214235353.32&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
          </span><br>
          </div>
          </html>

          $ xsltproc --html remdivs.xslt input.xml
          <html>
          <body>
          <div class="no highlight">
          Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
          <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
          </span><br>
          </div>

          </body>
          </html>


          edit after further problem explanation.



          Further explanation.



          xsltproc performs the transformation of the input document based on the template (remdivs.xslt), I use the --html option to relax the strict xml validation since your input document contains <br> empty elements (opposed to <br/>).



          The processor first takes your input doc and builds a document model in memory, then traverses the elements in the model applying the templates it finds in the .xslt



          Looking at the .xslt, it contains the preamble declaration and then some general processing rules that help to define the type of output required.



          There are only 2 templates, the first one



          <xsl:template match="@* | node()">
          <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
          </xsl:copy>
          </xsl:template>


          This template has a match attribute, so it is only applied to those elements in the input document that match the match expression, in this case the expression is "@* | node()", which will match any attribute or any node in your document, the whole thing! What it does to those elements is laid out inside, it copies the output of applying each template selectively, but the selection criteria templates will be the name of every attribute and element. The effect is, if only this template was present would be a copy of your original input document, with the ouput processing rules applied.



          The 2nd template does the rejection.



           <xsl:template match="div[@class='block highlight']"/>


          Here, it matches specifically <div> elements that have an attribute named class, with a value of 'block highlight'. so those <div>...</div> blocks where this matches, are replaced with the output this template produces, sine this one is empty (has a terminating /), no output is produced.



          This, on the other hand,



           <xsl:template match="div[@class='block highlight']">
          suppressed div output<br>
          </xsl:template>


          Will output some text in place of the suppressed div-block.



          Here's a different suppression template based on your revised question.



          <xsl:template match="div">
          <xsl:choose>
          <xsl:when test="not(contains(span[@class='line'],'GET /thestring'))">
          <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
          </xsl:copy>
          </xsl:when>
          <xsl:otherwise><!-- Just do nothing to supress output -->
          </xsl:otherwise>
          </xsl:choose>
          </xsl:template>


          This template is applied to all div-elements, it tests whether the text content of any of it's child span-elements that also contain the class attribute value of line, does not contain the string 'GET /thestring'.



          When it does not contain the string we do the same sort of copy as in the 1st template otherwise we do nothing to suppress the output of that div-block.



          Read further about XPath which defines how to address the elements and attributes of a document and XSLT to write processing templates, these examples should help to make it clearer for the beginner.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Apr 13 at 10:42

























          answered Apr 11 at 14:46









          X Tian

          7,29111836




          7,29111836











          • Dear X Tian, this tool looks just like what I have been searching for! However, I don't want to filter on the class id of the div block as they are all the same. I rather want to filter on the contents within the div block. Especially I want to match the GET xyz part. I updated my initial question to be more precise.
            – user1192748
            Apr 12 at 3:38
















          • Dear X Tian, this tool looks just like what I have been searching for! However, I don't want to filter on the class id of the div block as they are all the same. I rather want to filter on the contents within the div block. Especially I want to match the GET xyz part. I updated my initial question to be more precise.
            – user1192748
            Apr 12 at 3:38















          Dear X Tian, this tool looks just like what I have been searching for! However, I don't want to filter on the class id of the div block as they are all the same. I rather want to filter on the contents within the div block. Especially I want to match the GET xyz part. I updated my initial question to be more precise.
          – user1192748
          Apr 12 at 3:38




          Dear X Tian, this tool looks just like what I have been searching for! However, I don't want to filter on the class id of the div block as they are all the same. I rather want to filter on the contents within the div block. Especially I want to match the GET xyz part. I updated my initial question to be more precise.
          – user1192748
          Apr 12 at 3:38












           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f436960%2fremove-div-blocks-matching-a-string-from-a-huge-html-file%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          How to check contact read email or not when send email to Individual?

          Displaying single band from multi-band raster using QGIS

          How many registers does an x86_64 CPU actually have?