Remove blocks matching a string from a huge html file
Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
I am on Mac and I want to remove multiple <div>
blocks from a html file that match a certain string. I tried to use sed for it the following way, but failed:
First I escaped all the symbols in my
STRING
that have special regex meaning and producedESCAPEDSTRING
But now I am struggling to find a tool that works on multiple lines and with regex to remove respective lines. I guess
sed
won't work.
In the following example, I want to remove any <div>
block that contains the string GET /thestring//index.php
, while everything else (i.e, the second to last block containing GET /thisisatotallydifferentstring
) remains part of the html file. A sample foo.html looks like this:
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=3214235353.32&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:18 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&session_id=3214235353.32&date= HTTP/1.1" 200 249 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:28 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&session_id=3214123353.99 HTTP/1.1" 200 278 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:18 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=1523141718.15&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:19 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&session_id=1523141718.15&date= HTTP/1.1" 200 249 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:29 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&session_id=1523141729.64 HTTP/1.1" 200 278 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=1523142027.44&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:28 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&session_id=1523142027.44&date= HTTP/1.1" 200 249 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:38 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&session_id=1523142038.38 HTTP/1.1" 200 278 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects setter usage and property overloading</span><br>
<span class="line"><b>Log line: </b>222.333.444.555 - - [03/Jan/2013:01:03:42 +0200] "GET /thisisatotallydifferentstring.html HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:05:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=1523142327.08&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
I would like to remove every <div></div>
block that contains 'thestring'.
My regex looks like this:
<div class="block highlight">n Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>n <span class="line"><b>Log line: .* - - [08/Apr/2018:.*] "GET /pixi//index.php.* HTTP/1.1" 200 .* "-" "-"n</span><br>n </div>n
Any suggestions?
text-processing regular-expression html
add a comment |Â
up vote
0
down vote
favorite
I am on Mac and I want to remove multiple <div>
blocks from a html file that match a certain string. I tried to use sed for it the following way, but failed:
First I escaped all the symbols in my
STRING
that have special regex meaning and producedESCAPEDSTRING
But now I am struggling to find a tool that works on multiple lines and with regex to remove respective lines. I guess
sed
won't work.
In the following example, I want to remove any <div>
block that contains the string GET /thestring//index.php
, while everything else (i.e, the second to last block containing GET /thisisatotallydifferentstring
) remains part of the html file. A sample foo.html looks like this:
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=3214235353.32&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:18 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&session_id=3214235353.32&date= HTTP/1.1" 200 249 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:28 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&session_id=3214123353.99 HTTP/1.1" 200 278 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:18 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=1523141718.15&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:19 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&session_id=1523141718.15&date= HTTP/1.1" 200 249 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:29 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&session_id=1523141729.64 HTTP/1.1" 200 278 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=1523142027.44&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:28 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&session_id=1523142027.44&date= HTTP/1.1" 200 249 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:38 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&session_id=1523142038.38 HTTP/1.1" 200 278 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects setter usage and property overloading</span><br>
<span class="line"><b>Log line: </b>222.333.444.555 - - [03/Jan/2013:01:03:42 +0200] "GET /thisisatotallydifferentstring.html HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:05:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=1523142327.08&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
I would like to remove every <div></div>
block that contains 'thestring'.
My regex looks like this:
<div class="block highlight">n Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>n <span class="line"><b>Log line: .* - - [08/Apr/2018:.*] "GET /pixi//index.php.* HTTP/1.1" 200 .* "-" "-"n</span><br>n </div>n
Any suggestions?
text-processing regular-expression html
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I am on Mac and I want to remove multiple <div>
blocks from a html file that match a certain string. I tried to use sed for it the following way, but failed:
First I escaped all the symbols in my
STRING
that have special regex meaning and producedESCAPEDSTRING
But now I am struggling to find a tool that works on multiple lines and with regex to remove respective lines. I guess
sed
won't work.
In the following example, I want to remove any <div>
block that contains the string GET /thestring//index.php
, while everything else (i.e, the second to last block containing GET /thisisatotallydifferentstring
) remains part of the html file. A sample foo.html looks like this:
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=3214235353.32&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:18 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&session_id=3214235353.32&date= HTTP/1.1" 200 249 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:28 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&session_id=3214123353.99 HTTP/1.1" 200 278 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:18 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=1523141718.15&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:19 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&session_id=1523141718.15&date= HTTP/1.1" 200 249 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:29 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&session_id=1523141729.64 HTTP/1.1" 200 278 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=1523142027.44&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:28 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&session_id=1523142027.44&date= HTTP/1.1" 200 249 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:38 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&session_id=1523142038.38 HTTP/1.1" 200 278 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects setter usage and property overloading</span><br>
<span class="line"><b>Log line: </b>222.333.444.555 - - [03/Jan/2013:01:03:42 +0200] "GET /thisisatotallydifferentstring.html HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:05:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=1523142327.08&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
I would like to remove every <div></div>
block that contains 'thestring'.
My regex looks like this:
<div class="block highlight">n Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>n <span class="line"><b>Log line: .* - - [08/Apr/2018:.*] "GET /pixi//index.php.* HTTP/1.1" 200 .* "-" "-"n</span><br>n </div>n
Any suggestions?
text-processing regular-expression html
I am on Mac and I want to remove multiple <div>
blocks from a html file that match a certain string. I tried to use sed for it the following way, but failed:
First I escaped all the symbols in my
STRING
that have special regex meaning and producedESCAPEDSTRING
But now I am struggling to find a tool that works on multiple lines and with regex to remove respective lines. I guess
sed
won't work.
In the following example, I want to remove any <div>
block that contains the string GET /thestring//index.php
, while everything else (i.e, the second to last block containing GET /thisisatotallydifferentstring
) remains part of the html file. A sample foo.html looks like this:
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=3214235353.32&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:18 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&session_id=3214235353.32&date= HTTP/1.1" 200 249 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:28 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&session_id=3214123353.99 HTTP/1.1" 200 278 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:18 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=1523141718.15&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:19 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&session_id=1523141718.15&date= HTTP/1.1" 200 249 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:29 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&session_id=1523141729.64 HTTP/1.1" 200 278 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=1523142027.44&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:28 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&session_id=1523142027.44&date= HTTP/1.1" 200 249 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:38 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&session_id=1523142038.38 HTTP/1.1" 200 278 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects setter usage and property overloading</span><br>
<span class="line"><b>Log line: </b>222.333.444.555 - - [03/Jan/2013:01:03:42 +0200] "GET /thisisatotallydifferentstring.html HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:05:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=1523142327.08&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
I would like to remove every <div></div>
block that contains 'thestring'.
My regex looks like this:
<div class="block highlight">n Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>n <span class="line"><b>Log line: .* - - [08/Apr/2018:.*] "GET /pixi//index.php.* HTTP/1.1" 200 .* "-" "-"n</span><br>n </div>n
Any suggestions?
text-processing regular-expression html
edited Apr 12 at 3:36
asked Apr 11 at 9:14
user1192748
1031
1031
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
0
down vote
Use the program xsltproc
you will have that on OSX, man xsltproc
For example:
$ cat remdivs.xslt
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="html" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*" />
<xsl:preserve-space elements="html body div" />
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="div[@class='block highlight']"/>
</xsl:stylesheet>
$ cat input.xml
<html>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
</span><br>
</div>
<div class="no highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=3214235353.32&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
</html>
$ xsltproc --html remdivs.xslt input.xml
<html>
<body>
<div class="no highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
</div>
</body>
</html>
edit after further problem explanation.
Further explanation.
xsltproc performs the transformation of the input document based on the template (remdivs.xslt), I use the --html option to relax the strict xml validation since your input document contains <br>
empty elements (opposed to <br/>
).
The processor first takes your input doc and builds a document model in memory, then traverses the elements in the model applying the templates it finds in the .xslt
Looking at the .xslt
, it contains the preamble declaration and then some general processing rules that help to define the type of output required.
There are only 2 templates, the first one
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
This template has a match
attribute, so it is only applied to those elements in the input document that match the match expression, in this case the expression is "@* | node()"
, which will match any attribute or any node in your document, the whole thing! What it does to those elements is laid out inside, it copies the output of applying each template selectively, but the selection criteria templates will be the name of every attribute and element. The effect is, if only this template was present would be a copy of your original input document, with the ouput processing rules applied.
The 2nd template does the rejection.
<xsl:template match="div[@class='block highlight']"/>
Here, it matches specifically <div>
elements that have an attribute named class
, with a value of 'block highlight'
. so those <div>...</div>
blocks where this matches, are replaced with the output this template produces, sine this one is empty (has a terminating /
), no output is produced.
This, on the other hand,
<xsl:template match="div[@class='block highlight']">
suppressed div output<br>
</xsl:template>
Will output some text in place of the suppressed div-block.
Here's a different suppression template based on your revised question.
<xsl:template match="div">
<xsl:choose>
<xsl:when test="not(contains(span[@class='line'],'GET /thestring'))">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:when>
<xsl:otherwise><!-- Just do nothing to supress output -->
</xsl:otherwise>
</xsl:choose>
</xsl:template>
This template is applied to all div-elements, it tests whether the text content of any of it's child span-elements that also contain the class
attribute value of line
, does not contain the string 'GET /thestring'.
When it does not contain the string we do the same sort of copy as in the 1st template otherwise we do nothing to suppress the output of that div-block.
Read further about XPath which defines how to address the elements and attributes of a document and XSLT to write processing templates, these examples should help to make it clearer for the beginner.
Dear X Tian, this tool looks just like what I have been searching for! However, I don't want to filter on the class id of the div block as they are all the same. I rather want to filter on the contents within the div block. Especially I want to match theGET xyz
part. I updated my initial question to be more precise.
â user1192748
Apr 12 at 3:38
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Use the program xsltproc
you will have that on OSX, man xsltproc
For example:
$ cat remdivs.xslt
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="html" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*" />
<xsl:preserve-space elements="html body div" />
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="div[@class='block highlight']"/>
</xsl:stylesheet>
$ cat input.xml
<html>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
</span><br>
</div>
<div class="no highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=3214235353.32&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
</html>
$ xsltproc --html remdivs.xslt input.xml
<html>
<body>
<div class="no highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
</div>
</body>
</html>
edit after further problem explanation.
Further explanation.
xsltproc performs the transformation of the input document based on the template (remdivs.xslt), I use the --html option to relax the strict xml validation since your input document contains <br>
empty elements (opposed to <br/>
).
The processor first takes your input doc and builds a document model in memory, then traverses the elements in the model applying the templates it finds in the .xslt
Looking at the .xslt
, it contains the preamble declaration and then some general processing rules that help to define the type of output required.
There are only 2 templates, the first one
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
This template has a match
attribute, so it is only applied to those elements in the input document that match the match expression, in this case the expression is "@* | node()"
, which will match any attribute or any node in your document, the whole thing! What it does to those elements is laid out inside, it copies the output of applying each template selectively, but the selection criteria templates will be the name of every attribute and element. The effect is, if only this template was present would be a copy of your original input document, with the ouput processing rules applied.
The 2nd template does the rejection.
<xsl:template match="div[@class='block highlight']"/>
Here, it matches specifically <div>
elements that have an attribute named class
, with a value of 'block highlight'
. so those <div>...</div>
blocks where this matches, are replaced with the output this template produces, sine this one is empty (has a terminating /
), no output is produced.
This, on the other hand,
<xsl:template match="div[@class='block highlight']">
suppressed div output<br>
</xsl:template>
Will output some text in place of the suppressed div-block.
Here's a different suppression template based on your revised question.
<xsl:template match="div">
<xsl:choose>
<xsl:when test="not(contains(span[@class='line'],'GET /thestring'))">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:when>
<xsl:otherwise><!-- Just do nothing to supress output -->
</xsl:otherwise>
</xsl:choose>
</xsl:template>
This template is applied to all div-elements, it tests whether the text content of any of it's child span-elements that also contain the class
attribute value of line
, does not contain the string 'GET /thestring'.
When it does not contain the string we do the same sort of copy as in the 1st template otherwise we do nothing to suppress the output of that div-block.
Read further about XPath which defines how to address the elements and attributes of a document and XSLT to write processing templates, these examples should help to make it clearer for the beginner.
Dear X Tian, this tool looks just like what I have been searching for! However, I don't want to filter on the class id of the div block as they are all the same. I rather want to filter on the contents within the div block. Especially I want to match theGET xyz
part. I updated my initial question to be more precise.
â user1192748
Apr 12 at 3:38
add a comment |Â
up vote
0
down vote
Use the program xsltproc
you will have that on OSX, man xsltproc
For example:
$ cat remdivs.xslt
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="html" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*" />
<xsl:preserve-space elements="html body div" />
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="div[@class='block highlight']"/>
</xsl:stylesheet>
$ cat input.xml
<html>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
</span><br>
</div>
<div class="no highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=3214235353.32&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
</html>
$ xsltproc --html remdivs.xslt input.xml
<html>
<body>
<div class="no highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
</div>
</body>
</html>
edit after further problem explanation.
Further explanation.
xsltproc performs the transformation of the input document based on the template (remdivs.xslt), I use the --html option to relax the strict xml validation since your input document contains <br>
empty elements (opposed to <br/>
).
The processor first takes your input doc and builds a document model in memory, then traverses the elements in the model applying the templates it finds in the .xslt
Looking at the .xslt
, it contains the preamble declaration and then some general processing rules that help to define the type of output required.
There are only 2 templates, the first one
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
This template has a match
attribute, so it is only applied to those elements in the input document that match the match expression, in this case the expression is "@* | node()"
, which will match any attribute or any node in your document, the whole thing! What it does to those elements is laid out inside, it copies the output of applying each template selectively, but the selection criteria templates will be the name of every attribute and element. The effect is, if only this template was present would be a copy of your original input document, with the ouput processing rules applied.
The 2nd template does the rejection.
<xsl:template match="div[@class='block highlight']"/>
Here, it matches specifically <div>
elements that have an attribute named class
, with a value of 'block highlight'
. so those <div>...</div>
blocks where this matches, are replaced with the output this template produces, sine this one is empty (has a terminating /
), no output is produced.
This, on the other hand,
<xsl:template match="div[@class='block highlight']">
suppressed div output<br>
</xsl:template>
Will output some text in place of the suppressed div-block.
Here's a different suppression template based on your revised question.
<xsl:template match="div">
<xsl:choose>
<xsl:when test="not(contains(span[@class='line'],'GET /thestring'))">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:when>
<xsl:otherwise><!-- Just do nothing to supress output -->
</xsl:otherwise>
</xsl:choose>
</xsl:template>
This template is applied to all div-elements, it tests whether the text content of any of it's child span-elements that also contain the class
attribute value of line
, does not contain the string 'GET /thestring'.
When it does not contain the string we do the same sort of copy as in the 1st template otherwise we do nothing to suppress the output of that div-block.
Read further about XPath which defines how to address the elements and attributes of a document and XSLT to write processing templates, these examples should help to make it clearer for the beginner.
Dear X Tian, this tool looks just like what I have been searching for! However, I don't want to filter on the class id of the div block as they are all the same. I rather want to filter on the contents within the div block. Especially I want to match theGET xyz
part. I updated my initial question to be more precise.
â user1192748
Apr 12 at 3:38
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Use the program xsltproc
you will have that on OSX, man xsltproc
For example:
$ cat remdivs.xslt
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="html" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*" />
<xsl:preserve-space elements="html body div" />
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="div[@class='block highlight']"/>
</xsl:stylesheet>
$ cat input.xml
<html>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
</span><br>
</div>
<div class="no highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=3214235353.32&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
</html>
$ xsltproc --html remdivs.xslt input.xml
<html>
<body>
<div class="no highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
</div>
</body>
</html>
edit after further problem explanation.
Further explanation.
xsltproc performs the transformation of the input document based on the template (remdivs.xslt), I use the --html option to relax the strict xml validation since your input document contains <br>
empty elements (opposed to <br/>
).
The processor first takes your input doc and builds a document model in memory, then traverses the elements in the model applying the templates it finds in the .xslt
Looking at the .xslt
, it contains the preamble declaration and then some general processing rules that help to define the type of output required.
There are only 2 templates, the first one
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
This template has a match
attribute, so it is only applied to those elements in the input document that match the match expression, in this case the expression is "@* | node()"
, which will match any attribute or any node in your document, the whole thing! What it does to those elements is laid out inside, it copies the output of applying each template selectively, but the selection criteria templates will be the name of every attribute and element. The effect is, if only this template was present would be a copy of your original input document, with the ouput processing rules applied.
The 2nd template does the rejection.
<xsl:template match="div[@class='block highlight']"/>
Here, it matches specifically <div>
elements that have an attribute named class
, with a value of 'block highlight'
. so those <div>...</div>
blocks where this matches, are replaced with the output this template produces, sine this one is empty (has a terminating /
), no output is produced.
This, on the other hand,
<xsl:template match="div[@class='block highlight']">
suppressed div output<br>
</xsl:template>
Will output some text in place of the suppressed div-block.
Here's a different suppression template based on your revised question.
<xsl:template match="div">
<xsl:choose>
<xsl:when test="not(contains(span[@class='line'],'GET /thestring'))">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:when>
<xsl:otherwise><!-- Just do nothing to supress output -->
</xsl:otherwise>
</xsl:choose>
</xsl:template>
This template is applied to all div-elements, it tests whether the text content of any of it's child span-elements that also contain the class
attribute value of line
, does not contain the string 'GET /thestring'.
When it does not contain the string we do the same sort of copy as in the 1st template otherwise we do nothing to suppress the output of that div-block.
Read further about XPath which defines how to address the elements and attributes of a document and XSLT to write processing templates, these examples should help to make it clearer for the beginner.
Use the program xsltproc
you will have that on OSX, man xsltproc
For example:
$ cat remdivs.xslt
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="html" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*" />
<xsl:preserve-space elements="html body div" />
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="div[@class='block highlight']"/>
</xsl:stylesheet>
$ cat input.xml
<html>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
</span><br>
</div>
<div class="no highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
</div>
<div class="block highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&session_id=3214235353.32&onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
</div>
</html>
$ xsltproc --html remdivs.xslt input.xml
<html>
<body>
<div class="no highlight">
Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
<span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&session_id=1523141136.42&data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
</div>
</body>
</html>
edit after further problem explanation.
Further explanation.
xsltproc performs the transformation of the input document based on the template (remdivs.xslt), I use the --html option to relax the strict xml validation since your input document contains <br>
empty elements (opposed to <br/>
).
The processor first takes your input doc and builds a document model in memory, then traverses the elements in the model applying the templates it finds in the .xslt
Looking at the .xslt
, it contains the preamble declaration and then some general processing rules that help to define the type of output required.
There are only 2 templates, the first one
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
This template has a match
attribute, so it is only applied to those elements in the input document that match the match expression, in this case the expression is "@* | node()"
, which will match any attribute or any node in your document, the whole thing! What it does to those elements is laid out inside, it copies the output of applying each template selectively, but the selection criteria templates will be the name of every attribute and element. The effect is, if only this template was present would be a copy of your original input document, with the ouput processing rules applied.
The 2nd template does the rejection.
<xsl:template match="div[@class='block highlight']"/>
Here, it matches specifically <div>
elements that have an attribute named class
, with a value of 'block highlight'
. so those <div>...</div>
blocks where this matches, are replaced with the output this template produces, sine this one is empty (has a terminating /
), no output is produced.
This, on the other hand,
<xsl:template match="div[@class='block highlight']">
suppressed div output<br>
</xsl:template>
Will output some text in place of the suppressed div-block.
Here's a different suppression template based on your revised question.
<xsl:template match="div">
<xsl:choose>
<xsl:when test="not(contains(span[@class='line'],'GET /thestring'))">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:when>
<xsl:otherwise><!-- Just do nothing to supress output -->
</xsl:otherwise>
</xsl:choose>
</xsl:template>
This template is applied to all div-elements, it tests whether the text content of any of it's child span-elements that also contain the class
attribute value of line
, does not contain the string 'GET /thestring'.
When it does not contain the string we do the same sort of copy as in the 1st template otherwise we do nothing to suppress the output of that div-block.
Read further about XPath which defines how to address the elements and attributes of a document and XSLT to write processing templates, these examples should help to make it clearer for the beginner.
edited Apr 13 at 10:42
answered Apr 11 at 14:46
X Tian
7,29111836
7,29111836
Dear X Tian, this tool looks just like what I have been searching for! However, I don't want to filter on the class id of the div block as they are all the same. I rather want to filter on the contents within the div block. Especially I want to match theGET xyz
part. I updated my initial question to be more precise.
â user1192748
Apr 12 at 3:38
add a comment |Â
Dear X Tian, this tool looks just like what I have been searching for! However, I don't want to filter on the class id of the div block as they are all the same. I rather want to filter on the contents within the div block. Especially I want to match theGET xyz
part. I updated my initial question to be more precise.
â user1192748
Apr 12 at 3:38
Dear X Tian, this tool looks just like what I have been searching for! However, I don't want to filter on the class id of the div block as they are all the same. I rather want to filter on the contents within the div block. Especially I want to match the
GET xyz
part. I updated my initial question to be more precise.â user1192748
Apr 12 at 3:38
Dear X Tian, this tool looks just like what I have been searching for! However, I don't want to filter on the class id of the div block as they are all the same. I rather want to filter on the contents within the div block. Especially I want to match the
GET xyz
part. I updated my initial question to be more precise.â user1192748
Apr 12 at 3:38
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f436960%2fremove-div-blocks-matching-a-string-from-a-huge-html-file%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password