Speeding up repeated python calls (or, alternatively, porting a complex regex to sed)

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












1















I am an academic medical physicist. I do experiments that generate a fair amount of data, and they are expensive to run. My university has a backup system that consists of a robot tape library in a disused salt mine that uses IBM's Spectrum Protect (invoked as dsmc) that I use for off site backups. Although there is no limit on the total size I can send to the salt mine, there is a per-day transfer limit of 200 gigabytes. As far as I know, there is no way to get the Spectrum Protect client to respect this limit, and stop after the transfer limit is reached.



If one busts this limit, the server locks the node and I have to send a grovelling apologetic email to someone to ask them to unlock it. They tell me off for using too much bandwidth, and, something like 24-48 hours later, unlock the node.



To get around the fact that I create data in discrete chunks (on experiment days) and am well under the bandwidth limit on a per-month or per-week basis, I've written a simple wrapper script to parse the output of dsmc and kill the transfer if it gets too large.



The parsing is done via treating the output of dsmc as a here doc in bash with a simple python script:



#!/bin/bash
# A silly wrapper script to halt TSM backups
#
# Usage: sudo /path/to/script /path/to/backup/location
#
# Requires python3 accessible as python3, and the regex / os modules.
# Tested on MacOS and Linux
BYTES_SENT=0;
#MAX_SIZE_TO_SEND=150 #Bytes, for testing
MAX_SIZE_TO_SEND=$[185*(2**30)]

args=("$@")
sudo rm -f /tmp/dsmc-script.PID

function outputParser()
python3 <<'EOF'
import os, re
rex=re.compile(r"Normal File-->s*?([,0-9]*,?)s*?/")
valueToParse=os.environ.get('line');
match=rex.match(valueToParse);
try:
stringToReturn = str(match.group(1));
stringToReturn =stringToReturn.replace(',','');
except AttributeError:
stringToReturn = "";
#Check for failed transfers
failedResults = re.findall(r"** Unsuccessful **", valueToParse);
nFailedResults = len(failedResults);
if (nFailedResults >0):
stringToReturn = "";
print(stringToReturn);
EOF
#I am sure that the above is a one-liner in sed or awk. I just don't know what the one line is.

function trapCaught()
#Do cleanup, not shown
echo ", quitting."


trap trapCaught sigint
killCount=0
startTime=$SECONDS

while read -r line; do
echo "$line"
export line;
X=$(export line=$line; outputParser)
if [[ ! -z "$X" ]]; then
BYTES_SENT=$[$BYTES_SENT + $X]
echo "Sent $X bytes, $BYTES_SENT in total"
fi
if (( BYTES_SENT > MAX_SIZE_TO_SEND )); then
if (( killCount < 1)); then
echo "STOPPED BACKUP BECAUSE $BYTES_SENT is GREATER THAN THE PERMITTED MAXIMUM OF $MAX_SIZE_TO_SEND";
killStartTime=$(( SECONDS - startTime ))
pid=$(cat /tmp/dsmc-script.PID)
echo "PID is $pid"
echo $pid | sudo xargs kill
fi

killCount=$[$killCount + 1];
timeKillNow=$(( SECONDS - killStartTime ))
rm -f /tmp/dsmc-script.PID

if (( killCount > 100 || timeKillNow > 30 )); then
echo "Taking too long to die; retrying"
echo $pid | sudo xargs kill -9;
sleep 0.1;
sudo kill -9 0;
fi

fi
done < <( sudo dsmc incr $args[0] & echo $! > /tmp/dsmc-script.PID )


This works, and suits my purposes. However, performance is bad bordering on terrible, and I think this is because each iteration through the while loop spins out another instance of the python interpreter / script combo.



Given that I can't change the limit, or the behaviour of the binary compiled blob dsmc, I have three related questions:



(a) Is this a sensible approach for solving this problem, or is there a much easier way that I am missing, such as advanced voodoo with netstat?



(b) Given that what python actually does is essentially exactly the same through each iteration in the loop, is there a way to cache the interpreter's translation of the code and hence speed the whole thing up hugely?



(c ) If I were to replace the python script with an equivalent sed or awk construct, I suspect this whole thing would be much, much faster. Why? Is it possible to do this type of arithmetic easily, or is that another red herring to go down?



Edit: Example output from dsmc for those not familiar is below -- a file is only sent if "Normal file" appears in a string, followed by its size in bytes. So, in the below, the file spclicert.kdb is sent, but neither TSM.PWD and nor the directory CaptiveNetworkSupport:



# dsmc incr / 
< header message containing personal information>
Incremental backup of volume '/'
ANS1898I ***** Processed 79,000 files *****
Directory--> 0 /Library/Preferences/SystemConfiguration/CaptiveNetworkSupport [Sent]
Normal File--> 5,080 /Library/Preferences/Tivoli Storage Manager/Nodes/SHUG2765-MACBOOKPRO-PHYSICS/spclicert.kdb [Sent]
Updating--> 224 /Library/Preferences/Tivoli Storage Manager/BrokenOrOld/TSM.PWD (original) [Sent]


So, the above script strips out the size in bytes of each file sent and simply adds them up.










share|improve this question













migrated from stackoverflow.com Jan 13 at 22:27


This question came from our site for professional and enthusiast programmers.


















  • Surely sed or awk could speed it up to an extent, but one can't really suggest way, if we don't know how the input looks like

    – Inian
    Jan 11 at 9:49











  • @Inian sorry about that -- example output provided!

    – Landak
    Jan 11 at 9:57






  • 1





    This can be probably solved with some iptables and lsof magic, just by monitoring the total amount of traffic sent over the socket or by monitoring the average bandwidth of the socket and then doing some simple maths to kill your process before exceeding your limit. Make sure you post this crosspost this to ServerFault and/or to Unix&Linux

    – JAAAY
    Jan 11 at 11:38











  • Even rewriting the whole script into python would make it much faster than a mix of python and bash. bash shouldn't be mistaken for a programming language.

    – Thomas Dickey
    Jan 13 at 23:02











  • It seems your Python script can be replaced with something like gawk 'match( $0, /NormalsFile-->s+([0-9,]+)/, a) && gsub(/,/ , "" , a[1]) print a[1] ' <<< "$line". This will just extract the file size (without commas) from lines containing the string "Normal file-->".

    – ozzy
    Jan 13 at 23:10
















1















I am an academic medical physicist. I do experiments that generate a fair amount of data, and they are expensive to run. My university has a backup system that consists of a robot tape library in a disused salt mine that uses IBM's Spectrum Protect (invoked as dsmc) that I use for off site backups. Although there is no limit on the total size I can send to the salt mine, there is a per-day transfer limit of 200 gigabytes. As far as I know, there is no way to get the Spectrum Protect client to respect this limit, and stop after the transfer limit is reached.



If one busts this limit, the server locks the node and I have to send a grovelling apologetic email to someone to ask them to unlock it. They tell me off for using too much bandwidth, and, something like 24-48 hours later, unlock the node.



To get around the fact that I create data in discrete chunks (on experiment days) and am well under the bandwidth limit on a per-month or per-week basis, I've written a simple wrapper script to parse the output of dsmc and kill the transfer if it gets too large.



The parsing is done via treating the output of dsmc as a here doc in bash with a simple python script:



#!/bin/bash
# A silly wrapper script to halt TSM backups
#
# Usage: sudo /path/to/script /path/to/backup/location
#
# Requires python3 accessible as python3, and the regex / os modules.
# Tested on MacOS and Linux
BYTES_SENT=0;
#MAX_SIZE_TO_SEND=150 #Bytes, for testing
MAX_SIZE_TO_SEND=$[185*(2**30)]

args=("$@")
sudo rm -f /tmp/dsmc-script.PID

function outputParser()
python3 <<'EOF'
import os, re
rex=re.compile(r"Normal File-->s*?([,0-9]*,?)s*?/")
valueToParse=os.environ.get('line');
match=rex.match(valueToParse);
try:
stringToReturn = str(match.group(1));
stringToReturn =stringToReturn.replace(',','');
except AttributeError:
stringToReturn = "";
#Check for failed transfers
failedResults = re.findall(r"** Unsuccessful **", valueToParse);
nFailedResults = len(failedResults);
if (nFailedResults >0):
stringToReturn = "";
print(stringToReturn);
EOF
#I am sure that the above is a one-liner in sed or awk. I just don't know what the one line is.

function trapCaught()
#Do cleanup, not shown
echo ", quitting."


trap trapCaught sigint
killCount=0
startTime=$SECONDS

while read -r line; do
echo "$line"
export line;
X=$(export line=$line; outputParser)
if [[ ! -z "$X" ]]; then
BYTES_SENT=$[$BYTES_SENT + $X]
echo "Sent $X bytes, $BYTES_SENT in total"
fi
if (( BYTES_SENT > MAX_SIZE_TO_SEND )); then
if (( killCount < 1)); then
echo "STOPPED BACKUP BECAUSE $BYTES_SENT is GREATER THAN THE PERMITTED MAXIMUM OF $MAX_SIZE_TO_SEND";
killStartTime=$(( SECONDS - startTime ))
pid=$(cat /tmp/dsmc-script.PID)
echo "PID is $pid"
echo $pid | sudo xargs kill
fi

killCount=$[$killCount + 1];
timeKillNow=$(( SECONDS - killStartTime ))
rm -f /tmp/dsmc-script.PID

if (( killCount > 100 || timeKillNow > 30 )); then
echo "Taking too long to die; retrying"
echo $pid | sudo xargs kill -9;
sleep 0.1;
sudo kill -9 0;
fi

fi
done < <( sudo dsmc incr $args[0] & echo $! > /tmp/dsmc-script.PID )


This works, and suits my purposes. However, performance is bad bordering on terrible, and I think this is because each iteration through the while loop spins out another instance of the python interpreter / script combo.



Given that I can't change the limit, or the behaviour of the binary compiled blob dsmc, I have three related questions:



(a) Is this a sensible approach for solving this problem, or is there a much easier way that I am missing, such as advanced voodoo with netstat?



(b) Given that what python actually does is essentially exactly the same through each iteration in the loop, is there a way to cache the interpreter's translation of the code and hence speed the whole thing up hugely?



(c ) If I were to replace the python script with an equivalent sed or awk construct, I suspect this whole thing would be much, much faster. Why? Is it possible to do this type of arithmetic easily, or is that another red herring to go down?



Edit: Example output from dsmc for those not familiar is below -- a file is only sent if "Normal file" appears in a string, followed by its size in bytes. So, in the below, the file spclicert.kdb is sent, but neither TSM.PWD and nor the directory CaptiveNetworkSupport:



# dsmc incr / 
< header message containing personal information>
Incremental backup of volume '/'
ANS1898I ***** Processed 79,000 files *****
Directory--> 0 /Library/Preferences/SystemConfiguration/CaptiveNetworkSupport [Sent]
Normal File--> 5,080 /Library/Preferences/Tivoli Storage Manager/Nodes/SHUG2765-MACBOOKPRO-PHYSICS/spclicert.kdb [Sent]
Updating--> 224 /Library/Preferences/Tivoli Storage Manager/BrokenOrOld/TSM.PWD (original) [Sent]


So, the above script strips out the size in bytes of each file sent and simply adds them up.










share|improve this question













migrated from stackoverflow.com Jan 13 at 22:27


This question came from our site for professional and enthusiast programmers.


















  • Surely sed or awk could speed it up to an extent, but one can't really suggest way, if we don't know how the input looks like

    – Inian
    Jan 11 at 9:49











  • @Inian sorry about that -- example output provided!

    – Landak
    Jan 11 at 9:57






  • 1





    This can be probably solved with some iptables and lsof magic, just by monitoring the total amount of traffic sent over the socket or by monitoring the average bandwidth of the socket and then doing some simple maths to kill your process before exceeding your limit. Make sure you post this crosspost this to ServerFault and/or to Unix&Linux

    – JAAAY
    Jan 11 at 11:38











  • Even rewriting the whole script into python would make it much faster than a mix of python and bash. bash shouldn't be mistaken for a programming language.

    – Thomas Dickey
    Jan 13 at 23:02











  • It seems your Python script can be replaced with something like gawk 'match( $0, /NormalsFile-->s+([0-9,]+)/, a) && gsub(/,/ , "" , a[1]) print a[1] ' <<< "$line". This will just extract the file size (without commas) from lines containing the string "Normal file-->".

    – ozzy
    Jan 13 at 23:10














1












1








1








I am an academic medical physicist. I do experiments that generate a fair amount of data, and they are expensive to run. My university has a backup system that consists of a robot tape library in a disused salt mine that uses IBM's Spectrum Protect (invoked as dsmc) that I use for off site backups. Although there is no limit on the total size I can send to the salt mine, there is a per-day transfer limit of 200 gigabytes. As far as I know, there is no way to get the Spectrum Protect client to respect this limit, and stop after the transfer limit is reached.



If one busts this limit, the server locks the node and I have to send a grovelling apologetic email to someone to ask them to unlock it. They tell me off for using too much bandwidth, and, something like 24-48 hours later, unlock the node.



To get around the fact that I create data in discrete chunks (on experiment days) and am well under the bandwidth limit on a per-month or per-week basis, I've written a simple wrapper script to parse the output of dsmc and kill the transfer if it gets too large.



The parsing is done via treating the output of dsmc as a here doc in bash with a simple python script:



#!/bin/bash
# A silly wrapper script to halt TSM backups
#
# Usage: sudo /path/to/script /path/to/backup/location
#
# Requires python3 accessible as python3, and the regex / os modules.
# Tested on MacOS and Linux
BYTES_SENT=0;
#MAX_SIZE_TO_SEND=150 #Bytes, for testing
MAX_SIZE_TO_SEND=$[185*(2**30)]

args=("$@")
sudo rm -f /tmp/dsmc-script.PID

function outputParser()
python3 <<'EOF'
import os, re
rex=re.compile(r"Normal File-->s*?([,0-9]*,?)s*?/")
valueToParse=os.environ.get('line');
match=rex.match(valueToParse);
try:
stringToReturn = str(match.group(1));
stringToReturn =stringToReturn.replace(',','');
except AttributeError:
stringToReturn = "";
#Check for failed transfers
failedResults = re.findall(r"** Unsuccessful **", valueToParse);
nFailedResults = len(failedResults);
if (nFailedResults >0):
stringToReturn = "";
print(stringToReturn);
EOF
#I am sure that the above is a one-liner in sed or awk. I just don't know what the one line is.

function trapCaught()
#Do cleanup, not shown
echo ", quitting."


trap trapCaught sigint
killCount=0
startTime=$SECONDS

while read -r line; do
echo "$line"
export line;
X=$(export line=$line; outputParser)
if [[ ! -z "$X" ]]; then
BYTES_SENT=$[$BYTES_SENT + $X]
echo "Sent $X bytes, $BYTES_SENT in total"
fi
if (( BYTES_SENT > MAX_SIZE_TO_SEND )); then
if (( killCount < 1)); then
echo "STOPPED BACKUP BECAUSE $BYTES_SENT is GREATER THAN THE PERMITTED MAXIMUM OF $MAX_SIZE_TO_SEND";
killStartTime=$(( SECONDS - startTime ))
pid=$(cat /tmp/dsmc-script.PID)
echo "PID is $pid"
echo $pid | sudo xargs kill
fi

killCount=$[$killCount + 1];
timeKillNow=$(( SECONDS - killStartTime ))
rm -f /tmp/dsmc-script.PID

if (( killCount > 100 || timeKillNow > 30 )); then
echo "Taking too long to die; retrying"
echo $pid | sudo xargs kill -9;
sleep 0.1;
sudo kill -9 0;
fi

fi
done < <( sudo dsmc incr $args[0] & echo $! > /tmp/dsmc-script.PID )


This works, and suits my purposes. However, performance is bad bordering on terrible, and I think this is because each iteration through the while loop spins out another instance of the python interpreter / script combo.



Given that I can't change the limit, or the behaviour of the binary compiled blob dsmc, I have three related questions:



(a) Is this a sensible approach for solving this problem, or is there a much easier way that I am missing, such as advanced voodoo with netstat?



(b) Given that what python actually does is essentially exactly the same through each iteration in the loop, is there a way to cache the interpreter's translation of the code and hence speed the whole thing up hugely?



(c ) If I were to replace the python script with an equivalent sed or awk construct, I suspect this whole thing would be much, much faster. Why? Is it possible to do this type of arithmetic easily, or is that another red herring to go down?



Edit: Example output from dsmc for those not familiar is below -- a file is only sent if "Normal file" appears in a string, followed by its size in bytes. So, in the below, the file spclicert.kdb is sent, but neither TSM.PWD and nor the directory CaptiveNetworkSupport:



# dsmc incr / 
< header message containing personal information>
Incremental backup of volume '/'
ANS1898I ***** Processed 79,000 files *****
Directory--> 0 /Library/Preferences/SystemConfiguration/CaptiveNetworkSupport [Sent]
Normal File--> 5,080 /Library/Preferences/Tivoli Storage Manager/Nodes/SHUG2765-MACBOOKPRO-PHYSICS/spclicert.kdb [Sent]
Updating--> 224 /Library/Preferences/Tivoli Storage Manager/BrokenOrOld/TSM.PWD (original) [Sent]


So, the above script strips out the size in bytes of each file sent and simply adds them up.










share|improve this question














I am an academic medical physicist. I do experiments that generate a fair amount of data, and they are expensive to run. My university has a backup system that consists of a robot tape library in a disused salt mine that uses IBM's Spectrum Protect (invoked as dsmc) that I use for off site backups. Although there is no limit on the total size I can send to the salt mine, there is a per-day transfer limit of 200 gigabytes. As far as I know, there is no way to get the Spectrum Protect client to respect this limit, and stop after the transfer limit is reached.



If one busts this limit, the server locks the node and I have to send a grovelling apologetic email to someone to ask them to unlock it. They tell me off for using too much bandwidth, and, something like 24-48 hours later, unlock the node.



To get around the fact that I create data in discrete chunks (on experiment days) and am well under the bandwidth limit on a per-month or per-week basis, I've written a simple wrapper script to parse the output of dsmc and kill the transfer if it gets too large.



The parsing is done via treating the output of dsmc as a here doc in bash with a simple python script:



#!/bin/bash
# A silly wrapper script to halt TSM backups
#
# Usage: sudo /path/to/script /path/to/backup/location
#
# Requires python3 accessible as python3, and the regex / os modules.
# Tested on MacOS and Linux
BYTES_SENT=0;
#MAX_SIZE_TO_SEND=150 #Bytes, for testing
MAX_SIZE_TO_SEND=$[185*(2**30)]

args=("$@")
sudo rm -f /tmp/dsmc-script.PID

function outputParser()
python3 <<'EOF'
import os, re
rex=re.compile(r"Normal File-->s*?([,0-9]*,?)s*?/")
valueToParse=os.environ.get('line');
match=rex.match(valueToParse);
try:
stringToReturn = str(match.group(1));
stringToReturn =stringToReturn.replace(',','');
except AttributeError:
stringToReturn = "";
#Check for failed transfers
failedResults = re.findall(r"** Unsuccessful **", valueToParse);
nFailedResults = len(failedResults);
if (nFailedResults >0):
stringToReturn = "";
print(stringToReturn);
EOF
#I am sure that the above is a one-liner in sed or awk. I just don't know what the one line is.

function trapCaught()
#Do cleanup, not shown
echo ", quitting."


trap trapCaught sigint
killCount=0
startTime=$SECONDS

while read -r line; do
echo "$line"
export line;
X=$(export line=$line; outputParser)
if [[ ! -z "$X" ]]; then
BYTES_SENT=$[$BYTES_SENT + $X]
echo "Sent $X bytes, $BYTES_SENT in total"
fi
if (( BYTES_SENT > MAX_SIZE_TO_SEND )); then
if (( killCount < 1)); then
echo "STOPPED BACKUP BECAUSE $BYTES_SENT is GREATER THAN THE PERMITTED MAXIMUM OF $MAX_SIZE_TO_SEND";
killStartTime=$(( SECONDS - startTime ))
pid=$(cat /tmp/dsmc-script.PID)
echo "PID is $pid"
echo $pid | sudo xargs kill
fi

killCount=$[$killCount + 1];
timeKillNow=$(( SECONDS - killStartTime ))
rm -f /tmp/dsmc-script.PID

if (( killCount > 100 || timeKillNow > 30 )); then
echo "Taking too long to die; retrying"
echo $pid | sudo xargs kill -9;
sleep 0.1;
sudo kill -9 0;
fi

fi
done < <( sudo dsmc incr $args[0] & echo $! > /tmp/dsmc-script.PID )


This works, and suits my purposes. However, performance is bad bordering on terrible, and I think this is because each iteration through the while loop spins out another instance of the python interpreter / script combo.



Given that I can't change the limit, or the behaviour of the binary compiled blob dsmc, I have three related questions:



(a) Is this a sensible approach for solving this problem, or is there a much easier way that I am missing, such as advanced voodoo with netstat?



(b) Given that what python actually does is essentially exactly the same through each iteration in the loop, is there a way to cache the interpreter's translation of the code and hence speed the whole thing up hugely?



(c ) If I were to replace the python script with an equivalent sed or awk construct, I suspect this whole thing would be much, much faster. Why? Is it possible to do this type of arithmetic easily, or is that another red herring to go down?



Edit: Example output from dsmc for those not familiar is below -- a file is only sent if "Normal file" appears in a string, followed by its size in bytes. So, in the below, the file spclicert.kdb is sent, but neither TSM.PWD and nor the directory CaptiveNetworkSupport:



# dsmc incr / 
< header message containing personal information>
Incremental backup of volume '/'
ANS1898I ***** Processed 79,000 files *****
Directory--> 0 /Library/Preferences/SystemConfiguration/CaptiveNetworkSupport [Sent]
Normal File--> 5,080 /Library/Preferences/Tivoli Storage Manager/Nodes/SHUG2765-MACBOOKPRO-PHYSICS/spclicert.kdb [Sent]
Updating--> 224 /Library/Preferences/Tivoli Storage Manager/BrokenOrOld/TSM.PWD (original) [Sent]


So, the above script strips out the size in bytes of each file sent and simply adds them up.







python bash performance backup






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 11 at 9:41









LandakLandak

26618




26618




migrated from stackoverflow.com Jan 13 at 22:27


This question came from our site for professional and enthusiast programmers.









migrated from stackoverflow.com Jan 13 at 22:27


This question came from our site for professional and enthusiast programmers.














  • Surely sed or awk could speed it up to an extent, but one can't really suggest way, if we don't know how the input looks like

    – Inian
    Jan 11 at 9:49











  • @Inian sorry about that -- example output provided!

    – Landak
    Jan 11 at 9:57






  • 1





    This can be probably solved with some iptables and lsof magic, just by monitoring the total amount of traffic sent over the socket or by monitoring the average bandwidth of the socket and then doing some simple maths to kill your process before exceeding your limit. Make sure you post this crosspost this to ServerFault and/or to Unix&Linux

    – JAAAY
    Jan 11 at 11:38











  • Even rewriting the whole script into python would make it much faster than a mix of python and bash. bash shouldn't be mistaken for a programming language.

    – Thomas Dickey
    Jan 13 at 23:02











  • It seems your Python script can be replaced with something like gawk 'match( $0, /NormalsFile-->s+([0-9,]+)/, a) && gsub(/,/ , "" , a[1]) print a[1] ' <<< "$line". This will just extract the file size (without commas) from lines containing the string "Normal file-->".

    – ozzy
    Jan 13 at 23:10


















  • Surely sed or awk could speed it up to an extent, but one can't really suggest way, if we don't know how the input looks like

    – Inian
    Jan 11 at 9:49











  • @Inian sorry about that -- example output provided!

    – Landak
    Jan 11 at 9:57






  • 1





    This can be probably solved with some iptables and lsof magic, just by monitoring the total amount of traffic sent over the socket or by monitoring the average bandwidth of the socket and then doing some simple maths to kill your process before exceeding your limit. Make sure you post this crosspost this to ServerFault and/or to Unix&Linux

    – JAAAY
    Jan 11 at 11:38











  • Even rewriting the whole script into python would make it much faster than a mix of python and bash. bash shouldn't be mistaken for a programming language.

    – Thomas Dickey
    Jan 13 at 23:02











  • It seems your Python script can be replaced with something like gawk 'match( $0, /NormalsFile-->s+([0-9,]+)/, a) && gsub(/,/ , "" , a[1]) print a[1] ' <<< "$line". This will just extract the file size (without commas) from lines containing the string "Normal file-->".

    – ozzy
    Jan 13 at 23:10

















Surely sed or awk could speed it up to an extent, but one can't really suggest way, if we don't know how the input looks like

– Inian
Jan 11 at 9:49





Surely sed or awk could speed it up to an extent, but one can't really suggest way, if we don't know how the input looks like

– Inian
Jan 11 at 9:49













@Inian sorry about that -- example output provided!

– Landak
Jan 11 at 9:57





@Inian sorry about that -- example output provided!

– Landak
Jan 11 at 9:57




1




1





This can be probably solved with some iptables and lsof magic, just by monitoring the total amount of traffic sent over the socket or by monitoring the average bandwidth of the socket and then doing some simple maths to kill your process before exceeding your limit. Make sure you post this crosspost this to ServerFault and/or to Unix&Linux

– JAAAY
Jan 11 at 11:38





This can be probably solved with some iptables and lsof magic, just by monitoring the total amount of traffic sent over the socket or by monitoring the average bandwidth of the socket and then doing some simple maths to kill your process before exceeding your limit. Make sure you post this crosspost this to ServerFault and/or to Unix&Linux

– JAAAY
Jan 11 at 11:38













Even rewriting the whole script into python would make it much faster than a mix of python and bash. bash shouldn't be mistaken for a programming language.

– Thomas Dickey
Jan 13 at 23:02





Even rewriting the whole script into python would make it much faster than a mix of python and bash. bash shouldn't be mistaken for a programming language.

– Thomas Dickey
Jan 13 at 23:02













It seems your Python script can be replaced with something like gawk 'match( $0, /NormalsFile-->s+([0-9,]+)/, a) && gsub(/,/ , "" , a[1]) print a[1] ' <<< "$line". This will just extract the file size (without commas) from lines containing the string "Normal file-->".

– ozzy
Jan 13 at 23:10






It seems your Python script can be replaced with something like gawk 'match( $0, /NormalsFile-->s+([0-9,]+)/, a) && gsub(/,/ , "" , a[1]) print a[1] ' <<< "$line". This will just extract the file size (without commas) from lines containing the string "Normal file-->".

– ozzy
Jan 13 at 23:10











2 Answers
2






active

oldest

votes


















2














Assuming the connection is reliable, a simple kludge would be to use a user-space traffic shaper. Just set it up to use no more than the maximum bandwidth per day.



An example using trickle, a big file foo, and scp:



l=$(( (200*10**6)/(24*60**2) ))
trickle -d $l scp foo username@remotehost:~/


And trickle would slow down the transfer to 2314K per second, which would top out at no more than 199,929,600,000 bytes per day. The file transfer program needn't be scp, it could be anything, (even a web-browser), (or dsmc), just so it's started from the command line.



An advantage of this method is that it wouldn't be necessary to break up the file foo if it was bigger than the daily limit. Of course it would take a while to send foo over, (if foo were 1TB, it would take 5 days), but it would take that long anyway.



trickle has a daemon version called trickled, which controls every subsequent run of `trickle. Example:



l=$(( (200*10**6)/(24*60**2) ))
trickled -d $l
trickle scp foo username@remotehost:~/ &
trickle scp bar username@remotehost:~/ &
trickle scp baz username@remotehost:~/ &


Supposing that each of the files foo, bar, and baz were 1TB in size, trickled would still keep the transfer within the 200GB/day limit.






share|improve this answer

























  • Great suggestion, to me seems far more sensible than killing the process arbitrarily

    – tink
    Jan 14 at 2:44


















1














Your input can be parsed entirely in bash. Here's a sample:



max=$[185*(2**30)]

export total=0
while read first second third rest; do
[[ "$first" == "Normal" && "$second" == "File-->" ]] &&
size=$third//,/
echo "file: $size"
total=$(( total + size ))
(( total > max )) && kill something

done < ~/tmp/your-input


If you're truly limited by the time taken to spawn a sub-process, this avoids the overhead even of calling out to awk or sed.






share|improve this answer






















    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f494317%2fspeeding-up-repeated-python-calls-or-alternatively-porting-a-complex-regex-to%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    Assuming the connection is reliable, a simple kludge would be to use a user-space traffic shaper. Just set it up to use no more than the maximum bandwidth per day.



    An example using trickle, a big file foo, and scp:



    l=$(( (200*10**6)/(24*60**2) ))
    trickle -d $l scp foo username@remotehost:~/


    And trickle would slow down the transfer to 2314K per second, which would top out at no more than 199,929,600,000 bytes per day. The file transfer program needn't be scp, it could be anything, (even a web-browser), (or dsmc), just so it's started from the command line.



    An advantage of this method is that it wouldn't be necessary to break up the file foo if it was bigger than the daily limit. Of course it would take a while to send foo over, (if foo were 1TB, it would take 5 days), but it would take that long anyway.



    trickle has a daemon version called trickled, which controls every subsequent run of `trickle. Example:



    l=$(( (200*10**6)/(24*60**2) ))
    trickled -d $l
    trickle scp foo username@remotehost:~/ &
    trickle scp bar username@remotehost:~/ &
    trickle scp baz username@remotehost:~/ &


    Supposing that each of the files foo, bar, and baz were 1TB in size, trickled would still keep the transfer within the 200GB/day limit.






    share|improve this answer

























    • Great suggestion, to me seems far more sensible than killing the process arbitrarily

      – tink
      Jan 14 at 2:44















    2














    Assuming the connection is reliable, a simple kludge would be to use a user-space traffic shaper. Just set it up to use no more than the maximum bandwidth per day.



    An example using trickle, a big file foo, and scp:



    l=$(( (200*10**6)/(24*60**2) ))
    trickle -d $l scp foo username@remotehost:~/


    And trickle would slow down the transfer to 2314K per second, which would top out at no more than 199,929,600,000 bytes per day. The file transfer program needn't be scp, it could be anything, (even a web-browser), (or dsmc), just so it's started from the command line.



    An advantage of this method is that it wouldn't be necessary to break up the file foo if it was bigger than the daily limit. Of course it would take a while to send foo over, (if foo were 1TB, it would take 5 days), but it would take that long anyway.



    trickle has a daemon version called trickled, which controls every subsequent run of `trickle. Example:



    l=$(( (200*10**6)/(24*60**2) ))
    trickled -d $l
    trickle scp foo username@remotehost:~/ &
    trickle scp bar username@remotehost:~/ &
    trickle scp baz username@remotehost:~/ &


    Supposing that each of the files foo, bar, and baz were 1TB in size, trickled would still keep the transfer within the 200GB/day limit.






    share|improve this answer

























    • Great suggestion, to me seems far more sensible than killing the process arbitrarily

      – tink
      Jan 14 at 2:44













    2












    2








    2







    Assuming the connection is reliable, a simple kludge would be to use a user-space traffic shaper. Just set it up to use no more than the maximum bandwidth per day.



    An example using trickle, a big file foo, and scp:



    l=$(( (200*10**6)/(24*60**2) ))
    trickle -d $l scp foo username@remotehost:~/


    And trickle would slow down the transfer to 2314K per second, which would top out at no more than 199,929,600,000 bytes per day. The file transfer program needn't be scp, it could be anything, (even a web-browser), (or dsmc), just so it's started from the command line.



    An advantage of this method is that it wouldn't be necessary to break up the file foo if it was bigger than the daily limit. Of course it would take a while to send foo over, (if foo were 1TB, it would take 5 days), but it would take that long anyway.



    trickle has a daemon version called trickled, which controls every subsequent run of `trickle. Example:



    l=$(( (200*10**6)/(24*60**2) ))
    trickled -d $l
    trickle scp foo username@remotehost:~/ &
    trickle scp bar username@remotehost:~/ &
    trickle scp baz username@remotehost:~/ &


    Supposing that each of the files foo, bar, and baz were 1TB in size, trickled would still keep the transfer within the 200GB/day limit.






    share|improve this answer















    Assuming the connection is reliable, a simple kludge would be to use a user-space traffic shaper. Just set it up to use no more than the maximum bandwidth per day.



    An example using trickle, a big file foo, and scp:



    l=$(( (200*10**6)/(24*60**2) ))
    trickle -d $l scp foo username@remotehost:~/


    And trickle would slow down the transfer to 2314K per second, which would top out at no more than 199,929,600,000 bytes per day. The file transfer program needn't be scp, it could be anything, (even a web-browser), (or dsmc), just so it's started from the command line.



    An advantage of this method is that it wouldn't be necessary to break up the file foo if it was bigger than the daily limit. Of course it would take a while to send foo over, (if foo were 1TB, it would take 5 days), but it would take that long anyway.



    trickle has a daemon version called trickled, which controls every subsequent run of `trickle. Example:



    l=$(( (200*10**6)/(24*60**2) ))
    trickled -d $l
    trickle scp foo username@remotehost:~/ &
    trickle scp bar username@remotehost:~/ &
    trickle scp baz username@remotehost:~/ &


    Supposing that each of the files foo, bar, and baz were 1TB in size, trickled would still keep the transfer within the 200GB/day limit.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Jan 14 at 2:47

























    answered Jan 14 at 2:37









    agcagc

    4,64611137




    4,64611137












    • Great suggestion, to me seems far more sensible than killing the process arbitrarily

      – tink
      Jan 14 at 2:44

















    • Great suggestion, to me seems far more sensible than killing the process arbitrarily

      – tink
      Jan 14 at 2:44
















    Great suggestion, to me seems far more sensible than killing the process arbitrarily

    – tink
    Jan 14 at 2:44





    Great suggestion, to me seems far more sensible than killing the process arbitrarily

    – tink
    Jan 14 at 2:44













    1














    Your input can be parsed entirely in bash. Here's a sample:



    max=$[185*(2**30)]

    export total=0
    while read first second third rest; do
    [[ "$first" == "Normal" && "$second" == "File-->" ]] &&
    size=$third//,/
    echo "file: $size"
    total=$(( total + size ))
    (( total > max )) && kill something

    done < ~/tmp/your-input


    If you're truly limited by the time taken to spawn a sub-process, this avoids the overhead even of calling out to awk or sed.






    share|improve this answer



























      1














      Your input can be parsed entirely in bash. Here's a sample:



      max=$[185*(2**30)]

      export total=0
      while read first second third rest; do
      [[ "$first" == "Normal" && "$second" == "File-->" ]] &&
      size=$third//,/
      echo "file: $size"
      total=$(( total + size ))
      (( total > max )) && kill something

      done < ~/tmp/your-input


      If you're truly limited by the time taken to spawn a sub-process, this avoids the overhead even of calling out to awk or sed.






      share|improve this answer

























        1












        1








        1







        Your input can be parsed entirely in bash. Here's a sample:



        max=$[185*(2**30)]

        export total=0
        while read first second third rest; do
        [[ "$first" == "Normal" && "$second" == "File-->" ]] &&
        size=$third//,/
        echo "file: $size"
        total=$(( total + size ))
        (( total > max )) && kill something

        done < ~/tmp/your-input


        If you're truly limited by the time taken to spawn a sub-process, this avoids the overhead even of calling out to awk or sed.






        share|improve this answer













        Your input can be parsed entirely in bash. Here's a sample:



        max=$[185*(2**30)]

        export total=0
        while read first second third rest; do
        [[ "$first" == "Normal" && "$second" == "File-->" ]] &&
        size=$third//,/
        echo "file: $size"
        total=$(( total + size ))
        (( total > max )) && kill something

        done < ~/tmp/your-input


        If you're truly limited by the time taken to spawn a sub-process, this avoids the overhead even of calling out to awk or sed.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 14 at 2:37









        wefwef

        30414




        30414



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Unix & Linux Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f494317%2fspeeding-up-repeated-python-calls-or-alternatively-porting-a-complex-regex-to%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown






            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Displaying single band from multi-band raster using QGIS

            How many registers does an x86_64 CPU actually have?