Speeding up repeated python calls (or, alternatively, porting a complex regex to sed)
Clash Royale CLAN TAG#URR8PPP
I am an academic medical physicist. I do experiments that generate a fair amount of data, and they are expensive to run. My university has a backup system that consists of a robot tape library in a disused salt mine that uses IBM's Spectrum Protect (invoked as dsmc
) that I use for off site backups. Although there is no limit on the total size I can send to the salt mine, there is a per-day transfer limit of 200 gigabytes. As far as I know, there is no way to get the Spectrum Protect client to respect this limit, and stop after the transfer limit is reached.
If one busts this limit, the server locks the node and I have to send a grovelling apologetic email to someone to ask them to unlock it. They tell me off for using too much bandwidth, and, something like 24-48 hours later, unlock the node.
To get around the fact that I create data in discrete chunks (on experiment days) and am well under the bandwidth limit on a per-month or per-week basis, I've written a simple wrapper script to parse the output of dsmc
and kill the transfer if it gets too large.
The parsing is done via treating the output of dsmc
as a here doc in bash with a simple python script:
#!/bin/bash
# A silly wrapper script to halt TSM backups
#
# Usage: sudo /path/to/script /path/to/backup/location
#
# Requires python3 accessible as python3, and the regex / os modules.
# Tested on MacOS and Linux
BYTES_SENT=0;
#MAX_SIZE_TO_SEND=150 #Bytes, for testing
MAX_SIZE_TO_SEND=$[185*(2**30)]
args=("$@")
sudo rm -f /tmp/dsmc-script.PID
function outputParser()
python3 <<'EOF'
import os, re
rex=re.compile(r"Normal File-->s*?([,0-9]*,?)s*?/")
valueToParse=os.environ.get('line');
match=rex.match(valueToParse);
try:
stringToReturn = str(match.group(1));
stringToReturn =stringToReturn.replace(',','');
except AttributeError:
stringToReturn = "";
#Check for failed transfers
failedResults = re.findall(r"** Unsuccessful **", valueToParse);
nFailedResults = len(failedResults);
if (nFailedResults >0):
stringToReturn = "";
print(stringToReturn);
EOF
#I am sure that the above is a one-liner in sed or awk. I just don't know what the one line is.
function trapCaught()
#Do cleanup, not shown
echo ", quitting."
trap trapCaught sigint
killCount=0
startTime=$SECONDS
while read -r line; do
echo "$line"
export line;
X=$(export line=$line; outputParser)
if [[ ! -z "$X" ]]; then
BYTES_SENT=$[$BYTES_SENT + $X]
echo "Sent $X bytes, $BYTES_SENT in total"
fi
if (( BYTES_SENT > MAX_SIZE_TO_SEND )); then
if (( killCount < 1)); then
echo "STOPPED BACKUP BECAUSE $BYTES_SENT is GREATER THAN THE PERMITTED MAXIMUM OF $MAX_SIZE_TO_SEND";
killStartTime=$(( SECONDS - startTime ))
pid=$(cat /tmp/dsmc-script.PID)
echo "PID is $pid"
echo $pid | sudo xargs kill
fi
killCount=$[$killCount + 1];
timeKillNow=$(( SECONDS - killStartTime ))
rm -f /tmp/dsmc-script.PID
if (( killCount > 100 || timeKillNow > 30 )); then
echo "Taking too long to die; retrying"
echo $pid | sudo xargs kill -9;
sleep 0.1;
sudo kill -9 0;
fi
fi
done < <( sudo dsmc incr $args[0] & echo $! > /tmp/dsmc-script.PID )
This works, and suits my purposes. However, performance is bad bordering on terrible, and I think this is because each iteration through the while
loop spins out another instance of the python interpreter / script combo.
Given that I can't change the limit, or the behaviour of the binary compiled blob dsmc
, I have three related questions:
(a) Is this a sensible approach for solving this problem, or is there a much easier way that I am missing, such as advanced voodoo with netstat
?
(b) Given that what python actually does is essentially exactly the same through each iteration in the loop, is there a way to cache the interpreter's translation of the code and hence speed the whole thing up hugely?
(c ) If I were to replace the python script with an equivalent sed
or awk
construct, I suspect this whole thing would be much, much faster. Why? Is it possible to do this type of arithmetic easily, or is that another red herring to go down?
Edit: Example output from dsmc
for those not familiar is below -- a file is only sent if "Normal file" appears in a string, followed by its size in bytes. So, in the below, the file spclicert.kdb
is sent, but neither TSM.PWD
and nor the directory CaptiveNetworkSupport
:
# dsmc incr /
< header message containing personal information>
Incremental backup of volume '/'
ANS1898I ***** Processed 79,000 files *****
Directory--> 0 /Library/Preferences/SystemConfiguration/CaptiveNetworkSupport [Sent]
Normal File--> 5,080 /Library/Preferences/Tivoli Storage Manager/Nodes/SHUG2765-MACBOOKPRO-PHYSICS/spclicert.kdb [Sent]
Updating--> 224 /Library/Preferences/Tivoli Storage Manager/BrokenOrOld/TSM.PWD (original) [Sent]
So, the above script strips out the size in bytes of each file sent and simply adds them up.
python bash performance backup
migrated from stackoverflow.com Jan 13 at 22:27
This question came from our site for professional and enthusiast programmers.
|
show 2 more comments
I am an academic medical physicist. I do experiments that generate a fair amount of data, and they are expensive to run. My university has a backup system that consists of a robot tape library in a disused salt mine that uses IBM's Spectrum Protect (invoked as dsmc
) that I use for off site backups. Although there is no limit on the total size I can send to the salt mine, there is a per-day transfer limit of 200 gigabytes. As far as I know, there is no way to get the Spectrum Protect client to respect this limit, and stop after the transfer limit is reached.
If one busts this limit, the server locks the node and I have to send a grovelling apologetic email to someone to ask them to unlock it. They tell me off for using too much bandwidth, and, something like 24-48 hours later, unlock the node.
To get around the fact that I create data in discrete chunks (on experiment days) and am well under the bandwidth limit on a per-month or per-week basis, I've written a simple wrapper script to parse the output of dsmc
and kill the transfer if it gets too large.
The parsing is done via treating the output of dsmc
as a here doc in bash with a simple python script:
#!/bin/bash
# A silly wrapper script to halt TSM backups
#
# Usage: sudo /path/to/script /path/to/backup/location
#
# Requires python3 accessible as python3, and the regex / os modules.
# Tested on MacOS and Linux
BYTES_SENT=0;
#MAX_SIZE_TO_SEND=150 #Bytes, for testing
MAX_SIZE_TO_SEND=$[185*(2**30)]
args=("$@")
sudo rm -f /tmp/dsmc-script.PID
function outputParser()
python3 <<'EOF'
import os, re
rex=re.compile(r"Normal File-->s*?([,0-9]*,?)s*?/")
valueToParse=os.environ.get('line');
match=rex.match(valueToParse);
try:
stringToReturn = str(match.group(1));
stringToReturn =stringToReturn.replace(',','');
except AttributeError:
stringToReturn = "";
#Check for failed transfers
failedResults = re.findall(r"** Unsuccessful **", valueToParse);
nFailedResults = len(failedResults);
if (nFailedResults >0):
stringToReturn = "";
print(stringToReturn);
EOF
#I am sure that the above is a one-liner in sed or awk. I just don't know what the one line is.
function trapCaught()
#Do cleanup, not shown
echo ", quitting."
trap trapCaught sigint
killCount=0
startTime=$SECONDS
while read -r line; do
echo "$line"
export line;
X=$(export line=$line; outputParser)
if [[ ! -z "$X" ]]; then
BYTES_SENT=$[$BYTES_SENT + $X]
echo "Sent $X bytes, $BYTES_SENT in total"
fi
if (( BYTES_SENT > MAX_SIZE_TO_SEND )); then
if (( killCount < 1)); then
echo "STOPPED BACKUP BECAUSE $BYTES_SENT is GREATER THAN THE PERMITTED MAXIMUM OF $MAX_SIZE_TO_SEND";
killStartTime=$(( SECONDS - startTime ))
pid=$(cat /tmp/dsmc-script.PID)
echo "PID is $pid"
echo $pid | sudo xargs kill
fi
killCount=$[$killCount + 1];
timeKillNow=$(( SECONDS - killStartTime ))
rm -f /tmp/dsmc-script.PID
if (( killCount > 100 || timeKillNow > 30 )); then
echo "Taking too long to die; retrying"
echo $pid | sudo xargs kill -9;
sleep 0.1;
sudo kill -9 0;
fi
fi
done < <( sudo dsmc incr $args[0] & echo $! > /tmp/dsmc-script.PID )
This works, and suits my purposes. However, performance is bad bordering on terrible, and I think this is because each iteration through the while
loop spins out another instance of the python interpreter / script combo.
Given that I can't change the limit, or the behaviour of the binary compiled blob dsmc
, I have three related questions:
(a) Is this a sensible approach for solving this problem, or is there a much easier way that I am missing, such as advanced voodoo with netstat
?
(b) Given that what python actually does is essentially exactly the same through each iteration in the loop, is there a way to cache the interpreter's translation of the code and hence speed the whole thing up hugely?
(c ) If I were to replace the python script with an equivalent sed
or awk
construct, I suspect this whole thing would be much, much faster. Why? Is it possible to do this type of arithmetic easily, or is that another red herring to go down?
Edit: Example output from dsmc
for those not familiar is below -- a file is only sent if "Normal file" appears in a string, followed by its size in bytes. So, in the below, the file spclicert.kdb
is sent, but neither TSM.PWD
and nor the directory CaptiveNetworkSupport
:
# dsmc incr /
< header message containing personal information>
Incremental backup of volume '/'
ANS1898I ***** Processed 79,000 files *****
Directory--> 0 /Library/Preferences/SystemConfiguration/CaptiveNetworkSupport [Sent]
Normal File--> 5,080 /Library/Preferences/Tivoli Storage Manager/Nodes/SHUG2765-MACBOOKPRO-PHYSICS/spclicert.kdb [Sent]
Updating--> 224 /Library/Preferences/Tivoli Storage Manager/BrokenOrOld/TSM.PWD (original) [Sent]
So, the above script strips out the size in bytes of each file sent and simply adds them up.
python bash performance backup
migrated from stackoverflow.com Jan 13 at 22:27
This question came from our site for professional and enthusiast programmers.
Surelysed
orawk
could speed it up to an extent, but one can't really suggest way, if we don't know how the input looks like
– Inian
Jan 11 at 9:49
@Inian sorry about that -- example output provided!
– Landak
Jan 11 at 9:57
1
This can be probably solved with some iptables and lsof magic, just by monitoring the total amount of traffic sent over the socket or by monitoring the average bandwidth of the socket and then doing some simple maths to kill your process before exceeding your limit. Make sure you post this crosspost this to ServerFault and/or to Unix&Linux
– JAAAY
Jan 11 at 11:38
Even rewriting the whole script into python would make it much faster than a mix of python and bash. bash shouldn't be mistaken for a programming language.
– Thomas Dickey
Jan 13 at 23:02
It seems your Python script can be replaced with something likegawk 'match( $0, /NormalsFile-->s+([0-9,]+)/, a) && gsub(/,/ , "" , a[1]) print a[1] ' <<< "$line"
. This will just extract the file size (without commas) from lines containing the string "Normal file-->".
– ozzy
Jan 13 at 23:10
|
show 2 more comments
I am an academic medical physicist. I do experiments that generate a fair amount of data, and they are expensive to run. My university has a backup system that consists of a robot tape library in a disused salt mine that uses IBM's Spectrum Protect (invoked as dsmc
) that I use for off site backups. Although there is no limit on the total size I can send to the salt mine, there is a per-day transfer limit of 200 gigabytes. As far as I know, there is no way to get the Spectrum Protect client to respect this limit, and stop after the transfer limit is reached.
If one busts this limit, the server locks the node and I have to send a grovelling apologetic email to someone to ask them to unlock it. They tell me off for using too much bandwidth, and, something like 24-48 hours later, unlock the node.
To get around the fact that I create data in discrete chunks (on experiment days) and am well under the bandwidth limit on a per-month or per-week basis, I've written a simple wrapper script to parse the output of dsmc
and kill the transfer if it gets too large.
The parsing is done via treating the output of dsmc
as a here doc in bash with a simple python script:
#!/bin/bash
# A silly wrapper script to halt TSM backups
#
# Usage: sudo /path/to/script /path/to/backup/location
#
# Requires python3 accessible as python3, and the regex / os modules.
# Tested on MacOS and Linux
BYTES_SENT=0;
#MAX_SIZE_TO_SEND=150 #Bytes, for testing
MAX_SIZE_TO_SEND=$[185*(2**30)]
args=("$@")
sudo rm -f /tmp/dsmc-script.PID
function outputParser()
python3 <<'EOF'
import os, re
rex=re.compile(r"Normal File-->s*?([,0-9]*,?)s*?/")
valueToParse=os.environ.get('line');
match=rex.match(valueToParse);
try:
stringToReturn = str(match.group(1));
stringToReturn =stringToReturn.replace(',','');
except AttributeError:
stringToReturn = "";
#Check for failed transfers
failedResults = re.findall(r"** Unsuccessful **", valueToParse);
nFailedResults = len(failedResults);
if (nFailedResults >0):
stringToReturn = "";
print(stringToReturn);
EOF
#I am sure that the above is a one-liner in sed or awk. I just don't know what the one line is.
function trapCaught()
#Do cleanup, not shown
echo ", quitting."
trap trapCaught sigint
killCount=0
startTime=$SECONDS
while read -r line; do
echo "$line"
export line;
X=$(export line=$line; outputParser)
if [[ ! -z "$X" ]]; then
BYTES_SENT=$[$BYTES_SENT + $X]
echo "Sent $X bytes, $BYTES_SENT in total"
fi
if (( BYTES_SENT > MAX_SIZE_TO_SEND )); then
if (( killCount < 1)); then
echo "STOPPED BACKUP BECAUSE $BYTES_SENT is GREATER THAN THE PERMITTED MAXIMUM OF $MAX_SIZE_TO_SEND";
killStartTime=$(( SECONDS - startTime ))
pid=$(cat /tmp/dsmc-script.PID)
echo "PID is $pid"
echo $pid | sudo xargs kill
fi
killCount=$[$killCount + 1];
timeKillNow=$(( SECONDS - killStartTime ))
rm -f /tmp/dsmc-script.PID
if (( killCount > 100 || timeKillNow > 30 )); then
echo "Taking too long to die; retrying"
echo $pid | sudo xargs kill -9;
sleep 0.1;
sudo kill -9 0;
fi
fi
done < <( sudo dsmc incr $args[0] & echo $! > /tmp/dsmc-script.PID )
This works, and suits my purposes. However, performance is bad bordering on terrible, and I think this is because each iteration through the while
loop spins out another instance of the python interpreter / script combo.
Given that I can't change the limit, or the behaviour of the binary compiled blob dsmc
, I have three related questions:
(a) Is this a sensible approach for solving this problem, or is there a much easier way that I am missing, such as advanced voodoo with netstat
?
(b) Given that what python actually does is essentially exactly the same through each iteration in the loop, is there a way to cache the interpreter's translation of the code and hence speed the whole thing up hugely?
(c ) If I were to replace the python script with an equivalent sed
or awk
construct, I suspect this whole thing would be much, much faster. Why? Is it possible to do this type of arithmetic easily, or is that another red herring to go down?
Edit: Example output from dsmc
for those not familiar is below -- a file is only sent if "Normal file" appears in a string, followed by its size in bytes. So, in the below, the file spclicert.kdb
is sent, but neither TSM.PWD
and nor the directory CaptiveNetworkSupport
:
# dsmc incr /
< header message containing personal information>
Incremental backup of volume '/'
ANS1898I ***** Processed 79,000 files *****
Directory--> 0 /Library/Preferences/SystemConfiguration/CaptiveNetworkSupport [Sent]
Normal File--> 5,080 /Library/Preferences/Tivoli Storage Manager/Nodes/SHUG2765-MACBOOKPRO-PHYSICS/spclicert.kdb [Sent]
Updating--> 224 /Library/Preferences/Tivoli Storage Manager/BrokenOrOld/TSM.PWD (original) [Sent]
So, the above script strips out the size in bytes of each file sent and simply adds them up.
python bash performance backup
I am an academic medical physicist. I do experiments that generate a fair amount of data, and they are expensive to run. My university has a backup system that consists of a robot tape library in a disused salt mine that uses IBM's Spectrum Protect (invoked as dsmc
) that I use for off site backups. Although there is no limit on the total size I can send to the salt mine, there is a per-day transfer limit of 200 gigabytes. As far as I know, there is no way to get the Spectrum Protect client to respect this limit, and stop after the transfer limit is reached.
If one busts this limit, the server locks the node and I have to send a grovelling apologetic email to someone to ask them to unlock it. They tell me off for using too much bandwidth, and, something like 24-48 hours later, unlock the node.
To get around the fact that I create data in discrete chunks (on experiment days) and am well under the bandwidth limit on a per-month or per-week basis, I've written a simple wrapper script to parse the output of dsmc
and kill the transfer if it gets too large.
The parsing is done via treating the output of dsmc
as a here doc in bash with a simple python script:
#!/bin/bash
# A silly wrapper script to halt TSM backups
#
# Usage: sudo /path/to/script /path/to/backup/location
#
# Requires python3 accessible as python3, and the regex / os modules.
# Tested on MacOS and Linux
BYTES_SENT=0;
#MAX_SIZE_TO_SEND=150 #Bytes, for testing
MAX_SIZE_TO_SEND=$[185*(2**30)]
args=("$@")
sudo rm -f /tmp/dsmc-script.PID
function outputParser()
python3 <<'EOF'
import os, re
rex=re.compile(r"Normal File-->s*?([,0-9]*,?)s*?/")
valueToParse=os.environ.get('line');
match=rex.match(valueToParse);
try:
stringToReturn = str(match.group(1));
stringToReturn =stringToReturn.replace(',','');
except AttributeError:
stringToReturn = "";
#Check for failed transfers
failedResults = re.findall(r"** Unsuccessful **", valueToParse);
nFailedResults = len(failedResults);
if (nFailedResults >0):
stringToReturn = "";
print(stringToReturn);
EOF
#I am sure that the above is a one-liner in sed or awk. I just don't know what the one line is.
function trapCaught()
#Do cleanup, not shown
echo ", quitting."
trap trapCaught sigint
killCount=0
startTime=$SECONDS
while read -r line; do
echo "$line"
export line;
X=$(export line=$line; outputParser)
if [[ ! -z "$X" ]]; then
BYTES_SENT=$[$BYTES_SENT + $X]
echo "Sent $X bytes, $BYTES_SENT in total"
fi
if (( BYTES_SENT > MAX_SIZE_TO_SEND )); then
if (( killCount < 1)); then
echo "STOPPED BACKUP BECAUSE $BYTES_SENT is GREATER THAN THE PERMITTED MAXIMUM OF $MAX_SIZE_TO_SEND";
killStartTime=$(( SECONDS - startTime ))
pid=$(cat /tmp/dsmc-script.PID)
echo "PID is $pid"
echo $pid | sudo xargs kill
fi
killCount=$[$killCount + 1];
timeKillNow=$(( SECONDS - killStartTime ))
rm -f /tmp/dsmc-script.PID
if (( killCount > 100 || timeKillNow > 30 )); then
echo "Taking too long to die; retrying"
echo $pid | sudo xargs kill -9;
sleep 0.1;
sudo kill -9 0;
fi
fi
done < <( sudo dsmc incr $args[0] & echo $! > /tmp/dsmc-script.PID )
This works, and suits my purposes. However, performance is bad bordering on terrible, and I think this is because each iteration through the while
loop spins out another instance of the python interpreter / script combo.
Given that I can't change the limit, or the behaviour of the binary compiled blob dsmc
, I have three related questions:
(a) Is this a sensible approach for solving this problem, or is there a much easier way that I am missing, such as advanced voodoo with netstat
?
(b) Given that what python actually does is essentially exactly the same through each iteration in the loop, is there a way to cache the interpreter's translation of the code and hence speed the whole thing up hugely?
(c ) If I were to replace the python script with an equivalent sed
or awk
construct, I suspect this whole thing would be much, much faster. Why? Is it possible to do this type of arithmetic easily, or is that another red herring to go down?
Edit: Example output from dsmc
for those not familiar is below -- a file is only sent if "Normal file" appears in a string, followed by its size in bytes. So, in the below, the file spclicert.kdb
is sent, but neither TSM.PWD
and nor the directory CaptiveNetworkSupport
:
# dsmc incr /
< header message containing personal information>
Incremental backup of volume '/'
ANS1898I ***** Processed 79,000 files *****
Directory--> 0 /Library/Preferences/SystemConfiguration/CaptiveNetworkSupport [Sent]
Normal File--> 5,080 /Library/Preferences/Tivoli Storage Manager/Nodes/SHUG2765-MACBOOKPRO-PHYSICS/spclicert.kdb [Sent]
Updating--> 224 /Library/Preferences/Tivoli Storage Manager/BrokenOrOld/TSM.PWD (original) [Sent]
So, the above script strips out the size in bytes of each file sent and simply adds them up.
python bash performance backup
python bash performance backup
asked Jan 11 at 9:41
LandakLandak
26618
26618
migrated from stackoverflow.com Jan 13 at 22:27
This question came from our site for professional and enthusiast programmers.
migrated from stackoverflow.com Jan 13 at 22:27
This question came from our site for professional and enthusiast programmers.
Surelysed
orawk
could speed it up to an extent, but one can't really suggest way, if we don't know how the input looks like
– Inian
Jan 11 at 9:49
@Inian sorry about that -- example output provided!
– Landak
Jan 11 at 9:57
1
This can be probably solved with some iptables and lsof magic, just by monitoring the total amount of traffic sent over the socket or by monitoring the average bandwidth of the socket and then doing some simple maths to kill your process before exceeding your limit. Make sure you post this crosspost this to ServerFault and/or to Unix&Linux
– JAAAY
Jan 11 at 11:38
Even rewriting the whole script into python would make it much faster than a mix of python and bash. bash shouldn't be mistaken for a programming language.
– Thomas Dickey
Jan 13 at 23:02
It seems your Python script can be replaced with something likegawk 'match( $0, /NormalsFile-->s+([0-9,]+)/, a) && gsub(/,/ , "" , a[1]) print a[1] ' <<< "$line"
. This will just extract the file size (without commas) from lines containing the string "Normal file-->".
– ozzy
Jan 13 at 23:10
|
show 2 more comments
Surelysed
orawk
could speed it up to an extent, but one can't really suggest way, if we don't know how the input looks like
– Inian
Jan 11 at 9:49
@Inian sorry about that -- example output provided!
– Landak
Jan 11 at 9:57
1
This can be probably solved with some iptables and lsof magic, just by monitoring the total amount of traffic sent over the socket or by monitoring the average bandwidth of the socket and then doing some simple maths to kill your process before exceeding your limit. Make sure you post this crosspost this to ServerFault and/or to Unix&Linux
– JAAAY
Jan 11 at 11:38
Even rewriting the whole script into python would make it much faster than a mix of python and bash. bash shouldn't be mistaken for a programming language.
– Thomas Dickey
Jan 13 at 23:02
It seems your Python script can be replaced with something likegawk 'match( $0, /NormalsFile-->s+([0-9,]+)/, a) && gsub(/,/ , "" , a[1]) print a[1] ' <<< "$line"
. This will just extract the file size (without commas) from lines containing the string "Normal file-->".
– ozzy
Jan 13 at 23:10
Surely
sed
or awk
could speed it up to an extent, but one can't really suggest way, if we don't know how the input looks like– Inian
Jan 11 at 9:49
Surely
sed
or awk
could speed it up to an extent, but one can't really suggest way, if we don't know how the input looks like– Inian
Jan 11 at 9:49
@Inian sorry about that -- example output provided!
– Landak
Jan 11 at 9:57
@Inian sorry about that -- example output provided!
– Landak
Jan 11 at 9:57
1
1
This can be probably solved with some iptables and lsof magic, just by monitoring the total amount of traffic sent over the socket or by monitoring the average bandwidth of the socket and then doing some simple maths to kill your process before exceeding your limit. Make sure you post this crosspost this to ServerFault and/or to Unix&Linux
– JAAAY
Jan 11 at 11:38
This can be probably solved with some iptables and lsof magic, just by monitoring the total amount of traffic sent over the socket or by monitoring the average bandwidth of the socket and then doing some simple maths to kill your process before exceeding your limit. Make sure you post this crosspost this to ServerFault and/or to Unix&Linux
– JAAAY
Jan 11 at 11:38
Even rewriting the whole script into python would make it much faster than a mix of python and bash. bash shouldn't be mistaken for a programming language.
– Thomas Dickey
Jan 13 at 23:02
Even rewriting the whole script into python would make it much faster than a mix of python and bash. bash shouldn't be mistaken for a programming language.
– Thomas Dickey
Jan 13 at 23:02
It seems your Python script can be replaced with something like
gawk 'match( $0, /NormalsFile-->s+([0-9,]+)/, a) && gsub(/,/ , "" , a[1]) print a[1] ' <<< "$line"
. This will just extract the file size (without commas) from lines containing the string "Normal file-->".– ozzy
Jan 13 at 23:10
It seems your Python script can be replaced with something like
gawk 'match( $0, /NormalsFile-->s+([0-9,]+)/, a) && gsub(/,/ , "" , a[1]) print a[1] ' <<< "$line"
. This will just extract the file size (without commas) from lines containing the string "Normal file-->".– ozzy
Jan 13 at 23:10
|
show 2 more comments
2 Answers
2
active
oldest
votes
Assuming the connection is reliable, a simple kludge would be to use a user-space traffic shaper. Just set it up to use no more than the maximum bandwidth per day.
An example using trickle
, a big file foo, and scp
:
l=$(( (200*10**6)/(24*60**2) ))
trickle -d $l scp foo username@remotehost:~/
And trickle
would slow down the transfer to 2314K per second, which would top out at no more than 199,929,600,000 bytes per day. The file transfer program needn't be scp
, it could be anything, (even a web-browser), (or dsmc
), just so it's started from the command line.
An advantage of this method is that it wouldn't be necessary to break up the file foo if it was bigger than the daily limit. Of course it would take a while to send foo over, (if foo were 1TB, it would take 5 days), but it would take that long anyway.
trickle
has a daemon version called trickled
, which controls every subsequent run of `trickle. Example:
l=$(( (200*10**6)/(24*60**2) ))
trickled -d $l
trickle scp foo username@remotehost:~/ &
trickle scp bar username@remotehost:~/ &
trickle scp baz username@remotehost:~/ &
Supposing that each of the files foo, bar, and baz were 1TB in size, trickled
would still keep the transfer within the 200GB/day limit.
Great suggestion, to me seems far more sensible than killing the process arbitrarily
– tink
Jan 14 at 2:44
add a comment |
Your input can be parsed entirely in bash
. Here's a sample:
max=$[185*(2**30)]
export total=0
while read first second third rest; do
[[ "$first" == "Normal" && "$second" == "File-->" ]] &&
size=$third//,/
echo "file: $size"
total=$(( total + size ))
(( total > max )) && kill something
done < ~/tmp/your-input
If you're truly limited by the time taken to spawn a sub-process, this avoids the overhead even of calling out to awk
or sed
.
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f494317%2fspeeding-up-repeated-python-calls-or-alternatively-porting-a-complex-regex-to%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Assuming the connection is reliable, a simple kludge would be to use a user-space traffic shaper. Just set it up to use no more than the maximum bandwidth per day.
An example using trickle
, a big file foo, and scp
:
l=$(( (200*10**6)/(24*60**2) ))
trickle -d $l scp foo username@remotehost:~/
And trickle
would slow down the transfer to 2314K per second, which would top out at no more than 199,929,600,000 bytes per day. The file transfer program needn't be scp
, it could be anything, (even a web-browser), (or dsmc
), just so it's started from the command line.
An advantage of this method is that it wouldn't be necessary to break up the file foo if it was bigger than the daily limit. Of course it would take a while to send foo over, (if foo were 1TB, it would take 5 days), but it would take that long anyway.
trickle
has a daemon version called trickled
, which controls every subsequent run of `trickle. Example:
l=$(( (200*10**6)/(24*60**2) ))
trickled -d $l
trickle scp foo username@remotehost:~/ &
trickle scp bar username@remotehost:~/ &
trickle scp baz username@remotehost:~/ &
Supposing that each of the files foo, bar, and baz were 1TB in size, trickled
would still keep the transfer within the 200GB/day limit.
Great suggestion, to me seems far more sensible than killing the process arbitrarily
– tink
Jan 14 at 2:44
add a comment |
Assuming the connection is reliable, a simple kludge would be to use a user-space traffic shaper. Just set it up to use no more than the maximum bandwidth per day.
An example using trickle
, a big file foo, and scp
:
l=$(( (200*10**6)/(24*60**2) ))
trickle -d $l scp foo username@remotehost:~/
And trickle
would slow down the transfer to 2314K per second, which would top out at no more than 199,929,600,000 bytes per day. The file transfer program needn't be scp
, it could be anything, (even a web-browser), (or dsmc
), just so it's started from the command line.
An advantage of this method is that it wouldn't be necessary to break up the file foo if it was bigger than the daily limit. Of course it would take a while to send foo over, (if foo were 1TB, it would take 5 days), but it would take that long anyway.
trickle
has a daemon version called trickled
, which controls every subsequent run of `trickle. Example:
l=$(( (200*10**6)/(24*60**2) ))
trickled -d $l
trickle scp foo username@remotehost:~/ &
trickle scp bar username@remotehost:~/ &
trickle scp baz username@remotehost:~/ &
Supposing that each of the files foo, bar, and baz were 1TB in size, trickled
would still keep the transfer within the 200GB/day limit.
Great suggestion, to me seems far more sensible than killing the process arbitrarily
– tink
Jan 14 at 2:44
add a comment |
Assuming the connection is reliable, a simple kludge would be to use a user-space traffic shaper. Just set it up to use no more than the maximum bandwidth per day.
An example using trickle
, a big file foo, and scp
:
l=$(( (200*10**6)/(24*60**2) ))
trickle -d $l scp foo username@remotehost:~/
And trickle
would slow down the transfer to 2314K per second, which would top out at no more than 199,929,600,000 bytes per day. The file transfer program needn't be scp
, it could be anything, (even a web-browser), (or dsmc
), just so it's started from the command line.
An advantage of this method is that it wouldn't be necessary to break up the file foo if it was bigger than the daily limit. Of course it would take a while to send foo over, (if foo were 1TB, it would take 5 days), but it would take that long anyway.
trickle
has a daemon version called trickled
, which controls every subsequent run of `trickle. Example:
l=$(( (200*10**6)/(24*60**2) ))
trickled -d $l
trickle scp foo username@remotehost:~/ &
trickle scp bar username@remotehost:~/ &
trickle scp baz username@remotehost:~/ &
Supposing that each of the files foo, bar, and baz were 1TB in size, trickled
would still keep the transfer within the 200GB/day limit.
Assuming the connection is reliable, a simple kludge would be to use a user-space traffic shaper. Just set it up to use no more than the maximum bandwidth per day.
An example using trickle
, a big file foo, and scp
:
l=$(( (200*10**6)/(24*60**2) ))
trickle -d $l scp foo username@remotehost:~/
And trickle
would slow down the transfer to 2314K per second, which would top out at no more than 199,929,600,000 bytes per day. The file transfer program needn't be scp
, it could be anything, (even a web-browser), (or dsmc
), just so it's started from the command line.
An advantage of this method is that it wouldn't be necessary to break up the file foo if it was bigger than the daily limit. Of course it would take a while to send foo over, (if foo were 1TB, it would take 5 days), but it would take that long anyway.
trickle
has a daemon version called trickled
, which controls every subsequent run of `trickle. Example:
l=$(( (200*10**6)/(24*60**2) ))
trickled -d $l
trickle scp foo username@remotehost:~/ &
trickle scp bar username@remotehost:~/ &
trickle scp baz username@remotehost:~/ &
Supposing that each of the files foo, bar, and baz were 1TB in size, trickled
would still keep the transfer within the 200GB/day limit.
edited Jan 14 at 2:47
answered Jan 14 at 2:37
agcagc
4,64611137
4,64611137
Great suggestion, to me seems far more sensible than killing the process arbitrarily
– tink
Jan 14 at 2:44
add a comment |
Great suggestion, to me seems far more sensible than killing the process arbitrarily
– tink
Jan 14 at 2:44
Great suggestion, to me seems far more sensible than killing the process arbitrarily
– tink
Jan 14 at 2:44
Great suggestion, to me seems far more sensible than killing the process arbitrarily
– tink
Jan 14 at 2:44
add a comment |
Your input can be parsed entirely in bash
. Here's a sample:
max=$[185*(2**30)]
export total=0
while read first second third rest; do
[[ "$first" == "Normal" && "$second" == "File-->" ]] &&
size=$third//,/
echo "file: $size"
total=$(( total + size ))
(( total > max )) && kill something
done < ~/tmp/your-input
If you're truly limited by the time taken to spawn a sub-process, this avoids the overhead even of calling out to awk
or sed
.
add a comment |
Your input can be parsed entirely in bash
. Here's a sample:
max=$[185*(2**30)]
export total=0
while read first second third rest; do
[[ "$first" == "Normal" && "$second" == "File-->" ]] &&
size=$third//,/
echo "file: $size"
total=$(( total + size ))
(( total > max )) && kill something
done < ~/tmp/your-input
If you're truly limited by the time taken to spawn a sub-process, this avoids the overhead even of calling out to awk
or sed
.
add a comment |
Your input can be parsed entirely in bash
. Here's a sample:
max=$[185*(2**30)]
export total=0
while read first second third rest; do
[[ "$first" == "Normal" && "$second" == "File-->" ]] &&
size=$third//,/
echo "file: $size"
total=$(( total + size ))
(( total > max )) && kill something
done < ~/tmp/your-input
If you're truly limited by the time taken to spawn a sub-process, this avoids the overhead even of calling out to awk
or sed
.
Your input can be parsed entirely in bash
. Here's a sample:
max=$[185*(2**30)]
export total=0
while read first second third rest; do
[[ "$first" == "Normal" && "$second" == "File-->" ]] &&
size=$third//,/
echo "file: $size"
total=$(( total + size ))
(( total > max )) && kill something
done < ~/tmp/your-input
If you're truly limited by the time taken to spawn a sub-process, this avoids the overhead even of calling out to awk
or sed
.
answered Jan 14 at 2:37
wefwef
30414
30414
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f494317%2fspeeding-up-repeated-python-calls-or-alternatively-porting-a-complex-regex-to%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Surely
sed
orawk
could speed it up to an extent, but one can't really suggest way, if we don't know how the input looks like– Inian
Jan 11 at 9:49
@Inian sorry about that -- example output provided!
– Landak
Jan 11 at 9:57
1
This can be probably solved with some iptables and lsof magic, just by monitoring the total amount of traffic sent over the socket or by monitoring the average bandwidth of the socket and then doing some simple maths to kill your process before exceeding your limit. Make sure you post this crosspost this to ServerFault and/or to Unix&Linux
– JAAAY
Jan 11 at 11:38
Even rewriting the whole script into python would make it much faster than a mix of python and bash. bash shouldn't be mistaken for a programming language.
– Thomas Dickey
Jan 13 at 23:02
It seems your Python script can be replaced with something like
gawk 'match( $0, /NormalsFile-->s+([0-9,]+)/, a) && gsub(/,/ , "" , a[1]) print a[1] ' <<< "$line"
. This will just extract the file size (without commas) from lines containing the string "Normal file-->".– ozzy
Jan 13 at 23:10