When counting source files and LOC with locate and find - why do Python files come up different?
Clash Royale CLAN TAG#URR8PPP
up vote
2
down vote
favorite
I am having trouble understanding why find
and locate
would work differently for C and Python source files. My goal is to count the number source files and the sum of their source code lines for a given language. I used both find
and locate
to compare outputs (updatedb
was just run prior to this with sudo
to make sure locate
reports current results).
For C files this works as expected, the number of source files is the same
$ find / -name *.c |& grep -v "Permission denied" | wc -l
1056
$ locate *.c | wc -l
1056
Using xargs
, the sum of source code lines also come up the same.
$ locate *.c | xargs wc -l | tail -3
138 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/selinux/genheaders/genheaders.c
147 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/selinux/mdp/mdp.c
705376 total
$ find / -name *.c |& grep -v "Permission denied" | xargs wc -l | tail -3
2994 /opt/Python-3.6.2/Objects/listobject.c
821 /opt/Python-3.6.2/Objects/bytes_methods.c
705376 total
Just to test, this also works for files with a .java
extension - I get the same consistent results. However, when I repeat the same with for Python files (ie. .py
extension)
Source file number matches.
$ find / -name *.py |& grep -v "Permission denied" | wc -l
9249
$ locate *.py | wc -l
9249
But the sum of lines of code for Python files gives very different results.
$ locate *.py | xargs wc -l | tail -3
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template: No such file or directory
wc: (dev).py: No such file or directory
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template.py: No such file or directory
220 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/rt-tester/rt-tester.py
129 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/tracing/draw_functrace.py
753350 total
$ find / -name *.py |& grep -v "Permission denied" | xargs wc -l | tail -3
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template: No such file or directory
wc: (dev).py: No such file or directory
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template.py: No such file or directory
1919 /opt/Python-3.6.2/python-gdb.py
69 /opt/Python-3.6.2/python-config.py
1034101 total
Can someone explain why this is the case? What's so different about Python files (I can't really believe it has to do with the file type, but I'm stumped). What am I missing here?
Same odd results under Ubuntu and RH
I run updatedb
with sudo
, but I'm running all of these command as a regular user.
ubuntu rhel find locate
add a comment |Â
up vote
2
down vote
favorite
I am having trouble understanding why find
and locate
would work differently for C and Python source files. My goal is to count the number source files and the sum of their source code lines for a given language. I used both find
and locate
to compare outputs (updatedb
was just run prior to this with sudo
to make sure locate
reports current results).
For C files this works as expected, the number of source files is the same
$ find / -name *.c |& grep -v "Permission denied" | wc -l
1056
$ locate *.c | wc -l
1056
Using xargs
, the sum of source code lines also come up the same.
$ locate *.c | xargs wc -l | tail -3
138 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/selinux/genheaders/genheaders.c
147 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/selinux/mdp/mdp.c
705376 total
$ find / -name *.c |& grep -v "Permission denied" | xargs wc -l | tail -3
2994 /opt/Python-3.6.2/Objects/listobject.c
821 /opt/Python-3.6.2/Objects/bytes_methods.c
705376 total
Just to test, this also works for files with a .java
extension - I get the same consistent results. However, when I repeat the same with for Python files (ie. .py
extension)
Source file number matches.
$ find / -name *.py |& grep -v "Permission denied" | wc -l
9249
$ locate *.py | wc -l
9249
But the sum of lines of code for Python files gives very different results.
$ locate *.py | xargs wc -l | tail -3
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template: No such file or directory
wc: (dev).py: No such file or directory
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template.py: No such file or directory
220 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/rt-tester/rt-tester.py
129 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/tracing/draw_functrace.py
753350 total
$ find / -name *.py |& grep -v "Permission denied" | xargs wc -l | tail -3
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template: No such file or directory
wc: (dev).py: No such file or directory
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template.py: No such file or directory
1919 /opt/Python-3.6.2/python-gdb.py
69 /opt/Python-3.6.2/python-config.py
1034101 total
Can someone explain why this is the case? What's so different about Python files (I can't really believe it has to do with the file type, but I'm stumped). What am I missing here?
Same odd results under Ubuntu and RH
I run updatedb
with sudo
, but I'm running all of these command as a regular user.
ubuntu rhel find locate
Your result for both commands should be inconsistent because once number of maximum arguments for a command is reached ( it's called MAXARGS ) , the xargs will start padding new wc command. So it outputs "total" several times during the process, and it gets lost because of tail. Try removing tail and see if multiple lines "total" come up.
â Sergiy Kolodyazhnyy
Oct 28 '17 at 0:06
@SergiyKolodyazhnyy When I read your comment it made intuitive sense to me (too), but I just re-ran both commands without the final pipe that uses tail and I end up with the same results :-/
â Levon
Oct 28 '17 at 0:41
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I am having trouble understanding why find
and locate
would work differently for C and Python source files. My goal is to count the number source files and the sum of their source code lines for a given language. I used both find
and locate
to compare outputs (updatedb
was just run prior to this with sudo
to make sure locate
reports current results).
For C files this works as expected, the number of source files is the same
$ find / -name *.c |& grep -v "Permission denied" | wc -l
1056
$ locate *.c | wc -l
1056
Using xargs
, the sum of source code lines also come up the same.
$ locate *.c | xargs wc -l | tail -3
138 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/selinux/genheaders/genheaders.c
147 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/selinux/mdp/mdp.c
705376 total
$ find / -name *.c |& grep -v "Permission denied" | xargs wc -l | tail -3
2994 /opt/Python-3.6.2/Objects/listobject.c
821 /opt/Python-3.6.2/Objects/bytes_methods.c
705376 total
Just to test, this also works for files with a .java
extension - I get the same consistent results. However, when I repeat the same with for Python files (ie. .py
extension)
Source file number matches.
$ find / -name *.py |& grep -v "Permission denied" | wc -l
9249
$ locate *.py | wc -l
9249
But the sum of lines of code for Python files gives very different results.
$ locate *.py | xargs wc -l | tail -3
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template: No such file or directory
wc: (dev).py: No such file or directory
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template.py: No such file or directory
220 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/rt-tester/rt-tester.py
129 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/tracing/draw_functrace.py
753350 total
$ find / -name *.py |& grep -v "Permission denied" | xargs wc -l | tail -3
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template: No such file or directory
wc: (dev).py: No such file or directory
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template.py: No such file or directory
1919 /opt/Python-3.6.2/python-gdb.py
69 /opt/Python-3.6.2/python-config.py
1034101 total
Can someone explain why this is the case? What's so different about Python files (I can't really believe it has to do with the file type, but I'm stumped). What am I missing here?
Same odd results under Ubuntu and RH
I run updatedb
with sudo
, but I'm running all of these command as a regular user.
ubuntu rhel find locate
I am having trouble understanding why find
and locate
would work differently for C and Python source files. My goal is to count the number source files and the sum of their source code lines for a given language. I used both find
and locate
to compare outputs (updatedb
was just run prior to this with sudo
to make sure locate
reports current results).
For C files this works as expected, the number of source files is the same
$ find / -name *.c |& grep -v "Permission denied" | wc -l
1056
$ locate *.c | wc -l
1056
Using xargs
, the sum of source code lines also come up the same.
$ locate *.c | xargs wc -l | tail -3
138 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/selinux/genheaders/genheaders.c
147 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/selinux/mdp/mdp.c
705376 total
$ find / -name *.c |& grep -v "Permission denied" | xargs wc -l | tail -3
2994 /opt/Python-3.6.2/Objects/listobject.c
821 /opt/Python-3.6.2/Objects/bytes_methods.c
705376 total
Just to test, this also works for files with a .java
extension - I get the same consistent results. However, when I repeat the same with for Python files (ie. .py
extension)
Source file number matches.
$ find / -name *.py |& grep -v "Permission denied" | wc -l
9249
$ locate *.py | wc -l
9249
But the sum of lines of code for Python files gives very different results.
$ locate *.py | xargs wc -l | tail -3
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template: No such file or directory
wc: (dev).py: No such file or directory
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template.py: No such file or directory
220 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/rt-tester/rt-tester.py
129 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/tracing/draw_functrace.py
753350 total
$ find / -name *.py |& grep -v "Permission denied" | xargs wc -l | tail -3
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template: No such file or directory
wc: (dev).py: No such file or directory
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template.py: No such file or directory
1919 /opt/Python-3.6.2/python-gdb.py
69 /opt/Python-3.6.2/python-config.py
1034101 total
Can someone explain why this is the case? What's so different about Python files (I can't really believe it has to do with the file type, but I'm stumped). What am I missing here?
Same odd results under Ubuntu and RH
I run updatedb
with sudo
, but I'm running all of these command as a regular user.
ubuntu rhel find locate
asked Oct 27 '17 at 23:35
Levon
6,32622935
6,32622935
Your result for both commands should be inconsistent because once number of maximum arguments for a command is reached ( it's called MAXARGS ) , the xargs will start padding new wc command. So it outputs "total" several times during the process, and it gets lost because of tail. Try removing tail and see if multiple lines "total" come up.
â Sergiy Kolodyazhnyy
Oct 28 '17 at 0:06
@SergiyKolodyazhnyy When I read your comment it made intuitive sense to me (too), but I just re-ran both commands without the final pipe that uses tail and I end up with the same results :-/
â Levon
Oct 28 '17 at 0:41
add a comment |Â
Your result for both commands should be inconsistent because once number of maximum arguments for a command is reached ( it's called MAXARGS ) , the xargs will start padding new wc command. So it outputs "total" several times during the process, and it gets lost because of tail. Try removing tail and see if multiple lines "total" come up.
â Sergiy Kolodyazhnyy
Oct 28 '17 at 0:06
@SergiyKolodyazhnyy When I read your comment it made intuitive sense to me (too), but I just re-ran both commands without the final pipe that uses tail and I end up with the same results :-/
â Levon
Oct 28 '17 at 0:41
Your result for both commands should be inconsistent because once number of maximum arguments for a command is reached ( it's called MAXARGS ) , the xargs will start padding new wc command. So it outputs "total" several times during the process, and it gets lost because of tail. Try removing tail and see if multiple lines "total" come up.
â Sergiy Kolodyazhnyy
Oct 28 '17 at 0:06
Your result for both commands should be inconsistent because once number of maximum arguments for a command is reached ( it's called MAXARGS ) , the xargs will start padding new wc command. So it outputs "total" several times during the process, and it gets lost because of tail. Try removing tail and see if multiple lines "total" come up.
â Sergiy Kolodyazhnyy
Oct 28 '17 at 0:06
@SergiyKolodyazhnyy When I read your comment it made intuitive sense to me (too), but I just re-ran both commands without the final pipe that uses tail and I end up with the same results :-/
â Levon
Oct 28 '17 at 0:41
@SergiyKolodyazhnyy When I read your comment it made intuitive sense to me (too), but I just re-ran both commands without the final pipe that uses tail and I end up with the same results :-/
â Levon
Oct 28 '17 at 0:41
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
2
down vote
accepted
There are many problems with your commands.
First, locate *.c
only looks for files matching *.c
if you run it in a directory that doesn't contain any file whose name matches *.c
. Otherwise the shell expands *.c
to the list of matching files. That's probably not happening, otherwise you'd get a lot fewer matches, but leaving unquoted globs like this is a bad habit because it will bite you one day. (It's a frequent topic on this site.) The same goes for find -name *.c
. Instead, write
locate '*.c' â¦
find / -name '*.c' â¦
or something similar.
There are some common why locate
and find
might give different results. They don't seem to apply in your case since you're getting the same number of hits, but once again this is something you need to be aware of.
locate
results are cached from the last run ofupdatedb
. This usually runs once at night.find
results calculated each time you run the command.- Depending on the system, on which
locate
implementation you have and on how it's configured, it may let you see only publicly accessible files (e.g. GNU findutils, rather than mlocate or slocate), or it may make an approximation of the files that you're allowed to access (e.g. because there's a complex setup involving Linux security modules that distinguish between applications trying to access the file). - The pattern
*SUFFIX
mean the same thing tolocate
and tofind -name
(assuming thatSUFFIX
doesn't contain slashes or wildcards), but other patterns don't. For examplelocate foo
is equivalent tofind / -name '*foo*'
, not tofind / -name 'foo'
.
Another thing that might, but probably doesn't, cause problems is that you've piped error messages from find
into the data processing part of your command. You strip out lines containing Permission denied
, which causes you to miss files containing this as part of their name (ok, you probably don't have any), and causes any error message that doesn't contain Permission denied
to be interpreted as an input line. It is rarely a good idea to mix data output with error output, and it's absurd here. If you want to ignore errors, redirect them to /dev/null
:
find ⦠2>/dev/null | â¦
What is definitely biting you is that xargs
expects an input syntax that's different from what find
produces. In the input of xargs
, any whitespace separates items, not just line breaks. The three characters '"
are also parsed specially. Spaces are common in file names and all other characters are permitted apart from /
and from null bytes. One of the lines that xargs
receives as input is
/usr/lib/python2.7/site-packages/setuptools/script template (dev).py
For xargs
, that's three items: /usr/lib/python2.7/site-packages/setuptools/script
, template
and (dev).py
. The reason for the error messages from wc
should now be clear.
There are several solutions for this. One is to use the null-delimited format for find
and xargs
. This works with any file name, even file names containing newlines (which are permitted, but uncommon).
find / -name '*.py' -print0 | xargs -0 wc -l | tail -3
Another is to forget about the problematic xargs
and make find
invoke the command directly.
find / -name '*.py' -exec wc -l + | tail -3
The first solution may be applicable to your locate
implementation, check if it has a -0
option. The second solution is specific to find
. If you're stuck with newline-delimited output from locate
, and you have the GNU version of xargs
, then you can use -d 'n'
to make it parse the input as newline-delimited without any form of quoting.
locate '*.py' | xargs -d 'n' wc -l | tail -3
This was your main problem. An additional problem is that there's a maximum length to the command line. The xargs
command (or the -exec ⦠+
action of find
) puts as many file names as it can on a command line, and if they don't all fit, then the command (here, wc -l
) is executed multiple times, once for each batch of files. With tail -3
, you're only seeing the last two files and the total for the last batch (assuming that there are at least two files in the last batch). The files in the previous batches are not reflected in this output. Since find
and locate
may not report files in the same order, you may see different results.
How to solve the maximum length problem depends on what you want to do with the data. If all you want is grand totals, then one way (assuming no newlines in file names) is to count all total
lines.
⦠| xargs -d 'n' wc -l | awk '/^[0-9]+ttotal$/ total += $1 END print total'
locate
doesn't include hidden files like find, does it ?
â Sergiy Kolodyazhnyy
Oct 28 '17 at 0:25
@SergiyKolodyazhnyylocate
andfind
both include hidden files. It's basically only shell wildcards (i.e.*
in the shell, not*
infind
) andls
that ignore dot files.
â Gilles
Oct 28 '17 at 0:44
Thank you for all this info and advice re pushing error messages to /dev/null rather than handling it like I did. It'll take me a while to digest this wealth of information - I did realize after I posted that using quotes around my "*.c" would be advisable. And the alternatives offered to find the numbers I was looking for. My quick initial question is, why did I not run into problems with the .java and .c files? I apologize if the answer is your extensive reply, in which case just ignore my question, I will read read your answer carefully.
â Levon
Oct 28 '17 at 0:53
PS: Feels pretty humbling.
â Levon
Oct 28 '17 at 0:53
1
@Levon You didn't run into problems with.java
and.c
files because none of them had a name containing a space, or was contained in a directory whose name contains a space. It's rare for source code files to contain spaces in their name, usually people stick to valid identifiers in the language. I wonder how this Python file (which is part of a distribution package) got its name.
â Gilles
Oct 28 '17 at 1:02
 |Â
show 2 more comments
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
There are many problems with your commands.
First, locate *.c
only looks for files matching *.c
if you run it in a directory that doesn't contain any file whose name matches *.c
. Otherwise the shell expands *.c
to the list of matching files. That's probably not happening, otherwise you'd get a lot fewer matches, but leaving unquoted globs like this is a bad habit because it will bite you one day. (It's a frequent topic on this site.) The same goes for find -name *.c
. Instead, write
locate '*.c' â¦
find / -name '*.c' â¦
or something similar.
There are some common why locate
and find
might give different results. They don't seem to apply in your case since you're getting the same number of hits, but once again this is something you need to be aware of.
locate
results are cached from the last run ofupdatedb
. This usually runs once at night.find
results calculated each time you run the command.- Depending on the system, on which
locate
implementation you have and on how it's configured, it may let you see only publicly accessible files (e.g. GNU findutils, rather than mlocate or slocate), or it may make an approximation of the files that you're allowed to access (e.g. because there's a complex setup involving Linux security modules that distinguish between applications trying to access the file). - The pattern
*SUFFIX
mean the same thing tolocate
and tofind -name
(assuming thatSUFFIX
doesn't contain slashes or wildcards), but other patterns don't. For examplelocate foo
is equivalent tofind / -name '*foo*'
, not tofind / -name 'foo'
.
Another thing that might, but probably doesn't, cause problems is that you've piped error messages from find
into the data processing part of your command. You strip out lines containing Permission denied
, which causes you to miss files containing this as part of their name (ok, you probably don't have any), and causes any error message that doesn't contain Permission denied
to be interpreted as an input line. It is rarely a good idea to mix data output with error output, and it's absurd here. If you want to ignore errors, redirect them to /dev/null
:
find ⦠2>/dev/null | â¦
What is definitely biting you is that xargs
expects an input syntax that's different from what find
produces. In the input of xargs
, any whitespace separates items, not just line breaks. The three characters '"
are also parsed specially. Spaces are common in file names and all other characters are permitted apart from /
and from null bytes. One of the lines that xargs
receives as input is
/usr/lib/python2.7/site-packages/setuptools/script template (dev).py
For xargs
, that's three items: /usr/lib/python2.7/site-packages/setuptools/script
, template
and (dev).py
. The reason for the error messages from wc
should now be clear.
There are several solutions for this. One is to use the null-delimited format for find
and xargs
. This works with any file name, even file names containing newlines (which are permitted, but uncommon).
find / -name '*.py' -print0 | xargs -0 wc -l | tail -3
Another is to forget about the problematic xargs
and make find
invoke the command directly.
find / -name '*.py' -exec wc -l + | tail -3
The first solution may be applicable to your locate
implementation, check if it has a -0
option. The second solution is specific to find
. If you're stuck with newline-delimited output from locate
, and you have the GNU version of xargs
, then you can use -d 'n'
to make it parse the input as newline-delimited without any form of quoting.
locate '*.py' | xargs -d 'n' wc -l | tail -3
This was your main problem. An additional problem is that there's a maximum length to the command line. The xargs
command (or the -exec ⦠+
action of find
) puts as many file names as it can on a command line, and if they don't all fit, then the command (here, wc -l
) is executed multiple times, once for each batch of files. With tail -3
, you're only seeing the last two files and the total for the last batch (assuming that there are at least two files in the last batch). The files in the previous batches are not reflected in this output. Since find
and locate
may not report files in the same order, you may see different results.
How to solve the maximum length problem depends on what you want to do with the data. If all you want is grand totals, then one way (assuming no newlines in file names) is to count all total
lines.
⦠| xargs -d 'n' wc -l | awk '/^[0-9]+ttotal$/ total += $1 END print total'
locate
doesn't include hidden files like find, does it ?
â Sergiy Kolodyazhnyy
Oct 28 '17 at 0:25
@SergiyKolodyazhnyylocate
andfind
both include hidden files. It's basically only shell wildcards (i.e.*
in the shell, not*
infind
) andls
that ignore dot files.
â Gilles
Oct 28 '17 at 0:44
Thank you for all this info and advice re pushing error messages to /dev/null rather than handling it like I did. It'll take me a while to digest this wealth of information - I did realize after I posted that using quotes around my "*.c" would be advisable. And the alternatives offered to find the numbers I was looking for. My quick initial question is, why did I not run into problems with the .java and .c files? I apologize if the answer is your extensive reply, in which case just ignore my question, I will read read your answer carefully.
â Levon
Oct 28 '17 at 0:53
PS: Feels pretty humbling.
â Levon
Oct 28 '17 at 0:53
1
@Levon You didn't run into problems with.java
and.c
files because none of them had a name containing a space, or was contained in a directory whose name contains a space. It's rare for source code files to contain spaces in their name, usually people stick to valid identifiers in the language. I wonder how this Python file (which is part of a distribution package) got its name.
â Gilles
Oct 28 '17 at 1:02
 |Â
show 2 more comments
up vote
2
down vote
accepted
There are many problems with your commands.
First, locate *.c
only looks for files matching *.c
if you run it in a directory that doesn't contain any file whose name matches *.c
. Otherwise the shell expands *.c
to the list of matching files. That's probably not happening, otherwise you'd get a lot fewer matches, but leaving unquoted globs like this is a bad habit because it will bite you one day. (It's a frequent topic on this site.) The same goes for find -name *.c
. Instead, write
locate '*.c' â¦
find / -name '*.c' â¦
or something similar.
There are some common why locate
and find
might give different results. They don't seem to apply in your case since you're getting the same number of hits, but once again this is something you need to be aware of.
locate
results are cached from the last run ofupdatedb
. This usually runs once at night.find
results calculated each time you run the command.- Depending on the system, on which
locate
implementation you have and on how it's configured, it may let you see only publicly accessible files (e.g. GNU findutils, rather than mlocate or slocate), or it may make an approximation of the files that you're allowed to access (e.g. because there's a complex setup involving Linux security modules that distinguish between applications trying to access the file). - The pattern
*SUFFIX
mean the same thing tolocate
and tofind -name
(assuming thatSUFFIX
doesn't contain slashes or wildcards), but other patterns don't. For examplelocate foo
is equivalent tofind / -name '*foo*'
, not tofind / -name 'foo'
.
Another thing that might, but probably doesn't, cause problems is that you've piped error messages from find
into the data processing part of your command. You strip out lines containing Permission denied
, which causes you to miss files containing this as part of their name (ok, you probably don't have any), and causes any error message that doesn't contain Permission denied
to be interpreted as an input line. It is rarely a good idea to mix data output with error output, and it's absurd here. If you want to ignore errors, redirect them to /dev/null
:
find ⦠2>/dev/null | â¦
What is definitely biting you is that xargs
expects an input syntax that's different from what find
produces. In the input of xargs
, any whitespace separates items, not just line breaks. The three characters '"
are also parsed specially. Spaces are common in file names and all other characters are permitted apart from /
and from null bytes. One of the lines that xargs
receives as input is
/usr/lib/python2.7/site-packages/setuptools/script template (dev).py
For xargs
, that's three items: /usr/lib/python2.7/site-packages/setuptools/script
, template
and (dev).py
. The reason for the error messages from wc
should now be clear.
There are several solutions for this. One is to use the null-delimited format for find
and xargs
. This works with any file name, even file names containing newlines (which are permitted, but uncommon).
find / -name '*.py' -print0 | xargs -0 wc -l | tail -3
Another is to forget about the problematic xargs
and make find
invoke the command directly.
find / -name '*.py' -exec wc -l + | tail -3
The first solution may be applicable to your locate
implementation, check if it has a -0
option. The second solution is specific to find
. If you're stuck with newline-delimited output from locate
, and you have the GNU version of xargs
, then you can use -d 'n'
to make it parse the input as newline-delimited without any form of quoting.
locate '*.py' | xargs -d 'n' wc -l | tail -3
This was your main problem. An additional problem is that there's a maximum length to the command line. The xargs
command (or the -exec ⦠+
action of find
) puts as many file names as it can on a command line, and if they don't all fit, then the command (here, wc -l
) is executed multiple times, once for each batch of files. With tail -3
, you're only seeing the last two files and the total for the last batch (assuming that there are at least two files in the last batch). The files in the previous batches are not reflected in this output. Since find
and locate
may not report files in the same order, you may see different results.
How to solve the maximum length problem depends on what you want to do with the data. If all you want is grand totals, then one way (assuming no newlines in file names) is to count all total
lines.
⦠| xargs -d 'n' wc -l | awk '/^[0-9]+ttotal$/ total += $1 END print total'
locate
doesn't include hidden files like find, does it ?
â Sergiy Kolodyazhnyy
Oct 28 '17 at 0:25
@SergiyKolodyazhnyylocate
andfind
both include hidden files. It's basically only shell wildcards (i.e.*
in the shell, not*
infind
) andls
that ignore dot files.
â Gilles
Oct 28 '17 at 0:44
Thank you for all this info and advice re pushing error messages to /dev/null rather than handling it like I did. It'll take me a while to digest this wealth of information - I did realize after I posted that using quotes around my "*.c" would be advisable. And the alternatives offered to find the numbers I was looking for. My quick initial question is, why did I not run into problems with the .java and .c files? I apologize if the answer is your extensive reply, in which case just ignore my question, I will read read your answer carefully.
â Levon
Oct 28 '17 at 0:53
PS: Feels pretty humbling.
â Levon
Oct 28 '17 at 0:53
1
@Levon You didn't run into problems with.java
and.c
files because none of them had a name containing a space, or was contained in a directory whose name contains a space. It's rare for source code files to contain spaces in their name, usually people stick to valid identifiers in the language. I wonder how this Python file (which is part of a distribution package) got its name.
â Gilles
Oct 28 '17 at 1:02
 |Â
show 2 more comments
up vote
2
down vote
accepted
up vote
2
down vote
accepted
There are many problems with your commands.
First, locate *.c
only looks for files matching *.c
if you run it in a directory that doesn't contain any file whose name matches *.c
. Otherwise the shell expands *.c
to the list of matching files. That's probably not happening, otherwise you'd get a lot fewer matches, but leaving unquoted globs like this is a bad habit because it will bite you one day. (It's a frequent topic on this site.) The same goes for find -name *.c
. Instead, write
locate '*.c' â¦
find / -name '*.c' â¦
or something similar.
There are some common why locate
and find
might give different results. They don't seem to apply in your case since you're getting the same number of hits, but once again this is something you need to be aware of.
locate
results are cached from the last run ofupdatedb
. This usually runs once at night.find
results calculated each time you run the command.- Depending on the system, on which
locate
implementation you have and on how it's configured, it may let you see only publicly accessible files (e.g. GNU findutils, rather than mlocate or slocate), or it may make an approximation of the files that you're allowed to access (e.g. because there's a complex setup involving Linux security modules that distinguish between applications trying to access the file). - The pattern
*SUFFIX
mean the same thing tolocate
and tofind -name
(assuming thatSUFFIX
doesn't contain slashes or wildcards), but other patterns don't. For examplelocate foo
is equivalent tofind / -name '*foo*'
, not tofind / -name 'foo'
.
Another thing that might, but probably doesn't, cause problems is that you've piped error messages from find
into the data processing part of your command. You strip out lines containing Permission denied
, which causes you to miss files containing this as part of their name (ok, you probably don't have any), and causes any error message that doesn't contain Permission denied
to be interpreted as an input line. It is rarely a good idea to mix data output with error output, and it's absurd here. If you want to ignore errors, redirect them to /dev/null
:
find ⦠2>/dev/null | â¦
What is definitely biting you is that xargs
expects an input syntax that's different from what find
produces. In the input of xargs
, any whitespace separates items, not just line breaks. The three characters '"
are also parsed specially. Spaces are common in file names and all other characters are permitted apart from /
and from null bytes. One of the lines that xargs
receives as input is
/usr/lib/python2.7/site-packages/setuptools/script template (dev).py
For xargs
, that's three items: /usr/lib/python2.7/site-packages/setuptools/script
, template
and (dev).py
. The reason for the error messages from wc
should now be clear.
There are several solutions for this. One is to use the null-delimited format for find
and xargs
. This works with any file name, even file names containing newlines (which are permitted, but uncommon).
find / -name '*.py' -print0 | xargs -0 wc -l | tail -3
Another is to forget about the problematic xargs
and make find
invoke the command directly.
find / -name '*.py' -exec wc -l + | tail -3
The first solution may be applicable to your locate
implementation, check if it has a -0
option. The second solution is specific to find
. If you're stuck with newline-delimited output from locate
, and you have the GNU version of xargs
, then you can use -d 'n'
to make it parse the input as newline-delimited without any form of quoting.
locate '*.py' | xargs -d 'n' wc -l | tail -3
This was your main problem. An additional problem is that there's a maximum length to the command line. The xargs
command (or the -exec ⦠+
action of find
) puts as many file names as it can on a command line, and if they don't all fit, then the command (here, wc -l
) is executed multiple times, once for each batch of files. With tail -3
, you're only seeing the last two files and the total for the last batch (assuming that there are at least two files in the last batch). The files in the previous batches are not reflected in this output. Since find
and locate
may not report files in the same order, you may see different results.
How to solve the maximum length problem depends on what you want to do with the data. If all you want is grand totals, then one way (assuming no newlines in file names) is to count all total
lines.
⦠| xargs -d 'n' wc -l | awk '/^[0-9]+ttotal$/ total += $1 END print total'
There are many problems with your commands.
First, locate *.c
only looks for files matching *.c
if you run it in a directory that doesn't contain any file whose name matches *.c
. Otherwise the shell expands *.c
to the list of matching files. That's probably not happening, otherwise you'd get a lot fewer matches, but leaving unquoted globs like this is a bad habit because it will bite you one day. (It's a frequent topic on this site.) The same goes for find -name *.c
. Instead, write
locate '*.c' â¦
find / -name '*.c' â¦
or something similar.
There are some common why locate
and find
might give different results. They don't seem to apply in your case since you're getting the same number of hits, but once again this is something you need to be aware of.
locate
results are cached from the last run ofupdatedb
. This usually runs once at night.find
results calculated each time you run the command.- Depending on the system, on which
locate
implementation you have and on how it's configured, it may let you see only publicly accessible files (e.g. GNU findutils, rather than mlocate or slocate), or it may make an approximation of the files that you're allowed to access (e.g. because there's a complex setup involving Linux security modules that distinguish between applications trying to access the file). - The pattern
*SUFFIX
mean the same thing tolocate
and tofind -name
(assuming thatSUFFIX
doesn't contain slashes or wildcards), but other patterns don't. For examplelocate foo
is equivalent tofind / -name '*foo*'
, not tofind / -name 'foo'
.
Another thing that might, but probably doesn't, cause problems is that you've piped error messages from find
into the data processing part of your command. You strip out lines containing Permission denied
, which causes you to miss files containing this as part of their name (ok, you probably don't have any), and causes any error message that doesn't contain Permission denied
to be interpreted as an input line. It is rarely a good idea to mix data output with error output, and it's absurd here. If you want to ignore errors, redirect them to /dev/null
:
find ⦠2>/dev/null | â¦
What is definitely biting you is that xargs
expects an input syntax that's different from what find
produces. In the input of xargs
, any whitespace separates items, not just line breaks. The three characters '"
are also parsed specially. Spaces are common in file names and all other characters are permitted apart from /
and from null bytes. One of the lines that xargs
receives as input is
/usr/lib/python2.7/site-packages/setuptools/script template (dev).py
For xargs
, that's three items: /usr/lib/python2.7/site-packages/setuptools/script
, template
and (dev).py
. The reason for the error messages from wc
should now be clear.
There are several solutions for this. One is to use the null-delimited format for find
and xargs
. This works with any file name, even file names containing newlines (which are permitted, but uncommon).
find / -name '*.py' -print0 | xargs -0 wc -l | tail -3
Another is to forget about the problematic xargs
and make find
invoke the command directly.
find / -name '*.py' -exec wc -l + | tail -3
The first solution may be applicable to your locate
implementation, check if it has a -0
option. The second solution is specific to find
. If you're stuck with newline-delimited output from locate
, and you have the GNU version of xargs
, then you can use -d 'n'
to make it parse the input as newline-delimited without any form of quoting.
locate '*.py' | xargs -d 'n' wc -l | tail -3
This was your main problem. An additional problem is that there's a maximum length to the command line. The xargs
command (or the -exec ⦠+
action of find
) puts as many file names as it can on a command line, and if they don't all fit, then the command (here, wc -l
) is executed multiple times, once for each batch of files. With tail -3
, you're only seeing the last two files and the total for the last batch (assuming that there are at least two files in the last batch). The files in the previous batches are not reflected in this output. Since find
and locate
may not report files in the same order, you may see different results.
How to solve the maximum length problem depends on what you want to do with the data. If all you want is grand totals, then one way (assuming no newlines in file names) is to count all total
lines.
⦠| xargs -d 'n' wc -l | awk '/^[0-9]+ttotal$/ total += $1 END print total'
edited Oct 29 '17 at 18:19
answered Oct 28 '17 at 0:13
Gilles
508k12010031532
508k12010031532
locate
doesn't include hidden files like find, does it ?
â Sergiy Kolodyazhnyy
Oct 28 '17 at 0:25
@SergiyKolodyazhnyylocate
andfind
both include hidden files. It's basically only shell wildcards (i.e.*
in the shell, not*
infind
) andls
that ignore dot files.
â Gilles
Oct 28 '17 at 0:44
Thank you for all this info and advice re pushing error messages to /dev/null rather than handling it like I did. It'll take me a while to digest this wealth of information - I did realize after I posted that using quotes around my "*.c" would be advisable. And the alternatives offered to find the numbers I was looking for. My quick initial question is, why did I not run into problems with the .java and .c files? I apologize if the answer is your extensive reply, in which case just ignore my question, I will read read your answer carefully.
â Levon
Oct 28 '17 at 0:53
PS: Feels pretty humbling.
â Levon
Oct 28 '17 at 0:53
1
@Levon You didn't run into problems with.java
and.c
files because none of them had a name containing a space, or was contained in a directory whose name contains a space. It's rare for source code files to contain spaces in their name, usually people stick to valid identifiers in the language. I wonder how this Python file (which is part of a distribution package) got its name.
â Gilles
Oct 28 '17 at 1:02
 |Â
show 2 more comments
locate
doesn't include hidden files like find, does it ?
â Sergiy Kolodyazhnyy
Oct 28 '17 at 0:25
@SergiyKolodyazhnyylocate
andfind
both include hidden files. It's basically only shell wildcards (i.e.*
in the shell, not*
infind
) andls
that ignore dot files.
â Gilles
Oct 28 '17 at 0:44
Thank you for all this info and advice re pushing error messages to /dev/null rather than handling it like I did. It'll take me a while to digest this wealth of information - I did realize after I posted that using quotes around my "*.c" would be advisable. And the alternatives offered to find the numbers I was looking for. My quick initial question is, why did I not run into problems with the .java and .c files? I apologize if the answer is your extensive reply, in which case just ignore my question, I will read read your answer carefully.
â Levon
Oct 28 '17 at 0:53
PS: Feels pretty humbling.
â Levon
Oct 28 '17 at 0:53
1
@Levon You didn't run into problems with.java
and.c
files because none of them had a name containing a space, or was contained in a directory whose name contains a space. It's rare for source code files to contain spaces in their name, usually people stick to valid identifiers in the language. I wonder how this Python file (which is part of a distribution package) got its name.
â Gilles
Oct 28 '17 at 1:02
locate
doesn't include hidden files like find, does it ?â Sergiy Kolodyazhnyy
Oct 28 '17 at 0:25
locate
doesn't include hidden files like find, does it ?â Sergiy Kolodyazhnyy
Oct 28 '17 at 0:25
@SergiyKolodyazhnyy
locate
and find
both include hidden files. It's basically only shell wildcards (i.e. *
in the shell, not *
in find
) and ls
that ignore dot files.â Gilles
Oct 28 '17 at 0:44
@SergiyKolodyazhnyy
locate
and find
both include hidden files. It's basically only shell wildcards (i.e. *
in the shell, not *
in find
) and ls
that ignore dot files.â Gilles
Oct 28 '17 at 0:44
Thank you for all this info and advice re pushing error messages to /dev/null rather than handling it like I did. It'll take me a while to digest this wealth of information - I did realize after I posted that using quotes around my "*.c" would be advisable. And the alternatives offered to find the numbers I was looking for. My quick initial question is, why did I not run into problems with the .java and .c files? I apologize if the answer is your extensive reply, in which case just ignore my question, I will read read your answer carefully.
â Levon
Oct 28 '17 at 0:53
Thank you for all this info and advice re pushing error messages to /dev/null rather than handling it like I did. It'll take me a while to digest this wealth of information - I did realize after I posted that using quotes around my "*.c" would be advisable. And the alternatives offered to find the numbers I was looking for. My quick initial question is, why did I not run into problems with the .java and .c files? I apologize if the answer is your extensive reply, in which case just ignore my question, I will read read your answer carefully.
â Levon
Oct 28 '17 at 0:53
PS: Feels pretty humbling.
â Levon
Oct 28 '17 at 0:53
PS: Feels pretty humbling.
â Levon
Oct 28 '17 at 0:53
1
1
@Levon You didn't run into problems with
.java
and .c
files because none of them had a name containing a space, or was contained in a directory whose name contains a space. It's rare for source code files to contain spaces in their name, usually people stick to valid identifiers in the language. I wonder how this Python file (which is part of a distribution package) got its name.â Gilles
Oct 28 '17 at 1:02
@Levon You didn't run into problems with
.java
and .c
files because none of them had a name containing a space, or was contained in a directory whose name contains a space. It's rare for source code files to contain spaces in their name, usually people stick to valid identifiers in the language. I wonder how this Python file (which is part of a distribution package) got its name.â Gilles
Oct 28 '17 at 1:02
 |Â
show 2 more comments
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f400980%2fwhen-counting-source-files-and-loc-with-locate-and-find-why-do-python-files-co%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Your result for both commands should be inconsistent because once number of maximum arguments for a command is reached ( it's called MAXARGS ) , the xargs will start padding new wc command. So it outputs "total" several times during the process, and it gets lost because of tail. Try removing tail and see if multiple lines "total" come up.
â Sergiy Kolodyazhnyy
Oct 28 '17 at 0:06
@SergiyKolodyazhnyy When I read your comment it made intuitive sense to me (too), but I just re-ran both commands without the final pipe that uses tail and I end up with the same results :-/
â Levon
Oct 28 '17 at 0:41