When counting source files and LOC with locate and find - why do Python files come up different?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite












I am having trouble understanding why find and locate would work differently for C and Python source files. My goal is to count the number source files and the sum of their source code lines for a given language. I used both find and locate to compare outputs (updatedb was just run prior to this with sudo to make sure locate reports current results).



For C files this works as expected, the number of source files is the same



$ find / -name *.c |& grep -v "Permission denied" | wc -l
1056
$ locate *.c | wc -l
1056


Using xargs, the sum of source code lines also come up the same.



$ locate *.c | xargs wc -l | tail -3
138 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/selinux/genheaders/genheaders.c
147 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/selinux/mdp/mdp.c
705376 total

$ find / -name *.c |& grep -v "Permission denied" | xargs wc -l | tail -3
2994 /opt/Python-3.6.2/Objects/listobject.c
821 /opt/Python-3.6.2/Objects/bytes_methods.c
705376 total


Just to test, this also works for files with a .java extension - I get the same consistent results. However, when I repeat the same with for Python files (ie. .py extension)



Source file number matches.



$ find / -name *.py |& grep -v "Permission denied" | wc -l
9249
$ locate *.py | wc -l
9249


But the sum of lines of code for Python files gives very different results.



$ locate *.py | xargs wc -l | tail -3
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template: No such file or directory
wc: (dev).py: No such file or directory
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template.py: No such file or directory
220 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/rt-tester/rt-tester.py
129 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/tracing/draw_functrace.py
753350 total

$ find / -name *.py |& grep -v "Permission denied" | xargs wc -l | tail -3
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template: No such file or directory
wc: (dev).py: No such file or directory
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template.py: No such file or directory
1919 /opt/Python-3.6.2/python-gdb.py
69 /opt/Python-3.6.2/python-config.py
1034101 total


Can someone explain why this is the case? What's so different about Python files (I can't really believe it has to do with the file type, but I'm stumped). What am I missing here?



Same odd results under Ubuntu and RH



I run updatedb with sudo, but I'm running all of these command as a regular user.







share|improve this question




















  • Your result for both commands should be inconsistent because once number of maximum arguments for a command is reached ( it's called MAXARGS ) , the xargs will start padding new wc command. So it outputs "total" several times during the process, and it gets lost because of tail. Try removing tail and see if multiple lines "total" come up.
    – Sergiy Kolodyazhnyy
    Oct 28 '17 at 0:06










  • @SergiyKolodyazhnyy When I read your comment it made intuitive sense to me (too), but I just re-ran both commands without the final pipe that uses tail and I end up with the same results :-/
    – Levon
    Oct 28 '17 at 0:41














up vote
2
down vote

favorite












I am having trouble understanding why find and locate would work differently for C and Python source files. My goal is to count the number source files and the sum of their source code lines for a given language. I used both find and locate to compare outputs (updatedb was just run prior to this with sudo to make sure locate reports current results).



For C files this works as expected, the number of source files is the same



$ find / -name *.c |& grep -v "Permission denied" | wc -l
1056
$ locate *.c | wc -l
1056


Using xargs, the sum of source code lines also come up the same.



$ locate *.c | xargs wc -l | tail -3
138 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/selinux/genheaders/genheaders.c
147 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/selinux/mdp/mdp.c
705376 total

$ find / -name *.c |& grep -v "Permission denied" | xargs wc -l | tail -3
2994 /opt/Python-3.6.2/Objects/listobject.c
821 /opt/Python-3.6.2/Objects/bytes_methods.c
705376 total


Just to test, this also works for files with a .java extension - I get the same consistent results. However, when I repeat the same with for Python files (ie. .py extension)



Source file number matches.



$ find / -name *.py |& grep -v "Permission denied" | wc -l
9249
$ locate *.py | wc -l
9249


But the sum of lines of code for Python files gives very different results.



$ locate *.py | xargs wc -l | tail -3
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template: No such file or directory
wc: (dev).py: No such file or directory
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template.py: No such file or directory
220 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/rt-tester/rt-tester.py
129 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/tracing/draw_functrace.py
753350 total

$ find / -name *.py |& grep -v "Permission denied" | xargs wc -l | tail -3
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template: No such file or directory
wc: (dev).py: No such file or directory
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template.py: No such file or directory
1919 /opt/Python-3.6.2/python-gdb.py
69 /opt/Python-3.6.2/python-config.py
1034101 total


Can someone explain why this is the case? What's so different about Python files (I can't really believe it has to do with the file type, but I'm stumped). What am I missing here?



Same odd results under Ubuntu and RH



I run updatedb with sudo, but I'm running all of these command as a regular user.







share|improve this question




















  • Your result for both commands should be inconsistent because once number of maximum arguments for a command is reached ( it's called MAXARGS ) , the xargs will start padding new wc command. So it outputs "total" several times during the process, and it gets lost because of tail. Try removing tail and see if multiple lines "total" come up.
    – Sergiy Kolodyazhnyy
    Oct 28 '17 at 0:06










  • @SergiyKolodyazhnyy When I read your comment it made intuitive sense to me (too), but I just re-ran both commands without the final pipe that uses tail and I end up with the same results :-/
    – Levon
    Oct 28 '17 at 0:41












up vote
2
down vote

favorite









up vote
2
down vote

favorite











I am having trouble understanding why find and locate would work differently for C and Python source files. My goal is to count the number source files and the sum of their source code lines for a given language. I used both find and locate to compare outputs (updatedb was just run prior to this with sudo to make sure locate reports current results).



For C files this works as expected, the number of source files is the same



$ find / -name *.c |& grep -v "Permission denied" | wc -l
1056
$ locate *.c | wc -l
1056


Using xargs, the sum of source code lines also come up the same.



$ locate *.c | xargs wc -l | tail -3
138 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/selinux/genheaders/genheaders.c
147 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/selinux/mdp/mdp.c
705376 total

$ find / -name *.c |& grep -v "Permission denied" | xargs wc -l | tail -3
2994 /opt/Python-3.6.2/Objects/listobject.c
821 /opt/Python-3.6.2/Objects/bytes_methods.c
705376 total


Just to test, this also works for files with a .java extension - I get the same consistent results. However, when I repeat the same with for Python files (ie. .py extension)



Source file number matches.



$ find / -name *.py |& grep -v "Permission denied" | wc -l
9249
$ locate *.py | wc -l
9249


But the sum of lines of code for Python files gives very different results.



$ locate *.py | xargs wc -l | tail -3
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template: No such file or directory
wc: (dev).py: No such file or directory
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template.py: No such file or directory
220 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/rt-tester/rt-tester.py
129 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/tracing/draw_functrace.py
753350 total

$ find / -name *.py |& grep -v "Permission denied" | xargs wc -l | tail -3
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template: No such file or directory
wc: (dev).py: No such file or directory
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template.py: No such file or directory
1919 /opt/Python-3.6.2/python-gdb.py
69 /opt/Python-3.6.2/python-config.py
1034101 total


Can someone explain why this is the case? What's so different about Python files (I can't really believe it has to do with the file type, but I'm stumped). What am I missing here?



Same odd results under Ubuntu and RH



I run updatedb with sudo, but I'm running all of these command as a regular user.







share|improve this question












I am having trouble understanding why find and locate would work differently for C and Python source files. My goal is to count the number source files and the sum of their source code lines for a given language. I used both find and locate to compare outputs (updatedb was just run prior to this with sudo to make sure locate reports current results).



For C files this works as expected, the number of source files is the same



$ find / -name *.c |& grep -v "Permission denied" | wc -l
1056
$ locate *.c | wc -l
1056


Using xargs, the sum of source code lines also come up the same.



$ locate *.c | xargs wc -l | tail -3
138 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/selinux/genheaders/genheaders.c
147 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/selinux/mdp/mdp.c
705376 total

$ find / -name *.c |& grep -v "Permission denied" | xargs wc -l | tail -3
2994 /opt/Python-3.6.2/Objects/listobject.c
821 /opt/Python-3.6.2/Objects/bytes_methods.c
705376 total


Just to test, this also works for files with a .java extension - I get the same consistent results. However, when I repeat the same with for Python files (ie. .py extension)



Source file number matches.



$ find / -name *.py |& grep -v "Permission denied" | wc -l
9249
$ locate *.py | wc -l
9249


But the sum of lines of code for Python files gives very different results.



$ locate *.py | xargs wc -l | tail -3
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template: No such file or directory
wc: (dev).py: No such file or directory
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template.py: No such file or directory
220 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/rt-tester/rt-tester.py
129 /usr/src/kernels/3.10.0-693.el7.ppc64/scripts/tracing/draw_functrace.py
753350 total

$ find / -name *.py |& grep -v "Permission denied" | xargs wc -l | tail -3
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template: No such file or directory
wc: (dev).py: No such file or directory
wc: /usr/lib/python2.7/site-packages/setuptools/script: No such file or directory
wc: template.py: No such file or directory
1919 /opt/Python-3.6.2/python-gdb.py
69 /opt/Python-3.6.2/python-config.py
1034101 total


Can someone explain why this is the case? What's so different about Python files (I can't really believe it has to do with the file type, but I'm stumped). What am I missing here?



Same odd results under Ubuntu and RH



I run updatedb with sudo, but I'm running all of these command as a regular user.









share|improve this question











share|improve this question




share|improve this question










asked Oct 27 '17 at 23:35









Levon

6,32622935




6,32622935











  • Your result for both commands should be inconsistent because once number of maximum arguments for a command is reached ( it's called MAXARGS ) , the xargs will start padding new wc command. So it outputs "total" several times during the process, and it gets lost because of tail. Try removing tail and see if multiple lines "total" come up.
    – Sergiy Kolodyazhnyy
    Oct 28 '17 at 0:06










  • @SergiyKolodyazhnyy When I read your comment it made intuitive sense to me (too), but I just re-ran both commands without the final pipe that uses tail and I end up with the same results :-/
    – Levon
    Oct 28 '17 at 0:41
















  • Your result for both commands should be inconsistent because once number of maximum arguments for a command is reached ( it's called MAXARGS ) , the xargs will start padding new wc command. So it outputs "total" several times during the process, and it gets lost because of tail. Try removing tail and see if multiple lines "total" come up.
    – Sergiy Kolodyazhnyy
    Oct 28 '17 at 0:06










  • @SergiyKolodyazhnyy When I read your comment it made intuitive sense to me (too), but I just re-ran both commands without the final pipe that uses tail and I end up with the same results :-/
    – Levon
    Oct 28 '17 at 0:41















Your result for both commands should be inconsistent because once number of maximum arguments for a command is reached ( it's called MAXARGS ) , the xargs will start padding new wc command. So it outputs "total" several times during the process, and it gets lost because of tail. Try removing tail and see if multiple lines "total" come up.
– Sergiy Kolodyazhnyy
Oct 28 '17 at 0:06




Your result for both commands should be inconsistent because once number of maximum arguments for a command is reached ( it's called MAXARGS ) , the xargs will start padding new wc command. So it outputs "total" several times during the process, and it gets lost because of tail. Try removing tail and see if multiple lines "total" come up.
– Sergiy Kolodyazhnyy
Oct 28 '17 at 0:06












@SergiyKolodyazhnyy When I read your comment it made intuitive sense to me (too), but I just re-ran both commands without the final pipe that uses tail and I end up with the same results :-/
– Levon
Oct 28 '17 at 0:41




@SergiyKolodyazhnyy When I read your comment it made intuitive sense to me (too), but I just re-ran both commands without the final pipe that uses tail and I end up with the same results :-/
– Levon
Oct 28 '17 at 0:41










1 Answer
1






active

oldest

votes

















up vote
2
down vote



accepted










There are many problems with your commands.



First, locate *.c only looks for files matching *.c if you run it in a directory that doesn't contain any file whose name matches *.c. Otherwise the shell expands *.c to the list of matching files. That's probably not happening, otherwise you'd get a lot fewer matches, but leaving unquoted globs like this is a bad habit because it will bite you one day. (It's a frequent topic on this site.) The same goes for find -name *.c. Instead, write



locate '*.c' …
find / -name '*.c' …


or something similar.



There are some common why locate and find might give different results. They don't seem to apply in your case since you're getting the same number of hits, but once again this is something you need to be aware of.




  • locate results are cached from the last run of updatedb. This usually runs once at night. find results calculated each time you run the command.

  • Depending on the system, on which locate implementation you have and on how it's configured, it may let you see only publicly accessible files (e.g. GNU findutils, rather than mlocate or slocate), or it may make an approximation of the files that you're allowed to access (e.g. because there's a complex setup involving Linux security modules that distinguish between applications trying to access the file).

  • The pattern *SUFFIX mean the same thing to locate and to find -name (assuming that SUFFIX doesn't contain slashes or wildcards), but other patterns don't. For example locate foo is equivalent to find / -name '*foo*', not to find / -name 'foo'.

Another thing that might, but probably doesn't, cause problems is that you've piped error messages from find into the data processing part of your command. You strip out lines containing Permission denied, which causes you to miss files containing this as part of their name (ok, you probably don't have any), and causes any error message that doesn't contain Permission denied to be interpreted as an input line. It is rarely a good idea to mix data output with error output, and it's absurd here. If you want to ignore errors, redirect them to /dev/null:



find … 2>/dev/null | …


What is definitely biting you is that xargs expects an input syntax that's different from what find produces. In the input of xargs, any whitespace separates items, not just line breaks. The three characters '" are also parsed specially. Spaces are common in file names and all other characters are permitted apart from / and from null bytes. One of the lines that xargs receives as input is



/usr/lib/python2.7/site-packages/setuptools/script template (dev).py


For xargs, that's three items: /usr/lib/python2.7/site-packages/setuptools/script, template and (dev).py. The reason for the error messages from wc should now be clear.



There are several solutions for this. One is to use the null-delimited format for find and xargs. This works with any file name, even file names containing newlines (which are permitted, but uncommon).



find / -name '*.py' -print0 | xargs -0 wc -l | tail -3


Another is to forget about the problematic xargs and make find invoke the command directly.



find / -name '*.py' -exec wc -l + | tail -3


The first solution may be applicable to your locate implementation, check if it has a -0 option. The second solution is specific to find. If you're stuck with newline-delimited output from locate, and you have the GNU version of xargs, then you can use -d 'n' to make it parse the input as newline-delimited without any form of quoting.



locate '*.py' | xargs -d 'n' wc -l | tail -3


This was your main problem. An additional problem is that there's a maximum length to the command line. The xargs command (or the -exec … + action of find) puts as many file names as it can on a command line, and if they don't all fit, then the command (here, wc -l) is executed multiple times, once for each batch of files. With tail -3, you're only seeing the last two files and the total for the last batch (assuming that there are at least two files in the last batch). The files in the previous batches are not reflected in this output. Since find and locate may not report files in the same order, you may see different results.



How to solve the maximum length problem depends on what you want to do with the data. If all you want is grand totals, then one way (assuming no newlines in file names) is to count all total lines.



… | xargs -d 'n' wc -l | awk '/^[0-9]+ttotal$/ total += $1 END print total'





share|improve this answer






















  • locate doesn't include hidden files like find, does it ?
    – Sergiy Kolodyazhnyy
    Oct 28 '17 at 0:25










  • @SergiyKolodyazhnyy locate and find both include hidden files. It's basically only shell wildcards (i.e. * in the shell, not * in find) and ls that ignore dot files.
    – Gilles
    Oct 28 '17 at 0:44










  • Thank you for all this info and advice re pushing error messages to /dev/null rather than handling it like I did. It'll take me a while to digest this wealth of information - I did realize after I posted that using quotes around my "*.c" would be advisable. And the alternatives offered to find the numbers I was looking for. My quick initial question is, why did I not run into problems with the .java and .c files? I apologize if the answer is your extensive reply, in which case just ignore my question, I will read read your answer carefully.
    – Levon
    Oct 28 '17 at 0:53










  • PS: Feels pretty humbling.
    – Levon
    Oct 28 '17 at 0:53






  • 1




    @Levon You didn't run into problems with .java and .c files because none of them had a name containing a space, or was contained in a directory whose name contains a space. It's rare for source code files to contain spaces in their name, usually people stick to valid identifiers in the language. I wonder how this Python file (which is part of a distribution package) got its name.
    – Gilles
    Oct 28 '17 at 1:02










Your Answer







StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f400980%2fwhen-counting-source-files-and-loc-with-locate-and-find-why-do-python-files-co%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote



accepted










There are many problems with your commands.



First, locate *.c only looks for files matching *.c if you run it in a directory that doesn't contain any file whose name matches *.c. Otherwise the shell expands *.c to the list of matching files. That's probably not happening, otherwise you'd get a lot fewer matches, but leaving unquoted globs like this is a bad habit because it will bite you one day. (It's a frequent topic on this site.) The same goes for find -name *.c. Instead, write



locate '*.c' …
find / -name '*.c' …


or something similar.



There are some common why locate and find might give different results. They don't seem to apply in your case since you're getting the same number of hits, but once again this is something you need to be aware of.




  • locate results are cached from the last run of updatedb. This usually runs once at night. find results calculated each time you run the command.

  • Depending on the system, on which locate implementation you have and on how it's configured, it may let you see only publicly accessible files (e.g. GNU findutils, rather than mlocate or slocate), or it may make an approximation of the files that you're allowed to access (e.g. because there's a complex setup involving Linux security modules that distinguish between applications trying to access the file).

  • The pattern *SUFFIX mean the same thing to locate and to find -name (assuming that SUFFIX doesn't contain slashes or wildcards), but other patterns don't. For example locate foo is equivalent to find / -name '*foo*', not to find / -name 'foo'.

Another thing that might, but probably doesn't, cause problems is that you've piped error messages from find into the data processing part of your command. You strip out lines containing Permission denied, which causes you to miss files containing this as part of their name (ok, you probably don't have any), and causes any error message that doesn't contain Permission denied to be interpreted as an input line. It is rarely a good idea to mix data output with error output, and it's absurd here. If you want to ignore errors, redirect them to /dev/null:



find … 2>/dev/null | …


What is definitely biting you is that xargs expects an input syntax that's different from what find produces. In the input of xargs, any whitespace separates items, not just line breaks. The three characters '" are also parsed specially. Spaces are common in file names and all other characters are permitted apart from / and from null bytes. One of the lines that xargs receives as input is



/usr/lib/python2.7/site-packages/setuptools/script template (dev).py


For xargs, that's three items: /usr/lib/python2.7/site-packages/setuptools/script, template and (dev).py. The reason for the error messages from wc should now be clear.



There are several solutions for this. One is to use the null-delimited format for find and xargs. This works with any file name, even file names containing newlines (which are permitted, but uncommon).



find / -name '*.py' -print0 | xargs -0 wc -l | tail -3


Another is to forget about the problematic xargs and make find invoke the command directly.



find / -name '*.py' -exec wc -l + | tail -3


The first solution may be applicable to your locate implementation, check if it has a -0 option. The second solution is specific to find. If you're stuck with newline-delimited output from locate, and you have the GNU version of xargs, then you can use -d 'n' to make it parse the input as newline-delimited without any form of quoting.



locate '*.py' | xargs -d 'n' wc -l | tail -3


This was your main problem. An additional problem is that there's a maximum length to the command line. The xargs command (or the -exec … + action of find) puts as many file names as it can on a command line, and if they don't all fit, then the command (here, wc -l) is executed multiple times, once for each batch of files. With tail -3, you're only seeing the last two files and the total for the last batch (assuming that there are at least two files in the last batch). The files in the previous batches are not reflected in this output. Since find and locate may not report files in the same order, you may see different results.



How to solve the maximum length problem depends on what you want to do with the data. If all you want is grand totals, then one way (assuming no newlines in file names) is to count all total lines.



… | xargs -d 'n' wc -l | awk '/^[0-9]+ttotal$/ total += $1 END print total'





share|improve this answer






















  • locate doesn't include hidden files like find, does it ?
    – Sergiy Kolodyazhnyy
    Oct 28 '17 at 0:25










  • @SergiyKolodyazhnyy locate and find both include hidden files. It's basically only shell wildcards (i.e. * in the shell, not * in find) and ls that ignore dot files.
    – Gilles
    Oct 28 '17 at 0:44










  • Thank you for all this info and advice re pushing error messages to /dev/null rather than handling it like I did. It'll take me a while to digest this wealth of information - I did realize after I posted that using quotes around my "*.c" would be advisable. And the alternatives offered to find the numbers I was looking for. My quick initial question is, why did I not run into problems with the .java and .c files? I apologize if the answer is your extensive reply, in which case just ignore my question, I will read read your answer carefully.
    – Levon
    Oct 28 '17 at 0:53










  • PS: Feels pretty humbling.
    – Levon
    Oct 28 '17 at 0:53






  • 1




    @Levon You didn't run into problems with .java and .c files because none of them had a name containing a space, or was contained in a directory whose name contains a space. It's rare for source code files to contain spaces in their name, usually people stick to valid identifiers in the language. I wonder how this Python file (which is part of a distribution package) got its name.
    – Gilles
    Oct 28 '17 at 1:02














up vote
2
down vote



accepted










There are many problems with your commands.



First, locate *.c only looks for files matching *.c if you run it in a directory that doesn't contain any file whose name matches *.c. Otherwise the shell expands *.c to the list of matching files. That's probably not happening, otherwise you'd get a lot fewer matches, but leaving unquoted globs like this is a bad habit because it will bite you one day. (It's a frequent topic on this site.) The same goes for find -name *.c. Instead, write



locate '*.c' …
find / -name '*.c' …


or something similar.



There are some common why locate and find might give different results. They don't seem to apply in your case since you're getting the same number of hits, but once again this is something you need to be aware of.




  • locate results are cached from the last run of updatedb. This usually runs once at night. find results calculated each time you run the command.

  • Depending on the system, on which locate implementation you have and on how it's configured, it may let you see only publicly accessible files (e.g. GNU findutils, rather than mlocate or slocate), or it may make an approximation of the files that you're allowed to access (e.g. because there's a complex setup involving Linux security modules that distinguish between applications trying to access the file).

  • The pattern *SUFFIX mean the same thing to locate and to find -name (assuming that SUFFIX doesn't contain slashes or wildcards), but other patterns don't. For example locate foo is equivalent to find / -name '*foo*', not to find / -name 'foo'.

Another thing that might, but probably doesn't, cause problems is that you've piped error messages from find into the data processing part of your command. You strip out lines containing Permission denied, which causes you to miss files containing this as part of their name (ok, you probably don't have any), and causes any error message that doesn't contain Permission denied to be interpreted as an input line. It is rarely a good idea to mix data output with error output, and it's absurd here. If you want to ignore errors, redirect them to /dev/null:



find … 2>/dev/null | …


What is definitely biting you is that xargs expects an input syntax that's different from what find produces. In the input of xargs, any whitespace separates items, not just line breaks. The three characters '" are also parsed specially. Spaces are common in file names and all other characters are permitted apart from / and from null bytes. One of the lines that xargs receives as input is



/usr/lib/python2.7/site-packages/setuptools/script template (dev).py


For xargs, that's three items: /usr/lib/python2.7/site-packages/setuptools/script, template and (dev).py. The reason for the error messages from wc should now be clear.



There are several solutions for this. One is to use the null-delimited format for find and xargs. This works with any file name, even file names containing newlines (which are permitted, but uncommon).



find / -name '*.py' -print0 | xargs -0 wc -l | tail -3


Another is to forget about the problematic xargs and make find invoke the command directly.



find / -name '*.py' -exec wc -l + | tail -3


The first solution may be applicable to your locate implementation, check if it has a -0 option. The second solution is specific to find. If you're stuck with newline-delimited output from locate, and you have the GNU version of xargs, then you can use -d 'n' to make it parse the input as newline-delimited without any form of quoting.



locate '*.py' | xargs -d 'n' wc -l | tail -3


This was your main problem. An additional problem is that there's a maximum length to the command line. The xargs command (or the -exec … + action of find) puts as many file names as it can on a command line, and if they don't all fit, then the command (here, wc -l) is executed multiple times, once for each batch of files. With tail -3, you're only seeing the last two files and the total for the last batch (assuming that there are at least two files in the last batch). The files in the previous batches are not reflected in this output. Since find and locate may not report files in the same order, you may see different results.



How to solve the maximum length problem depends on what you want to do with the data. If all you want is grand totals, then one way (assuming no newlines in file names) is to count all total lines.



… | xargs -d 'n' wc -l | awk '/^[0-9]+ttotal$/ total += $1 END print total'





share|improve this answer






















  • locate doesn't include hidden files like find, does it ?
    – Sergiy Kolodyazhnyy
    Oct 28 '17 at 0:25










  • @SergiyKolodyazhnyy locate and find both include hidden files. It's basically only shell wildcards (i.e. * in the shell, not * in find) and ls that ignore dot files.
    – Gilles
    Oct 28 '17 at 0:44










  • Thank you for all this info and advice re pushing error messages to /dev/null rather than handling it like I did. It'll take me a while to digest this wealth of information - I did realize after I posted that using quotes around my "*.c" would be advisable. And the alternatives offered to find the numbers I was looking for. My quick initial question is, why did I not run into problems with the .java and .c files? I apologize if the answer is your extensive reply, in which case just ignore my question, I will read read your answer carefully.
    – Levon
    Oct 28 '17 at 0:53










  • PS: Feels pretty humbling.
    – Levon
    Oct 28 '17 at 0:53






  • 1




    @Levon You didn't run into problems with .java and .c files because none of them had a name containing a space, or was contained in a directory whose name contains a space. It's rare for source code files to contain spaces in their name, usually people stick to valid identifiers in the language. I wonder how this Python file (which is part of a distribution package) got its name.
    – Gilles
    Oct 28 '17 at 1:02












up vote
2
down vote



accepted







up vote
2
down vote



accepted






There are many problems with your commands.



First, locate *.c only looks for files matching *.c if you run it in a directory that doesn't contain any file whose name matches *.c. Otherwise the shell expands *.c to the list of matching files. That's probably not happening, otherwise you'd get a lot fewer matches, but leaving unquoted globs like this is a bad habit because it will bite you one day. (It's a frequent topic on this site.) The same goes for find -name *.c. Instead, write



locate '*.c' …
find / -name '*.c' …


or something similar.



There are some common why locate and find might give different results. They don't seem to apply in your case since you're getting the same number of hits, but once again this is something you need to be aware of.




  • locate results are cached from the last run of updatedb. This usually runs once at night. find results calculated each time you run the command.

  • Depending on the system, on which locate implementation you have and on how it's configured, it may let you see only publicly accessible files (e.g. GNU findutils, rather than mlocate or slocate), or it may make an approximation of the files that you're allowed to access (e.g. because there's a complex setup involving Linux security modules that distinguish between applications trying to access the file).

  • The pattern *SUFFIX mean the same thing to locate and to find -name (assuming that SUFFIX doesn't contain slashes or wildcards), but other patterns don't. For example locate foo is equivalent to find / -name '*foo*', not to find / -name 'foo'.

Another thing that might, but probably doesn't, cause problems is that you've piped error messages from find into the data processing part of your command. You strip out lines containing Permission denied, which causes you to miss files containing this as part of their name (ok, you probably don't have any), and causes any error message that doesn't contain Permission denied to be interpreted as an input line. It is rarely a good idea to mix data output with error output, and it's absurd here. If you want to ignore errors, redirect them to /dev/null:



find … 2>/dev/null | …


What is definitely biting you is that xargs expects an input syntax that's different from what find produces. In the input of xargs, any whitespace separates items, not just line breaks. The three characters '" are also parsed specially. Spaces are common in file names and all other characters are permitted apart from / and from null bytes. One of the lines that xargs receives as input is



/usr/lib/python2.7/site-packages/setuptools/script template (dev).py


For xargs, that's three items: /usr/lib/python2.7/site-packages/setuptools/script, template and (dev).py. The reason for the error messages from wc should now be clear.



There are several solutions for this. One is to use the null-delimited format for find and xargs. This works with any file name, even file names containing newlines (which are permitted, but uncommon).



find / -name '*.py' -print0 | xargs -0 wc -l | tail -3


Another is to forget about the problematic xargs and make find invoke the command directly.



find / -name '*.py' -exec wc -l + | tail -3


The first solution may be applicable to your locate implementation, check if it has a -0 option. The second solution is specific to find. If you're stuck with newline-delimited output from locate, and you have the GNU version of xargs, then you can use -d 'n' to make it parse the input as newline-delimited without any form of quoting.



locate '*.py' | xargs -d 'n' wc -l | tail -3


This was your main problem. An additional problem is that there's a maximum length to the command line. The xargs command (or the -exec … + action of find) puts as many file names as it can on a command line, and if they don't all fit, then the command (here, wc -l) is executed multiple times, once for each batch of files. With tail -3, you're only seeing the last two files and the total for the last batch (assuming that there are at least two files in the last batch). The files in the previous batches are not reflected in this output. Since find and locate may not report files in the same order, you may see different results.



How to solve the maximum length problem depends on what you want to do with the data. If all you want is grand totals, then one way (assuming no newlines in file names) is to count all total lines.



… | xargs -d 'n' wc -l | awk '/^[0-9]+ttotal$/ total += $1 END print total'





share|improve this answer














There are many problems with your commands.



First, locate *.c only looks for files matching *.c if you run it in a directory that doesn't contain any file whose name matches *.c. Otherwise the shell expands *.c to the list of matching files. That's probably not happening, otherwise you'd get a lot fewer matches, but leaving unquoted globs like this is a bad habit because it will bite you one day. (It's a frequent topic on this site.) The same goes for find -name *.c. Instead, write



locate '*.c' …
find / -name '*.c' …


or something similar.



There are some common why locate and find might give different results. They don't seem to apply in your case since you're getting the same number of hits, but once again this is something you need to be aware of.




  • locate results are cached from the last run of updatedb. This usually runs once at night. find results calculated each time you run the command.

  • Depending on the system, on which locate implementation you have and on how it's configured, it may let you see only publicly accessible files (e.g. GNU findutils, rather than mlocate or slocate), or it may make an approximation of the files that you're allowed to access (e.g. because there's a complex setup involving Linux security modules that distinguish between applications trying to access the file).

  • The pattern *SUFFIX mean the same thing to locate and to find -name (assuming that SUFFIX doesn't contain slashes or wildcards), but other patterns don't. For example locate foo is equivalent to find / -name '*foo*', not to find / -name 'foo'.

Another thing that might, but probably doesn't, cause problems is that you've piped error messages from find into the data processing part of your command. You strip out lines containing Permission denied, which causes you to miss files containing this as part of their name (ok, you probably don't have any), and causes any error message that doesn't contain Permission denied to be interpreted as an input line. It is rarely a good idea to mix data output with error output, and it's absurd here. If you want to ignore errors, redirect them to /dev/null:



find … 2>/dev/null | …


What is definitely biting you is that xargs expects an input syntax that's different from what find produces. In the input of xargs, any whitespace separates items, not just line breaks. The three characters '" are also parsed specially. Spaces are common in file names and all other characters are permitted apart from / and from null bytes. One of the lines that xargs receives as input is



/usr/lib/python2.7/site-packages/setuptools/script template (dev).py


For xargs, that's three items: /usr/lib/python2.7/site-packages/setuptools/script, template and (dev).py. The reason for the error messages from wc should now be clear.



There are several solutions for this. One is to use the null-delimited format for find and xargs. This works with any file name, even file names containing newlines (which are permitted, but uncommon).



find / -name '*.py' -print0 | xargs -0 wc -l | tail -3


Another is to forget about the problematic xargs and make find invoke the command directly.



find / -name '*.py' -exec wc -l + | tail -3


The first solution may be applicable to your locate implementation, check if it has a -0 option. The second solution is specific to find. If you're stuck with newline-delimited output from locate, and you have the GNU version of xargs, then you can use -d 'n' to make it parse the input as newline-delimited without any form of quoting.



locate '*.py' | xargs -d 'n' wc -l | tail -3


This was your main problem. An additional problem is that there's a maximum length to the command line. The xargs command (or the -exec … + action of find) puts as many file names as it can on a command line, and if they don't all fit, then the command (here, wc -l) is executed multiple times, once for each batch of files. With tail -3, you're only seeing the last two files and the total for the last batch (assuming that there are at least two files in the last batch). The files in the previous batches are not reflected in this output. Since find and locate may not report files in the same order, you may see different results.



How to solve the maximum length problem depends on what you want to do with the data. If all you want is grand totals, then one way (assuming no newlines in file names) is to count all total lines.



… | xargs -d 'n' wc -l | awk '/^[0-9]+ttotal$/ total += $1 END print total'






share|improve this answer














share|improve this answer



share|improve this answer








edited Oct 29 '17 at 18:19

























answered Oct 28 '17 at 0:13









Gilles

508k12010031532




508k12010031532











  • locate doesn't include hidden files like find, does it ?
    – Sergiy Kolodyazhnyy
    Oct 28 '17 at 0:25










  • @SergiyKolodyazhnyy locate and find both include hidden files. It's basically only shell wildcards (i.e. * in the shell, not * in find) and ls that ignore dot files.
    – Gilles
    Oct 28 '17 at 0:44










  • Thank you for all this info and advice re pushing error messages to /dev/null rather than handling it like I did. It'll take me a while to digest this wealth of information - I did realize after I posted that using quotes around my "*.c" would be advisable. And the alternatives offered to find the numbers I was looking for. My quick initial question is, why did I not run into problems with the .java and .c files? I apologize if the answer is your extensive reply, in which case just ignore my question, I will read read your answer carefully.
    – Levon
    Oct 28 '17 at 0:53










  • PS: Feels pretty humbling.
    – Levon
    Oct 28 '17 at 0:53






  • 1




    @Levon You didn't run into problems with .java and .c files because none of them had a name containing a space, or was contained in a directory whose name contains a space. It's rare for source code files to contain spaces in their name, usually people stick to valid identifiers in the language. I wonder how this Python file (which is part of a distribution package) got its name.
    – Gilles
    Oct 28 '17 at 1:02
















  • locate doesn't include hidden files like find, does it ?
    – Sergiy Kolodyazhnyy
    Oct 28 '17 at 0:25










  • @SergiyKolodyazhnyy locate and find both include hidden files. It's basically only shell wildcards (i.e. * in the shell, not * in find) and ls that ignore dot files.
    – Gilles
    Oct 28 '17 at 0:44










  • Thank you for all this info and advice re pushing error messages to /dev/null rather than handling it like I did. It'll take me a while to digest this wealth of information - I did realize after I posted that using quotes around my "*.c" would be advisable. And the alternatives offered to find the numbers I was looking for. My quick initial question is, why did I not run into problems with the .java and .c files? I apologize if the answer is your extensive reply, in which case just ignore my question, I will read read your answer carefully.
    – Levon
    Oct 28 '17 at 0:53










  • PS: Feels pretty humbling.
    – Levon
    Oct 28 '17 at 0:53






  • 1




    @Levon You didn't run into problems with .java and .c files because none of them had a name containing a space, or was contained in a directory whose name contains a space. It's rare for source code files to contain spaces in their name, usually people stick to valid identifiers in the language. I wonder how this Python file (which is part of a distribution package) got its name.
    – Gilles
    Oct 28 '17 at 1:02















locate doesn't include hidden files like find, does it ?
– Sergiy Kolodyazhnyy
Oct 28 '17 at 0:25




locate doesn't include hidden files like find, does it ?
– Sergiy Kolodyazhnyy
Oct 28 '17 at 0:25












@SergiyKolodyazhnyy locate and find both include hidden files. It's basically only shell wildcards (i.e. * in the shell, not * in find) and ls that ignore dot files.
– Gilles
Oct 28 '17 at 0:44




@SergiyKolodyazhnyy locate and find both include hidden files. It's basically only shell wildcards (i.e. * in the shell, not * in find) and ls that ignore dot files.
– Gilles
Oct 28 '17 at 0:44












Thank you for all this info and advice re pushing error messages to /dev/null rather than handling it like I did. It'll take me a while to digest this wealth of information - I did realize after I posted that using quotes around my "*.c" would be advisable. And the alternatives offered to find the numbers I was looking for. My quick initial question is, why did I not run into problems with the .java and .c files? I apologize if the answer is your extensive reply, in which case just ignore my question, I will read read your answer carefully.
– Levon
Oct 28 '17 at 0:53




Thank you for all this info and advice re pushing error messages to /dev/null rather than handling it like I did. It'll take me a while to digest this wealth of information - I did realize after I posted that using quotes around my "*.c" would be advisable. And the alternatives offered to find the numbers I was looking for. My quick initial question is, why did I not run into problems with the .java and .c files? I apologize if the answer is your extensive reply, in which case just ignore my question, I will read read your answer carefully.
– Levon
Oct 28 '17 at 0:53












PS: Feels pretty humbling.
– Levon
Oct 28 '17 at 0:53




PS: Feels pretty humbling.
– Levon
Oct 28 '17 at 0:53




1




1




@Levon You didn't run into problems with .java and .c files because none of them had a name containing a space, or was contained in a directory whose name contains a space. It's rare for source code files to contain spaces in their name, usually people stick to valid identifiers in the language. I wonder how this Python file (which is part of a distribution package) got its name.
– Gilles
Oct 28 '17 at 1:02




@Levon You didn't run into problems with .java and .c files because none of them had a name containing a space, or was contained in a directory whose name contains a space. It's rare for source code files to contain spaces in their name, usually people stick to valid identifiers in the language. I wonder how this Python file (which is part of a distribution package) got its name.
– Gilles
Oct 28 '17 at 1:02

















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f400980%2fwhen-counting-source-files-and-loc-with-locate-and-find-why-do-python-files-co%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

How to check contact read email or not when send email to Individual?

Bahrain

Postfix configuration issue with fips on centos 7; mailgun relay