Find the smallest set of values where at least one value is present in every row of a file

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite
1












I have a file df like:



1 4
1 6
1 7
1 10
2 3
2 9
2 10
3 4
4 7
9 10


I need to identify the smallest set of values such that every row in df contains at least one of these values.



From the above df the desired out is:



1
2
4
9


I have a process that works but is terribly slow. Is there a way I can execute this more quickly or parallelize the process?



x=1 
while [ $x -gt 0 ]
do
i=$(paste df | tr 't' 'n' | sort | uniq -c | sort -r -k1,1 -k2,2n | awk 'NR==1print $2')
echo $i >> out
grep -vw $i df > tmpdf
cat tmpdf > df
x=$(paste df | wc -l)
done






share|improve this question





















  • Would 1 2 4 10 be an acceptable solution?
    – icarus
    Jun 27 at 10:37










  • due to the condition at least one - all values seem to be valid. Elaborate, why 7 should be ignored and 9 be in result?
    – RomanPerekhrest
    Jun 27 at 10:39











  • You are looking for the strongly connected nodes in a connected undirected graph. Or something similar.
    – Kusalananda
    Jun 27 at 10:51











  • @icarus Yes, 1 2 4 10 would be fine. The current code just takes the lowest value when values are in equal abundance, but taking the highest would do fine.
    – Jay
    Jun 27 at 11:23










  • @RomanPerekhrest I identify the most common value and use a subtractive process to find the next most common value. 1 appears 4 times, and then 2 appears 3 times. When rows containing 1 or 2 are removed, the next most frequently occuring value is 4. Rows containing 4 are thus removed, which concomitantly removes 7. The remaining row has value 9 and 10. From here, because 9 and 10 are equally abundant, I choose the lowest value 9, but as I mention to @icarus, taking the higher of equally abundant values would be fine.
    – Jay
    Jun 27 at 11:35















up vote
2
down vote

favorite
1












I have a file df like:



1 4
1 6
1 7
1 10
2 3
2 9
2 10
3 4
4 7
9 10


I need to identify the smallest set of values such that every row in df contains at least one of these values.



From the above df the desired out is:



1
2
4
9


I have a process that works but is terribly slow. Is there a way I can execute this more quickly or parallelize the process?



x=1 
while [ $x -gt 0 ]
do
i=$(paste df | tr 't' 'n' | sort | uniq -c | sort -r -k1,1 -k2,2n | awk 'NR==1print $2')
echo $i >> out
grep -vw $i df > tmpdf
cat tmpdf > df
x=$(paste df | wc -l)
done






share|improve this question





















  • Would 1 2 4 10 be an acceptable solution?
    – icarus
    Jun 27 at 10:37










  • due to the condition at least one - all values seem to be valid. Elaborate, why 7 should be ignored and 9 be in result?
    – RomanPerekhrest
    Jun 27 at 10:39











  • You are looking for the strongly connected nodes in a connected undirected graph. Or something similar.
    – Kusalananda
    Jun 27 at 10:51











  • @icarus Yes, 1 2 4 10 would be fine. The current code just takes the lowest value when values are in equal abundance, but taking the highest would do fine.
    – Jay
    Jun 27 at 11:23










  • @RomanPerekhrest I identify the most common value and use a subtractive process to find the next most common value. 1 appears 4 times, and then 2 appears 3 times. When rows containing 1 or 2 are removed, the next most frequently occuring value is 4. Rows containing 4 are thus removed, which concomitantly removes 7. The remaining row has value 9 and 10. From here, because 9 and 10 are equally abundant, I choose the lowest value 9, but as I mention to @icarus, taking the higher of equally abundant values would be fine.
    – Jay
    Jun 27 at 11:35













up vote
2
down vote

favorite
1









up vote
2
down vote

favorite
1






1





I have a file df like:



1 4
1 6
1 7
1 10
2 3
2 9
2 10
3 4
4 7
9 10


I need to identify the smallest set of values such that every row in df contains at least one of these values.



From the above df the desired out is:



1
2
4
9


I have a process that works but is terribly slow. Is there a way I can execute this more quickly or parallelize the process?



x=1 
while [ $x -gt 0 ]
do
i=$(paste df | tr 't' 'n' | sort | uniq -c | sort -r -k1,1 -k2,2n | awk 'NR==1print $2')
echo $i >> out
grep -vw $i df > tmpdf
cat tmpdf > df
x=$(paste df | wc -l)
done






share|improve this question













I have a file df like:



1 4
1 6
1 7
1 10
2 3
2 9
2 10
3 4
4 7
9 10


I need to identify the smallest set of values such that every row in df contains at least one of these values.



From the above df the desired out is:



1
2
4
9


I have a process that works but is terribly slow. Is there a way I can execute this more quickly or parallelize the process?



x=1 
while [ $x -gt 0 ]
do
i=$(paste df | tr 't' 'n' | sort | uniq -c | sort -r -k1,1 -k2,2n | awk 'NR==1print $2')
echo $i >> out
grep -vw $i df > tmpdf
cat tmpdf > df
x=$(paste df | wc -l)
done








share|improve this question












share|improve this question




share|improve this question








edited Jun 27 at 11:43









Jeff Schaller

30.8k846104




30.8k846104









asked Jun 27 at 10:16









Jay

111




111











  • Would 1 2 4 10 be an acceptable solution?
    – icarus
    Jun 27 at 10:37










  • due to the condition at least one - all values seem to be valid. Elaborate, why 7 should be ignored and 9 be in result?
    – RomanPerekhrest
    Jun 27 at 10:39











  • You are looking for the strongly connected nodes in a connected undirected graph. Or something similar.
    – Kusalananda
    Jun 27 at 10:51











  • @icarus Yes, 1 2 4 10 would be fine. The current code just takes the lowest value when values are in equal abundance, but taking the highest would do fine.
    – Jay
    Jun 27 at 11:23










  • @RomanPerekhrest I identify the most common value and use a subtractive process to find the next most common value. 1 appears 4 times, and then 2 appears 3 times. When rows containing 1 or 2 are removed, the next most frequently occuring value is 4. Rows containing 4 are thus removed, which concomitantly removes 7. The remaining row has value 9 and 10. From here, because 9 and 10 are equally abundant, I choose the lowest value 9, but as I mention to @icarus, taking the higher of equally abundant values would be fine.
    – Jay
    Jun 27 at 11:35

















  • Would 1 2 4 10 be an acceptable solution?
    – icarus
    Jun 27 at 10:37










  • due to the condition at least one - all values seem to be valid. Elaborate, why 7 should be ignored and 9 be in result?
    – RomanPerekhrest
    Jun 27 at 10:39











  • You are looking for the strongly connected nodes in a connected undirected graph. Or something similar.
    – Kusalananda
    Jun 27 at 10:51











  • @icarus Yes, 1 2 4 10 would be fine. The current code just takes the lowest value when values are in equal abundance, but taking the highest would do fine.
    – Jay
    Jun 27 at 11:23










  • @RomanPerekhrest I identify the most common value and use a subtractive process to find the next most common value. 1 appears 4 times, and then 2 appears 3 times. When rows containing 1 or 2 are removed, the next most frequently occuring value is 4. Rows containing 4 are thus removed, which concomitantly removes 7. The remaining row has value 9 and 10. From here, because 9 and 10 are equally abundant, I choose the lowest value 9, but as I mention to @icarus, taking the higher of equally abundant values would be fine.
    – Jay
    Jun 27 at 11:35
















Would 1 2 4 10 be an acceptable solution?
– icarus
Jun 27 at 10:37




Would 1 2 4 10 be an acceptable solution?
– icarus
Jun 27 at 10:37












due to the condition at least one - all values seem to be valid. Elaborate, why 7 should be ignored and 9 be in result?
– RomanPerekhrest
Jun 27 at 10:39





due to the condition at least one - all values seem to be valid. Elaborate, why 7 should be ignored and 9 be in result?
– RomanPerekhrest
Jun 27 at 10:39













You are looking for the strongly connected nodes in a connected undirected graph. Or something similar.
– Kusalananda
Jun 27 at 10:51





You are looking for the strongly connected nodes in a connected undirected graph. Or something similar.
– Kusalananda
Jun 27 at 10:51













@icarus Yes, 1 2 4 10 would be fine. The current code just takes the lowest value when values are in equal abundance, but taking the highest would do fine.
– Jay
Jun 27 at 11:23




@icarus Yes, 1 2 4 10 would be fine. The current code just takes the lowest value when values are in equal abundance, but taking the highest would do fine.
– Jay
Jun 27 at 11:23












@RomanPerekhrest I identify the most common value and use a subtractive process to find the next most common value. 1 appears 4 times, and then 2 appears 3 times. When rows containing 1 or 2 are removed, the next most frequently occuring value is 4. Rows containing 4 are thus removed, which concomitantly removes 7. The remaining row has value 9 and 10. From here, because 9 and 10 are equally abundant, I choose the lowest value 9, but as I mention to @icarus, taking the higher of equally abundant values would be fine.
– Jay
Jun 27 at 11:35





@RomanPerekhrest I identify the most common value and use a subtractive process to find the next most common value. 1 appears 4 times, and then 2 appears 3 times. When rows containing 1 or 2 are removed, the next most frequently occuring value is 4. Rows containing 4 are thus removed, which concomitantly removes 7. The remaining row has value 9 and 10. From here, because 9 and 10 are equally abundant, I choose the lowest value 9, but as I mention to @icarus, taking the higher of equally abundant values would be fine.
– Jay
Jun 27 at 11:35











1 Answer
1






active

oldest

votes

















up vote
0
down vote













Here's some "line-noisy" perl:



perl -lane '
for $f (@F) push @$x$f, $. # 1.
} END
use List::Util qw/any first/;
sub sort_by_count_desc
map $_->[0]
sort $a->[0] <=> $b->[0]
map [$_, scalar(@$x$_)]
@_

@ordered = sort_by_count_desc(keys %x); # 2.
%result = ();
for ($i=1; $i<=$.; $i++) # 3.
$node = first any $_ == $i @$x$_ @ordered; # 4.
$result$node = 1;

print join "n", sort_by_count_desc(keys %result);
' df


Where



  1. loops over the lines of the file and maps each value to the list of lines it appears on

  2. creates an ordered line of the values sorted by the size of the list of appearances, descending

  3. loops over the range of line numbers, and foreach line number

  4. finds the first value for which the line number appears

This outputs



1
2
4
10





share END improve this answer





















  • Thanks for your answer. However, I tested this on a different dataset and it outputs a set of values greater than the minimal number required to disconnect all nodes.
    – Jay
    Jun 27 at 19:57










  • Ah well. I did not follow a known algorithm with this implementation.
    – glenn jackman
    Jun 27 at 20:51












up vote
0
down vote










up vote
0
down vote









Here's some "line-noisy" perl:



perl -lane '
for $f (@F) push @$x$f, $. # 1.
END improve this answer













Here's some "line-noisy" perl:



perl -lane '
for $f (@F) push @$x$f, $. # 1.
END {
use List::Util qw/any first/;
sub sort_by_count_desc
map $_->[0]
sort $a->[0] <=> $b->[0]
map [$_, scalar(@$x$_)]
@_

@ordered = sort_by_count_desc(keys %x); # 2.
%result = ();
for ($i=1; $i<=$.; $i++) # 3.
$node = first any $_ == $i @$x$_ @ordered; # 4.
$result$node = 1;

print join "n", sort_by_count_desc(keys %result);
' df


Where



  1. loops over the lines of the file and maps each value to the list of lines it appears on

  2. creates an ordered line of the values sorted by the size of the list of appearances, descending

  3. loops over the range of line numbers, and foreach line number

  4. finds the first value for which the line number appears

This outputs



1
2
4
10






share|improve this answer













share|improve this answer



share|improve this answer











answered Jun 27 at 14:53









glenn jackman

45.6k264100




45.6k264100











  • Thanks for your answer. However, I tested this on a different dataset and it outputs a set of values greater than the minimal number required to disconnect all nodes.
    – Jay
    Jun 27 at 19:57










  • Ah well. I did not follow a known algorithm with this implementation.
    – glenn jackman
    Jun 27 at 20:51
















  • Thanks for your answer. However, I tested this on a different dataset and it outputs a set of values greater than the minimal number required to disconnect all nodes.
    – Jay
    Jun 27 at 19:57










  • Ah well. I did not follow a known algorithm with this implementation.
    – glenn jackman
    Jun 27 at 20:51















Thanks for your answer. However, I tested this on a different dataset and it outputs a set of values greater than the minimal number required to disconnect all nodes.
– Jay
Jun 27 at 19:57




Thanks for your answer. However, I tested this on a different dataset and it outputs a set of values greater than the minimal number required to disconnect all nodes.
– Jay
Jun 27 at 19:57












Ah well. I did not follow a known algorithm with this implementation.
– glenn jackman
Jun 27 at 20:51




Ah well. I did not follow a known algorithm with this implementation.
– glenn jackman
Jun 27 at 20:51












 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f452188%2ffind-the-smallest-set-of-values-where-at-least-one-value-is-present-in-every-row%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

How to check contact read email or not when send email to Individual?

Bahrain

Postfix configuration issue with fips on centos 7; mailgun relay