Find the smallest set of values where at least one value is present in every row of a file

up vote
2
down vote

favorite

I have a file df like:

I need to identify the smallest set of values such that every row in df contains at least one of these values.

From the above df the desired out is:

I have a process that works but is terribly slow. Is there a way I can execute this more quickly or parallelize the process?

x=1 
while [ $x -gt 0 ] 
do 
i=$(paste df | tr 't' 'n' | sort | uniq -c | sort -r -k1,1 -k2,2n | awk 'NR==1print $2') 
echo $i >> out 
grep -vw $i df > tmpdf 
cat tmpdf > df 
x=$(paste df | wc -l) 
done

edited Jun 27 at 11:43

Jeff Schaller

30.8k846104

asked Jun 27 at 10:16

Jay

111

Would 1 2 4 10 be an acceptable solution?
â€“Â icarus
Jun 27 at 10:37

due to the condition at least one - all values seem to be valid. Elaborate, why 7 should be ignored and 9 be in result?
â€“Â RomanPerekhrest
Jun 27 at 10:39

You are looking for the strongly connected nodes in a connected undirected graph. Or something similar.
â€“Â Kusalananda
Jun 27 at 10:51

@icarus Yes, 1 2 4 10 would be fine. The current code just takes the lowest value when values are in equal abundance, but taking the highest would do fine.
â€“Â Jay
Jun 27 at 11:23

@RomanPerekhrest I identify the most common value and use a subtractive process to find the next most common value. 1 appears 4 times, and then 2 appears 3 times. When rows containing 1 or 2 are removed, the next most frequently occuring value is 4. Rows containing 4 are thus removed, which concomitantly removes 7. The remaining row has value 9 and 10. From here, because 9 and 10 are equally abundant, I choose the lowest value 9, but as I mention to @icarus, taking the higher of equally abundant values would be fine.
â€“Â Jay
Jun 27 at 11:35

Â |Â
show 3 more comments

up vote
2
down vote

favorite

I have a file df like:

I need to identify the smallest set of values such that every row in df contains at least one of these values.

From the above df the desired out is:

I have a process that works but is terribly slow. Is there a way I can execute this more quickly or parallelize the process?

x=1 
while [ $x -gt 0 ] 
do 
i=$(paste df | tr 't' 'n' | sort | uniq -c | sort -r -k1,1 -k2,2n | awk 'NR==1print $2') 
echo $i >> out 
grep -vw $i df > tmpdf 
cat tmpdf > df 
x=$(paste df | wc -l) 
done

edited Jun 27 at 11:43

Jeff Schaller

30.8k846104

asked Jun 27 at 10:16

Jay

111

Would 1 2 4 10 be an acceptable solution?
â€“Â icarus
Jun 27 at 10:37

due to the condition at least one - all values seem to be valid. Elaborate, why 7 should be ignored and 9 be in result?
â€“Â RomanPerekhrest
Jun 27 at 10:39

You are looking for the strongly connected nodes in a connected undirected graph. Or something similar.
â€“Â Kusalananda
Jun 27 at 10:51

@icarus Yes, 1 2 4 10 would be fine. The current code just takes the lowest value when values are in equal abundance, but taking the highest would do fine.
â€“Â Jay
Jun 27 at 11:23

@RomanPerekhrest I identify the most common value and use a subtractive process to find the next most common value. 1 appears 4 times, and then 2 appears 3 times. When rows containing 1 or 2 are removed, the next most frequently occuring value is 4. Rows containing 4 are thus removed, which concomitantly removes 7. The remaining row has value 9 and 10. From here, because 9 and 10 are equally abundant, I choose the lowest value 9, but as I mention to @icarus, taking the higher of equally abundant values would be fine.
â€“Â Jay
Jun 27 at 11:35

Â |Â
show 3 more comments

up vote
2
down vote

favorite

I have a file df like:

I need to identify the smallest set of values such that every row in df contains at least one of these values.

From the above df the desired out is:

I have a process that works but is terribly slow. Is there a way I can execute this more quickly or parallelize the process?

x=1 
while [ $x -gt 0 ] 
do 
i=$(paste df | tr 't' 'n' | sort | uniq -c | sort -r -k1,1 -k2,2n | awk 'NR==1print $2') 
echo $i >> out 
grep -vw $i df > tmpdf 
cat tmpdf > df 
x=$(paste df | wc -l) 
done

edited Jun 27 at 11:43

Jeff Schaller

30.8k846104

asked Jun 27 at 10:16

Jay

111

I have a file df like:

I need to identify the smallest set of values such that every row in df contains at least one of these values.

From the above df the desired out is:

I have a process that works but is terribly slow. Is there a way I can execute this more quickly or parallelize the process?

x=1 
while [ $x -gt 0 ] 
do 
i=$(paste df | tr 't' 'n' | sort | uniq -c | sort -r -k1,1 -k2,2n | awk 'NR==1print $2') 
echo $i >> out 
grep -vw $i df > tmpdf 
cat tmpdf > df 
x=$(paste df | wc -l) 
done

edited Jun 27 at 11:43

Jeff Schaller

30.8k846104

asked Jun 27 at 10:16

Jay

111

edited Jun 27 at 11:43

Jeff Schaller

30.8k846104

edited Jun 27 at 11:43

Jeff Schaller

30.8k846104

edited Jun 27 at 11:43

Jeff Schaller

30.8k846104

asked Jun 27 at 10:16

Jay

111

asked Jun 27 at 10:16

Jay

111

asked Jun 27 at 10:16

Jay

111

Would 1 2 4 10 be an acceptable solution?
â€“Â icarus
Jun 27 at 10:37

due to the condition at least one - all values seem to be valid. Elaborate, why 7 should be ignored and 9 be in result?
â€“Â RomanPerekhrest
Jun 27 at 10:39

You are looking for the strongly connected nodes in a connected undirected graph. Or something similar.
â€“Â Kusalananda
Jun 27 at 10:51

@icarus Yes, 1 2 4 10 would be fine. The current code just takes the lowest value when values are in equal abundance, but taking the highest would do fine.
â€“Â Jay
Jun 27 at 11:23

@RomanPerekhrest I identify the most common value and use a subtractive process to find the next most common value. 1 appears 4 times, and then 2 appears 3 times. When rows containing 1 or 2 are removed, the next most frequently occuring value is 4. Rows containing 4 are thus removed, which concomitantly removes 7. The remaining row has value 9 and 10. From here, because 9 and 10 are equally abundant, I choose the lowest value 9, but as I mention to @icarus, taking the higher of equally abundant values would be fine.
â€“Â Jay
Jun 27 at 11:35

Â |Â
show 3 more comments

Would 1 2 4 10 be an acceptable solution?
â€“Â icarus
Jun 27 at 10:37

due to the condition at least one - all values seem to be valid. Elaborate, why 7 should be ignored and 9 be in result?
â€“Â RomanPerekhrest
Jun 27 at 10:39

You are looking for the strongly connected nodes in a connected undirected graph. Or something similar.
â€“Â Kusalananda
Jun 27 at 10:51

@icarus Yes, 1 2 4 10 would be fine. The current code just takes the lowest value when values are in equal abundance, but taking the highest would do fine.
â€“Â Jay
Jun 27 at 11:23

@RomanPerekhrest I identify the most common value and use a subtractive process to find the next most common value. 1 appears 4 times, and then 2 appears 3 times. When rows containing 1 or 2 are removed, the next most frequently occuring value is 4. Rows containing 4 are thus removed, which concomitantly removes 7. The remaining row has value 9 and 10. From here, because 9 and 10 are equally abundant, I choose the lowest value 9, but as I mention to @icarus, taking the higher of equally abundant values would be fine.
â€“Â Jay
Jun 27 at 11:35

Would 1 2 4 10 be an acceptable solution?
â€“Â icarus
Jun 27 at 10:37

due to the condition at least one - all values seem to be valid. Elaborate, why 7 should be ignored and 9 be in result?
â€“Â RomanPerekhrest
Jun 27 at 10:39

You are looking for the strongly connected nodes in a connected undirected graph. Or something similar.
â€“Â Kusalananda
Jun 27 at 10:51

@icarus Yes, 1 2 4 10 would be fine. The current code just takes the lowest value when values are in equal abundance, but taking the highest would do fine.
â€“Â Jay
Jun 27 at 11:23

@RomanPerekhrest I identify the most common value and use a subtractive process to find the next most common value. 1 appears 4 times, and then 2 appears 3 times. When rows containing 1 or 2 are removed, the next most frequently occuring value is 4. Rows containing 4 are thus removed, which concomitantly removes 7. The remaining row has value 9 and 10. From here, because 9 and 10 are equally abundant, I choose the lowest value 9, but as I mention to @icarus, taking the higher of equally abundant values would be fine.
â€“Â Jay
Jun 27 at 11:35

Â |Â
show 3 more comments

1 Answer
1

active

oldest

votes

up vote
0
down vote

Here's some "line-noisy" perl:

perl -lane '
 for $f (@F) push @$x$f, $. # 1.
 } END 
 use List::Util qw/any first/;
 sub sort_by_count_desc 
 map $_->[0] 
 sort $a->[0] <=> $b->[0] 
 map [$_, scalar(@$x$_)] 
 @_ 
 
 @ordered = sort_by_count_desc(keys %x); # 2.
 %result = ();
 for ($i=1; $i<=$.; $i++) # 3.
 $node = first any $_ == $i @$x$_ @ordered; # 4.
 $result$node = 1;
 
 print join "n", sort_by_count_desc(keys %result);
' df

Where

loops over the lines of the file and maps each value to the list of lines it appears on

creates an ordered line of the values sorted by the size of the list of appearances, descending

loops over the range of line numbers, and foreach line number

finds the first value for which the line number appears

This outputs

answered Jun 27 at 14:53

glenn jackman

45.6k264100

Thanks for your answer. However, I tested this on a different dataset and it outputs a set of values greater than the minimal number required to disconnect all nodes.
â€“Â Jay
Jun 27 at 19:57

Ah well. I did not follow a known algorithm with this implementation.
â€“Â glenn jackman
Jun 27 at 20:51

add a commentÂ END Â

up vote
0
down vote

Here's some "line-noisy" perl:

perl -lane '
 for $f (@F) push @$x$f, $. # 1.
 END improve this answer

answered Jun 27 at 14:53

glenn jackman

45.6k264100

Here's some "line-noisy" perl:

perl -lane '
 for $f (@F) push @$x$f, $. # 1.
 END {
 use List::Util qw/any first/;
 sub sort_by_count_desc 
 map $_->[0] 
 sort $a->[0] <=> $b->[0] 
 map [$_, scalar(@$x$_)] 
 @_ 
 
 @ordered = sort_by_count_desc(keys %x); # 2.
 %result = ();
 for ($i=1; $i<=$.; $i++) # 3.
 $node = first any $_ == $i @$x$_ @ordered; # 4.
 $result$node = 1;
 
 print join "n", sort_by_count_desc(keys %result);
' df

Where

loops over the lines of the file and maps each value to the list of lines it appears on

creates an ordered line of the values sorted by the size of the list of appearances, descending

loops over the range of line numbers, and foreach line number

finds the first value for which the line number appears

This outputs

answered Jun 27 at 14:53

glenn jackman

45.6k264100

answered Jun 27 at 14:53

glenn jackman

45.6k264100

answered Jun 27 at 14:53

glenn jackman

45.6k264100

answered Jun 27 at 14:53

glenn jackman

45.6k264100

Thanks for your answer. However, I tested this on a different dataset and it outputs a set of values greater than the minimal number required to disconnect all nodes.
â€“Â Jay
Jun 27 at 19:57

Ah well. I did not follow a known algorithm with this implementation.
â€“Â glenn jackman
Jun 27 at 20:51

add a commentÂ |Â

Thanks for your answer. However, I tested this on a different dataset and it outputs a set of values greater than the minimal number required to disconnect all nodes.
â€“Â Jay
Jun 27 at 19:57

Ah well. I did not follow a known algorithm with this implementation.
â€“Â glenn jackman
Jun 27 at 20:51

Thanks for your answer. However, I tested this on a different dataset and it outputs a set of values greater than the minimal number required to disconnect all nodes.
â€“Â Jay
Jun 27 at 19:57

Ah well. I did not follow a known algorithm with this implementation.
â€“Â glenn jackman
Jun 27 at 20:51

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f452188%2ffind-the-smallest-set-of-values-where-at-least-one-value-is-present-in-every-row%23new-answer', 'question_page');

);

Post as a guest

Name

搜尋此網誌

mjhjmtu