Find the smallest set of values where at least one value is present in every row of a file
Clash Royale CLAN TAG#URR8PPP
up vote
2
down vote
favorite
I have a file df
like:
1 4
1 6
1 7
1 10
2 3
2 9
2 10
3 4
4 7
9 10
I need to identify the smallest set of values such that every row in df contains at least one of these values.
From the above df
the desired out
is:
1
2
4
9
I have a process that works but is terribly slow. Is there a way I can execute this more quickly or parallelize the process?
x=1
while [ $x -gt 0 ]
do
i=$(paste df | tr 't' 'n' | sort | uniq -c | sort -r -k1,1 -k2,2n | awk 'NR==1print $2')
echo $i >> out
grep -vw $i df > tmpdf
cat tmpdf > df
x=$(paste df | wc -l)
done
text-processing numeric-data
 |Â
show 3 more comments
up vote
2
down vote
favorite
I have a file df
like:
1 4
1 6
1 7
1 10
2 3
2 9
2 10
3 4
4 7
9 10
I need to identify the smallest set of values such that every row in df contains at least one of these values.
From the above df
the desired out
is:
1
2
4
9
I have a process that works but is terribly slow. Is there a way I can execute this more quickly or parallelize the process?
x=1
while [ $x -gt 0 ]
do
i=$(paste df | tr 't' 'n' | sort | uniq -c | sort -r -k1,1 -k2,2n | awk 'NR==1print $2')
echo $i >> out
grep -vw $i df > tmpdf
cat tmpdf > df
x=$(paste df | wc -l)
done
text-processing numeric-data
Would 1 2 4 10 be an acceptable solution?
â icarus
Jun 27 at 10:37
due to the condition at least one - all values seem to be valid. Elaborate, why7
should be ignored and9
be in result?
â RomanPerekhrest
Jun 27 at 10:39
You are looking for the strongly connected nodes in a connected undirected graph. Or something similar.
â Kusalananda
Jun 27 at 10:51
@icarus Yes, 1 2 4 10 would be fine. The current code just takes the lowest value when values are in equal abundance, but taking the highest would do fine.
â Jay
Jun 27 at 11:23
@RomanPerekhrest I identify the most common value and use a subtractive process to find the next most common value.1
appears 4 times, and then2
appears 3 times. When rows containing1
or2
are removed, the next most frequently occuring value is4
. Rows containing4
are thus removed, which concomitantly removes7
. The remaining row has value9
and10
. From here, because9
and10
are equally abundant, I choose the lowest value9
, but as I mention to @icarus, taking the higher of equally abundant values would be fine.
â Jay
Jun 27 at 11:35
 |Â
show 3 more comments
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I have a file df
like:
1 4
1 6
1 7
1 10
2 3
2 9
2 10
3 4
4 7
9 10
I need to identify the smallest set of values such that every row in df contains at least one of these values.
From the above df
the desired out
is:
1
2
4
9
I have a process that works but is terribly slow. Is there a way I can execute this more quickly or parallelize the process?
x=1
while [ $x -gt 0 ]
do
i=$(paste df | tr 't' 'n' | sort | uniq -c | sort -r -k1,1 -k2,2n | awk 'NR==1print $2')
echo $i >> out
grep -vw $i df > tmpdf
cat tmpdf > df
x=$(paste df | wc -l)
done
text-processing numeric-data
I have a file df
like:
1 4
1 6
1 7
1 10
2 3
2 9
2 10
3 4
4 7
9 10
I need to identify the smallest set of values such that every row in df contains at least one of these values.
From the above df
the desired out
is:
1
2
4
9
I have a process that works but is terribly slow. Is there a way I can execute this more quickly or parallelize the process?
x=1
while [ $x -gt 0 ]
do
i=$(paste df | tr 't' 'n' | sort | uniq -c | sort -r -k1,1 -k2,2n | awk 'NR==1print $2')
echo $i >> out
grep -vw $i df > tmpdf
cat tmpdf > df
x=$(paste df | wc -l)
done
text-processing numeric-data
edited Jun 27 at 11:43
Jeff Schaller
30.8k846104
30.8k846104
asked Jun 27 at 10:16
Jay
111
111
Would 1 2 4 10 be an acceptable solution?
â icarus
Jun 27 at 10:37
due to the condition at least one - all values seem to be valid. Elaborate, why7
should be ignored and9
be in result?
â RomanPerekhrest
Jun 27 at 10:39
You are looking for the strongly connected nodes in a connected undirected graph. Or something similar.
â Kusalananda
Jun 27 at 10:51
@icarus Yes, 1 2 4 10 would be fine. The current code just takes the lowest value when values are in equal abundance, but taking the highest would do fine.
â Jay
Jun 27 at 11:23
@RomanPerekhrest I identify the most common value and use a subtractive process to find the next most common value.1
appears 4 times, and then2
appears 3 times. When rows containing1
or2
are removed, the next most frequently occuring value is4
. Rows containing4
are thus removed, which concomitantly removes7
. The remaining row has value9
and10
. From here, because9
and10
are equally abundant, I choose the lowest value9
, but as I mention to @icarus, taking the higher of equally abundant values would be fine.
â Jay
Jun 27 at 11:35
 |Â
show 3 more comments
Would 1 2 4 10 be an acceptable solution?
â icarus
Jun 27 at 10:37
due to the condition at least one - all values seem to be valid. Elaborate, why7
should be ignored and9
be in result?
â RomanPerekhrest
Jun 27 at 10:39
You are looking for the strongly connected nodes in a connected undirected graph. Or something similar.
â Kusalananda
Jun 27 at 10:51
@icarus Yes, 1 2 4 10 would be fine. The current code just takes the lowest value when values are in equal abundance, but taking the highest would do fine.
â Jay
Jun 27 at 11:23
@RomanPerekhrest I identify the most common value and use a subtractive process to find the next most common value.1
appears 4 times, and then2
appears 3 times. When rows containing1
or2
are removed, the next most frequently occuring value is4
. Rows containing4
are thus removed, which concomitantly removes7
. The remaining row has value9
and10
. From here, because9
and10
are equally abundant, I choose the lowest value9
, but as I mention to @icarus, taking the higher of equally abundant values would be fine.
â Jay
Jun 27 at 11:35
Would 1 2 4 10 be an acceptable solution?
â icarus
Jun 27 at 10:37
Would 1 2 4 10 be an acceptable solution?
â icarus
Jun 27 at 10:37
due to the condition at least one - all values seem to be valid. Elaborate, why
7
should be ignored and 9
be in result?â RomanPerekhrest
Jun 27 at 10:39
due to the condition at least one - all values seem to be valid. Elaborate, why
7
should be ignored and 9
be in result?â RomanPerekhrest
Jun 27 at 10:39
You are looking for the strongly connected nodes in a connected undirected graph. Or something similar.
â Kusalananda
Jun 27 at 10:51
You are looking for the strongly connected nodes in a connected undirected graph. Or something similar.
â Kusalananda
Jun 27 at 10:51
@icarus Yes, 1 2 4 10 would be fine. The current code just takes the lowest value when values are in equal abundance, but taking the highest would do fine.
â Jay
Jun 27 at 11:23
@icarus Yes, 1 2 4 10 would be fine. The current code just takes the lowest value when values are in equal abundance, but taking the highest would do fine.
â Jay
Jun 27 at 11:23
@RomanPerekhrest I identify the most common value and use a subtractive process to find the next most common value.
1
appears 4 times, and then 2
appears 3 times. When rows containing 1
or 2
are removed, the next most frequently occuring value is 4
. Rows containing 4
are thus removed, which concomitantly removes 7
. The remaining row has value 9
and 10
. From here, because 9
and 10
are equally abundant, I choose the lowest value 9
, but as I mention to @icarus, taking the higher of equally abundant values would be fine.â Jay
Jun 27 at 11:35
@RomanPerekhrest I identify the most common value and use a subtractive process to find the next most common value.
1
appears 4 times, and then 2
appears 3 times. When rows containing 1
or 2
are removed, the next most frequently occuring value is 4
. Rows containing 4
are thus removed, which concomitantly removes 7
. The remaining row has value 9
and 10
. From here, because 9
and 10
are equally abundant, I choose the lowest value 9
, but as I mention to @icarus, taking the higher of equally abundant values would be fine.â Jay
Jun 27 at 11:35
 |Â
show 3 more comments
1 Answer
1
active
oldest
votes
up vote
0
down vote
Here's some "line-noisy" perl:
perl -lane '
for $f (@F) push @$x$f, $. # 1.
} END
use List::Util qw/any first/;
sub sort_by_count_desc
map $_->[0]
sort $a->[0] <=> $b->[0]
map [$_, scalar(@$x$_)]
@_
@ordered = sort_by_count_desc(keys %x); # 2.
%result = ();
for ($i=1; $i<=$.; $i++) # 3.
$node = first any $_ == $i @$x$_ @ordered; # 4.
$result$node = 1;
print join "n", sort_by_count_desc(keys %result);
' df
Where
- loops over the lines of the file and maps each value to the list of lines it appears on
- creates an ordered line of the values sorted by the size of the list of appearances, descending
- loops over the range of line numbers, and foreach line number
- finds the first value for which the line number appears
This outputs
1
2
4
10
Thanks for your answer. However, I tested this on a different dataset and it outputs a set of values greater than the minimal number required to disconnect all nodes.
â Jay
Jun 27 at 19:57
Ah well. I did not follow a known algorithm with this implementation.
â glenn jackman
Jun 27 at 20:51
add a comment END Â
up vote
0
down vote
up vote
0
down vote
Here's some "line-noisy" perl:
perl -lane '
for $f (@F) push @$x$f, $. # 1.
END improve this answer
Here's some "line-noisy" perl:
perl -lane '
for $f (@F) push @$x$f, $. # 1.
END {
use List::Util qw/any first/;
sub sort_by_count_desc
map $_->[0]
sort $a->[0] <=> $b->[0]
map [$_, scalar(@$x$_)]
@_
@ordered = sort_by_count_desc(keys %x); # 2.
%result = ();
for ($i=1; $i<=$.; $i++) # 3.
$node = first any $_ == $i @$x$_ @ordered; # 4.
$result$node = 1;
print join "n", sort_by_count_desc(keys %result);
' df
Where
- loops over the lines of the file and maps each value to the list of lines it appears on
- creates an ordered line of the values sorted by the size of the list of appearances, descending
- loops over the range of line numbers, and foreach line number
- finds the first value for which the line number appears
This outputs
1
2
4
10
answered Jun 27 at 14:53
glenn jackman
45.6k264100
45.6k264100
Thanks for your answer. However, I tested this on a different dataset and it outputs a set of values greater than the minimal number required to disconnect all nodes.
â Jay
Jun 27 at 19:57
Ah well. I did not follow a known algorithm with this implementation.
â glenn jackman
Jun 27 at 20:51
add a comment |Â
Thanks for your answer. However, I tested this on a different dataset and it outputs a set of values greater than the minimal number required to disconnect all nodes.
â Jay
Jun 27 at 19:57
Ah well. I did not follow a known algorithm with this implementation.
â glenn jackman
Jun 27 at 20:51
Thanks for your answer. However, I tested this on a different dataset and it outputs a set of values greater than the minimal number required to disconnect all nodes.
â Jay
Jun 27 at 19:57
Thanks for your answer. However, I tested this on a different dataset and it outputs a set of values greater than the minimal number required to disconnect all nodes.
â Jay
Jun 27 at 19:57
Ah well. I did not follow a known algorithm with this implementation.
â glenn jackman
Jun 27 at 20:51
Ah well. I did not follow a known algorithm with this implementation.
â glenn jackman
Jun 27 at 20:51
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f452188%2ffind-the-smallest-set-of-values-where-at-least-one-value-is-present-in-every-row%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Would 1 2 4 10 be an acceptable solution?
â icarus
Jun 27 at 10:37
due to the condition at least one - all values seem to be valid. Elaborate, why
7
should be ignored and9
be in result?â RomanPerekhrest
Jun 27 at 10:39
You are looking for the strongly connected nodes in a connected undirected graph. Or something similar.
â Kusalananda
Jun 27 at 10:51
@icarus Yes, 1 2 4 10 would be fine. The current code just takes the lowest value when values are in equal abundance, but taking the highest would do fine.
â Jay
Jun 27 at 11:23
@RomanPerekhrest I identify the most common value and use a subtractive process to find the next most common value.
1
appears 4 times, and then2
appears 3 times. When rows containing1
or2
are removed, the next most frequently occuring value is4
. Rows containing4
are thus removed, which concomitantly removes7
. The remaining row has value9
and10
. From here, because9
and10
are equally abundant, I choose the lowest value9
, but as I mention to @icarus, taking the higher of equally abundant values would be fine.â Jay
Jun 27 at 11:35