Sort function doesn't work
Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
So a week ago I asked a question, > here
, The question was about sorting. If I use this code for sorting and creating files:
tail -n +2 File1.txt |
split -l1 --filter='
head -n 1 File2.txt &&
cat <(tail -n +2 File2.txt) - > "$FILE"'
it does work on the files I use in that example, but when I use this on my real files which are bigger its seems like the sorting doesn't work.
I fixed this problem before by using LC_ALL=C , but it seems that it worked just 1 time, so I don't know what the real problem is. if i specifically print and sort the column it works, but not with in side this code..
Maybe because its to much in 1 time to do? i have 151 columns with different data annotated, and I just want to sort the columns 43 and 151, but I still need the new sorted files. Please help me out.
text-processing files sort locale
 |Â
show 1 more comment
up vote
0
down vote
favorite
So a week ago I asked a question, > here
, The question was about sorting. If I use this code for sorting and creating files:
tail -n +2 File1.txt |
split -l1 --filter='
head -n 1 File2.txt &&
cat <(tail -n +2 File2.txt) - > "$FILE"'
it does work on the files I use in that example, but when I use this on my real files which are bigger its seems like the sorting doesn't work.
I fixed this problem before by using LC_ALL=C , but it seems that it worked just 1 time, so I don't know what the real problem is. if i specifically print and sort the column it works, but not with in side this code..
Maybe because its to much in 1 time to do? i have 151 columns with different data annotated, and I just want to sort the columns 43 and 151, but I still need the new sorted files. Please help me out.
text-processing files sort locale
The files can't be found if I do that
â Osman Altun
Jan 4 at 10:47
Is that the actual command youâÂÂve tried using? Have you tried changing the sort command replacing -k4 which will sort on contents of 4th field to end of line to âÂÂ-k43,43 -k151,151â which will sort on just column 43 then just 151, I think.
â Guy
Jan 16 at 1:49
@Guy I tried also -k43,43 , the problem was that I had columns with empty rows and that caused a problem for the sort function
â Osman Altun
Jan 16 at 6:24
I presume the problem is that as sort counts fields as starting at the change between a word and white space a blank field isnâÂÂt seen, the row is just presumed to have fewer fields overall.
â Guy
Jan 16 at 12:00
are all the columns laid out like the previous example showed, with each starting at a particular character in the line?
â Guy
Jan 16 at 12:11
 |Â
show 1 more comment
up vote
0
down vote
favorite
up vote
0
down vote
favorite
So a week ago I asked a question, > here
, The question was about sorting. If I use this code for sorting and creating files:
tail -n +2 File1.txt |
split -l1 --filter='
head -n 1 File2.txt &&
cat <(tail -n +2 File2.txt) - > "$FILE"'
it does work on the files I use in that example, but when I use this on my real files which are bigger its seems like the sorting doesn't work.
I fixed this problem before by using LC_ALL=C , but it seems that it worked just 1 time, so I don't know what the real problem is. if i specifically print and sort the column it works, but not with in side this code..
Maybe because its to much in 1 time to do? i have 151 columns with different data annotated, and I just want to sort the columns 43 and 151, but I still need the new sorted files. Please help me out.
text-processing files sort locale
So a week ago I asked a question, > here
, The question was about sorting. If I use this code for sorting and creating files:
tail -n +2 File1.txt |
split -l1 --filter='
head -n 1 File2.txt &&
cat <(tail -n +2 File2.txt) - > "$FILE"'
it does work on the files I use in that example, but when I use this on my real files which are bigger its seems like the sorting doesn't work.
I fixed this problem before by using LC_ALL=C , but it seems that it worked just 1 time, so I don't know what the real problem is. if i specifically print and sort the column it works, but not with in side this code..
Maybe because its to much in 1 time to do? i have 151 columns with different data annotated, and I just want to sort the columns 43 and 151, but I still need the new sorted files. Please help me out.
text-processing files sort locale
edited Jan 4 at 10:08
asked Jan 4 at 9:13
Osman Altun
207
207
The files can't be found if I do that
â Osman Altun
Jan 4 at 10:47
Is that the actual command youâÂÂve tried using? Have you tried changing the sort command replacing -k4 which will sort on contents of 4th field to end of line to âÂÂ-k43,43 -k151,151â which will sort on just column 43 then just 151, I think.
â Guy
Jan 16 at 1:49
@Guy I tried also -k43,43 , the problem was that I had columns with empty rows and that caused a problem for the sort function
â Osman Altun
Jan 16 at 6:24
I presume the problem is that as sort counts fields as starting at the change between a word and white space a blank field isnâÂÂt seen, the row is just presumed to have fewer fields overall.
â Guy
Jan 16 at 12:00
are all the columns laid out like the previous example showed, with each starting at a particular character in the line?
â Guy
Jan 16 at 12:11
 |Â
show 1 more comment
The files can't be found if I do that
â Osman Altun
Jan 4 at 10:47
Is that the actual command youâÂÂve tried using? Have you tried changing the sort command replacing -k4 which will sort on contents of 4th field to end of line to âÂÂ-k43,43 -k151,151â which will sort on just column 43 then just 151, I think.
â Guy
Jan 16 at 1:49
@Guy I tried also -k43,43 , the problem was that I had columns with empty rows and that caused a problem for the sort function
â Osman Altun
Jan 16 at 6:24
I presume the problem is that as sort counts fields as starting at the change between a word and white space a blank field isnâÂÂt seen, the row is just presumed to have fewer fields overall.
â Guy
Jan 16 at 12:00
are all the columns laid out like the previous example showed, with each starting at a particular character in the line?
â Guy
Jan 16 at 12:11
The files can't be found if I do that
â Osman Altun
Jan 4 at 10:47
The files can't be found if I do that
â Osman Altun
Jan 4 at 10:47
Is that the actual command youâÂÂve tried using? Have you tried changing the sort command replacing -k4 which will sort on contents of 4th field to end of line to âÂÂ-k43,43 -k151,151â which will sort on just column 43 then just 151, I think.
â Guy
Jan 16 at 1:49
Is that the actual command youâÂÂve tried using? Have you tried changing the sort command replacing -k4 which will sort on contents of 4th field to end of line to âÂÂ-k43,43 -k151,151â which will sort on just column 43 then just 151, I think.
â Guy
Jan 16 at 1:49
@Guy I tried also -k43,43 , the problem was that I had columns with empty rows and that caused a problem for the sort function
â Osman Altun
Jan 16 at 6:24
@Guy I tried also -k43,43 , the problem was that I had columns with empty rows and that caused a problem for the sort function
â Osman Altun
Jan 16 at 6:24
I presume the problem is that as sort counts fields as starting at the change between a word and white space a blank field isnâÂÂt seen, the row is just presumed to have fewer fields overall.
â Guy
Jan 16 at 12:00
I presume the problem is that as sort counts fields as starting at the change between a word and white space a blank field isnâÂÂt seen, the row is just presumed to have fewer fields overall.
â Guy
Jan 16 at 12:00
are all the columns laid out like the previous example showed, with each starting at a particular character in the line?
â Guy
Jan 16 at 12:11
are all the columns laid out like the previous example showed, with each starting at a particular character in the line?
â Guy
Jan 16 at 12:11
 |Â
show 1 more comment
1 Answer
1
active
oldest
votes
up vote
0
down vote
right, for this I'm going from the format of previous examples of data given that a column is defined by position, ie how many characters from the start of the line. Unfortunately for you, as you have found, if any of these columns are blank, the tools you have been trying to use don't then count them as a column at all:
col 1 col 2 col 3 col 4 col 5
chr3 31663820 31663820 0.713 3
col 1 col 2 col 3 col 4
chr3 33093371 3.753 4
I've just written up a quick script in python as it felt a bit easier to figure out. when given your two files on the command line, it then sorts them according to a hard-coded part of the line, but this can obviously be changed.
At the moment it would also sort the list once for each field you entered. But again it would be possible to update the sort function to return a tuple of floats in the desired order rather than a single float for the comparisons.
#! /usr/bin/python
# own_sort.py
# ./own_sort.py 'unique values file' 'duplicate values file'
# allows access to command line arguments.
import sys
# this is just to get some example inputs
test_line = 'chr3 39597927 39597927 8.721 5'
phylop_col = (test_line.find('8.721'), test_line.find('8.721')+7)
# this will return a sorting function with the particular column start and end
# positions desired, so its easy to change
def return_sorting_func(col_start, col_end):
# a sorting key for pythons built in sort. the key must take a single element,
# and return something for the sort function to compare.
def sorting_func(line):
# use the exact location, ie how many characters from the start of the line.
field = line[phylop_col[0]: phylop_col[1]]
try:
# if this field has a float, return it
return float(field)
except ValueError:
# else return a default
return float('-inf') # will give default of lowest rank
# return 0.0 # default value of 0
return sorting_func
if __name__ == '__main__':
uniq_list =
dups_list =
# read both files into their own lists
with open(sys.argv[1]) as uniqs, open(sys.argv[2]) as dups:
uniq_list = list(uniqs.readlines())
dups_list = list(dups.readlines())
# and sort, using our key function from above, with relevant start and end positions
# and reverse the resulting list.
combined_list = sorted(uniq_list[1:] + dups_list[1:],
key=return_sorting_func(phylop_col[0], phylop_col[1]),
reverse=True)
# to print out, cut off end of line (newline) and print header and footer around other
# results, which can then be piped from stdout.
print(dups_list[0][:-1])
for line in combined_list:
print(line[:-1])
print(dups_list[0][:-1])
so using the given files from the other question, I've ended up with:
~$>cat unique_data.txt
chromosoom start end phylop GPS
chr1 28745756 28745756 7.905 5
chr1 31227215 31227215 10.263 5
chr1 47562402 47562402 2.322 4
chr1 64859630 64859630 1.714 3
chr1 70805699 70805699 1.913 2
chr1 89760653 89760653 -0.1 0
chr1 95630169 95630169 -1.651 -1
~$>cat dups_data.txt
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>cat dups_data_with_gaps_1.txt
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 8.721 5
which both give the same output of
~$>./own_sort.py unique_data.txt dups_data_with_gaps_1.txt
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr3 30648039 30648039 2.214 3
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chromosoom start end phylop GPS
but if the sort column has blanks like the following, then that element will end up as the last row:
~$>cat dups_data_with_gaps_2.txt
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>./own_sort.py unique_data.txt dups_data_with_gaps_2.txt
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chr3 30648039 30648039 3
chromosoom start end phylop GPS
on the output of this, you can then also run through a pipeline to list where those lines from the 'unique' file have ended up in the overall listings.
~$>./own_sort.py unique_data.txt dups_data.txt | head -n -1 | tail -n +2 | grep -Fn -f unique_data.txt
1:chr1 31227215 31227215 10.263 5
3:chr1 28745756 28745756 7.905 5
5:chr1 47562402 47562402 2.322 4
7:chr1 70805699 70805699 1.913 2
8:chr1 64859630 64859630 1.714 3
12:chr1 89760653 89760653 -0.1 0
14:chr1 95630169 95630169 -1.651 -1
grep will sort for strings (-F
), and output the line number (-n
) and read the strings to search for from a file (-f unique_data.txt
)
Sorry theres a lot there with examples. The awkward thing you need to do if you have a lot of fields is make sure you have a reliable way to identify the start and end of the field, and to get that for your larger files.
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
right, for this I'm going from the format of previous examples of data given that a column is defined by position, ie how many characters from the start of the line. Unfortunately for you, as you have found, if any of these columns are blank, the tools you have been trying to use don't then count them as a column at all:
col 1 col 2 col 3 col 4 col 5
chr3 31663820 31663820 0.713 3
col 1 col 2 col 3 col 4
chr3 33093371 3.753 4
I've just written up a quick script in python as it felt a bit easier to figure out. when given your two files on the command line, it then sorts them according to a hard-coded part of the line, but this can obviously be changed.
At the moment it would also sort the list once for each field you entered. But again it would be possible to update the sort function to return a tuple of floats in the desired order rather than a single float for the comparisons.
#! /usr/bin/python
# own_sort.py
# ./own_sort.py 'unique values file' 'duplicate values file'
# allows access to command line arguments.
import sys
# this is just to get some example inputs
test_line = 'chr3 39597927 39597927 8.721 5'
phylop_col = (test_line.find('8.721'), test_line.find('8.721')+7)
# this will return a sorting function with the particular column start and end
# positions desired, so its easy to change
def return_sorting_func(col_start, col_end):
# a sorting key for pythons built in sort. the key must take a single element,
# and return something for the sort function to compare.
def sorting_func(line):
# use the exact location, ie how many characters from the start of the line.
field = line[phylop_col[0]: phylop_col[1]]
try:
# if this field has a float, return it
return float(field)
except ValueError:
# else return a default
return float('-inf') # will give default of lowest rank
# return 0.0 # default value of 0
return sorting_func
if __name__ == '__main__':
uniq_list =
dups_list =
# read both files into their own lists
with open(sys.argv[1]) as uniqs, open(sys.argv[2]) as dups:
uniq_list = list(uniqs.readlines())
dups_list = list(dups.readlines())
# and sort, using our key function from above, with relevant start and end positions
# and reverse the resulting list.
combined_list = sorted(uniq_list[1:] + dups_list[1:],
key=return_sorting_func(phylop_col[0], phylop_col[1]),
reverse=True)
# to print out, cut off end of line (newline) and print header and footer around other
# results, which can then be piped from stdout.
print(dups_list[0][:-1])
for line in combined_list:
print(line[:-1])
print(dups_list[0][:-1])
so using the given files from the other question, I've ended up with:
~$>cat unique_data.txt
chromosoom start end phylop GPS
chr1 28745756 28745756 7.905 5
chr1 31227215 31227215 10.263 5
chr1 47562402 47562402 2.322 4
chr1 64859630 64859630 1.714 3
chr1 70805699 70805699 1.913 2
chr1 89760653 89760653 -0.1 0
chr1 95630169 95630169 -1.651 -1
~$>cat dups_data.txt
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>cat dups_data_with_gaps_1.txt
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 8.721 5
which both give the same output of
~$>./own_sort.py unique_data.txt dups_data_with_gaps_1.txt
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr3 30648039 30648039 2.214 3
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chromosoom start end phylop GPS
but if the sort column has blanks like the following, then that element will end up as the last row:
~$>cat dups_data_with_gaps_2.txt
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>./own_sort.py unique_data.txt dups_data_with_gaps_2.txt
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chr3 30648039 30648039 3
chromosoom start end phylop GPS
on the output of this, you can then also run through a pipeline to list where those lines from the 'unique' file have ended up in the overall listings.
~$>./own_sort.py unique_data.txt dups_data.txt | head -n -1 | tail -n +2 | grep -Fn -f unique_data.txt
1:chr1 31227215 31227215 10.263 5
3:chr1 28745756 28745756 7.905 5
5:chr1 47562402 47562402 2.322 4
7:chr1 70805699 70805699 1.913 2
8:chr1 64859630 64859630 1.714 3
12:chr1 89760653 89760653 -0.1 0
14:chr1 95630169 95630169 -1.651 -1
grep will sort for strings (-F
), and output the line number (-n
) and read the strings to search for from a file (-f unique_data.txt
)
Sorry theres a lot there with examples. The awkward thing you need to do if you have a lot of fields is make sure you have a reliable way to identify the start and end of the field, and to get that for your larger files.
add a comment |Â
up vote
0
down vote
right, for this I'm going from the format of previous examples of data given that a column is defined by position, ie how many characters from the start of the line. Unfortunately for you, as you have found, if any of these columns are blank, the tools you have been trying to use don't then count them as a column at all:
col 1 col 2 col 3 col 4 col 5
chr3 31663820 31663820 0.713 3
col 1 col 2 col 3 col 4
chr3 33093371 3.753 4
I've just written up a quick script in python as it felt a bit easier to figure out. when given your two files on the command line, it then sorts them according to a hard-coded part of the line, but this can obviously be changed.
At the moment it would also sort the list once for each field you entered. But again it would be possible to update the sort function to return a tuple of floats in the desired order rather than a single float for the comparisons.
#! /usr/bin/python
# own_sort.py
# ./own_sort.py 'unique values file' 'duplicate values file'
# allows access to command line arguments.
import sys
# this is just to get some example inputs
test_line = 'chr3 39597927 39597927 8.721 5'
phylop_col = (test_line.find('8.721'), test_line.find('8.721')+7)
# this will return a sorting function with the particular column start and end
# positions desired, so its easy to change
def return_sorting_func(col_start, col_end):
# a sorting key for pythons built in sort. the key must take a single element,
# and return something for the sort function to compare.
def sorting_func(line):
# use the exact location, ie how many characters from the start of the line.
field = line[phylop_col[0]: phylop_col[1]]
try:
# if this field has a float, return it
return float(field)
except ValueError:
# else return a default
return float('-inf') # will give default of lowest rank
# return 0.0 # default value of 0
return sorting_func
if __name__ == '__main__':
uniq_list =
dups_list =
# read both files into their own lists
with open(sys.argv[1]) as uniqs, open(sys.argv[2]) as dups:
uniq_list = list(uniqs.readlines())
dups_list = list(dups.readlines())
# and sort, using our key function from above, with relevant start and end positions
# and reverse the resulting list.
combined_list = sorted(uniq_list[1:] + dups_list[1:],
key=return_sorting_func(phylop_col[0], phylop_col[1]),
reverse=True)
# to print out, cut off end of line (newline) and print header and footer around other
# results, which can then be piped from stdout.
print(dups_list[0][:-1])
for line in combined_list:
print(line[:-1])
print(dups_list[0][:-1])
so using the given files from the other question, I've ended up with:
~$>cat unique_data.txt
chromosoom start end phylop GPS
chr1 28745756 28745756 7.905 5
chr1 31227215 31227215 10.263 5
chr1 47562402 47562402 2.322 4
chr1 64859630 64859630 1.714 3
chr1 70805699 70805699 1.913 2
chr1 89760653 89760653 -0.1 0
chr1 95630169 95630169 -1.651 -1
~$>cat dups_data.txt
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>cat dups_data_with_gaps_1.txt
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 8.721 5
which both give the same output of
~$>./own_sort.py unique_data.txt dups_data_with_gaps_1.txt
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr3 30648039 30648039 2.214 3
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chromosoom start end phylop GPS
but if the sort column has blanks like the following, then that element will end up as the last row:
~$>cat dups_data_with_gaps_2.txt
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>./own_sort.py unique_data.txt dups_data_with_gaps_2.txt
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chr3 30648039 30648039 3
chromosoom start end phylop GPS
on the output of this, you can then also run through a pipeline to list where those lines from the 'unique' file have ended up in the overall listings.
~$>./own_sort.py unique_data.txt dups_data.txt | head -n -1 | tail -n +2 | grep -Fn -f unique_data.txt
1:chr1 31227215 31227215 10.263 5
3:chr1 28745756 28745756 7.905 5
5:chr1 47562402 47562402 2.322 4
7:chr1 70805699 70805699 1.913 2
8:chr1 64859630 64859630 1.714 3
12:chr1 89760653 89760653 -0.1 0
14:chr1 95630169 95630169 -1.651 -1
grep will sort for strings (-F
), and output the line number (-n
) and read the strings to search for from a file (-f unique_data.txt
)
Sorry theres a lot there with examples. The awkward thing you need to do if you have a lot of fields is make sure you have a reliable way to identify the start and end of the field, and to get that for your larger files.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
right, for this I'm going from the format of previous examples of data given that a column is defined by position, ie how many characters from the start of the line. Unfortunately for you, as you have found, if any of these columns are blank, the tools you have been trying to use don't then count them as a column at all:
col 1 col 2 col 3 col 4 col 5
chr3 31663820 31663820 0.713 3
col 1 col 2 col 3 col 4
chr3 33093371 3.753 4
I've just written up a quick script in python as it felt a bit easier to figure out. when given your two files on the command line, it then sorts them according to a hard-coded part of the line, but this can obviously be changed.
At the moment it would also sort the list once for each field you entered. But again it would be possible to update the sort function to return a tuple of floats in the desired order rather than a single float for the comparisons.
#! /usr/bin/python
# own_sort.py
# ./own_sort.py 'unique values file' 'duplicate values file'
# allows access to command line arguments.
import sys
# this is just to get some example inputs
test_line = 'chr3 39597927 39597927 8.721 5'
phylop_col = (test_line.find('8.721'), test_line.find('8.721')+7)
# this will return a sorting function with the particular column start and end
# positions desired, so its easy to change
def return_sorting_func(col_start, col_end):
# a sorting key for pythons built in sort. the key must take a single element,
# and return something for the sort function to compare.
def sorting_func(line):
# use the exact location, ie how many characters from the start of the line.
field = line[phylop_col[0]: phylop_col[1]]
try:
# if this field has a float, return it
return float(field)
except ValueError:
# else return a default
return float('-inf') # will give default of lowest rank
# return 0.0 # default value of 0
return sorting_func
if __name__ == '__main__':
uniq_list =
dups_list =
# read both files into their own lists
with open(sys.argv[1]) as uniqs, open(sys.argv[2]) as dups:
uniq_list = list(uniqs.readlines())
dups_list = list(dups.readlines())
# and sort, using our key function from above, with relevant start and end positions
# and reverse the resulting list.
combined_list = sorted(uniq_list[1:] + dups_list[1:],
key=return_sorting_func(phylop_col[0], phylop_col[1]),
reverse=True)
# to print out, cut off end of line (newline) and print header and footer around other
# results, which can then be piped from stdout.
print(dups_list[0][:-1])
for line in combined_list:
print(line[:-1])
print(dups_list[0][:-1])
so using the given files from the other question, I've ended up with:
~$>cat unique_data.txt
chromosoom start end phylop GPS
chr1 28745756 28745756 7.905 5
chr1 31227215 31227215 10.263 5
chr1 47562402 47562402 2.322 4
chr1 64859630 64859630 1.714 3
chr1 70805699 70805699 1.913 2
chr1 89760653 89760653 -0.1 0
chr1 95630169 95630169 -1.651 -1
~$>cat dups_data.txt
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>cat dups_data_with_gaps_1.txt
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 8.721 5
which both give the same output of
~$>./own_sort.py unique_data.txt dups_data_with_gaps_1.txt
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr3 30648039 30648039 2.214 3
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chromosoom start end phylop GPS
but if the sort column has blanks like the following, then that element will end up as the last row:
~$>cat dups_data_with_gaps_2.txt
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>./own_sort.py unique_data.txt dups_data_with_gaps_2.txt
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chr3 30648039 30648039 3
chromosoom start end phylop GPS
on the output of this, you can then also run through a pipeline to list where those lines from the 'unique' file have ended up in the overall listings.
~$>./own_sort.py unique_data.txt dups_data.txt | head -n -1 | tail -n +2 | grep -Fn -f unique_data.txt
1:chr1 31227215 31227215 10.263 5
3:chr1 28745756 28745756 7.905 5
5:chr1 47562402 47562402 2.322 4
7:chr1 70805699 70805699 1.913 2
8:chr1 64859630 64859630 1.714 3
12:chr1 89760653 89760653 -0.1 0
14:chr1 95630169 95630169 -1.651 -1
grep will sort for strings (-F
), and output the line number (-n
) and read the strings to search for from a file (-f unique_data.txt
)
Sorry theres a lot there with examples. The awkward thing you need to do if you have a lot of fields is make sure you have a reliable way to identify the start and end of the field, and to get that for your larger files.
right, for this I'm going from the format of previous examples of data given that a column is defined by position, ie how many characters from the start of the line. Unfortunately for you, as you have found, if any of these columns are blank, the tools you have been trying to use don't then count them as a column at all:
col 1 col 2 col 3 col 4 col 5
chr3 31663820 31663820 0.713 3
col 1 col 2 col 3 col 4
chr3 33093371 3.753 4
I've just written up a quick script in python as it felt a bit easier to figure out. when given your two files on the command line, it then sorts them according to a hard-coded part of the line, but this can obviously be changed.
At the moment it would also sort the list once for each field you entered. But again it would be possible to update the sort function to return a tuple of floats in the desired order rather than a single float for the comparisons.
#! /usr/bin/python
# own_sort.py
# ./own_sort.py 'unique values file' 'duplicate values file'
# allows access to command line arguments.
import sys
# this is just to get some example inputs
test_line = 'chr3 39597927 39597927 8.721 5'
phylop_col = (test_line.find('8.721'), test_line.find('8.721')+7)
# this will return a sorting function with the particular column start and end
# positions desired, so its easy to change
def return_sorting_func(col_start, col_end):
# a sorting key for pythons built in sort. the key must take a single element,
# and return something for the sort function to compare.
def sorting_func(line):
# use the exact location, ie how many characters from the start of the line.
field = line[phylop_col[0]: phylop_col[1]]
try:
# if this field has a float, return it
return float(field)
except ValueError:
# else return a default
return float('-inf') # will give default of lowest rank
# return 0.0 # default value of 0
return sorting_func
if __name__ == '__main__':
uniq_list =
dups_list =
# read both files into their own lists
with open(sys.argv[1]) as uniqs, open(sys.argv[2]) as dups:
uniq_list = list(uniqs.readlines())
dups_list = list(dups.readlines())
# and sort, using our key function from above, with relevant start and end positions
# and reverse the resulting list.
combined_list = sorted(uniq_list[1:] + dups_list[1:],
key=return_sorting_func(phylop_col[0], phylop_col[1]),
reverse=True)
# to print out, cut off end of line (newline) and print header and footer around other
# results, which can then be piped from stdout.
print(dups_list[0][:-1])
for line in combined_list:
print(line[:-1])
print(dups_list[0][:-1])
so using the given files from the other question, I've ended up with:
~$>cat unique_data.txt
chromosoom start end phylop GPS
chr1 28745756 28745756 7.905 5
chr1 31227215 31227215 10.263 5
chr1 47562402 47562402 2.322 4
chr1 64859630 64859630 1.714 3
chr1 70805699 70805699 1.913 2
chr1 89760653 89760653 -0.1 0
chr1 95630169 95630169 -1.651 -1
~$>cat dups_data.txt
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>cat dups_data_with_gaps_1.txt
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 8.721 5
which both give the same output of
~$>./own_sort.py unique_data.txt dups_data_with_gaps_1.txt
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr3 30648039 30648039 2.214 3
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chromosoom start end phylop GPS
but if the sort column has blanks like the following, then that element will end up as the last row:
~$>cat dups_data_with_gaps_2.txt
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>./own_sort.py unique_data.txt dups_data_with_gaps_2.txt
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chr3 30648039 30648039 3
chromosoom start end phylop GPS
on the output of this, you can then also run through a pipeline to list where those lines from the 'unique' file have ended up in the overall listings.
~$>./own_sort.py unique_data.txt dups_data.txt | head -n -1 | tail -n +2 | grep -Fn -f unique_data.txt
1:chr1 31227215 31227215 10.263 5
3:chr1 28745756 28745756 7.905 5
5:chr1 47562402 47562402 2.322 4
7:chr1 70805699 70805699 1.913 2
8:chr1 64859630 64859630 1.714 3
12:chr1 89760653 89760653 -0.1 0
14:chr1 95630169 95630169 -1.651 -1
grep will sort for strings (-F
), and output the line number (-n
) and read the strings to search for from a file (-f unique_data.txt
)
Sorry theres a lot there with examples. The awkward thing you need to do if you have a lot of fields is make sure you have a reliable way to identify the start and end of the field, and to get that for your larger files.
answered Jan 16 at 16:34
Guy
7231318
7231318
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f414731%2fsort-function-doesnt-work%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
The files can't be found if I do that
â Osman Altun
Jan 4 at 10:47
Is that the actual command youâÂÂve tried using? Have you tried changing the sort command replacing -k4 which will sort on contents of 4th field to end of line to âÂÂ-k43,43 -k151,151â which will sort on just column 43 then just 151, I think.
â Guy
Jan 16 at 1:49
@Guy I tried also -k43,43 , the problem was that I had columns with empty rows and that caused a problem for the sort function
â Osman Altun
Jan 16 at 6:24
I presume the problem is that as sort counts fields as starting at the change between a word and white space a blank field isnâÂÂt seen, the row is just presumed to have fewer fields overall.
â Guy
Jan 16 at 12:00
are all the columns laid out like the previous example showed, with each starting at a particular character in the line?
â Guy
Jan 16 at 12:11