Sort function doesn't work

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












So a week ago I asked a question, > here
, The question was about sorting. If I use this code for sorting and creating files:



tail -n +2 File1.txt |
split -l1 --filter='

head -n 1 File2.txt &&
cat <(tail -n +2 File2.txt) - > "$FILE"'


it does work on the files I use in that example, but when I use this on my real files which are bigger its seems like the sorting doesn't work.



I fixed this problem before by using LC_ALL=C , but it seems that it worked just 1 time, so I don't know what the real problem is. if i specifically print and sort the column it works, but not with in side this code..



Maybe because its to much in 1 time to do? i have 151 columns with different data annotated, and I just want to sort the columns 43 and 151, but I still need the new sorted files. Please help me out.







share|improve this question






















  • The files can't be found if I do that
    – Osman Altun
    Jan 4 at 10:47










  • Is that the actual command you’ve tried using? Have you tried changing the sort command replacing -k4 which will sort on contents of 4th field to end of line to ‘-k43,43 -k151,151’ which will sort on just column 43 then just 151, I think.
    – Guy
    Jan 16 at 1:49










  • @Guy I tried also -k43,43 , the problem was that I had columns with empty rows and that caused a problem for the sort function
    – Osman Altun
    Jan 16 at 6:24










  • I presume the problem is that as sort counts fields as starting at the change between a word and white space a blank field isn’t seen, the row is just presumed to have fewer fields overall.
    – Guy
    Jan 16 at 12:00










  • are all the columns laid out like the previous example showed, with each starting at a particular character in the line?
    – Guy
    Jan 16 at 12:11














up vote
0
down vote

favorite












So a week ago I asked a question, > here
, The question was about sorting. If I use this code for sorting and creating files:



tail -n +2 File1.txt |
split -l1 --filter='

head -n 1 File2.txt &&
cat <(tail -n +2 File2.txt) - > "$FILE"'


it does work on the files I use in that example, but when I use this on my real files which are bigger its seems like the sorting doesn't work.



I fixed this problem before by using LC_ALL=C , but it seems that it worked just 1 time, so I don't know what the real problem is. if i specifically print and sort the column it works, but not with in side this code..



Maybe because its to much in 1 time to do? i have 151 columns with different data annotated, and I just want to sort the columns 43 and 151, but I still need the new sorted files. Please help me out.







share|improve this question






















  • The files can't be found if I do that
    – Osman Altun
    Jan 4 at 10:47










  • Is that the actual command you’ve tried using? Have you tried changing the sort command replacing -k4 which will sort on contents of 4th field to end of line to ‘-k43,43 -k151,151’ which will sort on just column 43 then just 151, I think.
    – Guy
    Jan 16 at 1:49










  • @Guy I tried also -k43,43 , the problem was that I had columns with empty rows and that caused a problem for the sort function
    – Osman Altun
    Jan 16 at 6:24










  • I presume the problem is that as sort counts fields as starting at the change between a word and white space a blank field isn’t seen, the row is just presumed to have fewer fields overall.
    – Guy
    Jan 16 at 12:00










  • are all the columns laid out like the previous example showed, with each starting at a particular character in the line?
    – Guy
    Jan 16 at 12:11












up vote
0
down vote

favorite









up vote
0
down vote

favorite











So a week ago I asked a question, > here
, The question was about sorting. If I use this code for sorting and creating files:



tail -n +2 File1.txt |
split -l1 --filter='

head -n 1 File2.txt &&
cat <(tail -n +2 File2.txt) - > "$FILE"'


it does work on the files I use in that example, but when I use this on my real files which are bigger its seems like the sorting doesn't work.



I fixed this problem before by using LC_ALL=C , but it seems that it worked just 1 time, so I don't know what the real problem is. if i specifically print and sort the column it works, but not with in side this code..



Maybe because its to much in 1 time to do? i have 151 columns with different data annotated, and I just want to sort the columns 43 and 151, but I still need the new sorted files. Please help me out.







share|improve this question














So a week ago I asked a question, > here
, The question was about sorting. If I use this code for sorting and creating files:



tail -n +2 File1.txt |
split -l1 --filter='

head -n 1 File2.txt &&
cat <(tail -n +2 File2.txt) - > "$FILE"'


it does work on the files I use in that example, but when I use this on my real files which are bigger its seems like the sorting doesn't work.



I fixed this problem before by using LC_ALL=C , but it seems that it worked just 1 time, so I don't know what the real problem is. if i specifically print and sort the column it works, but not with in side this code..



Maybe because its to much in 1 time to do? i have 151 columns with different data annotated, and I just want to sort the columns 43 and 151, but I still need the new sorted files. Please help me out.









share|improve this question













share|improve this question




share|improve this question








edited Jan 4 at 10:08

























asked Jan 4 at 9:13









Osman Altun

207




207











  • The files can't be found if I do that
    – Osman Altun
    Jan 4 at 10:47










  • Is that the actual command you’ve tried using? Have you tried changing the sort command replacing -k4 which will sort on contents of 4th field to end of line to ‘-k43,43 -k151,151’ which will sort on just column 43 then just 151, I think.
    – Guy
    Jan 16 at 1:49










  • @Guy I tried also -k43,43 , the problem was that I had columns with empty rows and that caused a problem for the sort function
    – Osman Altun
    Jan 16 at 6:24










  • I presume the problem is that as sort counts fields as starting at the change between a word and white space a blank field isn’t seen, the row is just presumed to have fewer fields overall.
    – Guy
    Jan 16 at 12:00










  • are all the columns laid out like the previous example showed, with each starting at a particular character in the line?
    – Guy
    Jan 16 at 12:11
















  • The files can't be found if I do that
    – Osman Altun
    Jan 4 at 10:47










  • Is that the actual command you’ve tried using? Have you tried changing the sort command replacing -k4 which will sort on contents of 4th field to end of line to ‘-k43,43 -k151,151’ which will sort on just column 43 then just 151, I think.
    – Guy
    Jan 16 at 1:49










  • @Guy I tried also -k43,43 , the problem was that I had columns with empty rows and that caused a problem for the sort function
    – Osman Altun
    Jan 16 at 6:24










  • I presume the problem is that as sort counts fields as starting at the change between a word and white space a blank field isn’t seen, the row is just presumed to have fewer fields overall.
    – Guy
    Jan 16 at 12:00










  • are all the columns laid out like the previous example showed, with each starting at a particular character in the line?
    – Guy
    Jan 16 at 12:11















The files can't be found if I do that
– Osman Altun
Jan 4 at 10:47




The files can't be found if I do that
– Osman Altun
Jan 4 at 10:47












Is that the actual command you’ve tried using? Have you tried changing the sort command replacing -k4 which will sort on contents of 4th field to end of line to ‘-k43,43 -k151,151’ which will sort on just column 43 then just 151, I think.
– Guy
Jan 16 at 1:49




Is that the actual command you’ve tried using? Have you tried changing the sort command replacing -k4 which will sort on contents of 4th field to end of line to ‘-k43,43 -k151,151’ which will sort on just column 43 then just 151, I think.
– Guy
Jan 16 at 1:49












@Guy I tried also -k43,43 , the problem was that I had columns with empty rows and that caused a problem for the sort function
– Osman Altun
Jan 16 at 6:24




@Guy I tried also -k43,43 , the problem was that I had columns with empty rows and that caused a problem for the sort function
– Osman Altun
Jan 16 at 6:24












I presume the problem is that as sort counts fields as starting at the change between a word and white space a blank field isn’t seen, the row is just presumed to have fewer fields overall.
– Guy
Jan 16 at 12:00




I presume the problem is that as sort counts fields as starting at the change between a word and white space a blank field isn’t seen, the row is just presumed to have fewer fields overall.
– Guy
Jan 16 at 12:00












are all the columns laid out like the previous example showed, with each starting at a particular character in the line?
– Guy
Jan 16 at 12:11




are all the columns laid out like the previous example showed, with each starting at a particular character in the line?
– Guy
Jan 16 at 12:11










1 Answer
1






active

oldest

votes

















up vote
0
down vote













right, for this I'm going from the format of previous examples of data given that a column is defined by position, ie how many characters from the start of the line. Unfortunately for you, as you have found, if any of these columns are blank, the tools you have been trying to use don't then count them as a column at all:



col 1 col 2 col 3 col 4 col 5 
chr3 31663820 31663820 0.713 3

col 1 col 2 col 3 col 4
chr3 33093371 3.753 4


I've just written up a quick script in python as it felt a bit easier to figure out. when given your two files on the command line, it then sorts them according to a hard-coded part of the line, but this can obviously be changed.
At the moment it would also sort the list once for each field you entered. But again it would be possible to update the sort function to return a tuple of floats in the desired order rather than a single float for the comparisons.



#! /usr/bin/python

# own_sort.py
# ./own_sort.py 'unique values file' 'duplicate values file'

# allows access to command line arguments.
import sys


# this is just to get some example inputs
test_line = 'chr3 39597927 39597927 8.721 5'
phylop_col = (test_line.find('8.721'), test_line.find('8.721')+7)


# this will return a sorting function with the particular column start and end
# positions desired, so its easy to change
def return_sorting_func(col_start, col_end):
# a sorting key for pythons built in sort. the key must take a single element,
# and return something for the sort function to compare.
def sorting_func(line):
# use the exact location, ie how many characters from the start of the line.
field = line[phylop_col[0]: phylop_col[1]]
try:
# if this field has a float, return it
return float(field)
except ValueError:
# else return a default
return float('-inf') # will give default of lowest rank
# return 0.0 # default value of 0
return sorting_func


if __name__ == '__main__':
uniq_list =
dups_list =

# read both files into their own lists
with open(sys.argv[1]) as uniqs, open(sys.argv[2]) as dups:
uniq_list = list(uniqs.readlines())
dups_list = list(dups.readlines())

# and sort, using our key function from above, with relevant start and end positions
# and reverse the resulting list.
combined_list = sorted(uniq_list[1:] + dups_list[1:],
key=return_sorting_func(phylop_col[0], phylop_col[1]),
reverse=True)

# to print out, cut off end of line (newline) and print header and footer around other
# results, which can then be piped from stdout.
print(dups_list[0][:-1])
for line in combined_list:
print(line[:-1])
print(dups_list[0][:-1])


so using the given files from the other question, I've ended up with:



~$>cat unique_data.txt 
chromosoom start end phylop GPS
chr1 28745756 28745756 7.905 5
chr1 31227215 31227215 10.263 5
chr1 47562402 47562402 2.322 4
chr1 64859630 64859630 1.714 3
chr1 70805699 70805699 1.913 2
chr1 89760653 89760653 -0.1 0
chr1 95630169 95630169 -1.651 -1
~$>cat dups_data.txt
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>cat dups_data_with_gaps_1.txt
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 8.721 5


which both give the same output of



~$>./own_sort.py unique_data.txt dups_data_with_gaps_1.txt
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr3 30648039 30648039 2.214 3
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chromosoom start end phylop GPS


but if the sort column has blanks like the following, then that element will end up as the last row:



~$>cat dups_data_with_gaps_2.txt 
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>./own_sort.py unique_data.txt dups_data_with_gaps_2.txt
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chr3 30648039 30648039 3
chromosoom start end phylop GPS


on the output of this, you can then also run through a pipeline to list where those lines from the 'unique' file have ended up in the overall listings.



~$>./own_sort.py unique_data.txt dups_data.txt | head -n -1 | tail -n +2 | grep -Fn -f unique_data.txt 
1:chr1 31227215 31227215 10.263 5
3:chr1 28745756 28745756 7.905 5
5:chr1 47562402 47562402 2.322 4
7:chr1 70805699 70805699 1.913 2
8:chr1 64859630 64859630 1.714 3
12:chr1 89760653 89760653 -0.1 0
14:chr1 95630169 95630169 -1.651 -1


grep will sort for strings (-F), and output the line number (-n) and read the strings to search for from a file (-f unique_data.txt)



Sorry theres a lot there with examples. The awkward thing you need to do if you have a lot of fields is make sure you have a reliable way to identify the start and end of the field, and to get that for your larger files.






share|improve this answer




















    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f414731%2fsort-function-doesnt-work%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    right, for this I'm going from the format of previous examples of data given that a column is defined by position, ie how many characters from the start of the line. Unfortunately for you, as you have found, if any of these columns are blank, the tools you have been trying to use don't then count them as a column at all:



    col 1 col 2 col 3 col 4 col 5 
    chr3 31663820 31663820 0.713 3

    col 1 col 2 col 3 col 4
    chr3 33093371 3.753 4


    I've just written up a quick script in python as it felt a bit easier to figure out. when given your two files on the command line, it then sorts them according to a hard-coded part of the line, but this can obviously be changed.
    At the moment it would also sort the list once for each field you entered. But again it would be possible to update the sort function to return a tuple of floats in the desired order rather than a single float for the comparisons.



    #! /usr/bin/python

    # own_sort.py
    # ./own_sort.py 'unique values file' 'duplicate values file'

    # allows access to command line arguments.
    import sys


    # this is just to get some example inputs
    test_line = 'chr3 39597927 39597927 8.721 5'
    phylop_col = (test_line.find('8.721'), test_line.find('8.721')+7)


    # this will return a sorting function with the particular column start and end
    # positions desired, so its easy to change
    def return_sorting_func(col_start, col_end):
    # a sorting key for pythons built in sort. the key must take a single element,
    # and return something for the sort function to compare.
    def sorting_func(line):
    # use the exact location, ie how many characters from the start of the line.
    field = line[phylop_col[0]: phylop_col[1]]
    try:
    # if this field has a float, return it
    return float(field)
    except ValueError:
    # else return a default
    return float('-inf') # will give default of lowest rank
    # return 0.0 # default value of 0
    return sorting_func


    if __name__ == '__main__':
    uniq_list =
    dups_list =

    # read both files into their own lists
    with open(sys.argv[1]) as uniqs, open(sys.argv[2]) as dups:
    uniq_list = list(uniqs.readlines())
    dups_list = list(dups.readlines())

    # and sort, using our key function from above, with relevant start and end positions
    # and reverse the resulting list.
    combined_list = sorted(uniq_list[1:] + dups_list[1:],
    key=return_sorting_func(phylop_col[0], phylop_col[1]),
    reverse=True)

    # to print out, cut off end of line (newline) and print header and footer around other
    # results, which can then be piped from stdout.
    print(dups_list[0][:-1])
    for line in combined_list:
    print(line[:-1])
    print(dups_list[0][:-1])


    so using the given files from the other question, I've ended up with:



    ~$>cat unique_data.txt 
    chromosoom start end phylop GPS
    chr1 28745756 28745756 7.905 5
    chr1 31227215 31227215 10.263 5
    chr1 47562402 47562402 2.322 4
    chr1 64859630 64859630 1.714 3
    chr1 70805699 70805699 1.913 2
    chr1 89760653 89760653 -0.1 0
    chr1 95630169 95630169 -1.651 -1
    ~$>cat dups_data.txt
    chromosoom start end phylop GPS
    chr3 15540407 15540407 -1.391 -1
    chr3 30648039 30648039 2.214 3
    chr3 31663820 31663820 0.713 3
    chr3 33093371 33093371 3.753 4
    chr3 37050398 37050398 1.650 2
    chr3 38053456 38053456 1.1 1
    chr3 39597927 39597927 8.721 5
    ~$>cat dups_data_with_gaps_1.txt
    chromosoom start end phylop GPS
    chr3 15540407 15540407 -1.391 -1
    chr3 30648039 30648039 2.214 3
    chr3 31663820 31663820 0.713 3
    chr3 33093371 3.753 4
    chr3 37050398 37050398 1.650 2
    chr3 38053456 38053456 1.1 1
    chr3 39597927 8.721 5


    which both give the same output of



    ~$>./own_sort.py unique_data.txt dups_data_with_gaps_1.txt
    chromosoom start end phylop GPS
    chr1 31227215 31227215 10.263 5
    chr3 39597927 39597927 8.721 5
    chr1 28745756 28745756 7.905 5
    chr3 33093371 33093371 3.753 4
    chr1 47562402 47562402 2.322 4
    chr3 30648039 30648039 2.214 3
    chr1 70805699 70805699 1.913 2
    chr1 64859630 64859630 1.714 3
    chr3 37050398 37050398 1.650 2
    chr3 38053456 38053456 1.1 1
    chr3 31663820 31663820 0.713 3
    chr1 89760653 89760653 -0.1 0
    chr3 15540407 15540407 -1.391 -1
    chr1 95630169 95630169 -1.651 -1
    chromosoom start end phylop GPS


    but if the sort column has blanks like the following, then that element will end up as the last row:



    ~$>cat dups_data_with_gaps_2.txt 
    chromosoom start end phylop GPS
    chr3 15540407 15540407 -1.391 -1
    chr3 30648039 30648039 3
    chr3 31663820 31663820 0.713 3
    chr3 33093371 33093371 3.753 4
    chr3 37050398 37050398 1.650 2
    chr3 38053456 38053456 1.1 1
    chr3 39597927 39597927 8.721 5
    ~$>./own_sort.py unique_data.txt dups_data_with_gaps_2.txt
    chromosoom start end phylop GPS
    chr1 31227215 31227215 10.263 5
    chr3 39597927 39597927 8.721 5
    chr1 28745756 28745756 7.905 5
    chr3 33093371 33093371 3.753 4
    chr1 47562402 47562402 2.322 4
    chr1 70805699 70805699 1.913 2
    chr1 64859630 64859630 1.714 3
    chr3 37050398 37050398 1.650 2
    chr3 38053456 38053456 1.1 1
    chr3 31663820 31663820 0.713 3
    chr1 89760653 89760653 -0.1 0
    chr3 15540407 15540407 -1.391 -1
    chr1 95630169 95630169 -1.651 -1
    chr3 30648039 30648039 3
    chromosoom start end phylop GPS


    on the output of this, you can then also run through a pipeline to list where those lines from the 'unique' file have ended up in the overall listings.



    ~$>./own_sort.py unique_data.txt dups_data.txt | head -n -1 | tail -n +2 | grep -Fn -f unique_data.txt 
    1:chr1 31227215 31227215 10.263 5
    3:chr1 28745756 28745756 7.905 5
    5:chr1 47562402 47562402 2.322 4
    7:chr1 70805699 70805699 1.913 2
    8:chr1 64859630 64859630 1.714 3
    12:chr1 89760653 89760653 -0.1 0
    14:chr1 95630169 95630169 -1.651 -1


    grep will sort for strings (-F), and output the line number (-n) and read the strings to search for from a file (-f unique_data.txt)



    Sorry theres a lot there with examples. The awkward thing you need to do if you have a lot of fields is make sure you have a reliable way to identify the start and end of the field, and to get that for your larger files.






    share|improve this answer
























      up vote
      0
      down vote













      right, for this I'm going from the format of previous examples of data given that a column is defined by position, ie how many characters from the start of the line. Unfortunately for you, as you have found, if any of these columns are blank, the tools you have been trying to use don't then count them as a column at all:



      col 1 col 2 col 3 col 4 col 5 
      chr3 31663820 31663820 0.713 3

      col 1 col 2 col 3 col 4
      chr3 33093371 3.753 4


      I've just written up a quick script in python as it felt a bit easier to figure out. when given your two files on the command line, it then sorts them according to a hard-coded part of the line, but this can obviously be changed.
      At the moment it would also sort the list once for each field you entered. But again it would be possible to update the sort function to return a tuple of floats in the desired order rather than a single float for the comparisons.



      #! /usr/bin/python

      # own_sort.py
      # ./own_sort.py 'unique values file' 'duplicate values file'

      # allows access to command line arguments.
      import sys


      # this is just to get some example inputs
      test_line = 'chr3 39597927 39597927 8.721 5'
      phylop_col = (test_line.find('8.721'), test_line.find('8.721')+7)


      # this will return a sorting function with the particular column start and end
      # positions desired, so its easy to change
      def return_sorting_func(col_start, col_end):
      # a sorting key for pythons built in sort. the key must take a single element,
      # and return something for the sort function to compare.
      def sorting_func(line):
      # use the exact location, ie how many characters from the start of the line.
      field = line[phylop_col[0]: phylop_col[1]]
      try:
      # if this field has a float, return it
      return float(field)
      except ValueError:
      # else return a default
      return float('-inf') # will give default of lowest rank
      # return 0.0 # default value of 0
      return sorting_func


      if __name__ == '__main__':
      uniq_list =
      dups_list =

      # read both files into their own lists
      with open(sys.argv[1]) as uniqs, open(sys.argv[2]) as dups:
      uniq_list = list(uniqs.readlines())
      dups_list = list(dups.readlines())

      # and sort, using our key function from above, with relevant start and end positions
      # and reverse the resulting list.
      combined_list = sorted(uniq_list[1:] + dups_list[1:],
      key=return_sorting_func(phylop_col[0], phylop_col[1]),
      reverse=True)

      # to print out, cut off end of line (newline) and print header and footer around other
      # results, which can then be piped from stdout.
      print(dups_list[0][:-1])
      for line in combined_list:
      print(line[:-1])
      print(dups_list[0][:-1])


      so using the given files from the other question, I've ended up with:



      ~$>cat unique_data.txt 
      chromosoom start end phylop GPS
      chr1 28745756 28745756 7.905 5
      chr1 31227215 31227215 10.263 5
      chr1 47562402 47562402 2.322 4
      chr1 64859630 64859630 1.714 3
      chr1 70805699 70805699 1.913 2
      chr1 89760653 89760653 -0.1 0
      chr1 95630169 95630169 -1.651 -1
      ~$>cat dups_data.txt
      chromosoom start end phylop GPS
      chr3 15540407 15540407 -1.391 -1
      chr3 30648039 30648039 2.214 3
      chr3 31663820 31663820 0.713 3
      chr3 33093371 33093371 3.753 4
      chr3 37050398 37050398 1.650 2
      chr3 38053456 38053456 1.1 1
      chr3 39597927 39597927 8.721 5
      ~$>cat dups_data_with_gaps_1.txt
      chromosoom start end phylop GPS
      chr3 15540407 15540407 -1.391 -1
      chr3 30648039 30648039 2.214 3
      chr3 31663820 31663820 0.713 3
      chr3 33093371 3.753 4
      chr3 37050398 37050398 1.650 2
      chr3 38053456 38053456 1.1 1
      chr3 39597927 8.721 5


      which both give the same output of



      ~$>./own_sort.py unique_data.txt dups_data_with_gaps_1.txt
      chromosoom start end phylop GPS
      chr1 31227215 31227215 10.263 5
      chr3 39597927 39597927 8.721 5
      chr1 28745756 28745756 7.905 5
      chr3 33093371 33093371 3.753 4
      chr1 47562402 47562402 2.322 4
      chr3 30648039 30648039 2.214 3
      chr1 70805699 70805699 1.913 2
      chr1 64859630 64859630 1.714 3
      chr3 37050398 37050398 1.650 2
      chr3 38053456 38053456 1.1 1
      chr3 31663820 31663820 0.713 3
      chr1 89760653 89760653 -0.1 0
      chr3 15540407 15540407 -1.391 -1
      chr1 95630169 95630169 -1.651 -1
      chromosoom start end phylop GPS


      but if the sort column has blanks like the following, then that element will end up as the last row:



      ~$>cat dups_data_with_gaps_2.txt 
      chromosoom start end phylop GPS
      chr3 15540407 15540407 -1.391 -1
      chr3 30648039 30648039 3
      chr3 31663820 31663820 0.713 3
      chr3 33093371 33093371 3.753 4
      chr3 37050398 37050398 1.650 2
      chr3 38053456 38053456 1.1 1
      chr3 39597927 39597927 8.721 5
      ~$>./own_sort.py unique_data.txt dups_data_with_gaps_2.txt
      chromosoom start end phylop GPS
      chr1 31227215 31227215 10.263 5
      chr3 39597927 39597927 8.721 5
      chr1 28745756 28745756 7.905 5
      chr3 33093371 33093371 3.753 4
      chr1 47562402 47562402 2.322 4
      chr1 70805699 70805699 1.913 2
      chr1 64859630 64859630 1.714 3
      chr3 37050398 37050398 1.650 2
      chr3 38053456 38053456 1.1 1
      chr3 31663820 31663820 0.713 3
      chr1 89760653 89760653 -0.1 0
      chr3 15540407 15540407 -1.391 -1
      chr1 95630169 95630169 -1.651 -1
      chr3 30648039 30648039 3
      chromosoom start end phylop GPS


      on the output of this, you can then also run through a pipeline to list where those lines from the 'unique' file have ended up in the overall listings.



      ~$>./own_sort.py unique_data.txt dups_data.txt | head -n -1 | tail -n +2 | grep -Fn -f unique_data.txt 
      1:chr1 31227215 31227215 10.263 5
      3:chr1 28745756 28745756 7.905 5
      5:chr1 47562402 47562402 2.322 4
      7:chr1 70805699 70805699 1.913 2
      8:chr1 64859630 64859630 1.714 3
      12:chr1 89760653 89760653 -0.1 0
      14:chr1 95630169 95630169 -1.651 -1


      grep will sort for strings (-F), and output the line number (-n) and read the strings to search for from a file (-f unique_data.txt)



      Sorry theres a lot there with examples. The awkward thing you need to do if you have a lot of fields is make sure you have a reliable way to identify the start and end of the field, and to get that for your larger files.






      share|improve this answer






















        up vote
        0
        down vote










        up vote
        0
        down vote









        right, for this I'm going from the format of previous examples of data given that a column is defined by position, ie how many characters from the start of the line. Unfortunately for you, as you have found, if any of these columns are blank, the tools you have been trying to use don't then count them as a column at all:



        col 1 col 2 col 3 col 4 col 5 
        chr3 31663820 31663820 0.713 3

        col 1 col 2 col 3 col 4
        chr3 33093371 3.753 4


        I've just written up a quick script in python as it felt a bit easier to figure out. when given your two files on the command line, it then sorts them according to a hard-coded part of the line, but this can obviously be changed.
        At the moment it would also sort the list once for each field you entered. But again it would be possible to update the sort function to return a tuple of floats in the desired order rather than a single float for the comparisons.



        #! /usr/bin/python

        # own_sort.py
        # ./own_sort.py 'unique values file' 'duplicate values file'

        # allows access to command line arguments.
        import sys


        # this is just to get some example inputs
        test_line = 'chr3 39597927 39597927 8.721 5'
        phylop_col = (test_line.find('8.721'), test_line.find('8.721')+7)


        # this will return a sorting function with the particular column start and end
        # positions desired, so its easy to change
        def return_sorting_func(col_start, col_end):
        # a sorting key for pythons built in sort. the key must take a single element,
        # and return something for the sort function to compare.
        def sorting_func(line):
        # use the exact location, ie how many characters from the start of the line.
        field = line[phylop_col[0]: phylop_col[1]]
        try:
        # if this field has a float, return it
        return float(field)
        except ValueError:
        # else return a default
        return float('-inf') # will give default of lowest rank
        # return 0.0 # default value of 0
        return sorting_func


        if __name__ == '__main__':
        uniq_list =
        dups_list =

        # read both files into their own lists
        with open(sys.argv[1]) as uniqs, open(sys.argv[2]) as dups:
        uniq_list = list(uniqs.readlines())
        dups_list = list(dups.readlines())

        # and sort, using our key function from above, with relevant start and end positions
        # and reverse the resulting list.
        combined_list = sorted(uniq_list[1:] + dups_list[1:],
        key=return_sorting_func(phylop_col[0], phylop_col[1]),
        reverse=True)

        # to print out, cut off end of line (newline) and print header and footer around other
        # results, which can then be piped from stdout.
        print(dups_list[0][:-1])
        for line in combined_list:
        print(line[:-1])
        print(dups_list[0][:-1])


        so using the given files from the other question, I've ended up with:



        ~$>cat unique_data.txt 
        chromosoom start end phylop GPS
        chr1 28745756 28745756 7.905 5
        chr1 31227215 31227215 10.263 5
        chr1 47562402 47562402 2.322 4
        chr1 64859630 64859630 1.714 3
        chr1 70805699 70805699 1.913 2
        chr1 89760653 89760653 -0.1 0
        chr1 95630169 95630169 -1.651 -1
        ~$>cat dups_data.txt
        chromosoom start end phylop GPS
        chr3 15540407 15540407 -1.391 -1
        chr3 30648039 30648039 2.214 3
        chr3 31663820 31663820 0.713 3
        chr3 33093371 33093371 3.753 4
        chr3 37050398 37050398 1.650 2
        chr3 38053456 38053456 1.1 1
        chr3 39597927 39597927 8.721 5
        ~$>cat dups_data_with_gaps_1.txt
        chromosoom start end phylop GPS
        chr3 15540407 15540407 -1.391 -1
        chr3 30648039 30648039 2.214 3
        chr3 31663820 31663820 0.713 3
        chr3 33093371 3.753 4
        chr3 37050398 37050398 1.650 2
        chr3 38053456 38053456 1.1 1
        chr3 39597927 8.721 5


        which both give the same output of



        ~$>./own_sort.py unique_data.txt dups_data_with_gaps_1.txt
        chromosoom start end phylop GPS
        chr1 31227215 31227215 10.263 5
        chr3 39597927 39597927 8.721 5
        chr1 28745756 28745756 7.905 5
        chr3 33093371 33093371 3.753 4
        chr1 47562402 47562402 2.322 4
        chr3 30648039 30648039 2.214 3
        chr1 70805699 70805699 1.913 2
        chr1 64859630 64859630 1.714 3
        chr3 37050398 37050398 1.650 2
        chr3 38053456 38053456 1.1 1
        chr3 31663820 31663820 0.713 3
        chr1 89760653 89760653 -0.1 0
        chr3 15540407 15540407 -1.391 -1
        chr1 95630169 95630169 -1.651 -1
        chromosoom start end phylop GPS


        but if the sort column has blanks like the following, then that element will end up as the last row:



        ~$>cat dups_data_with_gaps_2.txt 
        chromosoom start end phylop GPS
        chr3 15540407 15540407 -1.391 -1
        chr3 30648039 30648039 3
        chr3 31663820 31663820 0.713 3
        chr3 33093371 33093371 3.753 4
        chr3 37050398 37050398 1.650 2
        chr3 38053456 38053456 1.1 1
        chr3 39597927 39597927 8.721 5
        ~$>./own_sort.py unique_data.txt dups_data_with_gaps_2.txt
        chromosoom start end phylop GPS
        chr1 31227215 31227215 10.263 5
        chr3 39597927 39597927 8.721 5
        chr1 28745756 28745756 7.905 5
        chr3 33093371 33093371 3.753 4
        chr1 47562402 47562402 2.322 4
        chr1 70805699 70805699 1.913 2
        chr1 64859630 64859630 1.714 3
        chr3 37050398 37050398 1.650 2
        chr3 38053456 38053456 1.1 1
        chr3 31663820 31663820 0.713 3
        chr1 89760653 89760653 -0.1 0
        chr3 15540407 15540407 -1.391 -1
        chr1 95630169 95630169 -1.651 -1
        chr3 30648039 30648039 3
        chromosoom start end phylop GPS


        on the output of this, you can then also run through a pipeline to list where those lines from the 'unique' file have ended up in the overall listings.



        ~$>./own_sort.py unique_data.txt dups_data.txt | head -n -1 | tail -n +2 | grep -Fn -f unique_data.txt 
        1:chr1 31227215 31227215 10.263 5
        3:chr1 28745756 28745756 7.905 5
        5:chr1 47562402 47562402 2.322 4
        7:chr1 70805699 70805699 1.913 2
        8:chr1 64859630 64859630 1.714 3
        12:chr1 89760653 89760653 -0.1 0
        14:chr1 95630169 95630169 -1.651 -1


        grep will sort for strings (-F), and output the line number (-n) and read the strings to search for from a file (-f unique_data.txt)



        Sorry theres a lot there with examples. The awkward thing you need to do if you have a lot of fields is make sure you have a reliable way to identify the start and end of the field, and to get that for your larger files.






        share|improve this answer












        right, for this I'm going from the format of previous examples of data given that a column is defined by position, ie how many characters from the start of the line. Unfortunately for you, as you have found, if any of these columns are blank, the tools you have been trying to use don't then count them as a column at all:



        col 1 col 2 col 3 col 4 col 5 
        chr3 31663820 31663820 0.713 3

        col 1 col 2 col 3 col 4
        chr3 33093371 3.753 4


        I've just written up a quick script in python as it felt a bit easier to figure out. when given your two files on the command line, it then sorts them according to a hard-coded part of the line, but this can obviously be changed.
        At the moment it would also sort the list once for each field you entered. But again it would be possible to update the sort function to return a tuple of floats in the desired order rather than a single float for the comparisons.



        #! /usr/bin/python

        # own_sort.py
        # ./own_sort.py 'unique values file' 'duplicate values file'

        # allows access to command line arguments.
        import sys


        # this is just to get some example inputs
        test_line = 'chr3 39597927 39597927 8.721 5'
        phylop_col = (test_line.find('8.721'), test_line.find('8.721')+7)


        # this will return a sorting function with the particular column start and end
        # positions desired, so its easy to change
        def return_sorting_func(col_start, col_end):
        # a sorting key for pythons built in sort. the key must take a single element,
        # and return something for the sort function to compare.
        def sorting_func(line):
        # use the exact location, ie how many characters from the start of the line.
        field = line[phylop_col[0]: phylop_col[1]]
        try:
        # if this field has a float, return it
        return float(field)
        except ValueError:
        # else return a default
        return float('-inf') # will give default of lowest rank
        # return 0.0 # default value of 0
        return sorting_func


        if __name__ == '__main__':
        uniq_list =
        dups_list =

        # read both files into their own lists
        with open(sys.argv[1]) as uniqs, open(sys.argv[2]) as dups:
        uniq_list = list(uniqs.readlines())
        dups_list = list(dups.readlines())

        # and sort, using our key function from above, with relevant start and end positions
        # and reverse the resulting list.
        combined_list = sorted(uniq_list[1:] + dups_list[1:],
        key=return_sorting_func(phylop_col[0], phylop_col[1]),
        reverse=True)

        # to print out, cut off end of line (newline) and print header and footer around other
        # results, which can then be piped from stdout.
        print(dups_list[0][:-1])
        for line in combined_list:
        print(line[:-1])
        print(dups_list[0][:-1])


        so using the given files from the other question, I've ended up with:



        ~$>cat unique_data.txt 
        chromosoom start end phylop GPS
        chr1 28745756 28745756 7.905 5
        chr1 31227215 31227215 10.263 5
        chr1 47562402 47562402 2.322 4
        chr1 64859630 64859630 1.714 3
        chr1 70805699 70805699 1.913 2
        chr1 89760653 89760653 -0.1 0
        chr1 95630169 95630169 -1.651 -1
        ~$>cat dups_data.txt
        chromosoom start end phylop GPS
        chr3 15540407 15540407 -1.391 -1
        chr3 30648039 30648039 2.214 3
        chr3 31663820 31663820 0.713 3
        chr3 33093371 33093371 3.753 4
        chr3 37050398 37050398 1.650 2
        chr3 38053456 38053456 1.1 1
        chr3 39597927 39597927 8.721 5
        ~$>cat dups_data_with_gaps_1.txt
        chromosoom start end phylop GPS
        chr3 15540407 15540407 -1.391 -1
        chr3 30648039 30648039 2.214 3
        chr3 31663820 31663820 0.713 3
        chr3 33093371 3.753 4
        chr3 37050398 37050398 1.650 2
        chr3 38053456 38053456 1.1 1
        chr3 39597927 8.721 5


        which both give the same output of



        ~$>./own_sort.py unique_data.txt dups_data_with_gaps_1.txt
        chromosoom start end phylop GPS
        chr1 31227215 31227215 10.263 5
        chr3 39597927 39597927 8.721 5
        chr1 28745756 28745756 7.905 5
        chr3 33093371 33093371 3.753 4
        chr1 47562402 47562402 2.322 4
        chr3 30648039 30648039 2.214 3
        chr1 70805699 70805699 1.913 2
        chr1 64859630 64859630 1.714 3
        chr3 37050398 37050398 1.650 2
        chr3 38053456 38053456 1.1 1
        chr3 31663820 31663820 0.713 3
        chr1 89760653 89760653 -0.1 0
        chr3 15540407 15540407 -1.391 -1
        chr1 95630169 95630169 -1.651 -1
        chromosoom start end phylop GPS


        but if the sort column has blanks like the following, then that element will end up as the last row:



        ~$>cat dups_data_with_gaps_2.txt 
        chromosoom start end phylop GPS
        chr3 15540407 15540407 -1.391 -1
        chr3 30648039 30648039 3
        chr3 31663820 31663820 0.713 3
        chr3 33093371 33093371 3.753 4
        chr3 37050398 37050398 1.650 2
        chr3 38053456 38053456 1.1 1
        chr3 39597927 39597927 8.721 5
        ~$>./own_sort.py unique_data.txt dups_data_with_gaps_2.txt
        chromosoom start end phylop GPS
        chr1 31227215 31227215 10.263 5
        chr3 39597927 39597927 8.721 5
        chr1 28745756 28745756 7.905 5
        chr3 33093371 33093371 3.753 4
        chr1 47562402 47562402 2.322 4
        chr1 70805699 70805699 1.913 2
        chr1 64859630 64859630 1.714 3
        chr3 37050398 37050398 1.650 2
        chr3 38053456 38053456 1.1 1
        chr3 31663820 31663820 0.713 3
        chr1 89760653 89760653 -0.1 0
        chr3 15540407 15540407 -1.391 -1
        chr1 95630169 95630169 -1.651 -1
        chr3 30648039 30648039 3
        chromosoom start end phylop GPS


        on the output of this, you can then also run through a pipeline to list where those lines from the 'unique' file have ended up in the overall listings.



        ~$>./own_sort.py unique_data.txt dups_data.txt | head -n -1 | tail -n +2 | grep -Fn -f unique_data.txt 
        1:chr1 31227215 31227215 10.263 5
        3:chr1 28745756 28745756 7.905 5
        5:chr1 47562402 47562402 2.322 4
        7:chr1 70805699 70805699 1.913 2
        8:chr1 64859630 64859630 1.714 3
        12:chr1 89760653 89760653 -0.1 0
        14:chr1 95630169 95630169 -1.651 -1


        grep will sort for strings (-F), and output the line number (-n) and read the strings to search for from a file (-f unique_data.txt)



        Sorry theres a lot there with examples. The awkward thing you need to do if you have a lot of fields is make sure you have a reliable way to identify the start and end of the field, and to get that for your larger files.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 16 at 16:34









        Guy

        7231318




        7231318






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f414731%2fsort-function-doesnt-work%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Displaying single band from multi-band raster using QGIS

            How many registers does an x86_64 CPU actually have?