Sort function doesn't work

up vote
0
down vote

favorite

So a week ago I asked a question, > here
, The question was about sorting. If I use this code for sorting and creating files:

tail -n +2 File1.txt |
 split -l1 --filter='
 
 head -n 1 File2.txt &&
 cat <(tail -n +2 File2.txt) - > "$FILE"'

it does work on the files I use in that example, but when I use this on my real files which are bigger its seems like the sorting doesn't work.

I fixed this problem before by using LC_ALL=C , but it seems that it worked just 1 time, so I don't know what the real problem is. if i specifically print and sort the column it works, but not with in side this code..

Maybe because its to much in 1 time to do? i have 151 columns with different data annotated, and I just want to sort the columns 43 and 151, but I still need the new sorted files. Please help me out.

edited Jan 4 at 10:08

asked Jan 4 at 9:13

Osman Altun

207

The files can't be found if I do that
â€“Â Osman Altun
Jan 4 at 10:47

Is that the actual command youÃ¢Â€Â™ve tried using? Have you tried changing the sort command replacing -k4 which will sort on contents of 4th field to end of line to Ã¢Â€Â˜-k43,43 -k151,151Ã¢Â€Â™ which will sort on just column 43 then just 151, I think.
â€“Â Guy
Jan 16 at 1:49

@Guy I tried also -k43,43 , the problem was that I had columns with empty rows and that caused a problem for the sort function
â€“Â Osman Altun
Jan 16 at 6:24

I presume the problem is that as sort counts fields as starting at the change between a word and white space a blank field isnÃ¢Â€Â™t seen, the row is just presumed to have fewer fields overall.
â€“Â Guy
Jan 16 at 12:00

are all the columns laid out like the previous example showed, with each starting at a particular character in the line?
â€“Â Guy
Jan 16 at 12:11

Â |Â
show 1 more comment

up vote
0
down vote

favorite

So a week ago I asked a question, > here
, The question was about sorting. If I use this code for sorting and creating files:

tail -n +2 File1.txt |
 split -l1 --filter='
 
 head -n 1 File2.txt &&
 cat <(tail -n +2 File2.txt) - > "$FILE"'

it does work on the files I use in that example, but when I use this on my real files which are bigger its seems like the sorting doesn't work.

Maybe because its to much in 1 time to do? i have 151 columns with different data annotated, and I just want to sort the columns 43 and 151, but I still need the new sorted files. Please help me out.

edited Jan 4 at 10:08

asked Jan 4 at 9:13

Osman Altun

207

The files can't be found if I do that
â€“Â Osman Altun
Jan 4 at 10:47

Is that the actual command youÃ¢Â€Â™ve tried using? Have you tried changing the sort command replacing -k4 which will sort on contents of 4th field to end of line to Ã¢Â€Â˜-k43,43 -k151,151Ã¢Â€Â™ which will sort on just column 43 then just 151, I think.
â€“Â Guy
Jan 16 at 1:49

@Guy I tried also -k43,43 , the problem was that I had columns with empty rows and that caused a problem for the sort function
â€“Â Osman Altun
Jan 16 at 6:24

I presume the problem is that as sort counts fields as starting at the change between a word and white space a blank field isnÃ¢Â€Â™t seen, the row is just presumed to have fewer fields overall.
â€“Â Guy
Jan 16 at 12:00

are all the columns laid out like the previous example showed, with each starting at a particular character in the line?
â€“Â Guy
Jan 16 at 12:11

Â |Â
show 1 more comment

up vote
0
down vote

favorite

So a week ago I asked a question, > here
, The question was about sorting. If I use this code for sorting and creating files:

tail -n +2 File1.txt |
 split -l1 --filter='
 
 head -n 1 File2.txt &&
 cat <(tail -n +2 File2.txt) - > "$FILE"'

it does work on the files I use in that example, but when I use this on my real files which are bigger its seems like the sorting doesn't work.

Maybe because its to much in 1 time to do? i have 151 columns with different data annotated, and I just want to sort the columns 43 and 151, but I still need the new sorted files. Please help me out.

edited Jan 4 at 10:08

asked Jan 4 at 9:13

Osman Altun

207

So a week ago I asked a question, > here
, The question was about sorting. If I use this code for sorting and creating files:

tail -n +2 File1.txt |
 split -l1 --filter='
 
 head -n 1 File2.txt &&
 cat <(tail -n +2 File2.txt) - > "$FILE"'

it does work on the files I use in that example, but when I use this on my real files which are bigger its seems like the sorting doesn't work.

Maybe because its to much in 1 time to do? i have 151 columns with different data annotated, and I just want to sort the columns 43 and 151, but I still need the new sorted files. Please help me out.

edited Jan 4 at 10:08

asked Jan 4 at 9:13

Osman Altun

207

edited Jan 4 at 10:08

asked Jan 4 at 9:13

Osman Altun

207

asked Jan 4 at 9:13

Osman Altun

207

asked Jan 4 at 9:13

Osman Altun

207

The files can't be found if I do that
â€“Â Osman Altun
Jan 4 at 10:47

Is that the actual command youÃ¢Â€Â™ve tried using? Have you tried changing the sort command replacing -k4 which will sort on contents of 4th field to end of line to Ã¢Â€Â˜-k43,43 -k151,151Ã¢Â€Â™ which will sort on just column 43 then just 151, I think.
â€“Â Guy
Jan 16 at 1:49

@Guy I tried also -k43,43 , the problem was that I had columns with empty rows and that caused a problem for the sort function
â€“Â Osman Altun
Jan 16 at 6:24

I presume the problem is that as sort counts fields as starting at the change between a word and white space a blank field isnÃ¢Â€Â™t seen, the row is just presumed to have fewer fields overall.
â€“Â Guy
Jan 16 at 12:00

are all the columns laid out like the previous example showed, with each starting at a particular character in the line?
â€“Â Guy
Jan 16 at 12:11

Â |Â
show 1 more comment

The files can't be found if I do that
â€“Â Osman Altun
Jan 4 at 10:47

Is that the actual command youÃ¢Â€Â™ve tried using? Have you tried changing the sort command replacing -k4 which will sort on contents of 4th field to end of line to Ã¢Â€Â˜-k43,43 -k151,151Ã¢Â€Â™ which will sort on just column 43 then just 151, I think.
â€“Â Guy
Jan 16 at 1:49

@Guy I tried also -k43,43 , the problem was that I had columns with empty rows and that caused a problem for the sort function
â€“Â Osman Altun
Jan 16 at 6:24

I presume the problem is that as sort counts fields as starting at the change between a word and white space a blank field isnÃ¢Â€Â™t seen, the row is just presumed to have fewer fields overall.
â€“Â Guy
Jan 16 at 12:00

are all the columns laid out like the previous example showed, with each starting at a particular character in the line?
â€“Â Guy
Jan 16 at 12:11

The files can't be found if I do that
â€“Â Osman Altun
Jan 4 at 10:47

Is that the actual command youÃ¢Â€Â™ve tried using? Have you tried changing the sort command replacing -k4 which will sort on contents of 4th field to end of line to Ã¢Â€Â˜-k43,43 -k151,151Ã¢Â€Â™ which will sort on just column 43 then just 151, I think.
â€“Â Guy
Jan 16 at 1:49

@Guy I tried also -k43,43 , the problem was that I had columns with empty rows and that caused a problem for the sort function
â€“Â Osman Altun
Jan 16 at 6:24

I presume the problem is that as sort counts fields as starting at the change between a word and white space a blank field isnÃ¢Â€Â™t seen, the row is just presumed to have fewer fields overall.
â€“Â Guy
Jan 16 at 12:00

are all the columns laid out like the previous example showed, with each starting at a particular character in the line?
â€“Â Guy
Jan 16 at 12:11

Â |Â
show 1 more comment

1 Answer
1

active

oldest

votes

up vote
0
down vote

right, for this I'm going from the format of previous examples of data given that a column is defined by position, ie how many characters from the start of the line. Unfortunately for you, as you have found, if any of these columns are blank, the tools you have been trying to use don't then count them as a column at all:

col 1 col 2 col 3 col 4 col 5 
chr3 31663820 31663820 0.713 3

col 1 col 2 col 3 col 4 
chr3 33093371 3.753 4

I've just written up a quick script in python as it felt a bit easier to figure out. when given your two files on the command line, it then sorts them according to a hard-coded part of the line, but this can obviously be changed.
At the moment it would also sort the list once for each field you entered. But again it would be possible to update the sort function to return a tuple of floats in the desired order rather than a single float for the comparisons.

#! /usr/bin/python

# own_sort.py
# ./own_sort.py 'unique values file' 'duplicate values file'

# allows access to command line arguments.
import sys


# this is just to get some example inputs
test_line = 'chr3 39597927 39597927 8.721 5'
phylop_col = (test_line.find('8.721'), test_line.find('8.721')+7)


# this will return a sorting function with the particular column start and end
# positions desired, so its easy to change
def return_sorting_func(col_start, col_end):
 # a sorting key for pythons built in sort. the key must take a single element, 
 # and return something for the sort function to compare.
 def sorting_func(line):
 # use the exact location, ie how many characters from the start of the line.
 field = line[phylop_col[0]: phylop_col[1]]
 try:
 # if this field has a float, return it
 return float(field)
 except ValueError:
 # else return a default
 return float('-inf') # will give default of lowest rank
 # return 0.0 # default value of 0
 return sorting_func


if __name__ == '__main__':
 uniq_list = 
 dups_list = 

 # read both files into their own lists
 with open(sys.argv[1]) as uniqs, open(sys.argv[2]) as dups:
 uniq_list = list(uniqs.readlines())
 dups_list = list(dups.readlines())

 # and sort, using our key function from above, with relevant start and end positions
 # and reverse the resulting list.
 combined_list = sorted(uniq_list[1:] + dups_list[1:], 
 key=return_sorting_func(phylop_col[0], phylop_col[1]), 
 reverse=True)

 # to print out, cut off end of line (newline) and print header and footer around other 
 # results, which can then be piped from stdout.
 print(dups_list[0][:-1])
 for line in combined_list:
 print(line[:-1])
 print(dups_list[0][:-1])

so using the given files from the other question, I've ended up with:

~$>cat unique_data.txt 
chromosoom start end phylop GPS
chr1 28745756 28745756 7.905 5 
chr1 31227215 31227215 10.263 5
chr1 47562402 47562402 2.322 4
chr1 64859630 64859630 1.714 3
chr1 70805699 70805699 1.913 2
chr1 89760653 89760653 -0.1 0
chr1 95630169 95630169 -1.651 -1
~$>cat dups_data.txt 
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>cat dups_data_with_gaps_1.txt 
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 8.721 5

which both give the same output of

~$>./own_sort.py unique_data.txt dups_data_with_gaps_1.txt
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5 
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr3 30648039 30648039 2.214 3
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chromosoom start end phylop GPS

but if the sort column has blanks like the following, then that element will end up as the last row:

~$>cat dups_data_with_gaps_2.txt 
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>./own_sort.py unique_data.txt dups_data_with_gaps_2.txt 
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5 
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chr3 30648039 30648039 3
chromosoom start end phylop GPS

on the output of this, you can then also run through a pipeline to list where those lines from the 'unique' file have ended up in the overall listings.

~$>./own_sort.py unique_data.txt dups_data.txt | head -n -1 | tail -n +2 | grep -Fn -f unique_data.txt 
1:chr1 31227215 31227215 10.263 5
3:chr1 28745756 28745756 7.905 5 
5:chr1 47562402 47562402 2.322 4
7:chr1 70805699 70805699 1.913 2
8:chr1 64859630 64859630 1.714 3
12:chr1 89760653 89760653 -0.1 0
14:chr1 95630169 95630169 -1.651 -1

grep will sort for strings (-F), and output the line number (-n) and read the strings to search for from a file (-f unique_data.txt)

Sorry theres a lot there with examples. The awkward thing you need to do if you have a lot of fields is make sure you have a reliable way to identify the start and end of the field, and to get that for your larger files.

answered Jan 16 at 16:34

Guy

7231318

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f414731%2fsort-function-doesnt-work%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

col 1 col 2 col 3 col 4 col 5 
chr3 31663820 31663820 0.713 3

col 1 col 2 col 3 col 4 
chr3 33093371 3.753 4

#! /usr/bin/python

# own_sort.py
# ./own_sort.py 'unique values file' 'duplicate values file'

# allows access to command line arguments.
import sys


# this is just to get some example inputs
test_line = 'chr3 39597927 39597927 8.721 5'
phylop_col = (test_line.find('8.721'), test_line.find('8.721')+7)


# this will return a sorting function with the particular column start and end
# positions desired, so its easy to change
def return_sorting_func(col_start, col_end):
 # a sorting key for pythons built in sort. the key must take a single element, 
 # and return something for the sort function to compare.
 def sorting_func(line):
 # use the exact location, ie how many characters from the start of the line.
 field = line[phylop_col[0]: phylop_col[1]]
 try:
 # if this field has a float, return it
 return float(field)
 except ValueError:
 # else return a default
 return float('-inf') # will give default of lowest rank
 # return 0.0 # default value of 0
 return sorting_func


if __name__ == '__main__':
 uniq_list = 
 dups_list = 

 # read both files into their own lists
 with open(sys.argv[1]) as uniqs, open(sys.argv[2]) as dups:
 uniq_list = list(uniqs.readlines())
 dups_list = list(dups.readlines())

 # and sort, using our key function from above, with relevant start and end positions
 # and reverse the resulting list.
 combined_list = sorted(uniq_list[1:] + dups_list[1:], 
 key=return_sorting_func(phylop_col[0], phylop_col[1]), 
 reverse=True)

 # to print out, cut off end of line (newline) and print header and footer around other 
 # results, which can then be piped from stdout.
 print(dups_list[0][:-1])
 for line in combined_list:
 print(line[:-1])
 print(dups_list[0][:-1])

so using the given files from the other question, I've ended up with:

~$>cat unique_data.txt 
chromosoom start end phylop GPS
chr1 28745756 28745756 7.905 5 
chr1 31227215 31227215 10.263 5
chr1 47562402 47562402 2.322 4
chr1 64859630 64859630 1.714 3
chr1 70805699 70805699 1.913 2
chr1 89760653 89760653 -0.1 0
chr1 95630169 95630169 -1.651 -1
~$>cat dups_data.txt 
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>cat dups_data_with_gaps_1.txt 
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 8.721 5

which both give the same output of

~$>./own_sort.py unique_data.txt dups_data_with_gaps_1.txt
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5 
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr3 30648039 30648039 2.214 3
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chromosoom start end phylop GPS

but if the sort column has blanks like the following, then that element will end up as the last row:

~$>cat dups_data_with_gaps_2.txt 
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>./own_sort.py unique_data.txt dups_data_with_gaps_2.txt 
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5 
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chr3 30648039 30648039 3
chromosoom start end phylop GPS

on the output of this, you can then also run through a pipeline to list where those lines from the 'unique' file have ended up in the overall listings.

~$>./own_sort.py unique_data.txt dups_data.txt | head -n -1 | tail -n +2 | grep -Fn -f unique_data.txt 
1:chr1 31227215 31227215 10.263 5
3:chr1 28745756 28745756 7.905 5 
5:chr1 47562402 47562402 2.322 4
7:chr1 70805699 70805699 1.913 2
8:chr1 64859630 64859630 1.714 3
12:chr1 89760653 89760653 -0.1 0
14:chr1 95630169 95630169 -1.651 -1

grep will sort for strings (-F), and output the line number (-n) and read the strings to search for from a file (-f unique_data.txt)

answered Jan 16 at 16:34

Guy

7231318

add a commentÂ |Â

up vote
0
down vote

col 1 col 2 col 3 col 4 col 5 
chr3 31663820 31663820 0.713 3

col 1 col 2 col 3 col 4 
chr3 33093371 3.753 4

#! /usr/bin/python

# own_sort.py
# ./own_sort.py 'unique values file' 'duplicate values file'

# allows access to command line arguments.
import sys


# this is just to get some example inputs
test_line = 'chr3 39597927 39597927 8.721 5'
phylop_col = (test_line.find('8.721'), test_line.find('8.721')+7)


# this will return a sorting function with the particular column start and end
# positions desired, so its easy to change
def return_sorting_func(col_start, col_end):
 # a sorting key for pythons built in sort. the key must take a single element, 
 # and return something for the sort function to compare.
 def sorting_func(line):
 # use the exact location, ie how many characters from the start of the line.
 field = line[phylop_col[0]: phylop_col[1]]
 try:
 # if this field has a float, return it
 return float(field)
 except ValueError:
 # else return a default
 return float('-inf') # will give default of lowest rank
 # return 0.0 # default value of 0
 return sorting_func


if __name__ == '__main__':
 uniq_list = 
 dups_list = 

 # read both files into their own lists
 with open(sys.argv[1]) as uniqs, open(sys.argv[2]) as dups:
 uniq_list = list(uniqs.readlines())
 dups_list = list(dups.readlines())

 # and sort, using our key function from above, with relevant start and end positions
 # and reverse the resulting list.
 combined_list = sorted(uniq_list[1:] + dups_list[1:], 
 key=return_sorting_func(phylop_col[0], phylop_col[1]), 
 reverse=True)

 # to print out, cut off end of line (newline) and print header and footer around other 
 # results, which can then be piped from stdout.
 print(dups_list[0][:-1])
 for line in combined_list:
 print(line[:-1])
 print(dups_list[0][:-1])

so using the given files from the other question, I've ended up with:

~$>cat unique_data.txt 
chromosoom start end phylop GPS
chr1 28745756 28745756 7.905 5 
chr1 31227215 31227215 10.263 5
chr1 47562402 47562402 2.322 4
chr1 64859630 64859630 1.714 3
chr1 70805699 70805699 1.913 2
chr1 89760653 89760653 -0.1 0
chr1 95630169 95630169 -1.651 -1
~$>cat dups_data.txt 
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>cat dups_data_with_gaps_1.txt 
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 8.721 5

which both give the same output of

~$>./own_sort.py unique_data.txt dups_data_with_gaps_1.txt
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5 
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr3 30648039 30648039 2.214 3
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chromosoom start end phylop GPS

but if the sort column has blanks like the following, then that element will end up as the last row:

~$>cat dups_data_with_gaps_2.txt 
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>./own_sort.py unique_data.txt dups_data_with_gaps_2.txt 
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5 
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chr3 30648039 30648039 3
chromosoom start end phylop GPS

on the output of this, you can then also run through a pipeline to list where those lines from the 'unique' file have ended up in the overall listings.

~$>./own_sort.py unique_data.txt dups_data.txt | head -n -1 | tail -n +2 | grep -Fn -f unique_data.txt 
1:chr1 31227215 31227215 10.263 5
3:chr1 28745756 28745756 7.905 5 
5:chr1 47562402 47562402 2.322 4
7:chr1 70805699 70805699 1.913 2
8:chr1 64859630 64859630 1.714 3
12:chr1 89760653 89760653 -0.1 0
14:chr1 95630169 95630169 -1.651 -1

grep will sort for strings (-F), and output the line number (-n) and read the strings to search for from a file (-f unique_data.txt)

answered Jan 16 at 16:34

Guy

7231318

add a commentÂ |Â

up vote
0
down vote

col 1 col 2 col 3 col 4 col 5 
chr3 31663820 31663820 0.713 3

col 1 col 2 col 3 col 4 
chr3 33093371 3.753 4

#! /usr/bin/python

# own_sort.py
# ./own_sort.py 'unique values file' 'duplicate values file'

# allows access to command line arguments.
import sys


# this is just to get some example inputs
test_line = 'chr3 39597927 39597927 8.721 5'
phylop_col = (test_line.find('8.721'), test_line.find('8.721')+7)


# this will return a sorting function with the particular column start and end
# positions desired, so its easy to change
def return_sorting_func(col_start, col_end):
 # a sorting key for pythons built in sort. the key must take a single element, 
 # and return something for the sort function to compare.
 def sorting_func(line):
 # use the exact location, ie how many characters from the start of the line.
 field = line[phylop_col[0]: phylop_col[1]]
 try:
 # if this field has a float, return it
 return float(field)
 except ValueError:
 # else return a default
 return float('-inf') # will give default of lowest rank
 # return 0.0 # default value of 0
 return sorting_func


if __name__ == '__main__':
 uniq_list = 
 dups_list = 

 # read both files into their own lists
 with open(sys.argv[1]) as uniqs, open(sys.argv[2]) as dups:
 uniq_list = list(uniqs.readlines())
 dups_list = list(dups.readlines())

 # and sort, using our key function from above, with relevant start and end positions
 # and reverse the resulting list.
 combined_list = sorted(uniq_list[1:] + dups_list[1:], 
 key=return_sorting_func(phylop_col[0], phylop_col[1]), 
 reverse=True)

 # to print out, cut off end of line (newline) and print header and footer around other 
 # results, which can then be piped from stdout.
 print(dups_list[0][:-1])
 for line in combined_list:
 print(line[:-1])
 print(dups_list[0][:-1])

so using the given files from the other question, I've ended up with:

~$>cat unique_data.txt 
chromosoom start end phylop GPS
chr1 28745756 28745756 7.905 5 
chr1 31227215 31227215 10.263 5
chr1 47562402 47562402 2.322 4
chr1 64859630 64859630 1.714 3
chr1 70805699 70805699 1.913 2
chr1 89760653 89760653 -0.1 0
chr1 95630169 95630169 -1.651 -1
~$>cat dups_data.txt 
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>cat dups_data_with_gaps_1.txt 
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 8.721 5

which both give the same output of

~$>./own_sort.py unique_data.txt dups_data_with_gaps_1.txt
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5 
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr3 30648039 30648039 2.214 3
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chromosoom start end phylop GPS

but if the sort column has blanks like the following, then that element will end up as the last row:

~$>cat dups_data_with_gaps_2.txt 
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>./own_sort.py unique_data.txt dups_data_with_gaps_2.txt 
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5 
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chr3 30648039 30648039 3
chromosoom start end phylop GPS

on the output of this, you can then also run through a pipeline to list where those lines from the 'unique' file have ended up in the overall listings.

~$>./own_sort.py unique_data.txt dups_data.txt | head -n -1 | tail -n +2 | grep -Fn -f unique_data.txt 
1:chr1 31227215 31227215 10.263 5
3:chr1 28745756 28745756 7.905 5 
5:chr1 47562402 47562402 2.322 4
7:chr1 70805699 70805699 1.913 2
8:chr1 64859630 64859630 1.714 3
12:chr1 89760653 89760653 -0.1 0
14:chr1 95630169 95630169 -1.651 -1

grep will sort for strings (-F), and output the line number (-n) and read the strings to search for from a file (-f unique_data.txt)

answered Jan 16 at 16:34

Guy

7231318

col 1 col 2 col 3 col 4 col 5 
chr3 31663820 31663820 0.713 3

col 1 col 2 col 3 col 4 
chr3 33093371 3.753 4

#! /usr/bin/python

# own_sort.py
# ./own_sort.py 'unique values file' 'duplicate values file'

# allows access to command line arguments.
import sys


# this is just to get some example inputs
test_line = 'chr3 39597927 39597927 8.721 5'
phylop_col = (test_line.find('8.721'), test_line.find('8.721')+7)


# this will return a sorting function with the particular column start and end
# positions desired, so its easy to change
def return_sorting_func(col_start, col_end):
 # a sorting key for pythons built in sort. the key must take a single element, 
 # and return something for the sort function to compare.
 def sorting_func(line):
 # use the exact location, ie how many characters from the start of the line.
 field = line[phylop_col[0]: phylop_col[1]]
 try:
 # if this field has a float, return it
 return float(field)
 except ValueError:
 # else return a default
 return float('-inf') # will give default of lowest rank
 # return 0.0 # default value of 0
 return sorting_func


if __name__ == '__main__':
 uniq_list = 
 dups_list = 

 # read both files into their own lists
 with open(sys.argv[1]) as uniqs, open(sys.argv[2]) as dups:
 uniq_list = list(uniqs.readlines())
 dups_list = list(dups.readlines())

 # and sort, using our key function from above, with relevant start and end positions
 # and reverse the resulting list.
 combined_list = sorted(uniq_list[1:] + dups_list[1:], 
 key=return_sorting_func(phylop_col[0], phylop_col[1]), 
 reverse=True)

 # to print out, cut off end of line (newline) and print header and footer around other 
 # results, which can then be piped from stdout.
 print(dups_list[0][:-1])
 for line in combined_list:
 print(line[:-1])
 print(dups_list[0][:-1])

so using the given files from the other question, I've ended up with:

~$>cat unique_data.txt 
chromosoom start end phylop GPS
chr1 28745756 28745756 7.905 5 
chr1 31227215 31227215 10.263 5
chr1 47562402 47562402 2.322 4
chr1 64859630 64859630 1.714 3
chr1 70805699 70805699 1.913 2
chr1 89760653 89760653 -0.1 0
chr1 95630169 95630169 -1.651 -1
~$>cat dups_data.txt 
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>cat dups_data_with_gaps_1.txt 
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 2.214 3
chr3 31663820 31663820 0.713 3
chr3 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 8.721 5

which both give the same output of

~$>./own_sort.py unique_data.txt dups_data_with_gaps_1.txt
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5 
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr3 30648039 30648039 2.214 3
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chromosoom start end phylop GPS

but if the sort column has blanks like the following, then that element will end up as the last row:

~$>cat dups_data_with_gaps_2.txt 
chromosoom start end phylop GPS
chr3 15540407 15540407 -1.391 -1
chr3 30648039 30648039 3
chr3 31663820 31663820 0.713 3
chr3 33093371 33093371 3.753 4
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 39597927 39597927 8.721 5
~$>./own_sort.py unique_data.txt dups_data_with_gaps_2.txt 
chromosoom start end phylop GPS
chr1 31227215 31227215 10.263 5
chr3 39597927 39597927 8.721 5
chr1 28745756 28745756 7.905 5 
chr3 33093371 33093371 3.753 4
chr1 47562402 47562402 2.322 4
chr1 70805699 70805699 1.913 2
chr1 64859630 64859630 1.714 3
chr3 37050398 37050398 1.650 2
chr3 38053456 38053456 1.1 1
chr3 31663820 31663820 0.713 3
chr1 89760653 89760653 -0.1 0
chr3 15540407 15540407 -1.391 -1
chr1 95630169 95630169 -1.651 -1
chr3 30648039 30648039 3
chromosoom start end phylop GPS

on the output of this, you can then also run through a pipeline to list where those lines from the 'unique' file have ended up in the overall listings.

~$>./own_sort.py unique_data.txt dups_data.txt | head -n -1 | tail -n +2 | grep -Fn -f unique_data.txt 
1:chr1 31227215 31227215 10.263 5
3:chr1 28745756 28745756 7.905 5 
5:chr1 47562402 47562402 2.322 4
7:chr1 70805699 70805699 1.913 2
8:chr1 64859630 64859630 1.714 3
12:chr1 89760653 89760653 -0.1 0
14:chr1 95630169 95630169 -1.651 -1

grep will sort for strings (-F), and output the line number (-n) and read the strings to search for from a file (-f unique_data.txt)

answered Jan 16 at 16:34

Guy

7231318

answered Jan 16 at 16:34

Guy

7231318

answered Jan 16 at 16:34

Guy

7231318

answered Jan 16 at 16:34

Guy

7231318

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu