代码之家 › 专栏 › 技术社区 › Gloom

使用python在单个BLAST文件中查找最佳交互命中

blast bioinformatics python

Gloom · 技术社区 · 7 年前

我有一个标准格式的BLAST OUTPMT 6输出文件,我想找到一种方法,在文件中循环,选择每个命中,找到它的相互命中,并解密哪个是最好的命中存储。

例如:

d = {}
for line in input_file:
    term = line.split('\t')
    qseqid = term[0]
    sseqid = term[1]
    hit = qseqid, sseqid
    recip_hit = sseqid, qseqid
    for line in input_file:
        if recip_hit in line:
            compare both lines
done

输入示例(制表符分隔):

Seq1    Seq2    80    1000   10    3   1    1000    100    1100    0.0    500
Seq2    Seq1    95    1000   10    3   100    1100    1    1000    1e-100    500

有人能提供如何有效解决这个问题的见解吗?

非常感谢

1 回复 | 直到 7 年前

Mr. T Andres Pinzon 7 年前

您可以解决问题,找到这些对,然后像这样比较这些线:

#create a dictionary to store pairs
line_dict = {}
#iterate over your file
for line in open("test.txt", "r"):
    line = line[:-1].split("\t")
    #ignore line, if not at least one value apart from the two sequence IDs
    if len(line) < 3:
        continue
    #identify the two sequences
    seq = tuple(line[0:2])
    #is reverse sequence already in dictionary?
    if seq[::-1] in line_dict:
        #append new line
        line_dict[seq[::-1]].append(line)
    else:
        #create new entry
        line_dict[seq] = [line]

#remove entries, for which no counterpart exists
pairs = {k: v for k, v in line_dict.items() if len(v) > 1}

#and do things with these pairs
for pair, seq in pairs.items():
    print(pair, "found in:")
    for item in seq:
        print(item)

优点是,您只需在文件上迭代一次,因为您只需存储所有数据,并在未找到匹配的反向对时丢弃它们。缺点是这会占用空间,因此对于非常大的文件,这种方法可能不可行。

类似的方法——将所有数据存储在工作记忆中——使用熊猫。这应该更快,因为排序算法是为熊猫优化的。pandas的另一个优点是,所有其他值都已包含在pandas列中,因此更容易进行进一步的分析。我当然更喜欢熊猫版,但我不知道你的系统上是否安装了熊猫版。为了便于沟通,我分配了 a 和 b 到包含序列的列 Seq1 和 Seq2 .

import pandas as pd
#read data into a dataframe
#not necessary: drop the header of the file, use custom columns names
df = pd.read_csv("test.txt", sep='\t', names=list("abcde"), header = 0)

#create a column that joins Seq1 - Seq2 or Seq2 - Seq1 to Seq1Seq2
df["pairs"] = df.apply(lambda row: ''.join(sorted([row["a"], row["b"]])), axis = 1)
#remove rows with no matching pair and sort the database
only_pairs = df[df["pairs"].duplicated(keep = False)].sort_values(by = "pairs")

print(only_pairs)

推荐文章

Google User · Django管理员在`list_display中未显示`creation_date`字段`

4 月前

user29747013 · 如何创建一个新的数据框架,其中包含原始数据框架中列的聚合列?

4 月前

ÎÎÎ½Î· ÎÎ®Î¹Î½Î¿Ï · Python lxml.html语法错误:使用lxml find时XPATH的谓词无效

4 月前

user29715306 · from_users=和chats=电视节目中的差异

4 月前

Redshoe · 当执行numpy.genfromtxt()时,python是否会读取文件的所有行?

4 月前

RASEL MAHMUD · 为什么以及如何在is_even()函数内的IF条件中递归X变量在满足0后递增?[副本]

4 月前

prayner · 更新嵌套字典包含列表中的项

4 月前

Bringo Jr · 我可以在O(n)中解决这个问题吗?

5 月前

Dave · 如何在for循环中修改列表值

5 月前

Shukurullox Komiljonov · 从记录中获得相互和解。使用SQL

5 月前