代码之家  ›  专栏  ›  技术社区  ›  F.Lira

使用字符列表在输出中重复写入-python

  •  0
  • F.Lira  · 技术社区  · 6 年前

    在一个文件中,我有一些字符需要替换。

    字母=[“B”,“Z”,“J”,“U”,“O”]

    for record in SeqIO.parse(inFile, "fasta"):
        for letter in letters:
            if letters in str(record.seq):
                print record.id 
                record.seq = str(record.seq).replace(letter, "X")
                outFile.write(">%s\n%s\n" % (record.description, record.seq))
            else:
                outFile.write(">%s\n%s\n" % (record.description, record.seq))
                #pass
    

    > >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
    > MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
    > >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
    > MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
    > >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
    > MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
    > >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
    > MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
    > >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
    > MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
    
    2 回复  |  直到 6 年前
        1
  •  4
  •   Chris_Rands    6 年前

    我想你要做的是替换模棱两可的 IUPAC amino acid codes (加上一些你不知何故获得的额外信件?) 'X' .

    更好地使用 str.translate() (在python3中)一次完成所有替换。另外,由于您使用Biopython读取文件,因此也可以使用Biopython轻松地编写输出文件。

    from Bio import SeqIO
    from Bio.Seq import Seq
    
    letters = ["B", "Z", "J", "U", "O"]
    trans_tab = str.maketrans(''.join(letters), 'X'*len(letters))
    
    def yield_seqs(in_file):
        for record in SeqIO.parse(in_file, 'fasta'):
            record.seq = Seq(str(record.seq).translate(trans_tab))
            yield record
    
    SeqIO.write(yield_seqs('input.fasta'), 'output.fasta', 'fasta')
    

    $ cat input.fasta 
    >1
    MBZJ
    $ python3 myscript.py
    $ cat output.fasta 
    >1
    MXXX
    
        2
  •  1
  •   blue_note    6 年前

    你打错了。

    if letters in str(record.seq):
    

    if letter in str(record.seq)
    

    所以,你的支票总是不通过,然后打印 else 每一个字母的部分。