I want to create two lists using data from a '.fasta' file containing around 14.000 proteins and their sequence in FASTA-format. This is the proteome of an organism.
One list will contain the identifiers, which is each line in the file starting with '>'. The other list needs to contain the full sequence from a protein as one element in the list. This is for me, a beginner, decently difficult considering the sequence of one protein is also longer than one line in the file, and every protein varies in length.
Here's a toy example (these are the first three proteins form the proteome):
tr|A0A087QGI9|A0A087QGI9_APTFO Neuroblast differentiation-associated protein AHNAK (Fragment) OS=Aptenodytes forsteritr|A0A087QH03|A0A087QH03_APTFO Bromodomain-containing protein 9 (Fragment) OS=Aptenodytes forsteri OX=9233 GN=AS27_06589 PE=4 SV=1 DYVDKPLEKPLKLVLKVGGSEVTELSGSGHDSSYYDDRSDHERERHKEKKKKKKKKSEKE KEKHLDDEERRKRKEEKKRKREKEQCDTEGETDDFDPGKKVEVEPPPDRPVRACRTQPAE NESTPIQQLLEHFLRQLQRKDPHGFFAFPVTDAIAPGYSMIIKHPMDFGTMKEKIAANEY KSVTEFKADFKLMCDNAMTYNRPDTVYYKLAKKILHTGFKMMSKAALLGDEDTAVEEPVP EVVPVQAETTKKSKKQNKEVISCIFEPEGNACSLTDSTAEEHVLALVEHAADEARDRINR YLPNSKIGYLKKNGDGTLLFSVVNSSDPEAEEEETHPVDLSSLSSKLLPGFTTLGFKDER RNKVTFLTSASTAPSMQNNSIFHDLKSDEMELLYSAYGDETGIQCALSLQEFVKDAGNYS KKIVDDLLDQITSGDHSKTIYQLKQRRNIPVKPLDEVKVGESAGDSNTSDLDFLSMKPYS DVSLDISMLSSLGKVKKELDHDDNHLHLDETTKLLQDLHEAQADRVGSRPSSNLSSLSNT SERDQHHLGSPSHLSVGEQQDMVHDPYEFLQSPETSNTTTN tr|A0A087QHB6|A0A087QHB6_APTFO Leucine-rich repeat-containing protein 2 OS=Aptenodytes forsteri OX=9233 GN=AS27_03020 PE=4 SV=1 MGHQVIIFDFSLIRGLWETRVKKNKERQRKEKERLEKSSLEKIKQEWNFILECRKKGIPQ SKYLKNGFVDTDEKILDTYGKTQLQKRHALSHETNKKRNKFIFQLSGEQWTEFPDSLKEQ IYLKEWHVYNTLIQTIPAYIALFQDLRVLELSKNQINHLPLEIGCLKNLKVLNVSFNNLK SVPPELGDCESLEKLDLSGNMEITELPFELSNLKQVTVVDVSANKFHSIPICVLRMSNLQ WLDISSNNLKDLPEDIDRLDQLQTLLLQKNKLTYLPRALVNMPKLSLLVVSGDDLVEIPT AVCESTTGLKFISLKDSPVETIVCEDTEEIIESEREREEFEKEFMKAYIEDLKERDSTPS YTTKVLLSLQL
I want the output of the lists to be this:
ID = ['>tr|A0A087QGI9|A0A087QGI9_APTFO Neuroblast differentiation-associated protein AHNAK (Fragment) OS=Aptenodytes forsteri OX=9233 GN=AS27_15363 PE=4 SV=1', '>tr|A0A087QH03|A0A087QH03_APTFO Bromodomain-containing protein 9 (Fragment) OS=Aptenodytes forsteri OX=9233 GN=AS27_06589 PE=4 SV=1', '>tr|A0A087QHB6|A0A087QHB6_APTFO Leucine-rich repeat-containing protein 2 OS=Aptenodytes forsteri OX=9233 GN=AS27_03020 PE=4 SV=1']
sequences = ['GDVDVSVPKLEGDLKGPEVDIKGPKVDIEAPDVDIHGPEGKIKIPKFKMPKFGLSGLKGEGPEVDVNLPEADVALSGPKVDVTLPDLDVEGPEGKLKGPKFKKPDVQFNVPKISMPEIDLNLKGPKLKADLDPSLPKIEGELKGPEVDIKGPKVDIEAPDVDFHGPEGKLKMPKFKMPKFGASGFKAEGPEVDVSVPKGELDVSGPKLDSEGAGFQIEGPEGKFKGPQFKMPEMNVKAPKISVPDVDLNLKGPKAKADMDMSLPQVDLKGSDLHVKGPKVEVEVPDLDLEGPKGKLKGPKFKMPEMNIKAPNISMPDFDLNLKGPKSKGGADVDLSVPKLEGDLSGPDVSIKGPKVDMDGPELDVEGREGKLKGPKFKMPEMHIKAPKISMPDASLNLKGPKLKGDVEGDVEVSVPKLEGDLKGPELDIKGPKVDIDVPDVEIEGPEGKLKGPKLKMPEMHFKAPKISMPDFDLNLKGPKVKGDVDVSVPCVEGDLKGPKIDVKGPELDVSAPDVQIEGPEGKVKGPKFKMPDMHFKAPKISMPDFDLNLKGPKVDVNLPTGDLDVSVPKVDIEGPELDIEVPEGKLKGPKFKMPEMHIKAPKISMPDIDLNLKGPKVKGDVEGDVDVSVPKLEGDLKGPEVDIKGPKVDIEAPDMDIHGPEGKFKMPKFKMPEMHIKAPKVSVPDVDFNIKGPQLKGGVDVSGPRLEGDLKAPQMDVQAPKIDVEAPDVTIEGPEGKLKGPKFKMPEMHVKAPKFSMPDVDFNLKGPRLKGDVDVSAPKLEGDLQRPGVDIEGPKLDLRAPDVNIEGPDGKLKGPGIKMPEMHVKTPQISMPDIDLNLKGLRLKGDADLQGPKLEGGLKGPEIDIKGPKVDIEGPEVDVDVDVECPDLNVEGPEANVKLPKFKKSKFGFGMKSPKAEIKVPGAEVDLPEAELNAESPDVNVAGKGKKSKFKMPKLHMSGPKVKGKKGGFDVNVAGGELDANLKSPDLDVNVAGPDVTIKGDATVKSPKGKKPMFGKISFPDVEFDLKSHRFRGDASLAGPKIEGELKAPDLEVSVPTAKGDVKGPSLNVDLDASDVSLKKPKFKLPGGQVGVGDLKMEGDLKGPAL', 'DYVDKPLEKPLKLVLKVGGSEVTELSGSGHDSSYYDDRSDHERERHKEKKKKKKKKSEKEKEKHLDDEERRKRKEEKKRKREKEQCDTEGETDDFDPGKKVEVEPPPDRPVRACRTQPAENESTPIQQLLEHFLRQLQRKDPHGFFAFPVTDAIAPGYSMIIKHPMDFGTMKEKIAANEY KSVTEFKADFKLMCDNAMTYNRPDTVYYKLAKKILHTGFKMMSKAALLGDEDTAVEEPVPEVVPVQAETTKKSKKQNKEVISCIFEPEGNACSLTDSTAEEHVLALVEHAADEARDRINRYLPNSKIGYLKKNGDGTLLFSVVNSSDPEAEEEETHPVDLSSLSSKLLPGFTTLGFKDERRNKVTFLTSASTAPSMQNNSIFHDLKSDEMELLYSAYGDETGIQCALSLQEFVKDAGNYSKKIVDDLLDQITSGDHSKTIYQLKQRRNIPVKPLDEVKVGESAGDSNTSDLDFLSMKPYSDVSLDISMLSSLGKVKKELDHDDNHLHLDETTKLLQDLHEAQADRVGSRPSSNLSSLSNTSERDQHHLGSPSHLSVGEQQDMVHDPYEFLQSPETSNTTTN', 'MGHQVIIFDFSLIRGLWETRVKKNKERQRKEKERLEKSSLEKIKQEWNFILECRKKGIPQSKYLKNGFVDTDEKILDTYGKTQLQKRHALSHETNKKRNKFIFQLSGEQWTEFPDSLKEQIYLKEWHVYNTLIQTIPAYIALFQDLRVLELSKNQINHLPLEIGCLKNLKVLNVSFNNLKSVPPELGDCESLEKLDLSGNMEITELPFELSNLKQVTVVDVSANKFHSIPICVLRMSNLQWLDISSNNLKDLPEDIDRLDQLQTLLLQKNKLTYLPRALVNMPKLSLLVVSGDDLVEIPTAVCESTTGLKFISLKDSPVETIVCEDTEEIIESEREREEFEKEFMKAYIEDLKERDSTPSYTTKVLLSLQL']
So this is what I tried, but I ran into some problems:
#Empty lists:
ID = []
sequences = []
#Empty string:
sequence = ""
#Open fasta-file:
with open("Proteome.fasta", "r") as proteome: #Opens .fasta file as proteome
for line in proteome: #for each line in proteome
if line.startswith(">"): #If the line starts with '>'
ID.append(line) #It will append the line to the list 'ID'
sequences.append(sequence) #This will append the sequence made to the list 'sequences'
sequence = "" #This should end the loop everytime it encounters '>' (the sequences will not be one giant sequence)
else:
sequence += str(line[:-1]) #makes one string from the different lines that make up one proteinsequence
The list 'sequences' starts with an empty element, which I think I can remove with the .remove() function. And this list does not contain the last sequence, probably because at the end there's no more '>' identifier (because it is the last protein).
Does anyone know what went wrong here, and how I can fix it? I also rather not use any import options, just "vanilla" python in IDLE Shell. Thanks in advance!
source https://stackoverflow.com/questions/74843751/python-creating-lists-from-a-fasta-file-containing-the-proteome-of-an-organ
Comments
Post a Comment