Skip to main content

Python - creating lists from a '.fasta' file containing the proteome of an organism

I want to create two lists using data from a '.fasta' file containing around 14.000 proteins and their sequence in FASTA-format. This is the proteome of an organism.

One list will contain the identifiers, which is each line in the file starting with '>'. The other list needs to contain the full sequence from a protein as one element in the list. This is for me, a beginner, decently difficult considering the sequence of one protein is also longer than one line in the file, and every protein varies in length.

Here's a toy example (these are the first three proteins form the proteome):

tr|A0A087QGI9|A0A087QGI9_APTFO Neuroblast differentiation-associated protein AHNAK (Fragment) OS=Aptenodytes forsteri OX=9233 GN=AS27_15363 PE=4 SV=1 GDVDVSVPKLEGDLKGPEVDIKGPKVDIEAPDVDIHGPEGKIKIPKFKMPKFGLSGLKGE GPEVDVNLPEADVALSGPKVDVTLPDLDVEGPEGKLKGPKFKKPDVQFNVPKISMPEIDL NLKGPKLKADLDPSLPKIEGELKGPEVDIKGPKVDIEAPDVDFHGPEGKLKMPKFKMPKF GASGFKAEGPEVDVSVPKGELDVSGPKLDSEGAGFQIEGPEGKFKGPQFKMPEMNVKAPK ISVPDVDLNLKGPKAKADMDMSLPQVDLKGSDLHVKGPKVEVEVPDLDLEGPKGKLKGPK FKMPEMNIKAPNISMPDFDLNLKGPKSKGGADVDLSVPKLEGDLSGPDVSIKGPKVDMDG PELDVEGREGKLKGPKFKMPEMHIKAPKISMPDASLNLKGPKLKGDVEGDVEVSVPKLEG DLKGPELDIKGPKVDIDVPDVEIEGPEGKLKGPKLKMPEMHFKAPKISMPDFDLNLKGPK VKGDVDVSVPCVEGDLKGPKIDVKGPELDVSAPDVQIEGPEGKVKGPKFKMPDMHFKAPK ISMPDFDLNLKGPKVDVNLPTGDLDVSVPKVDIEGPELDIEVPEGKLKGPKFKMPEMHIK APKISMPDIDLNLKGPKVKGDVEGDVDVSVPKLEGDLKGPEVDIKGPKVDIEAPDMDIHG PEGKFKMPKFKMPEMHIKAPKVSVPDVDFNIKGPQLKGGVDVSGPRLEGDLKAPQMDVQA PKIDVEAPDVTIEGPEGKLKGPKFKMPEMHVKAPKFSMPDVDFNLKGPRLKGDVDVSAPK LEGDLQRPGVDIEGPKLDLRAPDVNIEGPDGKLKGPGIKMPEMHVKTPQISMPDIDLNLK GLRLKGDADLQGPKLEGGLKGPEIDIKGPKVDIEGPEVDVDVDVECPDLNVEGPEANVKL PKFKKSKFGFGMKSPKAEIKVPGAEVDLPEAELNAESPDVNVAGKGKKSKFKMPKLHMSG PKVKGKKGGFDVNVAGGELDANLKSPDLDVNVAGPDVTIKGDATVKSPKGKKPMFGKISF PDVEFDLKSHRFRGDASLAGPKIEGELKAPDLEVSVPTAKGDVKGPSLNVDLDASDVSLK KPKFKLPGGQVGVGDLKMEGDLKGPAL tr|A0A087QH03|A0A087QH03_APTFO Bromodomain-containing protein 9 (Fragment) OS=Aptenodytes forsteri OX=9233 GN=AS27_06589 PE=4 SV=1 DYVDKPLEKPLKLVLKVGGSEVTELSGSGHDSSYYDDRSDHERERHKEKKKKKKKKSEKE KEKHLDDEERRKRKEEKKRKREKEQCDTEGETDDFDPGKKVEVEPPPDRPVRACRTQPAE NESTPIQQLLEHFLRQLQRKDPHGFFAFPVTDAIAPGYSMIIKHPMDFGTMKEKIAANEY KSVTEFKADFKLMCDNAMTYNRPDTVYYKLAKKILHTGFKMMSKAALLGDEDTAVEEPVP EVVPVQAETTKKSKKQNKEVISCIFEPEGNACSLTDSTAEEHVLALVEHAADEARDRINR YLPNSKIGYLKKNGDGTLLFSVVNSSDPEAEEEETHPVDLSSLSSKLLPGFTTLGFKDER RNKVTFLTSASTAPSMQNNSIFHDLKSDEMELLYSAYGDETGIQCALSLQEFVKDAGNYS KKIVDDLLDQITSGDHSKTIYQLKQRRNIPVKPLDEVKVGESAGDSNTSDLDFLSMKPYS DVSLDISMLSSLGKVKKELDHDDNHLHLDETTKLLQDLHEAQADRVGSRPSSNLSSLSNT SERDQHHLGSPSHLSVGEQQDMVHDPYEFLQSPETSNTTTN tr|A0A087QHB6|A0A087QHB6_APTFO Leucine-rich repeat-containing protein 2 OS=Aptenodytes forsteri OX=9233 GN=AS27_03020 PE=4 SV=1 MGHQVIIFDFSLIRGLWETRVKKNKERQRKEKERLEKSSLEKIKQEWNFILECRKKGIPQ SKYLKNGFVDTDEKILDTYGKTQLQKRHALSHETNKKRNKFIFQLSGEQWTEFPDSLKEQ IYLKEWHVYNTLIQTIPAYIALFQDLRVLELSKNQINHLPLEIGCLKNLKVLNVSFNNLK SVPPELGDCESLEKLDLSGNMEITELPFELSNLKQVTVVDVSANKFHSIPICVLRMSNLQ WLDISSNNLKDLPEDIDRLDQLQTLLLQKNKLTYLPRALVNMPKLSLLVVSGDDLVEIPT AVCESTTGLKFISLKDSPVETIVCEDTEEIIESEREREEFEKEFMKAYIEDLKERDSTPS YTTKVLLSLQL

I want the output of the lists to be this:

ID = ['>tr|A0A087QGI9|A0A087QGI9_APTFO Neuroblast differentiation-associated protein AHNAK (Fragment) OS=Aptenodytes forsteri OX=9233 GN=AS27_15363 PE=4 SV=1', '>tr|A0A087QH03|A0A087QH03_APTFO Bromodomain-containing protein 9 (Fragment) OS=Aptenodytes forsteri OX=9233 GN=AS27_06589 PE=4 SV=1', '>tr|A0A087QHB6|A0A087QHB6_APTFO Leucine-rich repeat-containing protein 2 OS=Aptenodytes forsteri OX=9233 GN=AS27_03020 PE=4 SV=1']

sequences = ['GDVDVSVPKLEGDLKGPEVDIKGPKVDIEAPDVDIHGPEGKIKIPKFKMPKFGLSGLKGEGPEVDVNLPEADVALSGPKVDVTLPDLDVEGPEGKLKGPKFKKPDVQFNVPKISMPEIDLNLKGPKLKADLDPSLPKIEGELKGPEVDIKGPKVDIEAPDVDFHGPEGKLKMPKFKMPKFGASGFKAEGPEVDVSVPKGELDVSGPKLDSEGAGFQIEGPEGKFKGPQFKMPEMNVKAPKISVPDVDLNLKGPKAKADMDMSLPQVDLKGSDLHVKGPKVEVEVPDLDLEGPKGKLKGPKFKMPEMNIKAPNISMPDFDLNLKGPKSKGGADVDLSVPKLEGDLSGPDVSIKGPKVDMDGPELDVEGREGKLKGPKFKMPEMHIKAPKISMPDASLNLKGPKLKGDVEGDVEVSVPKLEGDLKGPELDIKGPKVDIDVPDVEIEGPEGKLKGPKLKMPEMHFKAPKISMPDFDLNLKGPKVKGDVDVSVPCVEGDLKGPKIDVKGPELDVSAPDVQIEGPEGKVKGPKFKMPDMHFKAPKISMPDFDLNLKGPKVDVNLPTGDLDVSVPKVDIEGPELDIEVPEGKLKGPKFKMPEMHIKAPKISMPDIDLNLKGPKVKGDVEGDVDVSVPKLEGDLKGPEVDIKGPKVDIEAPDMDIHGPEGKFKMPKFKMPEMHIKAPKVSVPDVDFNIKGPQLKGGVDVSGPRLEGDLKAPQMDVQAPKIDVEAPDVTIEGPEGKLKGPKFKMPEMHVKAPKFSMPDVDFNLKGPRLKGDVDVSAPKLEGDLQRPGVDIEGPKLDLRAPDVNIEGPDGKLKGPGIKMPEMHVKTPQISMPDIDLNLKGLRLKGDADLQGPKLEGGLKGPEIDIKGPKVDIEGPEVDVDVDVECPDLNVEGPEANVKLPKFKKSKFGFGMKSPKAEIKVPGAEVDLPEAELNAESPDVNVAGKGKKSKFKMPKLHMSGPKVKGKKGGFDVNVAGGELDANLKSPDLDVNVAGPDVTIKGDATVKSPKGKKPMFGKISFPDVEFDLKSHRFRGDASLAGPKIEGELKAPDLEVSVPTAKGDVKGPSLNVDLDASDVSLKKPKFKLPGGQVGVGDLKMEGDLKGPAL', 'DYVDKPLEKPLKLVLKVGGSEVTELSGSGHDSSYYDDRSDHERERHKEKKKKKKKKSEKEKEKHLDDEERRKRKEEKKRKREKEQCDTEGETDDFDPGKKVEVEPPPDRPVRACRTQPAENESTPIQQLLEHFLRQLQRKDPHGFFAFPVTDAIAPGYSMIIKHPMDFGTMKEKIAANEY KSVTEFKADFKLMCDNAMTYNRPDTVYYKLAKKILHTGFKMMSKAALLGDEDTAVEEPVPEVVPVQAETTKKSKKQNKEVISCIFEPEGNACSLTDSTAEEHVLALVEHAADEARDRINRYLPNSKIGYLKKNGDGTLLFSVVNSSDPEAEEEETHPVDLSSLSSKLLPGFTTLGFKDERRNKVTFLTSASTAPSMQNNSIFHDLKSDEMELLYSAYGDETGIQCALSLQEFVKDAGNYSKKIVDDLLDQITSGDHSKTIYQLKQRRNIPVKPLDEVKVGESAGDSNTSDLDFLSMKPYSDVSLDISMLSSLGKVKKELDHDDNHLHLDETTKLLQDLHEAQADRVGSRPSSNLSSLSNTSERDQHHLGSPSHLSVGEQQDMVHDPYEFLQSPETSNTTTN', 'MGHQVIIFDFSLIRGLWETRVKKNKERQRKEKERLEKSSLEKIKQEWNFILECRKKGIPQSKYLKNGFVDTDEKILDTYGKTQLQKRHALSHETNKKRNKFIFQLSGEQWTEFPDSLKEQIYLKEWHVYNTLIQTIPAYIALFQDLRVLELSKNQINHLPLEIGCLKNLKVLNVSFNNLKSVPPELGDCESLEKLDLSGNMEITELPFELSNLKQVTVVDVSANKFHSIPICVLRMSNLQWLDISSNNLKDLPEDIDRLDQLQTLLLQKNKLTYLPRALVNMPKLSLLVVSGDDLVEIPTAVCESTTGLKFISLKDSPVETIVCEDTEEIIESEREREEFEKEFMKAYIEDLKERDSTPSYTTKVLLSLQL']

So this is what I tried, but I ran into some problems:

#Empty lists:
ID = []
sequences = []


#Empty string:
sequence = ""

#Open fasta-file:
with open("Proteome.fasta", "r") as proteome:       #Opens .fasta file as proteome
    for line in proteome:                           #for each line in proteome                            
        if line.startswith(">"):                    #If the line starts with '>'
            ID.append(line)                         #It will append the line to the list 'ID'
            sequences.append(sequence)              #This will append the sequence made to the list 'sequences'
            sequence = ""                           #This should end the loop everytime it encounters '>' (the sequences will not be one giant sequence)
        else:                                       
            sequence += str(line[:-1])              #makes one string from the different lines that make up one proteinsequence

The list 'sequences' starts with an empty element, which I think I can remove with the .remove() function. And this list does not contain the last sequence, probably because at the end there's no more '>' identifier (because it is the last protein).

Does anyone know what went wrong here, and how I can fix it? I also rather not use any import options, just "vanilla" python in IDLE Shell. Thanks in advance!



source https://stackoverflow.com/questions/74843751/python-creating-lists-from-a-fasta-file-containing-the-proteome-of-an-organ

Comments

Popular posts from this blog

ValueError: X has 10 features, but LinearRegression is expecting 1 features as input

So, I am trying to predict the model but its throwing error like it has 10 features but it expacts only 1. So I am confused can anyone help me with it? more importantly its not working for me when my friend runs it. It works perfectly fine dose anyone know the reason about it? cv = KFold(n_splits = 10) all_loss = [] for i in range(9): # 1st for loop over polynomial orders poly_order = i X_train = make_polynomial(x, poly_order) loss_at_order = [] # initiate a set to collect loss for CV for train_index, test_index in cv.split(X_train): print('TRAIN:', train_index, 'TEST:', test_index) X_train_cv, X_test_cv = X_train[train_index], X_test[test_index] t_train_cv, t_test_cv = t[train_index], t[test_index] reg.fit(X_train_cv, t_train_cv) loss_at_order.append(np.mean((t_test_cv - reg.predict(X_test_cv))**2)) # collect loss at fold all_loss.append(np.mean(loss_at_order)) # collect loss at order plt.plot(np.log(al...

Sorting large arrays of big numeric stings

I was solving bigSorting() problem from hackerrank: Consider an array of numeric strings where each string is a positive number with anywhere from to digits. Sort the array's elements in non-decreasing, or ascending order of their integer values and return the sorted array. I know it works as follows: def bigSorting(unsorted): return sorted(unsorted, key=int) But I didnt guess this approach earlier. Initially I tried below: def bigSorting(unsorted): int_unsorted = [int(i) for i in unsorted] int_sorted = sorted(int_unsorted) return [str(i) for i in int_sorted] However, for some of the test cases, it was showing time limit exceeded. Why is it so? PS: I dont know exactly what those test cases were as hacker rank does not reveal all test cases. source https://stackoverflow.com/questions/73007397/sorting-large-arrays-of-big-numeric-stings

How to load Javascript with imported modules?

I am trying to import modules from tensorflowjs, and below is my code. test.html <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Document</title </head> <body> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@2.0.0/dist/tf.min.js"></script> <script type="module" src="./test.js"></script> </body> </html> test.js import * as tf from "./node_modules/@tensorflow/tfjs"; import {loadGraphModel} from "./node_modules/@tensorflow/tfjs-converter"; const MODEL_URL = './model.json'; const model = await loadGraphModel(MODEL_URL); const cat = document.getElementById('cat'); model.execute(tf.browser.fromPixels(cat)); Besides, I run the server using python -m http.server in my command prompt(Windows 10), and this is the error prompt in the console log of my browser: Failed to loa...