Skip to main content

Python - creating lists from a '.fasta' file containing the proteome of an organism

I want to create two lists using data from a '.fasta' file containing around 14.000 proteins and their sequence in FASTA-format. This is the proteome of an organism.

One list will contain the identifiers, which is each line in the file starting with '>'. The other list needs to contain the full sequence from a protein as one element in the list. This is for me, a beginner, decently difficult considering the sequence of one protein is also longer than one line in the file, and every protein varies in length.

Here's a toy example (these are the first three proteins form the proteome):

tr|A0A087QGI9|A0A087QGI9_APTFO Neuroblast differentiation-associated protein AHNAK (Fragment) OS=Aptenodytes forsteritr|A0A087QH03|A0A087QH03_APTFO Bromodomain-containing protein 9 (Fragment) OS=Aptenodytes forsteri OX=9233 GN=AS27_06589 PE=4 SV=1 DYVDKPLEKPLKLVLKVGGSEVTELSGSGHDSSYYDDRSDHERERHKEKKKKKKKKSEKE KEKHLDDEERRKRKEEKKRKREKEQCDTEGETDDFDPGKKVEVEPPPDRPVRACRTQPAE NESTPIQQLLEHFLRQLQRKDPHGFFAFPVTDAIAPGYSMIIKHPMDFGTMKEKIAANEY KSVTEFKADFKLMCDNAMTYNRPDTVYYKLAKKILHTGFKMMSKAALLGDEDTAVEEPVP EVVPVQAETTKKSKKQNKEVISCIFEPEGNACSLTDSTAEEHVLALVEHAADEARDRINR YLPNSKIGYLKKNGDGTLLFSVVNSSDPEAEEEETHPVDLSSLSSKLLPGFTTLGFKDER RNKVTFLTSASTAPSMQNNSIFHDLKSDEMELLYSAYGDETGIQCALSLQEFVKDAGNYS KKIVDDLLDQITSGDHSKTIYQLKQRRNIPVKPLDEVKVGESAGDSNTSDLDFLSMKPYS DVSLDISMLSSLGKVKKELDHDDNHLHLDETTKLLQDLHEAQADRVGSRPSSNLSSLSNT SERDQHHLGSPSHLSVGEQQDMVHDPYEFLQSPETSNTTTN tr|A0A087QHB6|A0A087QHB6_APTFO Leucine-rich repeat-containing protein 2 OS=Aptenodytes forsteri OX=9233 GN=AS27_03020 PE=4 SV=1 MGHQVIIFDFSLIRGLWETRVKKNKERQRKEKERLEKSSLEKIKQEWNFILECRKKGIPQ SKYLKNGFVDTDEKILDTYGKTQLQKRHALSHETNKKRNKFIFQLSGEQWTEFPDSLKEQ IYLKEWHVYNTLIQTIPAYIALFQDLRVLELSKNQINHLPLEIGCLKNLKVLNVSFNNLK SVPPELGDCESLEKLDLSGNMEITELPFELSNLKQVTVVDVSANKFHSIPICVLRMSNLQ WLDISSNNLKDLPEDIDRLDQLQTLLLQKNKLTYLPRALVNMPKLSLLVVSGDDLVEIPT AVCESTTGLKFISLKDSPVETIVCEDTEEIIESEREREEFEKEFMKAYIEDLKERDSTPS YTTKVLLSLQL

I want the output of the lists to be this:

ID = ['>tr|A0A087QGI9|A0A087QGI9_APTFO Neuroblast differentiation-associated protein AHNAK (Fragment) OS=Aptenodytes forsteri OX=9233 GN=AS27_15363 PE=4 SV=1', '>tr|A0A087QH03|A0A087QH03_APTFO Bromodomain-containing protein 9 (Fragment) OS=Aptenodytes forsteri OX=9233 GN=AS27_06589 PE=4 SV=1', '>tr|A0A087QHB6|A0A087QHB6_APTFO Leucine-rich repeat-containing protein 2 OS=Aptenodytes forsteri OX=9233 GN=AS27_03020 PE=4 SV=1']

sequences = ['GDVDVSVPKLEGDLKGPEVDIKGPKVDIEAPDVDIHGPEGKIKIPKFKMPKFGLSGLKGEGPEVDVNLPEADVALSGPKVDVTLPDLDVEGPEGKLKGPKFKKPDVQFNVPKISMPEIDLNLKGPKLKADLDPSLPKIEGELKGPEVDIKGPKVDIEAPDVDFHGPEGKLKMPKFKMPKFGASGFKAEGPEVDVSVPKGELDVSGPKLDSEGAGFQIEGPEGKFKGPQFKMPEMNVKAPKISVPDVDLNLKGPKAKADMDMSLPQVDLKGSDLHVKGPKVEVEVPDLDLEGPKGKLKGPKFKMPEMNIKAPNISMPDFDLNLKGPKSKGGADVDLSVPKLEGDLSGPDVSIKGPKVDMDGPELDVEGREGKLKGPKFKMPEMHIKAPKISMPDASLNLKGPKLKGDVEGDVEVSVPKLEGDLKGPELDIKGPKVDIDVPDVEIEGPEGKLKGPKLKMPEMHFKAPKISMPDFDLNLKGPKVKGDVDVSVPCVEGDLKGPKIDVKGPELDVSAPDVQIEGPEGKVKGPKFKMPDMHFKAPKISMPDFDLNLKGPKVDVNLPTGDLDVSVPKVDIEGPELDIEVPEGKLKGPKFKMPEMHIKAPKISMPDIDLNLKGPKVKGDVEGDVDVSVPKLEGDLKGPEVDIKGPKVDIEAPDMDIHGPEGKFKMPKFKMPEMHIKAPKVSVPDVDFNIKGPQLKGGVDVSGPRLEGDLKAPQMDVQAPKIDVEAPDVTIEGPEGKLKGPKFKMPEMHVKAPKFSMPDVDFNLKGPRLKGDVDVSAPKLEGDLQRPGVDIEGPKLDLRAPDVNIEGPDGKLKGPGIKMPEMHVKTPQISMPDIDLNLKGLRLKGDADLQGPKLEGGLKGPEIDIKGPKVDIEGPEVDVDVDVECPDLNVEGPEANVKLPKFKKSKFGFGMKSPKAEIKVPGAEVDLPEAELNAESPDVNVAGKGKKSKFKMPKLHMSGPKVKGKKGGFDVNVAGGELDANLKSPDLDVNVAGPDVTIKGDATVKSPKGKKPMFGKISFPDVEFDLKSHRFRGDASLAGPKIEGELKAPDLEVSVPTAKGDVKGPSLNVDLDASDVSLKKPKFKLPGGQVGVGDLKMEGDLKGPAL', 'DYVDKPLEKPLKLVLKVGGSEVTELSGSGHDSSYYDDRSDHERERHKEKKKKKKKKSEKEKEKHLDDEERRKRKEEKKRKREKEQCDTEGETDDFDPGKKVEVEPPPDRPVRACRTQPAENESTPIQQLLEHFLRQLQRKDPHGFFAFPVTDAIAPGYSMIIKHPMDFGTMKEKIAANEY KSVTEFKADFKLMCDNAMTYNRPDTVYYKLAKKILHTGFKMMSKAALLGDEDTAVEEPVPEVVPVQAETTKKSKKQNKEVISCIFEPEGNACSLTDSTAEEHVLALVEHAADEARDRINRYLPNSKIGYLKKNGDGTLLFSVVNSSDPEAEEEETHPVDLSSLSSKLLPGFTTLGFKDERRNKVTFLTSASTAPSMQNNSIFHDLKSDEMELLYSAYGDETGIQCALSLQEFVKDAGNYSKKIVDDLLDQITSGDHSKTIYQLKQRRNIPVKPLDEVKVGESAGDSNTSDLDFLSMKPYSDVSLDISMLSSLGKVKKELDHDDNHLHLDETTKLLQDLHEAQADRVGSRPSSNLSSLSNTSERDQHHLGSPSHLSVGEQQDMVHDPYEFLQSPETSNTTTN', 'MGHQVIIFDFSLIRGLWETRVKKNKERQRKEKERLEKSSLEKIKQEWNFILECRKKGIPQSKYLKNGFVDTDEKILDTYGKTQLQKRHALSHETNKKRNKFIFQLSGEQWTEFPDSLKEQIYLKEWHVYNTLIQTIPAYIALFQDLRVLELSKNQINHLPLEIGCLKNLKVLNVSFNNLKSVPPELGDCESLEKLDLSGNMEITELPFELSNLKQVTVVDVSANKFHSIPICVLRMSNLQWLDISSNNLKDLPEDIDRLDQLQTLLLQKNKLTYLPRALVNMPKLSLLVVSGDDLVEIPTAVCESTTGLKFISLKDSPVETIVCEDTEEIIESEREREEFEKEFMKAYIEDLKERDSTPSYTTKVLLSLQL']

So this is what I tried, but I ran into some problems:

#Empty lists:
ID = []
sequences = []


#Empty string:
sequence = ""

#Open fasta-file:
with open("Proteome.fasta", "r") as proteome:       #Opens .fasta file as proteome
    for line in proteome:                           #for each line in proteome                            
        if line.startswith(">"):                    #If the line starts with '>'
            ID.append(line)                         #It will append the line to the list 'ID'
            sequences.append(sequence)              #This will append the sequence made to the list 'sequences'
            sequence = ""                           #This should end the loop everytime it encounters '>' (the sequences will not be one giant sequence)
        else:                                       
            sequence += str(line[:-1])              #makes one string from the different lines that make up one proteinsequence

The list 'sequences' starts with an empty element, which I think I can remove with the .remove() function. And this list does not contain the last sequence, probably because at the end there's no more '>' identifier (because it is the last protein).

Does anyone know what went wrong here, and how I can fix it? I also rather not use any import options, just "vanilla" python in IDLE Shell. Thanks in advance!



source https://stackoverflow.com/questions/74843751/python-creating-lists-from-a-fasta-file-containing-the-proteome-of-an-organ

Comments

Popular posts from this blog

How to show number of registered users in Laravel based on usertype?

i'm trying to display data from the database in the admin dashboard i used this: <?php use Illuminate\Support\Facades\DB; $users = DB::table('users')->count(); echo $users; ?> and i have successfully get the correct data from the database but what if i want to display a specific data for example in this user table there is "usertype" that specify if the user is normal user or admin i want to user the same code above but to display a specific usertype i tried this: <?php use Illuminate\Support\Facades\DB; $users = DB::table('users')->count()->WHERE usertype =admin; echo $users; ?> but it didn't work, what am i doing wrong? source https://stackoverflow.com/questions/68199726/how-to-show-number-of-registered-users-in-laravel-based-on-usertype

Why is my reports service not connecting?

I am trying to pull some data from a Postgres database using Node.js and node-postures but I can't figure out why my service isn't connecting. my routes/index.js file: const express = require('express'); const router = express.Router(); const ordersCountController = require('../controllers/ordersCountController'); const ordersController = require('../controllers/ordersController'); const weeklyReportsController = require('../controllers/weeklyReportsController'); router.get('/orders_count', ordersCountController); router.get('/orders', ordersController); router.get('/weekly_reports', weeklyReportsController); module.exports = router; My controllers/weeklyReportsController.js file: const weeklyReportsService = require('../services/weeklyReportsService'); const weeklyReportsController = async (req, res) => { try { const data = await weeklyReportsService; res.json({data}) console...

How to split a rinex file if I need 24 hours data

Trying to divide rinex file using the command gfzrnx but getting this error. While doing that getting this error msg 'gfzrnx' is not recognized as an internal or external command Trying to split rinex file using the command gfzrnx. also install'gfzrnx'. my doubt is I need to run this program in 'gfzrnx' or in 'cmdprompt'. I am expecting a rinex file with 24 hrs or 1 day data.I Have 48 hrs data in RINEX format. Please help me to solve this issue. source https://stackoverflow.com/questions/75385367/how-to-split-a-rinex-file-if-i-need-24-hours-data