Skip to main content

Python - creating lists from a '.fasta' file containing the proteome of an organism

I want to create two lists using data from a '.fasta' file containing around 14.000 proteins and their sequence in FASTA-format. This is the proteome of an organism.

One list will contain the identifiers, which is each line in the file starting with '>'. The other list needs to contain the full sequence from a protein as one element in the list. This is for me, a beginner, decently difficult considering the sequence of one protein is also longer than one line in the file, and every protein varies in length.

Here's a toy example (these are the first three proteins form the proteome):

tr|A0A087QGI9|A0A087QGI9_APTFO Neuroblast differentiation-associated protein AHNAK (Fragment) OS=Aptenodytes forsteri OX=9233 GN=AS27_15363 PE=4 SV=1 GDVDVSVPKLEGDLKGPEVDIKGPKVDIEAPDVDIHGPEGKIKIPKFKMPKFGLSGLKGE GPEVDVNLPEADVALSGPKVDVTLPDLDVEGPEGKLKGPKFKKPDVQFNVPKISMPEIDL NLKGPKLKADLDPSLPKIEGELKGPEVDIKGPKVDIEAPDVDFHGPEGKLKMPKFKMPKF GASGFKAEGPEVDVSVPKGELDVSGPKLDSEGAGFQIEGPEGKFKGPQFKMPEMNVKAPK ISVPDVDLNLKGPKAKADMDMSLPQVDLKGSDLHVKGPKVEVEVPDLDLEGPKGKLKGPK FKMPEMNIKAPNISMPDFDLNLKGPKSKGGADVDLSVPKLEGDLSGPDVSIKGPKVDMDG PELDVEGREGKLKGPKFKMPEMHIKAPKISMPDASLNLKGPKLKGDVEGDVEVSVPKLEG DLKGPELDIKGPKVDIDVPDVEIEGPEGKLKGPKLKMPEMHFKAPKISMPDFDLNLKGPK VKGDVDVSVPCVEGDLKGPKIDVKGPELDVSAPDVQIEGPEGKVKGPKFKMPDMHFKAPK ISMPDFDLNLKGPKVDVNLPTGDLDVSVPKVDIEGPELDIEVPEGKLKGPKFKMPEMHIK APKISMPDIDLNLKGPKVKGDVEGDVDVSVPKLEGDLKGPEVDIKGPKVDIEAPDMDIHG PEGKFKMPKFKMPEMHIKAPKVSVPDVDFNIKGPQLKGGVDVSGPRLEGDLKAPQMDVQA PKIDVEAPDVTIEGPEGKLKGPKFKMPEMHVKAPKFSMPDVDFNLKGPRLKGDVDVSAPK LEGDLQRPGVDIEGPKLDLRAPDVNIEGPDGKLKGPGIKMPEMHVKTPQISMPDIDLNLK GLRLKGDADLQGPKLEGGLKGPEIDIKGPKVDIEGPEVDVDVDVECPDLNVEGPEANVKL PKFKKSKFGFGMKSPKAEIKVPGAEVDLPEAELNAESPDVNVAGKGKKSKFKMPKLHMSG PKVKGKKGGFDVNVAGGELDANLKSPDLDVNVAGPDVTIKGDATVKSPKGKKPMFGKISF PDVEFDLKSHRFRGDASLAGPKIEGELKAPDLEVSVPTAKGDVKGPSLNVDLDASDVSLK KPKFKLPGGQVGVGDLKMEGDLKGPAL tr|A0A087QH03|A0A087QH03_APTFO Bromodomain-containing protein 9 (Fragment) OS=Aptenodytes forsteri OX=9233 GN=AS27_06589 PE=4 SV=1 DYVDKPLEKPLKLVLKVGGSEVTELSGSGHDSSYYDDRSDHERERHKEKKKKKKKKSEKE KEKHLDDEERRKRKEEKKRKREKEQCDTEGETDDFDPGKKVEVEPPPDRPVRACRTQPAE NESTPIQQLLEHFLRQLQRKDPHGFFAFPVTDAIAPGYSMIIKHPMDFGTMKEKIAANEY KSVTEFKADFKLMCDNAMTYNRPDTVYYKLAKKILHTGFKMMSKAALLGDEDTAVEEPVP EVVPVQAETTKKSKKQNKEVISCIFEPEGNACSLTDSTAEEHVLALVEHAADEARDRINR YLPNSKIGYLKKNGDGTLLFSVVNSSDPEAEEEETHPVDLSSLSSKLLPGFTTLGFKDER RNKVTFLTSASTAPSMQNNSIFHDLKSDEMELLYSAYGDETGIQCALSLQEFVKDAGNYS KKIVDDLLDQITSGDHSKTIYQLKQRRNIPVKPLDEVKVGESAGDSNTSDLDFLSMKPYS DVSLDISMLSSLGKVKKELDHDDNHLHLDETTKLLQDLHEAQADRVGSRPSSNLSSLSNT SERDQHHLGSPSHLSVGEQQDMVHDPYEFLQSPETSNTTTN tr|A0A087QHB6|A0A087QHB6_APTFO Leucine-rich repeat-containing protein 2 OS=Aptenodytes forsteri OX=9233 GN=AS27_03020 PE=4 SV=1 MGHQVIIFDFSLIRGLWETRVKKNKERQRKEKERLEKSSLEKIKQEWNFILECRKKGIPQ SKYLKNGFVDTDEKILDTYGKTQLQKRHALSHETNKKRNKFIFQLSGEQWTEFPDSLKEQ IYLKEWHVYNTLIQTIPAYIALFQDLRVLELSKNQINHLPLEIGCLKNLKVLNVSFNNLK SVPPELGDCESLEKLDLSGNMEITELPFELSNLKQVTVVDVSANKFHSIPICVLRMSNLQ WLDISSNNLKDLPEDIDRLDQLQTLLLQKNKLTYLPRALVNMPKLSLLVVSGDDLVEIPT AVCESTTGLKFISLKDSPVETIVCEDTEEIIESEREREEFEKEFMKAYIEDLKERDSTPS YTTKVLLSLQL

I want the output of the lists to be this:

ID = ['>tr|A0A087QGI9|A0A087QGI9_APTFO Neuroblast differentiation-associated protein AHNAK (Fragment) OS=Aptenodytes forsteri OX=9233 GN=AS27_15363 PE=4 SV=1', '>tr|A0A087QH03|A0A087QH03_APTFO Bromodomain-containing protein 9 (Fragment) OS=Aptenodytes forsteri OX=9233 GN=AS27_06589 PE=4 SV=1', '>tr|A0A087QHB6|A0A087QHB6_APTFO Leucine-rich repeat-containing protein 2 OS=Aptenodytes forsteri OX=9233 GN=AS27_03020 PE=4 SV=1']

sequences = ['GDVDVSVPKLEGDLKGPEVDIKGPKVDIEAPDVDIHGPEGKIKIPKFKMPKFGLSGLKGEGPEVDVNLPEADVALSGPKVDVTLPDLDVEGPEGKLKGPKFKKPDVQFNVPKISMPEIDLNLKGPKLKADLDPSLPKIEGELKGPEVDIKGPKVDIEAPDVDFHGPEGKLKMPKFKMPKFGASGFKAEGPEVDVSVPKGELDVSGPKLDSEGAGFQIEGPEGKFKGPQFKMPEMNVKAPKISVPDVDLNLKGPKAKADMDMSLPQVDLKGSDLHVKGPKVEVEVPDLDLEGPKGKLKGPKFKMPEMNIKAPNISMPDFDLNLKGPKSKGGADVDLSVPKLEGDLSGPDVSIKGPKVDMDGPELDVEGREGKLKGPKFKMPEMHIKAPKISMPDASLNLKGPKLKGDVEGDVEVSVPKLEGDLKGPELDIKGPKVDIDVPDVEIEGPEGKLKGPKLKMPEMHFKAPKISMPDFDLNLKGPKVKGDVDVSVPCVEGDLKGPKIDVKGPELDVSAPDVQIEGPEGKVKGPKFKMPDMHFKAPKISMPDFDLNLKGPKVDVNLPTGDLDVSVPKVDIEGPELDIEVPEGKLKGPKFKMPEMHIKAPKISMPDIDLNLKGPKVKGDVEGDVDVSVPKLEGDLKGPEVDIKGPKVDIEAPDMDIHGPEGKFKMPKFKMPEMHIKAPKVSVPDVDFNIKGPQLKGGVDVSGPRLEGDLKAPQMDVQAPKIDVEAPDVTIEGPEGKLKGPKFKMPEMHVKAPKFSMPDVDFNLKGPRLKGDVDVSAPKLEGDLQRPGVDIEGPKLDLRAPDVNIEGPDGKLKGPGIKMPEMHVKTPQISMPDIDLNLKGLRLKGDADLQGPKLEGGLKGPEIDIKGPKVDIEGPEVDVDVDVECPDLNVEGPEANVKLPKFKKSKFGFGMKSPKAEIKVPGAEVDLPEAELNAESPDVNVAGKGKKSKFKMPKLHMSGPKVKGKKGGFDVNVAGGELDANLKSPDLDVNVAGPDVTIKGDATVKSPKGKKPMFGKISFPDVEFDLKSHRFRGDASLAGPKIEGELKAPDLEVSVPTAKGDVKGPSLNVDLDASDVSLKKPKFKLPGGQVGVGDLKMEGDLKGPAL', 'DYVDKPLEKPLKLVLKVGGSEVTELSGSGHDSSYYDDRSDHERERHKEKKKKKKKKSEKEKEKHLDDEERRKRKEEKKRKREKEQCDTEGETDDFDPGKKVEVEPPPDRPVRACRTQPAENESTPIQQLLEHFLRQLQRKDPHGFFAFPVTDAIAPGYSMIIKHPMDFGTMKEKIAANEY KSVTEFKADFKLMCDNAMTYNRPDTVYYKLAKKILHTGFKMMSKAALLGDEDTAVEEPVPEVVPVQAETTKKSKKQNKEVISCIFEPEGNACSLTDSTAEEHVLALVEHAADEARDRINRYLPNSKIGYLKKNGDGTLLFSVVNSSDPEAEEEETHPVDLSSLSSKLLPGFTTLGFKDERRNKVTFLTSASTAPSMQNNSIFHDLKSDEMELLYSAYGDETGIQCALSLQEFVKDAGNYSKKIVDDLLDQITSGDHSKTIYQLKQRRNIPVKPLDEVKVGESAGDSNTSDLDFLSMKPYSDVSLDISMLSSLGKVKKELDHDDNHLHLDETTKLLQDLHEAQADRVGSRPSSNLSSLSNTSERDQHHLGSPSHLSVGEQQDMVHDPYEFLQSPETSNTTTN', 'MGHQVIIFDFSLIRGLWETRVKKNKERQRKEKERLEKSSLEKIKQEWNFILECRKKGIPQSKYLKNGFVDTDEKILDTYGKTQLQKRHALSHETNKKRNKFIFQLSGEQWTEFPDSLKEQIYLKEWHVYNTLIQTIPAYIALFQDLRVLELSKNQINHLPLEIGCLKNLKVLNVSFNNLKSVPPELGDCESLEKLDLSGNMEITELPFELSNLKQVTVVDVSANKFHSIPICVLRMSNLQWLDISSNNLKDLPEDIDRLDQLQTLLLQKNKLTYLPRALVNMPKLSLLVVSGDDLVEIPTAVCESTTGLKFISLKDSPVETIVCEDTEEIIESEREREEFEKEFMKAYIEDLKERDSTPSYTTKVLLSLQL']

So this is what I tried, but I ran into some problems:

#Empty lists:
ID = []
sequences = []


#Empty string:
sequence = ""

#Open fasta-file:
with open("Proteome.fasta", "r") as proteome:       #Opens .fasta file as proteome
    for line in proteome:                           #for each line in proteome                            
        if line.startswith(">"):                    #If the line starts with '>'
            ID.append(line)                         #It will append the line to the list 'ID'
            sequences.append(sequence)              #This will append the sequence made to the list 'sequences'
            sequence = ""                           #This should end the loop everytime it encounters '>' (the sequences will not be one giant sequence)
        else:                                       
            sequence += str(line[:-1])              #makes one string from the different lines that make up one proteinsequence

The list 'sequences' starts with an empty element, which I think I can remove with the .remove() function. And this list does not contain the last sequence, probably because at the end there's no more '>' identifier (because it is the last protein).

Does anyone know what went wrong here, and how I can fix it? I also rather not use any import options, just "vanilla" python in IDLE Shell. Thanks in advance!



source https://stackoverflow.com/questions/74843751/python-creating-lists-from-a-fasta-file-containing-the-proteome-of-an-organ

Comments

Popular posts from this blog

Prop `className` did not match in next js app

I have written a sample code ( Github Link here ). this is a simple next js app, but giving me error when I refresh the page. This seems to be the common problem and I tried the fix provided in the internet but does not seem to fix my issue. The error is Warning: Prop className did not match. Server: "MuiBox-root MuiBox-root-1" Client: "MuiBox-root MuiBox-root-2". Did changes for _document.js, modified _app.js as mentioned in official website and solutions in stackoverflow. but nothing seems to work. Could someone take a look and help me whats wrong with the code? Via Active questions tagged javascript - Stack Overflow https://ift.tt/2FdjaAW

How to show number of registered users in Laravel based on usertype?

i'm trying to display data from the database in the admin dashboard i used this: <?php use Illuminate\Support\Facades\DB; $users = DB::table('users')->count(); echo $users; ?> and i have successfully get the correct data from the database but what if i want to display a specific data for example in this user table there is "usertype" that specify if the user is normal user or admin i want to user the same code above but to display a specific usertype i tried this: <?php use Illuminate\Support\Facades\DB; $users = DB::table('users')->count()->WHERE usertype =admin; echo $users; ?> but it didn't work, what am i doing wrong? source https://stackoverflow.com/questions/68199726/how-to-show-number-of-registered-users-in-laravel-based-on-usertype

Why is my reports service not connecting?

I am trying to pull some data from a Postgres database using Node.js and node-postures but I can't figure out why my service isn't connecting. my routes/index.js file: const express = require('express'); const router = express.Router(); const ordersCountController = require('../controllers/ordersCountController'); const ordersController = require('../controllers/ordersController'); const weeklyReportsController = require('../controllers/weeklyReportsController'); router.get('/orders_count', ordersCountController); router.get('/orders', ordersController); router.get('/weekly_reports', weeklyReportsController); module.exports = router; My controllers/weeklyReportsController.js file: const weeklyReportsService = require('../services/weeklyReportsService'); const weeklyReportsController = async (req, res) => { try { const data = await weeklyReportsService; res.json({data}) console