chapter i – Amino Acids & Sequence Alignment

Multiple Sequence Alignment (MSA): Tools & Databases

This discussion covers protein multiple sequence alignment and how to explore a protein domain content, the family it belongs to, and the presence of characteristic conserved sequence motifs. Using online bioinformatics tools and databases, I provide a step-by-step guide to generate a multiple sequence alignment, analyze conserved features, and examine sequence-structure relationships.

Preparing a Multiple Sequence Alignment: Steps, Tools & Databases

In this section, we examine multiple sequence alignment (MSA) and demonstrate how it helps to infer information about a protein’s structure, function, and evolutionary history. When studying a protein, some questions we might want to answer include:

What is the protein’s evolutionary origin (the protein family to which it belongs)?
What is the domain content and the domain architecture?
What conserved sequence patterns are present in the sequence?
Where in the sequence can we find functionally essential sites, such as substrate binding, cofactor binding, metal binding, interactions with other proteins, etc.?
Is there a three-dimensional structure of the protein? Is it an experimental or a predicted structure?
And of course, which tools and bioinformatics databases we plan to use in the work?

These questions are general and can serve as a guide for any work involving a specific protein. Naturally, there are many more specific questions that depend on the type of project and the targeted protein. Using modern bioinformatics tools and databases, we can answer many of these questions using a multiple sequence alignment. With the development of protein tertiary structure prediction methods such as AlphaFold and ESM Metagenomic Atlas, reasonably accurate protein structures can be readily found and examined. These structures can be used, e.g., to analyze the protein’s domain content, the presence of disordered regions, and the location of functionally important residues (which are usually conserved within a protein family). Structural insights can subsequently be compared with insights we gain from a multiple sequence alignment.

Before starting the alignment, we first need to decide which tools and databases to use. Usually, I run standard multiple sequence alignments at UniProt, which is part of the Expasy server at the Swiss Bioinformatics Institute. I also sometimes use the European Bioinformatics Institute (EBI) multiple sequence alignment server for more complex alignments. With “complex”, I mean that the percentage identity between the sequences is low (evolutionarily distant proteins), and the alignment contains many insertions and deletions of various lengths. Larger insertions and deletions could, for example, result from longer surface loops present in some family members (see the example MSA in the introduction to protein sequence alignment). In such cases, we may need to change the gap insertion and gap extension parameters, and even the substitution matrix used to calculate the alignment score. The alignment tools at EBI offer greater flexibility in controlling these parameters. When there are many insertions and deletions in the sequences being aligned, variations in the parameters used to calculate the alignment score can considerably affect the final alignment. For this reason, we may sometimes need to try different parameter values until we are satisfied with the final result.

We begin by examining information about the protein in UniProt and other related databases to gain a deeper understanding, and only after that can we generate the multiple sequence alignment.

Structural Biology Services by SARomics Biostructures

experimental methods in structural biology — MAX IV Laboratory, Synchrotron Radiation Facility in Lund, Sweden. Two protein crystallography beamlines are dedicated to structural biology research.

Choosing a Protein: Magnesium Chelatase

As an exercise for the multiple sequence alignment, I will choose two related proteins mentioned earlier in the introduction to sequence alignment: BchI and BchD, two subunits of the magnesium chelatase enzyme. The enzyme magnesium chelatase catalyzes the insertion of a Mg2+ ion into protoporphyrin IX at the first committed step of chlorophyll biosynthesis. It consists of three subunits, the largest of which is BchH (approximately 120 kDa), which is the catalytic subunit. It binds protoporphyrin IX and magnesium and catalyses the insertion of the metal into the porphyrin ring. For our sequence alignment, we will use the two other subunits, BchI (35 kDa) and BchD (70 kDa). Together, these subunits form a large 600 kDa oligomeric complex that hydrolyses ATP and undergoes large conformational changes during the catalytic cycle (see image below and the related publication for details).

Based on the size, we can conclude that BchI and BchD are multidomain proteins. In these instances, it’s crucial to identify the specific domains and analyze their conservation patterns (conserved motifs), shedding light on their functions within the enzyme complex and their evolutionary origins. Insights gained from this analysis could enhance our understanding of this enzyme’s function.

3D structure of BchI-BchD complex of magnesium chelatase

Cryo-electron microscopy reconstruction of the complex of subunits BchI and BchD of  Rhodobacter capsulatus magnesium chelatase. Where appropriate, the available X-ray structure of subunit BchI of the enzyme (shown in ribbon representation) was docked into the EM density (about 7 Å resolution). Other domains were modeled based on known structures from other proteins. Published in Lundqvist et al., Structure 2010.

Analysing The Protein In UniProt

Initially, we need to select the sequences for the multiple sequence alignment. To start, we write the name of the protein (BchD) into the UniProt search window. The search will return a large number of entries. You may notice on the left, under “Status“, that there are “Reviewed” and “Unreviewed” sequences. This is one of my favorite features: when there are enough reviewed sequences, I usually choose them for further analysis. These sequences have been verified to be what we expect them to be. Many automatically annotated sequences are in the Unreviewed section (see the image below). I usually try to avoid them because they may contain mistakes, such as sequences erroneously assigned to BchD.

We will select BCHD_RHOCB (entry P26175), which refers to the BchD subunit from Rhodobacter capsulatus magnesium chelatase. In plants, the equivalent subunit is called ChlD. The “B” in BchD indicates that the protein is involved in bacteriochlorophyll synthesis. Upon clicking on the entry ID P26175, the page that opens will contain detailed information on the enzymen magnesium chelatase, including its biological function (photosynthesis, magnesium chelatase activity), the type of ligands/substrates it binds, its catalytic function (insertion of magnesium and ATP hydrolysis), links to published works, links to related entries in other databases, the amino acid sequence of BchD, and even the predicted AlphaFold model of the tertiary structure of the protein.

Exploring The Protein Family, Conserved Domains & Sequence Motifs

Before making a multiple sequence alignment, it is essential to have an understanding of the protein’s domain content. If you click “Family & Domains” on the left menu, you’ll see BchD’s domain content (see image below). In this case, only the vWFA domain (von Willebrand Factor A-like domain superfamily) at the C-terminal of the protein (residues 379-559) is shown. Following the “View protein in InterPro” link, which takes us to the family classification database InterPro, we arrive at the BCHD_RHOCB page. Here, we get a much more detailed analysis of the protein’s domain composition. Apart from the vWFA domain, they also identify a P-loop NTPase family domain at the N-terminal part of the sequence, residues 78-235. The characteristic P-loop sequence motif, which is [AG] x (4)-G-K-[ST] as defined in the Prosite database (x means any residues at that position are allowed), is not conserved in R. capsulatus BchD. However, the motif remains intact in other family members. There are many ATP- and GTP-binding proteins that contain the P-loop sequence motif. We will examine this further when we analyze the sequence alignment. We can also note that there is a region between these two domains that the InterPro database classifies as disordered.
There are several other links to the InterPro database, the BchD/ChlD_VWA domain, and to ChlI/MoxR_AAA_lid, the α-helical AAA+ lid domain that is found in all AAA+ ATPases. There is even a link to the general superfamily of P-loop containing nucleoside triphosphatases, P-loop_NTPase.

The vWFA domain is characterized by the conservation of the MIDAS motif, which stands for Metal Ion-Dependent Adhesion Site. This motif includes the DXSXS sequence motif and additional threonine (T) and aspartate (D) residues located further down the sequence. The multiple sequence alignment shown in the image below is from NCBI’s Conserved Protein Domain Family server. From the alignment, we can see that in BchD, the DXSXS motif residues are D385, S387, and S389, located close to the N-terminus of the vWFA domain (marked in yellow and with a # on top of the alignment in the image below). Threonine T452 and aspartate D482 are found further down in the sequence. These residues are involved in binding metal ions such as Ca²⁺ or Mg²⁺. However, their exact function within the magnesium chelatase complex remains unknown. For a detailed description of the vWFA domain, refer to the paper by Lacy et al. (2004).

Multiple sequence alignment and the Metal Ion-Dependent Adhesion Site of BchD

Analysing The AlphaFold Tertiary Structure

Since AlphaFold has provided predicted structures for virtually all amino acid sequences in the database. We need to click “Structure” in the left menu in UniProt (as shown in the image above). This will bring us to structure-related links, including one to the AlphaFold site. We can examine the predicted structure of BchD to confirm domain assignments by the InterPro server. The model (image on the right) clearly confirms domain predictions. Both the N-terminal (residues 78-235) and the C-terminal (residues 375-559) domains, which were recognized as a P-loop-containing and vWAF-type domains, respectively, are well separated from the rest of the structure and show typical folds for these domain classes. The predicted structure also reveals that a long stretch of the sequence, which lacks significant secondary structure, connects the C-terminal vWAF-domain and N-terminal domains. The residues from approximately 232 to 309 (orange indicates unreliable predictions) are followed by a region with some secondary structure, which continues until Met375, where the vWAF domain begins. Examining the sequence alignment, we can observe that between the two domains of the protein, up to Glu309, lies a so-called low-complexity region. This region is poorly conserved, with numerous repeats and a high proportion of acidic residues. Such regions are usually very flexible and often found to be involved in protein-protein interactions. This suggests that this region may interact with other subunits within the large 600-kDa magnesium chelatase complex, as shown in the cryo-EM-reconstructed model above. The rest of the sequence, up to the vWAF domain, appears to have a more ordered structure, although judging by the color, the prediction’s reliability remains low.

Interesting to note, as demonstrated in the publication by Fodje et al., that the N-terminal domain of BchD is homologous to the N-terminal domain of BchD and has a conserved fold of the so-called AAA+ family of proteins. This similarity has implications for the quaternary structure and function of the enzyme magnesium chelatase. We will discuss this below.

Finally, The Protein Sequence

From the analysis above, we learned that BchD contains two distinct domains: a P-loop-type domain at the N-terminus, a vWFA domain closer to the C-terminus, and a disordered region between them. With this knowledge, we can build our strategy for making the multiple sequence alignment.
To start, we need, of course, the amino acid sequence of the protein. Closer to the bottom of the UniProt page, we can find it.

The amino acid sequence of BchD at the UniProt database

To use the sequence for a database search, click “Download” at the top to download it. After downloading, you will notice that the sequence format will be different, known as the FASTA format. A brief description of this format is available at the bottom of this page. Since multiple sequences are needed for the multiple sequence alignment, we can use BLAST (Basic Local Alignment Search Tool) to obtain a list of related sequences. It is also possible to choose the sequences from the list we get after searching UniProt with BchD. The advantage of using BLAST is that, in the output, we get a list of sequences with pairwise alignments to our search query, with the percentage of sequence identity and the alignment score shown for each alignment (see the introduction to this chapter). This allows us to choose reasonably distant sequences with lower sequence identity, which I find more useful for a multiple sequence alignment than just including closely related sequences.

To run BLAST, select “Tools” above the sequence (image above) and click “BLAST.” In this case, you don’t need to download the sequence; it is automatically pasted into the BLAST window. You can also choose the substitution matrix for the search (BLOSUM62 is used by default) and choose the type of pairwise alignment, gapped or ungapped. BLAST searches the entire database and returns a list of sequences aligned to our query sequence using pairwise alignment. If you only have an amino acid sequence, e.g., from a sequencing project, you can also paste it into the search field on the BLAST page. This is useful if you have sequenced an unknown protein and want to find its relatives in the database. Once BLAST is ready, you can select several sequences from the list based on the organism and percentage sequence identity and add them to the Basket for later use.

Running Multiple Sequence Alignment At UniProt

I think you should run your own MSA after reading the material on the exercise here.
To run the multiple sequence alignment, I chose at least 3-4 sequences to get an idea of the conservation pattern within the protein family. To run the alignment, I select the sequences from the UniProt search results (after searching for BchD) or from BLAST search results, as shown in the introduction to this chapter. BLAST results are preferable because they explicitly display the percentage sequence identity between the search query (BchD from Rhodobacter capsulatus) and the database sequence. I prefer to use sequences with lower sequence identity, since this provides more information about the protein overall. After selecting the sequence, I click “Add” to add it to the Basket. You will find the basket at the top of the page on the right-hand side.
Clicking the basket opens a small window showing the selected sequences. There, we can select the sequences we want to align, and in the Tools drop-down menu, we can simply click Align. Although it is possible to run the alignment directly from the search results page without saving the entries to the basket, I prefer to save them because they will remain there regardless of what I do in the main window.
Clicking “Align” will take us to the alignment page. An important option I like there is the “Output sequence order” under “Advanced parameters“. I usually select “input order” and group the sequences as I want them, e.g., bacterial and plant sequences separately, simply using cut and paste. This ordering facilitates the analysis of conservation patterns across different kingdoms of life, as discussed in the introduction MSA example.
The sequences in the alignment window may also be edited if needed. The sequences in the window are in FASTA format, which differs from the default format shown on the UniProt page (see image above). The FASTA format presentation can be edited. I typically remove most of the information from the top row after the “>” sign and retain only the protein name, for example, BCHD_RHOCB. This gives a pretty view of the alignment and will simplify its analysis. I usually place the query protein (BCHD_RHOCB) at the top before running the alignment. The results are shown in the image below:

multiple sequence alignment, BchD subunit of magnesium chelatase

The alignment in the image is colored according to identity/similarity. I did a BLAST search, and from the results list I selected sequences with 50-40% identity to the R. capsulatus sequence. This colouring scheme shows that most BchD sequences are well-conserved. The start of the two domains and the last residue in the N-terminal domain are marked by a star and a green line (the N-terminal (residues 78-235) and the C-terminal domain (residues 375-559), R. capsulatus BchD numbering). We can also see that the region between the two domains predicted by AlphaFold to be disordered is the least conserved and is rich in proline and charged residues (residues 236-374). Other coloring options are available in the “Highlight property” drop-down menu at the top of the alignment. We can color by residue type or show hydrophobic, charged, and polar amino acids.

Multiple Sequence Alignment of the BchD N-terminal Domain and BchI subunit

As an example of a more complex alignment, I aligned BchD with the smaller magnesium chelatase subunit, BchI. The sequence similarity is low, about 29%. The alignment (shown below; only part of the BchD sequence is included) shows that the ATP binding site residues (the P-loop motif, marked on the alignment) in BchI are not conserved in BchD. This means that R. capsulatus BchD does not hydrolyze ATP. However, it is still possible that ATP may bind to the protein and drive the oligomerization of the complex (as it does for BchI). The alignment also shows large insertions in the BchI sequence compared to the BchD N-terminal domain. This is a good reason for submitting the predicted structure of the BchD N-terminal domain to a fold recognition server. This way, we can determine which AAA+ proteins its three-dimensional structure is most similar to.

amino acid sequence alignment: BchI-BchD

The FASTA Sequence Format

When working with various applications, it’s often necessary to have the amino acid sequence in FASTA format. This format consists of the amino acid sequence in one-letter code, typically with 60 letters per line. The most important feature is the”>” symbol at the beginning of the first line. When using an alignment program on the EBI MSA server or the UniProt BLAST tool, the beginning of the text line (sp|P26239|BCHI_RHOCB) after the “>” symbol will appear as the alignment title for the sequence. To make things more convenient, you can delete most of the text there and only leave the UniProt entry code of the protein (BCHI_RHOCB) on that line. It will serve as a helpful sequence identifier after the alignment is completed. You can also write your own text, e.g., if you are trying to identify a sequence of an unknown protein. Below is the BchI sequence in FASTA format:

>sp|P26239|BCHI_RHOCB Magnesium-chelatase 38 kDa subunit OS=Rhodobacter capsulatus (strain ATCC BAA-309 / NBRC 16581 / SB1003) OX=272942 GN=bchI PE=1 SV=1
MTTAVARLQPSASGAKTRPVFPFSAIVGQEDMKLALLLTAVDPGIGGVLVFGDRGTGKST
AVRALAALLPEIEAVEGCPVSSPNVEMIPDWATVLSTNVIRKPTPVVDLPLGVSEDRVVG
ALDIERAISKGEKAFEPGLLARANRGYLYIDECNLLEDHIVDLLLDVAQSGENVVERDGL
SIRHPARFVLVGSGNPEEGDLRPQLLDRFGLSVEVLSPRDVETRVEVIRRRDTYDADPKA
FLEEWRPKDMDIRNQILEARERLPKVEAPNTALYDCAALCIALGSDGLRGELTLLRSARA
LAALEGATAVGRDHLKRVATMALSHRLRRDPLDEAGSTARVARTVEETLP

Concluding remarks
This tutorial demonstrates the capabilities of protein analysis using multiple sequence alignment and other tools and databases at UniProt and other servers, including the AlphaFold structure prediction. As noted in the text, the analysis I provide here is general. Depending on a specific task and purpose, we may search for other properties of a protein. The protein universe is really endless!
There are also many other methods for combining sequence and structure analysis. I hope this introduction can help you to design your own approach.