The time an ion takes to pass down the tube depends on the ratio of its charge to its mass its mass/charge ratio, m/z. Thus what the spectrometer actually observes is a time, the time of flight of an ion as it travels from anode or cathode to detector. However this is usually converted by the spectrometer's software to an m/z ratio, and it is this ratio that is presented to the experimenter. m is the mass of the ion in Daltons, and z is the charge of the ion, with the charge of a proton being 1. In this and the next chapter we will mainly be considering ions which have a single positive charge, so that z=1, and we may regard the observed values as being masses of the ions; but we should remember that they are really m/z ratios.
In this page, we discuss how Maldi mass spectrometry is used to analyse picomole samples of relatively pure protein. The protein will have been purified by a method such as 2-d gel electrophoresis. The technique is most useful if the specimen contains only one protein or a no more than a handful of different proteins.
Before mass spectrometry, the reactivity of the cysteine side-chains should be quenched. This is often done by either carbamidation or by carboxylation.
The protein is then cleaved using an endopeptidase that has highly specific cleavage points. The most appropriate enzyme for general use is trypsin. Trypsin cleaves after a K or R that is not followed by a P. Table 1 shows some other enzyme specificities.
|Enxyme||Cleaves at:||except if:|
|arg C||after R||before P|
|asp N||before D|
|chymotrypsin||after F, (L, M,) W or Y||before P; after PY|
|cyanogen bromide||after M|
|Glu C (basic)||after E||before P or E|
|Glu C (acidic)||after D or E||before D or E|
|Lys C||after K|
|pepsin (high acidity)||after F or L|
|pepsin (low acidity)||after A, E, F, L, Q, W or Y|
|proteinase K||after A, C, F, G, M, S, W or Y|
|trypsin||after K or R||before P|
For the rest of this chapter, we will be assuming that trypsin is the enzyme used.
Trypsin is particularly appropriate for positively-charged mass spectrometry. If a peptide is to be observed in a mass spectrometer, the unionised peptide must be able to become ionised by trapping and retaining a proton: colloquially, it must be able to 'fly'. It can best do this if it contains a strongly basic region. Arginine has a strongly basic region, asparagine and lysine are more weakly basic. Thus trypsin, by cleaving after each arginine and lysine, ensures that each peptide will have a site capable of retaining a proton.
Trypsin cleaves the protein, as detailed above, into pieces (referred to as peptides) with a mean mass of about 1000 Da. The solution, which contains these peptides and the trypsin, is then run on the mass spectrometer, which is used positively charged, so that the peptides become protonated.
Ideally, we would observe a spectrum containing a set of between five and 50 precise and accurate m/z ratios, each caused by a peptide derived from our original protein. Later we will discuss how to interpret such an ideal spectrum. The following section is concerned with the various ways that a real spectrum will fall short of this ideal state, and how to deal with this.
Some maximum entropy deconvolution software assumes that peaks are broadened into Gaussians, also known as normal distribution curved or bell curves. In fact, though this is unlikely to be far wrong, it may not be the most appropriate assumption. It is better so see what kind of peak broadening is found in a particular mass spectrometer, and to use this information in setting up the deconvolution software.
With some mass spectrometers, the degree and the form of the peak broadening may depend on the size of the peak. If this variation is significant enough to affect the results of the deconvolution, its characteristics should be observed from the spectra, and used in setting up the deconvolution software.
Related to instrument distortion is distortion by the software supplied by the manufacturer of the mass spectrometer. Some manufacturers, in an attempt to make the peaks 'look sharper', apply a sharpening filter to the spectra. This does not improve the data, it damages it. Unfortunately, some of this damage is irreversible. Therefore, if you suspect that such a filter is being applied, you should arrange for it to be removed or disabled.
|Figure 1. A 13C multiplet.|
The isotope frequencies of the elements found in proteins are listed in table 2.
Although carbon is not the only contributor to the multiplet nature of peptide MS peaks, it is the main contributor, and so this feature of spectra is normally referred to as 13C broadening.
For all five elements which occur in proteins, the lightest isotope is also much the commonest. When we refer to the "monoisotopic" mass of a peptide, this is the mass which it would have if all its atoms were of the commonest isotope. The masses given below for amino acids are the monoisotopic masses.
The expected form of a multiplet can be accurately predicted from its mass, so deconvolution can be used to reduce a 13C multiplet to a single peak. It should be done so as to 'reduce' a multiplet to its constituent peak of lowest isotopic mass, so that the multiplet of Figure 1 would be reduced to a single peak at 1847.9 Daltons, even though the peak at 1848.8 is larger.
The first step is to establish the nature of the miscalibration. Each peak in a miscalibrated, or uncalibrated, spectrum will have been shifted by an amount which is a function of its mass:
M' = M + f(M)We can find the details of this function by examining a few spectra which include peptides or other substances with accurately known masses. We will find that the parameters of the function vary from spectrum to spectrum; but that the overall form of the function (linear, quadratic, etc.) is constant for any one spectrometer. Once we have established the form of the function, we can write calibration software which takes an uncalibrated spectrum, detects in it marker peaks of accurately known mass, uses these to calculate the values of the parameters of the miscalibration, and recalibrates the spectrum.
If we are lucky, the function will be a linear function:
M' = M + aM + bSuch functions are known as affine functions. These have the very helpful property that an affine function of an affine function is itself an affine function. This is helpful for us because it means that repeated attempts to recalibrate the same spectrum will not damage it, they will merely be equivalent to applying one affine function to it.
It is not only affine functions that have this desirable property: see footnote G.
Ideally, the miscalibration of the spectrum will have the form of an affine function, and any automatic recalibration done by the software supplied with the spectrometer will also have the form of an affine function. Then we can disregard the fact that an automatic attempt to recalibrate the spectrum has already been made, and do our own recalibration as described below.
If the automatic recalibration done by the supplied software is more complicated than an affine function, an optimistic user may assume that the supplier of the software knows what they are doing, and trust in the software to do its job. This author however would not make such an assumption, but would recommend disabling the automatic recalibration, and doing the calibration properly. It is worth taking some trouble over this: a 10 p.p.m. error in calibration means an additional 10 p.p.m. error in every peak.
To recalibrate a spectrum, we need some peaks to calibrate by. Such peaks must:
|F||Very strong peak|
|i||Peptide with a missed internal cleavage point.|
|m||Peptide with an oxidized methionine.|
Some suppliers of trypsin treat it in ways that are intended to reduce autolysis. These do not prevent autolysis, but they may alter the structure of the trypsin so as to produce peaks different from those listed above.
In selecting peaks to use for calibration, we want peaks that are reliably seen in every spectrum, so it may be best to choose the strongest peaks. However, if the strongest peaks are affected by saturation that makes their positions harder to read accurately, it may be better to exclude these.
The number of calibration peaks we need depends on the number of parameters we must fit using them: if we are to fit a function of the form
M' = M + aM + bthere are two parameters, so in theory two peaks are sufficient. However it is always possible that one or more of our chosen calibration peaks will, in a particular spectrum, overlap another protein peak, so that its position cannot be read accurately. Therefore it is better to specify about three times as many calibration peaks as we have parameters to fit, and to use a procedure like this for each spectrum:
Before we analyse the peptide peaks in a protein MALDI spectrum, we should try to filter out peaks due to contaminants. If we are running many protein MALDI spectra, we can use a consensus of these to form a list of the more frequently seen contaminants, and filter these out. Such a consensus list should be revised periodically, so as to keep up with possible changes to the sources of contamination, e.g. a different brand of vacuum pump oil, or keratin from a different operator.
We can also filter out some peaks, as being of provably non-protein origin. This is because proteins have a characteristic ratio of their mass in Daltons to their baryon number that is, to the number of protons and neutrons which they contain. This ratio varies somewhat from one amino acid to another, but is on average 1.00051. Thus, a protein ion may have a mass of 1000.5 Daltons, but cannot have a mass of 1000.0 Daltons or 1000.9 Daltons. The mass distribution among all those protein ions having the same number of baryons is Gaussian, and for a peptide of mass between 1000 and 1001, has a mean of 1000.5 and a standard deviation of about 0.045. Thus many contaminant peaks will have masses which are impossible, or unlikely, for peptides, while being consistent with carbohydrates, fats, or oils etc. Depending on the procedures used, it may be more convenient to filter out such peaks, or to leave them in the peak list. However it should be recognised that they convey no information about the identity of the proteins being analysed.
|1-letter code||3-letter code||name||structure (side-chain only is shown, except for proline)||relative frequency||monoisotopic mass, Da|
The mass of a MALDI peak will be the sum of the masses of its constituent amino acids as listed in table 4, plus 19.018390 Da, the mass of a water molecule plus a proton.
We will not normally see cysteine as a component of peptides, instead we will see either carboxymethyl cysteine or carbamido cysteine, according to how we have treated the cysteine in the proteins we are studying. We are likely to see both methionine and oxymethionine, as it is difficult to avoid the partial oxidation of methionine.
Finding a set of amino acids whose masses add up to an observed mass is a form of the 'subset sum problem'. The general form of this problem is: given a set of integers, some of them negative, to find a subset which sums to 0. This is studied by cryptographers (who call it the "knapsack problem") and has no fast solution. However, the difficulty for us is not the computer time, but the abundance of solutions. For all but the smallest peptides, it gives rise to a set of possibilities too large to be useful.
For example, suppose we observe a peak with a mass of 400.20±0.20 Da. This is small for a tryptic peptide. Nevertheless we find that this may be any of 1102 peptides, or if we ignore the ordering of the amino acids within the peptide, any of 54 sets of amino acids. These are listed in table 5. Even for such a small peptide, this is likely to be too many for us to handle easily. The numbers rapidly get worse for larger peptides, as shown in table 6.
If we recall that we are examining a tryptic digest, and that therefore each peptide (except possibly the terminal peptide of the protein) must contain at least one lysine or arginine, we can reduce the number of possibilities. The results of this are shown in the final column of table 6. This reduction of the number of possibilities becomes less the larger the peptide.
|mass||amino acids||no. of permutations|
|TOTAL||54 combinations||1102 permutations|
|No of combinations|
including K or R
We can do much to reduce the number of possibilities by obtaining an accurate mass for the peak (though the accuracy of a mass can only be as good as the accuracy with which the spectrum has been calibrated). Table 7 shows how, for a mass of 900.45 Da, the number of possibilities is reduced as we increase the accuracy.
|No. of million|
Note that, for accuracies of worse than 70 p.p.m., the benefit of increasing the accuracy is relatively small. For accuracies better than 70 p.p.m., the number of possibilities drops in direct proportion to the increased accuracy.
However, there is a limit to what can be achieved in the way of de novo peptide identification, however accurately the peaks are read. Tables 5 and 6 show three reasons for this:
Therefore, libraries are usually used in interpreting MALDI peptide spectra, as described below.
A protein sequence library may be assembled from
Given a mass read from a MALDI peak, and a protein library, it is easy to write a computer algorithm to 'walk' through the library, identifying sequences of amino acids in the library that could give rise to that mass. The time taken to run such an algorithm depends directly on the size of the library. The way it does its walk will be based on the endopeptidase that has been used if this is trypsin, it will start at the beginning of each protein, and step from there to each tryptic cleavage point (as specified in table 1 above) until it reaches the end of that protein.
In a large library, we are likely to find a large number of 'hits', a hit being a potential match between the observed peak and a sequence in the library. We will hope to identify the protein by finding several hits from the same spectrum on the same protein within the library. For each hit that the program finds, it should note the following, which all influence how much significance we should assign to the hit.
When we have processed all the peaks in the spectrum in this way, we will have a large number of random hits, scattered throughout the proteins in the library. Also, we hope, we will have a concentration of hits in one particular protein the one in the sample. Or we may have several identifiable proteins in the sample, all showing concentrations of hits.
In some cases, it will be clear how the results should be interpreted there will be one or a few proteins with groups of hits well above the random background. However in some cases we may need to use statistics to distinguish convincing sets of hits from the random background. One way to do this is to combine Bayesian measures for each hit: these can include the goodness of the fit between the observed and theoretical masses, a measure associated with the absolute mass, and a measure associated with the number and nature of missed cleavages internal to the peptide, all as listed above. The way to calculate the last two measures is best found empirically if we have a reasonably large body of data, including meaningless random hits and confirmed genuine hits, we can assign relative weights to different types of hit.
We must also take account of the total size of each protein in the library. This is relevant for two different reasons.
The more obvious reason why the size of the protein is relevant is that larger proteins will generate more random hits. If our library contains the enormous protein known as 'titin', we will find a large number of random hits on it from almost any MALDI spectrum; and we do not want to be misled by this. When we are comparing the hypothesis 'this protein is responsible for this hit on our spectrum' with the null hypothesis 'this hit on this protein is a random coincidental match', then the likelihood of the null hypothesis depends linearly on the absolute size of the protein.
There is a more subtle reason for taking account of the size of each protein in the library when we score the hits on it. The 'protein' on which we performed trypsinolysis may not have been a whole protein. Indeed, if we obtained it by 2d gel electrophoresis, we should have a rough idea of its mass, and will know that this is less than the total mass of many of the proteins in the database. We can only credit those hits which fall within a span limited by what we know about the total mass of the protein or protein fragment on which the trypsinolysis was done. So, what we should aim to calculate is 'what is the likelihood of a set of hits like the set we have observed, all falling within a total mass consistent with the presumed mass of our protein fragment, arising by chance?'
We may find that the hits on a correct protein tend to cluster within the protein, more than we would expect if the MALDI spectrum were showing peaks from tryptic peptides randomly chosen from within the protein or protein fragment present in the sample. This may be due to the tertiary structure of the protein, with its exposed hydrophilic region being more likely to give rise to tryptic fragments. If this effect is thought to be significant, we can assign extra weight to abutting or overlapping peptides.
Errors in the actual protein sequence (or the DNA or RNA sequence from which it is derived) are no doubt numerous, but do not directly affect our calculations.
A commonly seen error is that the same protein has been included in the library more than once. Those libraries which are compiled from more than one source generally try to be 'non-redundant' they have been edited to try to remove duplicate versions of the same protein. But this editing is not perfect. Two versions of the same protein may be included because there are enough errors in the sequences that they are not recognised as the same, or because they really are slightly different, occurring in different genotypes of the same species. However, some groups of different proteins are very similar in sequence, being derived from common ancestral proteins. If we find our MALDI spectrum gives hits on two similar proteins, it may be hard to tell whether what we have is a genuine hit together with another similar protein, or a genuine hit together with an erroneous version of the same protein.
If we are using a protein library that is derived from six-frame translation of DNA, we may find that our MALDI hits are grouped in two, or even three, of the library’s 'proteins', corresponding to the same section of DNA read with different reading-frames (but all in the same direction). This can happen if the original DNA data has single-base omissions, causing changes of reading frame.
An on-line version of MOWSE, using the OWL library, was once available at www.hgmp.mrc.ac.uk/Bioinformatics/Webapp/mowse/.
Modification may occur in vivo, in vitro, or in the spectrometer.
In vivo modifications are those which occur naturally in the organism which made the protein. Some, such as glycosylation, are likely to be reversed by the process of preparing the protein for mass spectrometry. Others, such as sulfation, may cause us to see ions with modified masses.
In vitro modifications are done in the laboratory in the course of preparing the protein. For example cysteine is commonly carbamidated, to break disulfide bridges and to prevent it from being reactive. The oxidation of methionine occurs in vitro, though unintentionally it is a consequence of exposing methionine to the atmosphere.
Some modifications occur spontaneously within the mass spectrometer. These typically involve losses of side-chains or of parts of side-chains.
Sometimes a whole spectrum may be contaminated with metal ions. The result is as if some of the protons, which provide the positive charge to the ions in the spectrometer, have been replaced by ions of the contaminating metal. For example if sodium is responsible, some of the ions will have a sodium ion of mass 22.989771 in place of a proton of mass 1.007825, so that they are too heavy by 21.981946 Da. This form of contamination can be recognised easily, as every peak is 'split' in the same way.
Table 8 lists some modifications, with their mass differences.
|modification||amino acid involved||context||mass difference, Da|
|water loss||S, T||mass spec||-18.010565|
|ammonia loss||Q, K, R, N; esp. n-terminal Q||mass spec||-17.026549|
|urea loss||c-terminal R||mass spec||-60.032363|
|hydration||H, R||mass spec, esp. B ions||+18.010565|
|hydroxylation||K, P||post-translational, in collagen||+15.994915|
|hydroxylation||P||post-translational, in plant cell walls||+15.994915|
|oxidation||M||lab preparation. partial||+15.994915|
|carbamidation||C||lab preparation. complete||+57.021464|
|carboxylation||C||lab preparation. complete||+58.005479|
|N-linked glycosylation||N||post-translational. sugar usually lost before MS||large, depends on the sugar|
|O-linked glycosylation||T, S, hydroxy-K||post-translational. sugar usually lost before MS||large, depends on the sugar|
|contamination by Na+||+21.981944|
|contamination by K+||+37.955588|
|contamination by Cu+||+61.921776|
Throughout this and the next chapter, I assume that the protein is represented in the conventional orientation, with the amino-terminal end at the left and the carboxy-terminal end at the right. Thus 'after' means 'to the carboxy-terminal end of'.
'Maximum Entropy and Bayesian Methods in Science and Engineering', John Skilling, in ed. C. R. Smith and E. J. Erickson, pp. 173-187, Kluwer Academic Press, Dordrecht, 1988.
The baryon number of an atom (or molecule) is the total number of protons and neutrons which it contains. Neutrons are very slightly more massive than protons. However, for all the molecules which we will encounter, two molecules with the same baryon number will be much closer in mass than two whose baryon numbers differ.
This, and other values given in this paragraph, depend on the frequencies of amino acids in the species being studied; and on any modifications that have been done to the amino acids, such as carbamidation of cysteine.
These percentages are highly approximate. They are derived from data on human protein sequences.
These figures assume that cysteine has been carbamidated.
It is possible to distinguish leucine from isoleucine by high-energy tandem collisions, and observation of the different v ions which they form.
Footnote G Another, and more powerful, function with this property of being "closed under composition" is
M' = (aM + b)/(cM + d)which can conveniently be represented by a matrix . Composition of these functions can then be done by matrix multiplication. Unlike for affine functions, the composition is not commutative, but it is still closed under composition, so the set of such functions forms a group.
This group is 3-transitive: this means that if we use only three calibration peaks, we can find a function which will cause them all to fit perfectly. This may sound good, but I am suspicious of a fit that is perfect for the calibration peaks, and I recommend that if these functions are used, you should always use at least four calibration peaks.
|Main peptide MS page.||
Copyright N.S.Wedd 2003, 2004, 2011.|
Last updated 2011-05-10