Maldi Peptide Spectrum Interpretation

'MALDI' stands for 'Matrix-Assisted Laser Desorption Ionization'. It is a technique for analysing substances, or mixtures of substances; and is appropriate for analysing mixtures of peptides. The specimen to be analysed is spotted onto a suitable matrix, where a laser is shone onto it. The energy from the laser helps to generate ions, which, propelled by the electrostatic field along the tube of the spectrometer, fly along the tube. These pages assume that "positive mode" spectrometry is used, with the field such that it is the positively-charged ions that will fly.

The time an ion takes to pass down the tube depends on the ratio of its charge to its mass – its mass/charge ratio, m/z. Thus what the spectrometer actually observes is a time, the time of flight of an ion as it travels from anode or cathode to detector. However this is usually converted by the spectrometer's software to an m/z ratio, and it is this ratio that is presented to the experimenter. m is the mass of the ion in Daltons, and z is the charge of the ion, with the charge of a proton being 1. In this and the next chapter we will mainly be considering ions which have a single positive charge, so that z=1, and we may regard the observed values as being masses of the ions; but we should remember that they are really m/z ratios.

In this page, we discuss how Maldi mass spectrometry is used to analyse picomole samples of relatively pure protein. The protein will have been purified by a method such as 2-d gel electrophoresis. The technique is most useful if the specimen contains only one protein or a no more than a handful of different proteins.

Before mass spectrometry, the reactivity of the cysteine side-chains should be quenched. This is often done by either carbamidation or by carboxylation.

The protein is then cleaved using an endopeptidase that has highly specific cleavage points. The most appropriate enzyme for general use is trypsin. Trypsin cleaves after a K or R that is not followed by a P. Table 1 shows some other enzyme specificities.

Enxyme	Cleaves at:	except if:
arg C	after R	before P
asp N	before D
chymotrypsin	after F, (L, M,) W or Y	before P; after PY
cyanogen bromide	after M
Glu C (basic)	after E	before P or E
Glu C (acidic)	after D or E	before D or E
Lys C	after K
pepsin (high acidity)	after F or L
pepsin (low acidity)	after A, E, F, L, Q, W or Y
proteinase K	after A, C, F, G, M, S, W or Y
trypsin	after K or R	before P

Table 1. Cleavage sites of endopeptidases.

For the rest of this chapter, we will be assuming that trypsin is the enzyme used.

Trypsin is particularly appropriate for positively-charged mass spectrometry. If a peptide is to be observed in a mass spectrometer, the unionised peptide must be able to become ionised by trapping and retaining a proton: colloquially, it must be able to 'fly'. It can best do this if it contains a strongly basic region. Arginine has a strongly basic region, asparagine and lysine are more weakly basic. Thus trypsin, by cleaving after each arginine and lysine, ensures that each peptide will have a site capable of retaining a proton.

Trypsin cleaves the protein, as detailed above, into pieces (referred to as peptides) with a mean mass of about 1000 Da. The solution, which contains these peptides and the trypsin, is then run on the mass spectrometer, which is used positively charged, so that the peptides become protonated.

Ideally, we would observe a spectrum containing a set of between five and 50 precise and accurate m/z ratios, each caused by a peptide derived from our original protein. Later we will discuss how to interpret such an ideal spectrum. The following section is concerned with the various ways that a real spectrum will fall short of this ideal state, and how to deal with this.

Cleaning a MALDI spectrum

Some causes of imperfection in MALDI spectra are:

noise
peak broadening
instrument distortion
carbon-13
saturation
miscalibration
contaminants of various kinds
cations other than protons

We will now discuss ways of allowing for these.

Noise

Noise in Maldi spectra is not a significant problem – there are plenty of genuine peaks which stand well above the noise level. However there is a continuum from the largest peaks to the smallest. If we have a reasonably high-resolution spectrum, we are likely to see a continuous series of small peaks with a separation of just over 1 Dalton. These peaks, colloquially known as "grass", are generally peptide peaks rather than true noise. We must decide how large (and maybe how sharp) a peak must be if it is to interest us.

Peak broadening

We would like the peaks to be sharp and narrow. In practice they will be broadened. Mathematically, while the process of converting a sharp peak to a broadened one is simple, the process of reversing this broadening, while minimising the loss of accuracy, is harder to achieve (just as, given a sharp image, it is easy to create a blurred version; but given a blurred image it may be hard to create a sharp one). This process of converting a broadened peak or a blurred image to a sharp one is known as 'deconvolution', and can only practically be done by software. A commonly used, and recommended, technique, is 'Maximum Entropy'. It was originally developed for clarifying blurred images – it is used for number-plate recognition.

Some maximum entropy deconvolution software assumes that peaks are broadened into Gaussians, also known as normal distribution curved or bell curves. In fact, though this is unlikely to be far wrong, it may not be the most appropriate assumption. It is better so see what kind of peak broadening is found in a particular mass spectrometer, and to use this information in setting up the deconvolution software.

With some mass spectrometers, the degree and the form of the peak broadening may depend on the size of the peak. If this variation is significant enough to affect the results of the deconvolution, its characteristics should be observed from the spectra, and used in setting up the deconvolution software.

Instrument distortion

Some mass spectrometers introduce systematic distortion to the spectrum. For example, a peak may 'ring', with one or a few spurious smaller peaks after each genuine peak. Again, this can be dealt with by using maximum entropy or a similar technique to deconvolute the spectrum, removing the spurious peaks. To allow the software to do this, we must provide it with the characteristics of the distortion.

Related to instrument distortion is distortion by the software supplied by the manufacturer of the mass spectrometer. Some manufacturers, in an attempt to make the peaks 'look sharper', apply a sharpening filter to the spectra. This does not improve the data, it damages it. Unfortunately, some of this damage is irreversible. Therefore, if you suspect that such a filter is being applied, you should arrange for it to be removed or disabled.

Figure 1. A 13C multiplet.

Carbon 13

The carbon, hydrogen, nitrogen, oxygen, and sulphur in proteins, as elsewhere, are mixtures of isotopes. About one carbon atom in 90 is 13C instead of the usual 12C, so a peptide with a mass of 1900, containing about 90 carbon atoms, will on average contain one 13C atom, but may contain none, one, two, or more. Thus such a peptide will not give a single sharp peak, but a multiplet, as in figure 1.

The isotope frequencies of the elements found in proteins are listed in table 2.

element	baryon number	abundance	baryon number	abundance	baryon number	abundance	baryon number	abundance
H	1	99.988%	2	0.012%
C	12	98.93%	13	1.07%
N	14	99.63%	15	0.37%
O	16	99.76%	17	0.04%	18	0.20%
S	32	94.93%	33	0.76%	34	4.29%	36	0.02%

Table 2. Relative abundances of isotopes of H, C, N, O, S.

Although carbon is not the only contributor to the multiplet nature of peptide MS peaks, it is the main contributor, and so this feature of spectra is normally referred to as 13C broadening.

For all five elements which occur in proteins, the lightest isotope is also much the commonest. When we refer to the "monoisotopic" mass of a peptide, this is the mass which it would have if all its atoms were of the commonest isotope. The masses given below for amino acids are the monoisotopic masses.

The expected form of a multiplet can be accurately predicted from its mass, so deconvolution can be used to reduce a 13C multiplet to a single peak. It should be done so as to 'reduce' a multiplet to its constituent peak of lowest isotopic mass, so that the multiplet of Figure 1 would be reduced to a single peak at 1847.9 Daltons, even though the peak at 1848.8 is larger.

Saturation

All mass spectrometers have a limited response range, and cannot report a peak height above a certain value. This gives rise to a characteristic peak shape with a flattened top. This can be damaging to the results because the peak shapes then do not fit the model used by the deconvolution software, so that after deconvolution the positions of these peaks are misreported. If possible it is best to set the mass spectrometer settings so that saturation does not arise. If it cannot be avoided, the loss of accuracy in the positions of the saturated peaks may be fairly small, but enough that it is best to avoid using such peaks for calibration purposes (see below under calibration).

Miscalibration

The accuracy of the peaks read from a spectrum can only be as good as the accuracy with which the spectrum is calibrated. It is therefore very important to calibrate every spectrum as accurately as possible. Some suppliers of mass spectrometers supply software which calibrates spectra automatically; but this does not necessarily work well. I recommend that users of protein MALDI spectra should do the recalibration themselves, and as accurately as possible. Fortunately the mathematics of this are fairly simple.

The first step is to establish the nature of the miscalibration. Each peak in a miscalibrated, or uncalibrated, spectrum will have been shifted by an amount which is a function of its mass:

M' = M + f(M)

We can find the details of this function by examining a few spectra which include peptides or other substances with accurately known masses. We will find that the parameters of the function vary from spectrum to spectrum; but that the overall form of the function (linear, quadratic, etc.) is constant for any one spectrometer. Once we have established the form of the function, we can write calibration software which takes an uncalibrated spectrum, detects in it marker peaks of accurately known mass, uses these to calculate the values of the parameters of the miscalibration, and recalibrates the spectrum.

If we are lucky, the function will be a linear function:

M' = M + aM + b

Such functions are known as affine functions. These have the very helpful property that an affine function of an affine function is itself an affine function. This is helpful for us because it means that repeated attempts to recalibrate the same spectrum will not damage it, they will merely be equivalent to applying one affine function to it.

It is not only affine functions that have this desirable property: see footnote G.

Ideally, the miscalibration of the spectrum will have the form of an affine function, and any automatic recalibration done by the software supplied with the spectrometer will also have the form of an affine function. Then we can disregard the fact that an automatic attempt to recalibrate the spectrum has already been made, and do our own recalibration as described below.

If the automatic recalibration done by the supplied software is more complicated than an affine function, an optimistic user may assume that the supplier of the software knows what they are doing, and trust in the software to do its job. This author however would not make such an assumption, but would recommend disabling the automatic recalibration, and doing the calibration properly. It is worth taking some trouble over this: a 10 p.p.m. error in calibration means an additional 10 p.p.m. error in every peak.

To recalibrate a spectrum, we need some peaks to calibrate by. Such peaks must:

be present in every spectrum
be sharp in every spectrum
have accurately known masses

We could achieve this by adding a calibration standard to every sample. However, for trypsinised peptide spectra, there is no need to do this. Trypsin is an endopeptidase, and a protein: it therefore digests itself, producing characteristic peptides, which we will see in every spectrum. These autolytic peaks are perfect for calibration purposes. The masses of some trypsin autolytic peaks are given in table 3.

peptide	mass	notes
IQVR	515.330555	F
SRIQVR	758.463693	i
VATVSLPR	842.509975	F
NKPGVYTK	906.504889
FPTDDDDK	952.389979
LSSPATLNSR	1045.564195	F
APVLSDSSCK	1063.509383
SSGSSYPSLLQCLK	1526.752466
VCNYVNWIQQTIAAN	1793.864475
LSSPATLNSRVATVSLPR	1869.055780	i
SCAAAGTECLISGWGNTK	1882.842754
LGEHNIDVLEGNEQFINAAK	2211.104580	F
SSYPGQITGNMICVGFLEGGK	2215.052746
SSYPGQITGNM*ICVGFLEGGK	2231.047661	m
IITHPNENGNTLDNDIMLIK	2265.154902
IITHPNENGNTLDNDIM*LIK	2281.149816	m
SSGSSYPSLLQCLKAPVLSDSSCK	2571.243459	i
NKPGVYTKVCNYVNWIQQTIAAN	2681.350975	i
VATVSLPRSCAAAGTECLISGWGNTK	2706.334339	i
IQVRLGEHNIDVLEGNEQFINAAK	2707.416745
SRIQVRLGEHNIDVLEGNEQFINAAK	2950.549884	i

Table 3. Trypsin autolytic peptides and their masses.

Notes:
F	Very strong peak
i	Peptide with a missed internal cleavage point.
m	Peptide with an oxidized methionine.

Some suppliers of trypsin treat it in ways that are intended to reduce autolysis. These do not prevent autolysis, but they may alter the structure of the trypsin so as to produce peaks different from those listed above.

In selecting peaks to use for calibration, we want peaks that are reliably seen in every spectrum, so it may be best to choose the strongest peaks. However, if the strongest peaks are affected by saturation that makes their positions harder to read accurately, it may be better to exclude these.

The number of calibration peaks we need depends on the number of parameters we must fit using them: if we are to fit a function of the form

M' = M + aM + b

there are two parameters, so in theory two peaks are sufficient. However it is always possible that one or more of our chosen calibration peaks will, in a particular spectrum, overlap another protein peak, so that its position cannot be read accurately. Therefore it is better to specify about three times as many calibration peaks as we have parameters to fit, and to use a procedure like this for each spectrum:

recalibrate the spectrum using a least-squares fit to all of these peaks
disregard the two calibration peaks which fit least well
recalibrate the spectrum using a least-squares fit to the remaining peaks

In awkward cases, where there are many peaks from the sample so that our calibration peaks are likely to be overlapped by other peaks, it may be best to repeat steps 2 and 3.

Contaminants

Ideally, a MALDI spectrum will show peaks only from the protein or proteins in the sample being analysed. In practice, it is likely to show other peaks, from contaminants. Some likely contaminants are listed below.

Matrix.

The matrix on which the sample is deposited before being ionised should be of a composition that resists forming ions. However it is present in far greater quantity than the proteins in the sample, and may form a few ions, giving rise to some contaminant peaks.

Pump oil.

The vacuum in the spectrometer is maintained by a pump, and this pump uses oil. An oil of very low volatility should be used, but it is still likely that it will volatilise in the vacuum, and that it may give rise to ions.

Trypsin autolytic fragments.

If trypsin is used, peptides derived from the autolysis of trypsin will form strong contaminant peaks in all spectra. The masses of expected trypsin autolytic fragments are given above in table 2. Of course, if a different endopeptidase is used, it will give rise to different autolytic fragments.

Keratin.

Unless very great care is taken by the operators, peaks due to keratin from their skin and hair will frequently be seen. Only keratins that are introduced prior to proteolysis will be digested into fragments that can be observed on the mass spectrometer. Therefore, clean conditions should prevail in all handling of the proteins during the purification stages, and proteolysis in particular.

Unknown.

Peaks with no known origin may be seen. There are two ways that we may know that such peaks are not to be due to protein from the sample: they may be seen in many different spectra which should have no common protein; and they may have masses inconsistent with protein.

Before we analyse the peptide peaks in a protein MALDI spectrum, we should try to filter out peaks due to contaminants. If we are running many protein MALDI spectra, we can use a consensus of these to form a list of the more frequently seen contaminants, and filter these out. Such a consensus list should be revised periodically, so as to keep up with possible changes to the sources of contamination, e.g. a different brand of vacuum pump oil, or keratin from a different operator.

We can also filter out some peaks, as being of provably non-protein origin. This is because proteins have a characteristic ratio of their mass in Daltons to their baryon number – that is, to the number of protons and neutrons which they contain. This ratio varies somewhat from one amino acid to another, but is on average 1.00051. Thus, a protein ion may have a mass of 1000.5 Daltons, but cannot have a mass of 1000.0 Daltons or 1000.9 Daltons. The mass distribution among all those protein ions having the same number of baryons is Gaussian, and for a peptide of mass between 1000 and 1001, has a mean of 1000.5 and a standard deviation of about 0.045. Thus many contaminant peaks will have masses which are impossible, or unlikely, for peptides, while being consistent with carbohydrates, fats, or oils etc. Depending on the procedures used, it may be more convenient to filter out such peaks, or to leave them in the peak list. However it should be recognised that they convey no information about the identity of the proteins being analysed.

Interpreting a MALDI spectrum

Once a MALDI spectrum has been cleaned, as described above, it consists of a list of peaks, each of which, we hope, represents a peptide. Broadly, there are two methods we can now use to interpret it. One is to regard it de novo, and to try to deduce, without other input, what the peptides might be. The other method, is to use a library of protein sequences, and to try to find proteins in the library which could have given rise to the peaks in the spectrum.

de novo MALDI spectrum interpretation

To interpret a MALDI spectrum de novo, all we can do is to regard each peak separately, and ask 'what peptides could have given rise to this peak?' We know that the mass of a peptide is the sum of the amino acids which comprise it, as listed in table 4.

1-letter code	3-letter code	name	structure (side-chain only is shown, except for proline)	relative frequency	monoisotopic mass, Da
G	Gly	glycine	�H	7%	57.021464
A	Ala	alanine	�CH3	7%	71.037114
S	Ser	serine	�CH2�OH	8%	87.032028
P	Pro	proline		6%	97.052764
V	Val	valine	�CH(�CH3)�CH3	6%	99.068414
T	Thr	threonine	�CH(�OH)�CH3	6%	101.047678
C	Cys	cysteine	�CH2�S	5%	103.009184
I	Ile	isoleucine	�CH(�CH3)�CH2�CH3	10%	113.084064
L	Leu	leucine	�CH2�CH2�CH2�CH3	4%	113.084064
N	AsN	aspagine	�CH2�CO�NH2	5%	114.042927
D	Asp	aspartic acid	�CH2�COOH	5%	115.026943
Q	GlN	glutamine	�CH2�CH2�CO�NH2	6%	128.058578
K	Lys	lysine	�CH2�CH2�CH2�CH2�NH2	7%	128.094963
E	Glu	glutamic acid	�CH2�CH2�COOH	2%	129.042593
M	Met	methionine	�CH2�CH2�S�CH3	2%	131.040485
H	His	histidine		4%	137.058912
F	Phe	phenylalanine	�CH2�C6H5	2%	147.068414
R	Arg	arginine	�CH2�CH2�CH2�NH�C(�NH2)=NH	5%	156.101111
Y	Tyr	tyrosine		3%	163.063329
W	Trp	tryptophan		1%	186.079313
		carboxymethylated cysteine			161.051049
		carbamidated cysteine			160.030648
		oxidised methionine			147.035399

Table 4. Amino acids and their masses.

The mass of a MALDI peak will be the sum of the masses of its constituent amino acids as listed in table 4, plus 19.018390 Da, the mass of a water molecule plus a proton.

We will not normally see cysteine as a component of peptides, instead we will see either carboxymethyl cysteine or carbamido cysteine, according to how we have treated the cysteine in the proteins we are studying. We are likely to see both methionine and oxymethionine, as it is difficult to avoid the partial oxidation of methionine.

Finding a set of amino acids whose masses add up to an observed mass is a form of the 'subset sum problem'. The general form of this problem is: given a set of integers, some of them negative, to find a subset which sums to 0. This is studied by cryptographers (who call it the "knapsack problem") and has no fast solution. However, the difficulty for us is not the computer time, but the abundance of solutions. For all but the smallest peptides, it gives rise to a set of possibilities too large to be useful.

For example, suppose we observe a peak with a mass of 400.20±0.20 Da. This is small for a tryptic peptide. Nevertheless we find that this may be any of 1102 peptides, or if we ignore the ordering of the amino acids within the peptide, any of 54 sets of amino acids. These are listed in table 5. Even for such a small peptide, this is likely to be too many for us to handle easily. The numbers rapidly get worse for larger peptides, as shown in table 6.

If we recall that we are examining a tryptic digest, and that therefore each peptide (except possibly the terminal peptide of the protein) must contain at least one lysine or arginine, we can reduce the number of possibilities. The results of this are shown in the final column of table 6. This reduction of the number of possibilities becomes less the larger the peptide.

mass	amino acids	no. of permutations
400.1343	dnng	12
400.1343	dnggg	20
400.1343	dggggg	6
400.1417	mdpg	24
400.1594	epss	12
400.1594	dtps	24
400.1594	edvg	24
400.1594	eeaa	6
400.1594	ddig	12
400.1594	ddva	12
400.1594	ddlg	12
400.1706	nnta	12
400.1706	qnsa	24
400.1706	qntg	24
400.1706	qqsg	12
400.1706	red	6
400.1706	ntagg	60
400.1706	qsagg	60
400.1706	qtggg	20
400.1706	nsaag	60
400.1706	saaggg	60
400.1706	tagggg	30
400.1747	wdv	6
400.1780	mtpa	24
400.1958	tttp	4
400.1958	eisa	24
400.1958	eitg	24
400.1958	etva	24
400.1958	dita	24
400.1958	dvvs	12
400.1958	elsa	24
400.1958	eltg	24
400.1958	dlta	24
400.2070	knsa	24
400.2070	kntg	24
400.2070	kqsg	24
400.2070	ksagg	60
400.2070	ktggg	20
400.2111	wit	6
400.2111	wlt	6
400.2111	fvpg	24
400.2144	mivg	24
400.2144	mvva	12
400.2144	mlvg	24
400.2223	rfp	6
400.2257	rmi	6
400.2257	rml	6
400.2322	iiss	6
400.2322	itvs	24
400.2322	ttvv	6
400.2322	liss	12
400.2322	ltvs	24
400.2322	llss	6
400.2434	kksg	12
TOTAL	54 combinations	1102 permutations

Table 5. All peptides with a mass of ~400.2 Da.

Mass, Da ±0.05%	No. of combinations	No. of permutations	No of combinations including K or R
100	0	0	0
200.1	4	8	0
300.1	16	99	3
400.2	54	1,102	11
500.2	216	17,330	74
600.3	748	230,161	273
700.3	2,276	3,403,786	1,035
800.4	6,710	50,417,203	3,350
900.4	18,364	758,036,231	10,443

Table 6. Total numbers of peptides within various 1 Da mass ranges, calculated by a full search.

We can do much to reduce the number of possibilities by obtaining an accurate mass for the peak (though the accuracy of a mass can only be as good as the accuracy with which the spectrum has been calibrated). Table 7 shows how, for a mass of 900.45 Da, the number of possibilities is reduced as we increase the accuracy.

Accuracy, +/-, Da	Accuracy, +/-, ppm	No. of combinations	No. of million permutations
0.5	555	18,364	758
0.4	444	18,364	758
0.3	333	18,364	758
0.2	222	18,359	758
0.1	111	16,849	732
0.08	89	15,223	683
0.06	67	12,767	598
0.05	56	10,901	506
0.04	44	9,098	432
0.03	33	6,788	323
0.02	22	4,913	252
0.01	11	2,467	100
0.008	9	2,148	87
0.006	7	1,922	87
0.005	6	1,327	50
0.004	4	840	29
0.003	3	542	22
0.002	2	542	22
0.001	1	200	3

Table 7. Total numbers of peptides within various ranges about 900.45 Da.

Note that, for accuracies of worse than 70 p.p.m., the benefit of increasing the accuracy is relatively small. For accuracies better than 70 p.p.m., the number of possibilities drops in direct proportion to the increased accuracy.

However, there is a limit to what can be achieved in the way of de novo peptide identification, however accurately the peaks are read. Tables 5 and 6 show three reasons for this:

Leucine and isoleucine cannot be distinguished, because they are isomers, having exactly the same mass.
There are many sets of amino acids which cannot be distinguished, because like leucine and isoleucine they contain identical sets of atoms. For example glycine + glycine has exactly the same mass as asparagine; isoleucine + glycine the same mass as valine + alanine; etc.
We have no information about the order of the amino acids within the peptide. This is the most serious problem with this method of identifying peptides. As we see from table 6, the number of permutations corresponding to each combination of amino acids rises exponentially with the mass of the peptide, and soon becomes unmanageable.

Therefore, libraries are usually used in interpreting MALDI peptide spectra, as described below.

Library MALDI spectrum interpretation

If we are working with proteins from a particular species, identification of proteins from MALDI spectra is greatly helped if we can use a library of proteins from that species. Supposedly complete libraries exist for some species, and these are far more useful than partial libraries. However partial libraries are far more useful than nothing, so long as we remember that the protein in the spectrum may be one that is not present in the library.

A protein sequence library may be assembled from

Raw DNA sequence, with 6-frame translation
DNA sequence, with knowledge of transcription and of intron-splicing
ESTs – Expressed Sequence Tags
Protein libraries, e.g. SwissProt
a fusion of two or more of the above.

Given a mass read from a MALDI peak, and a protein library, it is easy to write a computer algorithm to 'walk' through the library, identifying sequences of amino acids in the library that could give rise to that mass. The time taken to run such an algorithm depends directly on the size of the library. The way it does its walk will be based on the endopeptidase that has been used – if this is trypsin, it will start at the beginning of each protein, and step from there to each tryptic cleavage point (as specified in table 1 above) until it reaches the end of that protein.

In a large library, we are likely to find a large number of 'hits', a hit being a potential match between the observed peak and a sequence in the library. We will hope to identify the protein by finding several hits from the same spectrum on the same protein within the library. For each hit that the program finds, it should note the following, which all influence how much significance we should assign to the hit.

How well it fits the observed peak. For example, if we observed a peak with a mass of 1536.83±0.02 Da, then a peptide from the library with a mass of 1536.855 Da is worth noting, but one with a mass of 1536.835 Da is more convincing.
The mass of the peak. Matches to large peptides carry somewhat more conviction than masses to small ones. This is because large peptides are relatively rare, so a match to one is less likely to occur by chance.
Whether it contains missed internal cleavage points, and if so, how many. Such a match is less convincing than a match to a perfect tryptic fragment, but still carries some weight. Some tryptic cleavage points, such as those between multiple lysines, are particularly likely to be missed; ideally, we could take account of this.
Where within the protein the match falls. The significance of this is discussed below.

When we have processed all the peaks in the spectrum in this way, we will have a large number of random hits, scattered throughout the proteins in the library. Also, we hope, we will have a concentration of hits in one particular protein – the one in the sample. Or we may have several identifiable proteins in the sample, all showing concentrations of hits.

In some cases, it will be clear how the results should be interpreted – there will be one or a few proteins with groups of hits well above the random background. However in some cases we may need to use statistics to distinguish convincing sets of hits from the random background. One way to do this is to combine Bayesian measures for each hit: these can include the goodness of the fit between the observed and theoretical masses, a measure associated with the absolute mass, and a measure associated with the number and nature of missed cleavages internal to the peptide, all as listed above. The way to calculate the last two measures is best found empirically – if we have a reasonably large body of data, including meaningless random hits and confirmed genuine hits, we can assign relative weights to different types of hit.

We must also take account of the total size of each protein in the library. This is relevant for two different reasons.

The more obvious reason why the size of the protein is relevant is that larger proteins will generate more random hits. If our library contains the enormous protein known as 'titin', we will find a large number of random hits on it from almost any MALDI spectrum; and we do not want to be misled by this. When we are comparing the hypothesis 'this protein is responsible for this hit on our spectrum' with the null hypothesis 'this hit on this protein is a random coincidental match', then the likelihood of the null hypothesis depends linearly on the absolute size of the protein.

There is a more subtle reason for taking account of the size of each protein in the library when we score the hits on it. The 'protein' on which we performed trypsinolysis may not have been a whole protein. Indeed, if we obtained it by 2d gel electrophoresis, we should have a rough idea of its mass, and will know that this is less than the total mass of many of the proteins in the database. We can only credit those hits which fall within a span limited by what we know about the total mass of the protein or protein fragment on which the trypsinolysis was done. So, what we should aim to calculate is 'what is the likelihood of a set of hits like the set we have observed, all falling within a total mass consistent with the presumed mass of our protein fragment, arising by chance?'

We may find that the hits on a correct protein tend to cluster within the protein, more than we would expect if the MALDI spectrum were showing peaks from tryptic peptides randomly chosen from within the protein or protein fragment present in the sample. This may be due to the tertiary structure of the protein, with its exposed hydrophilic region being more likely to give rise to tryptic fragments. If this effect is thought to be significant, we can assign extra weight to abutting or overlapping peptides.

More than one protein in one spectrum

We often find that the sample contained more than one protein. It may be possible to identify several proteins from the same spectrum. An effective way to do this, when we have one clearly identified protein, is to delete from the spectrum all peaks that are likely to be due to that protein. Then we re-run the modified spectrum against the library. If we find that there is now a second clearly-identified protein, we can repeat this procedure, and hope to find a third one; and so on.

Errors in protein libraries

All of this is complicated by the fact that protein libraries have errors in them.

Errors in the actual protein sequence (or the DNA or RNA sequence from which it is derived) are no doubt numerous, but do not directly affect our calculations.

A commonly seen error is that the same protein has been included in the library more than once. Those libraries which are compiled from more than one source generally try to be 'non-redundant' – they have been edited to try to remove duplicate versions of the same protein. But this editing is not perfect. Two versions of the same protein may be included because there are enough errors in the sequences that they are not recognised as the same, or because they really are slightly different, occurring in different genotypes of the same species. However, some groups of different proteins are very similar in sequence, being derived from common ancestral proteins. If we find our MALDI spectrum gives hits on two similar proteins, it may be hard to tell whether what we have is a genuine hit together with another similar protein, or a genuine hit together with an erroneous version of the same protein.

If we are using a protein library that is derived from six-frame translation of DNA, we may find that our MALDI hits are grouped in two, or even three, of the library�s 'proteins', corresponding to the same section of DNA read with different reading-frames (but all in the same direction). This can happen if the original DNA data has single-base omissions, causing changes of reading frame.

MOWSE

A program which does this kind of analysis, taking as its inputs a MALDI-derived peak list and a protein library, is 'MOWSE'. MOWSE was developed by Darryl Pappin (Imperial Cancer Research Fund, UK) and Alan Bleasby (SERC Daresbury Laboratory, UK).

An on-line version of MOWSE, using the OWL library, was once available at www.hgmp.mrc.ac.uk/Bioinformatics/Webapp/mowse/.

Amino acid modifications

Sometimes an amino acid is modified in some way, causing it to have a mass different from that listed in table 4 above. This section lists some modifications, with the mass differences that they involve. This is relevant both in analysing MALDI spectra, and in analysing tandem spectra.

Modification may occur in vivo, in vitro, or in the spectrometer.

In vivo modifications are those which occur naturally in the organism which made the protein. Some, such as glycosylation, are likely to be reversed by the process of preparing the protein for mass spectrometry. Others, such as sulfation, may cause us to see ions with modified masses.

In vitro modifications are done in the laboratory in the course of preparing the protein. For example cysteine is commonly carbamidated, to break disulfide bridges and to prevent it from being reactive. The oxidation of methionine occurs in vitro, though unintentionally – it is a consequence of exposing methionine to the atmosphere.

Some modifications occur spontaneously within the mass spectrometer. These typically involve losses of side-chains or of parts of side-chains.

Sometimes a whole spectrum may be contaminated with metal ions. The result is as if some of the protons, which provide the positive charge to the ions in the spectrometer, have been replaced by ions of the contaminating metal. For example if sodium is responsible, some of the ions will have a sodium ion of mass 22.989771 in place of a proton of mass 1.007825, so that they are too heavy by 21.981946 Da. This form of contamination can be recognised easily, as every peak is 'split' in the same way.

Table 8 lists some modifications, with their mass differences.

modification	amino acid involved	context	mass difference, Da
water loss	S, T	mass spec	-18.010565
ammonia loss	Q, K, R, N; esp. n-terminal Q	mass spec	-17.026549
urea loss	c-terminal R	mass spec	-60.032363
hydration	H, R	mass spec, esp. B ions	+18.010565
methylation		various	+14.015650
hydroxylation	K, P	post-translational, in collagen	+15.994915
hydroxylation	P	post-translational, in plant cell walls	+15.994915
oxidation	M	lab preparation. partial	+15.994915
acetylation	S		+42.010565
carbamidation	C	lab preparation. complete	+57.021464
carboxylation	C	lab preparation. complete	+58.005479
phosphorylation	T,S,Y	post-translational	+79.966330
amidation	c-terminal		-0.984016
formylation			+27.994915
sulfation			+79.956815
N-linked glycosylation	N	post-translational. sugar usually lost before MS	large, depends on the sugar
O-linked glycosylation	T, S, hydroxy-K	post-translational. sugar usually lost before MS	large, depends on the sugar
fucosylation	S		+146.057908
contamination by Na+			+21.981944
contamination by K+			+37.955588
contamination by Cu+			+61.921776

Table 8. Modifications to ion masses.

Conclusions

The use of MALDI spectrometry alone is rarely sufficient to identify a protein. If the protein is not mixed, and gives rise to several peptides, and several of these give matches against the same protein in a good complete library for the species, it may be possible to obtain an identification. More often, MALDI spectrometry cannot prove what protein was present, but can provide a list of likely candidates. This is useful, as then the peptides that provide evidence for these candidate proteins can be taken forward for MS/MS spectrometry, which is more likely to provide convincing proof of protein identity.

Footnotes

By definition, the mass of an atom of 12C is 12. Thus the mass of a hydrogen atom is 1.007825032.

Throughout this and the next chapter, I assume that the protein is represented in the conventional orientation, with the amino-terminal end at the left and the carboxy-terminal end at the right. Thus 'after' means 'to the carboxy-terminal end of'.

'Maximum Entropy and Bayesian Methods in Science and Engineering', John Skilling, in ed. C. R. Smith and E. J. Erickson, pp. 173-187, Kluwer Academic Press, Dordrecht, 1988.

The baryon number of an atom (or molecule) is the total number of protons and neutrons which it contains. Neutrons are very slightly more massive than protons. However, for all the molecules which we will encounter, two molecules with the same baryon number will be much closer in mass than two whose baryon numbers differ.

This, and other values given in this paragraph, depend on the frequencies of amino acids in the species being studied; and on any modifications that have been done to the amino acids, such as carbamidation of cysteine.

These percentages are highly approximate. They are derived from data on human protein sequences.

These figures assume that cysteine has been carbamidated.

It is possible to distinguish leucine from isoleucine by high-energy tandem collisions, and observation of the different v ions which they form.

Footnote G Another, and more powerful, function with this property of being "closed under composition" is

M' = (aM + b)/(cM + d)

which can conveniently be represented by a matrix

. Composition of these functions can then be done by matrix multiplication. Unlike for affine functions, the composition is not commutative, but it is still closed under composition, so the set of such functions forms a group.

This group is 3-transitive: this means that if we use only three calibration peaks, we can find a function which will cause them all to fit perfectly. This may sound good, but I am suspicious of a fit that is perfect for the calibration peaks, and I recommend that if these functions are used, you should always use at least four calibration peaks.

Main peptide MS page. Copyright N.S.Wedd 2003, 2004, 2011.
Last updated 2011-05-10