Purity Evaluation of Bovine Serum Albumin Products
Fusion of Mass Spectrometry and Raman Spectroscopy
data
Ricardo Cunha
cunha@iuta.de29 November, 2024
Source:vignettes/articles/demo_ms_raman_fusion_bsa.Rmd
demo_ms_raman_fusion_bsa.Rmd
Introduction
Mass Spectrometry (MS) and Raman spectroscopy (Raman) are orthogonal techniques. While MS measures the chemical composition of a given sample, Raman is able to provide structural information of the constituents. This orthogonality is interesting when applying quality evaluation workflows, where both chemical composition and structural integrity are relevant. In this article, we show how MS and Raman can be pre-processed and fused for a combined statistical evaluation to support quality control. As test sets, we use four different bovine serum albumin (BSA) products where we evaluate the quality of the active product ingredient (API).
Pre-processing
The preliminary processing is applied differently for MS and Raman data as the data structure does not allow direct fusion of the data. Therefore, in the following two sub-chapters we describe the pre-processing workflow steps for MS and Raman data.
MS
The four different BSA products were measured in triplicate in full scan MS mode. Giving the following set of .d files.
# Raw data from an Agilent QTOF
basename(ms_files)
[1] "BSA_7AqYf_1.d" "BSA_7AqYf_2.d" "BSA_7AqYf_3.d" "BSA_m595c_1.d"
[5] "BSA_m595c_2.d" "BSA_m595c_3.d" "BSA_utpLV_1.d" "BSA_utpLV_2.d"
[9] "BSA_utpLV_3.d" "BSA_wAgwS_1.d" "BSA_wAgwS_2.d" "BSA_wAgwS_3.d"
[13] "H2O.d" "H2O_2.d"
With the .d files, a MassSpecEngine is created to
pre-process the MS data. Note that the MS files were centroided and
converted to mzML during creation of the MassSpecEngine, using
the installed msconvert
tool from the ProteoWizard in the
background. The analysis replicate names are modified to properly assign
the analyses within each replicate measurement. Two water (H2O) files
were used as blanks for the analyses.
ms <- StreamFind::MassSpecEngine$new(
headers = list(name = "BSA Quality Evaluation Project"),
analyses = ms_files,
centroid = TRUE,
levels = 1
)
# Changes the analysis replicate names
ms$add_replicate_names(sub("_\\d+$", "", ms$get_analysis_names()))
# Overview
ms
MassSpecEngine
File:
NA
Headers:name: BSA Quality Evaluation Project
author: NA
date: 2024-11-29 10:02:08.385106
Workflow:
empty
Analyses:
analysis replicate blank type polarity spectra chromatograms
<char> <char> <char> <char> <char> <num> <num>
1: BSA_7AqYf_1 BSA_7AqYf <NA> MS positive 591 7
2: BSA_7AqYf_2 BSA_7AqYf <NA> MS positive 591 7
3: BSA_7AqYf_3 BSA_7AqYf <NA> MS positive 591 7
4: BSA_m595c_1 BSA_m595c <NA> MS positive 592 7
5: BSA_m595c_2 BSA_m595c <NA> MS positive 591 7
6: BSA_m595c_3 BSA_m595c <NA> MS positive 592 7
7: BSA_utpLV_1 BSA_utpLV <NA> MS positive 591 7
8: BSA_utpLV_2 BSA_utpLV <NA> MS positive 591 7
9: BSA_utpLV_3 BSA_utpLV <NA> MS positive 591 7
10: BSA_wAgwS_1 BSA_wAgwS <NA> MS positive 592 7
11: BSA_wAgwS_2 BSA_wAgwS <NA> MS positive 591 7
12: BSA_wAgwS_3 BSA_wAgwS <NA> MS positive 591 7
13: H2O H2O <NA> MS positive 591 7
14: H2O_2 H2O <NA> MS positive 591 7
concentration
<num>
1: NA
2: NA
3: NA
4: NA
5: NA
6: NA
7: NA
8: NA
9: NA
10: NA
11: NA
12: NA
13: NA
14: NA
# Note that the TIC chromatograms are not yet loaded (see below)
# In plot_spectra_tic, the TIC from spectra headers are used
ms$plot_spectra_tic(colorBy = "replicates")
The data processing workflow is based on applied processing methods within the engine. As pre-processing, the following steps are applied:
- Find the retention time elution of BSA based on the total ion chromatogram (TIC) of each analysis;
- Find the mass-to-charge (m/z) range of BSA for deconvolution of charges;
- Deconvolute charges for estimation of the exact mass, in DA;
The individual processing steps are applied according to ProcessingSettings. Below, the integration of the TIC is shown using the dedicated ProcessingSettings after loading the TIC chromatograms from the analysis files and smoothing using a moving average approach.
# Loads the TIC chromatograms
ms$run(MassSpecSettings_LoadChromatograms_StreamFind(chromatograms = "TIC"))
ms$analyses$has_chromatograms
[1] TRUE
# Smoothing of the TIC chromatogram for improving peak finding
ms$run(MassSpecSettings_SmoothChromatograms_movingaverage(windowSize = 5))
ms$analyses$has_chromatograms
[1] TRUE
ms$plot_chromatograms(colorBy = "replicates")
# Settings for integration of the TIC
ps_ic <- MassSpecSettings_IntegrateChromatograms_StreamFind(
merge = TRUE,
closeByThreshold = 5,
minPeakHeight = 3E6,
minPeakDistance = 3,
minPeakWidth = 20,
maxPeakWidth = 200,
minSN = 5
)
show(ps_ic)
StreamFind::MassSpecSettings_IntegrateChromatograms_StreamFind
engine MassSpec
method IntegrateChromatograms
algorithm StreamFind
version 0.2.0
software StreamFind
developer Ricardo Cunha
contact cunha@iuta.de
link https://odea-project.github.io/StreamFind
doi NA
parameters:
- merge TRUE
- closeByThreshold 5
- minPeakHeight 3e+06
- minPeakDistance 3
- minPeakWidth 20
- maxPeakWidth 200
- minSN 5
ms$run(ps_ic)
ms$plot_chromatograms_peaks(colorBy = "replicates")
The average retention time of the main peak was 363.9823581 seconds with a standard deviation of 0.8060852. The following step is to estimate the m/z range of BSA for deconvolution of charges.
# Average retention time of the main peak
rt_av <- mean(data.table::rbindlist(ms$chromatograms$peaks)$rt)
rt_av
[1] 363.9824
ms$plot_spectra_ms1(
analyses = 1,
rt = rt_av,
sec = 0.5,
colorBy = "replicates",
presence = 0.1,
minIntensity = 5000,
interactive = FALSE
)
According to the MS1 spectra, the m/z from 1200 to 2000 can
be added to the dedicated ProcessingSettings for deconvolution,
as shown below. The workflow is an ordered list of
ProcessingSettings objects for each workflow step, which is
added to the engine. The exact parameters for each method can be
optimized individually by calling the specific method directly. Once the
settings are added, the workflow can be inspected with the
print_workflow()
method.
ms_workflow <- list(
MassSpecSettings_LoadSpectra_StreamFind(
mzmin = 1200,
mzmax = 2000,
rtmin = rt_av - 1,
rtmax = rt_av + 1,
levels = 1
),
MassSpecSettings_ClusterSpectra_StreamFind(
val = "mz",
clustVal = 0.005,
presence = 0.1
),
MassSpecSettings_CalculateSpectraCharges_StreamFind(
roundVal = 15,
relLowCut = 0.2,
absLowCut = 8000
),
MassSpecSettings_DeconvoluteSpectra_StreamFind(),
MassSpecSettings_SmoothSpectra_movingaverage(
windowSize = 15
),
MassSpecSettings_BinSpectra_StreamFind(
binNames = c("rt", "mass"),
binValues = c(5, 50),
byUnit = TRUE,
refBinAnalysis = 1
),
MassSpecSettings_AverageSpectra_StreamFind(),
MassSpecSettings_NormalizeSpectra_minmax(),
MassSpecSettings_NormalizeSpectra_blockweight()
)
# Replaces the settings added to integrate the TIC chromatograms
ms$workflow <- ms_workflow
ms$print_workflow()
1: LoadSpectra (StreamFind)
2: ClusterSpectra (StreamFind)
3: CalculateSpectraCharges (StreamFind)
4: DeconvoluteSpectra (StreamFind)
5: SmoothSpectra (movingaverage)
6: BinSpectra (StreamFind)
7: AverageSpectra (StreamFind)
8: NormalizeSpectra (minmax)
9: NormalizeSpectra (blockweight)
The run_workflow()
method is called to run the data
processing.
ms$run_workflow()
plot_spectra(ms$analyses, colorBy = "analyses")
Raman
The four different BSA products were measured using an online coupling of size exclusion chromatography (SEC) with capillary-enhanced Raman spectroscopy (CERS) through a liquid-core optical fibre flow cell, as reported by Thissen et al. (DOI: 10.1021/acs.analchem.3c03991). The raw spectra was converted to asc format for each retention time. A subset of the background and the main BSA elution peak was selected for pre-processing.
raman <- RamanEngine$new(analyses = raman_files)
# Modifying the replicate names to match the MS analyses
raman$add_replicate_names(unique(ms$get_analysis_names()))
raman
RamanEngine
File:
NA
Headers:name: NA
author: NA
date: 2024-11-29 10:02:20.393593
Workflow:
empty
Analyses:
analysis replicate blank type spectra
<char> <char> <char> <char> <num>
1: BSA_7AqYf_Raman BSA_7AqYf_Raman <NA> raman 4093952
2: BSA_m595c_Raman BSA_m595c_Raman <NA> raman 4093952
3: BSA_utpLV_Raman BSA_utpLV_Raman <NA> raman 4093952
4: BSA_wAgwS_Raman BSA_wAgwS_Raman <NA> raman 4093952
raman$plot_chromatograms(yLab = "Abbundance sum for each spectrum", colorBy = "analyses")
raman$plot_spectra(rt = c(430, 440), colorBy = "analyses")
As for MS data, the workflow for Raman data is also assembled using an ordered list of processing settings, as shown below.
raman_workflow <- list(
RamanSettings_BinSpectra_StreamFind(
binNames = "rt",
binValues = 20,
byUnit = FALSE
),
RamanSettings_SubtractSpectraSection_StreamFind(
sectionVal = "rt",
sectionWindow = c(0, 200)
),
RamanSettings_DeleteSpectraSection_StreamFind(
shiftmin = -100,
shiftmax = 315
),
RamanSettings_SmoothSpectra_savgol(
fl = 11,
forder = 2,
dorder = 0
),
RamanSettings_DeleteSpectraSection_StreamFind(
shiftmin = 315,
shiftmax = 330
),
RamanSettings_DeleteSpectraSection_StreamFind(
shiftmin = 2300,
shiftmax = 2600
),
RamanSettings_CorrectSpectraBaseline_baseline_als(
lambda = 3,
p = 0.015,
maxit = 10
),
RamanSettings_DeleteSpectraSection_StreamFind(
rtmin = 0,
rtmax = 420
),
RamanSettings_DeleteSpectraSection_StreamFind(
rtmin = 430,
rtmax = 900
),
RamanSettings_AverageSpectra_StreamFind(),
RamanSettings_NormalizeSpectra_minmax(),
RamanSettings_NormalizeSpectra_blockweight()
)
raman$workflow <- raman_workflow
raman$print_workflow()
1: BinSpectra (StreamFind)
2: SubtractSpectraSection (StreamFind)
3: DeleteSpectraSection (StreamFind)
4: SmoothSpectra (savgol)
5: DeleteSpectraSection (StreamFind)
6: DeleteSpectraSection (StreamFind)
7: CorrectSpectraBaseline (baseline_als)
8: DeleteSpectraSection (StreamFind)
9: DeleteSpectraSection (StreamFind)
10: AverageSpectra (StreamFind)
11: NormalizeSpectra (minmax)
12: NormalizeSpectra (blockweight)
A particularity of the workflow for Raman data is that the background
subtraction (workflow step number 3) is performed from a time region
(i.e., between 0 and 3 minutes) before the main BSA peak, which is from
approximately from 5 to 8 minutes. The workflow is then applied using
the run_workflow()
method.
raman$run_workflow()
raman$plot_spectra_baseline()
raman$plot_spectra()
Fusion
The fusion of MS and Raman data is performed by combining the processed data from both engines. The data is then merged into a single data frame for statistical evaluation using a StatisticEngine, as shown in the next chapter.
ms_mat <- get_spectra_matrix(ms$analyses)
raman_mat <- get_spectra_matrix(raman$analyses)
fused_mat <- cbind(ms_mat, raman_mat)
attr(fused_mat, "xaxis.name") = "Keys"
attr(fused_mat, "xaxis.values") = seq_len(ncol(fused_mat))
str(fused_mat)
num [1:4, 1:880] 0.00593 0.00808 0 0 0.00519 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:4] "BSA_7AqYf" "BSA_m595c" "BSA_utpLV" "BSA_wAgwS"
..$ : chr [1:880] "364-65516" "364-65566" "364-65616" "364-65666" ...
- attr(*, "xaxis.name")= chr "Keys"
- attr(*, "xaxis.values")= int [1:880] 1 2 3 4 5 6 7 8 9 10 ...
Statistic analysis
A statistic analysis is performed with the StatisticEngine for evaluation of the differences between the BSA products.
PCA
The fused data is used to create a PCA model with the StatisticEngine using processing settings based on the mdatools package.
stat <- StatisticEngine$new(analyses = fused_mat)
# Plots the data from the added fused data.frame/matrix
plot_data(stat$analyses)
# Note that scale is set to FALSE to avoid scaling the data
# as blockweight normalization was applied for each dataset
stat$run(
StatisticSettings_MakeModel_pca_mdatools(
center = TRUE,
ncomp = 2,
alpha = 0.05,
gamma = 0.02
)
)
summary(stat$model) # summary method is from mdatools
Summary for PCA model (class pca)
Type of limits: ddmoments
Alpha: 0.05
Gamma: 0.02
Eigenvals Expvar Cumexpvar Nq Nh
Comp 1 0.005 65.82 65.82 8 3
Comp 2 0.001 20.79 86.61 2 8
plot(stat$model)
# Adds a color key to the loadings plot to evaluate the contribution of each dataset (i.e., MS and Raman)
plot_loadings(stat$analyses, colorKey = rep(c("MS", "Raman"), c(ncol(ms_mat), ncol(raman_mat))))
# The mdatools S3 class model object is obtained via the active binding of the StatisticEngine
plot(stat$model$model, show.labels = TRUE) # plot method from mdatools
MCR purity
An alternative approach for the statistic analysis is the MCR pure model (based on mdatools R package) to evaluate the purity level in the analyses.
# Note that the model in the engine will be overwritten by the MCR model
stat$run(StatisticSettings_MakeModel_mcrpure_mdatools(ncomp = 2, offset = 0.05))
summary(stat$model)
Summary for MCR Purity case (class mcrpure)
Expvar Cumexpvar Varindex Purity
Comp 1 43.24 43.24 17 17.715
Comp 2 55.98 99.21 19 25.003
plot(stat$model)
# Gets the most pure features for each component
get_model_data(stat$model)$purity
vars vals feature result
<int> <num> <char> <char>
1: 17 17.71518 364-66316 model
2: 19 25.00344 364-66416 model