Comparison of different bovine serum albumin products
Fusion of mass spectrometry and raman spectroscopy
data
Ricardo Cunha
cunha@iuta.de01 August, 2024
Source:vignettes/articles/demo_ms_raman_fusion_bsa.Rmd
demo_ms_raman_fusion_bsa.Rmd
Introduction
Mass Spectrometry (MS) and Raman spectroscopy (Raman) are orthogonal techniques. While MS measures the chemical composition of a given sample, Raman is able to provide structural information of the constituents. This orthogonality is interesting when applying quality evaluation workflows, where both chemical composition and structural integrity are relevant. In this article, we show how MS and Raman can be pre-processed and fused for a combined statistical evaluation to support quality control. As test sets, we use four different bovine serum albumin (BSA) products.
Pre-processing
The preliminary processing is applied differently for MS and Raman as the raw data structure does not allow direct fusion of the data. Therefore, in the following two sub-chapters we describe the pre-processing workflow steps for MS and Raman data.
MS
The four different BSA products were measured in triplicate in full scan MS mode. Giving the following set of .d files.
# Raw data from an Agilent QTOF
basename(ms_files_d)
[1] "BSA_7AqYf_1.d" "BSA_7AqYf_2.d" "BSA_7AqYf_3.d" "BSA_m595c_1.d"
[5] "BSA_m595c_2.d" "BSA_m595c_3.d" "BSA_utpLV_1.d" "BSA_utpLV_2.d"
[9] "BSA_utpLV_3.d" "BSA_wAgwS_1.d" "BSA_wAgwS_2.d" "BSA_wAgwS_3.d"
[13] "H2O.d"
The first step is conversion to open source mzML format using MSConvert from ProteoWizard CLI, as shwon below.
# Add option to centroid MS1 data before conversion
optList <- list(filter = "peakPicking vendor msLevel=1")
# convert_ms_files(ms_files_d, outputFormat = "mzML", optList = optList)
After conversion, the mzML files are created in the same directory. The converted files are listed below.
basename(ms_files)
[1] "BSA_7AqYf_1.mzML" "BSA_7AqYf_2.mzML" "BSA_7AqYf_3.mzML" "BSA_m595c_1.mzML"
[5] "BSA_m595c_2.mzML" "BSA_m595c_3.mzML" "BSA_utpLV_1.mzML" "BSA_utpLV_2.mzML"
[9] "BSA_utpLV_3.mzML" "BSA_wAgwS_1.mzML" "BSA_wAgwS_2.mzML" "BSA_wAgwS_3.mzML"
[13] "H2O.mzML"
With the mzML files, a new MassSpecEngine is created to start the MS pre-processing project. Below the engine is created and the analysis replicate names are modified to properly assign the analyses within each replicate measurement.
ms <- MassSpecEngine$new(files = ms_files, headers = list(name = "BSA Quality Evaluation Project"))
# Changes the analysis replicate names
ms$add_replicate_names(sub("_\\d+$", "", ms$get_analysis_names()))
# Overview
ms
MassSpecEngine
name BSA Quality Evaluation Project
author NA
file NA
date 2024-08-01 10:03:40.418132
Workflow empty
Analyses
analysis replicate blank type polarity spectra chromatograms
<char> <char> <char> <char> <char> <num> <num>
1: BSA_7AqYf_1 BSA_7AqYf <NA> MS positive 591 7
2: BSA_7AqYf_2 BSA_7AqYf <NA> MS positive 591 7
3: BSA_7AqYf_3 BSA_7AqYf <NA> MS positive 591 7
4: BSA_m595c_1 BSA_m595c <NA> MS positive 592 7
5: BSA_m595c_2 BSA_m595c <NA> MS positive 591 7
6: BSA_m595c_3 BSA_m595c <NA> MS positive 592 7
7: BSA_utpLV_1 BSA_utpLV <NA> MS positive 591 7
8: BSA_utpLV_2 BSA_utpLV <NA> MS positive 591 7
9: BSA_utpLV_3 BSA_utpLV <NA> MS positive 591 7
10: BSA_wAgwS_1 BSA_wAgwS <NA> MS positive 592 7
11: BSA_wAgwS_2 BSA_wAgwS <NA> MS positive 591 7
12: BSA_wAgwS_3 BSA_wAgwS <NA> MS positive 591 7
13: H2O H2O <NA> MS positive 591 7
Results empty
ms$plot_spectra_tic(colorBy = "replicates")
The data processing workflow is based on applied processing methods within the engine. As pre-processing, the following steps are applied:
- Find the retention time elution of BSA based on the total ion chromatogram (TIC) of each analysis;
- Find the mass-to-charge (m/z) range of BSA for deconvolution of charges;
- Deconvolute charges for estimation of the exact mass, in DA;
The individual processing steps are applied according to ProcessingSettings. Below, the integration of the TIC is shown using the dedicated ProcessingSettings after loading the TIC chromatograms from the analysis files and smoothing using a moving average approach.
ms$load_chromatograms("TIC")
# Smoothing of the TIC chromatogram for improving peak finding
ms$process(MassSpecSettings_SmoothChromatograms_movingaverage(windowSize = 5))
# Settings for integration of the TIC
ps_ic <- MassSpecSettings_IntegrateChromatograms_StreamFind(
merge = TRUE,
closeByThreshold = 5,
minPeakHeight = 3E6,
minPeakDistance = 3,
minPeakWidth = 20,
maxPeakWidth = 200,
minSN = 5
)
ps_ic
ProcessingSettings
engine MassSpec
call IntegrateChromatograms
algorithm StreamFind
version 0.2.0
software StreamFind
developer Ricardo Cunha
contact cunha@iuta.de
link https://odea-project.github.io/StreamFind
doi NA
parameters:
- merge TRUE
- closeByThreshold 5
- minPeakHeight 3e+06
- minPeakDistance 3
- minPeakWidth 20
- maxPeakWidth 200
- minSN 5
ms$process(ps_ic)
ms$plot_chromatograms_peaks(colorBy = "replicates")
The average retention time of the main peak was NA seconds with a standard deviation of NA. The following step is to estimate the m/z range of BSA for deconvolution of charges. For this, the MS1 spectra of the peak retention time +/- 0.1 seconds is plotted.
ms$plot_spectra_ms1(rt = mean(rbindlist(ms$chromatograms_peaks)$rt), sec = 0.5,
colorBy = "replicates", presence = 0.1, minIntensity = 1000, interactive = FALSE
)
According to the MS1 spectra, the m/z from 1200 to 2000 can be added to the dedicated ProcessingSettings for deconvolution, as shown below. The first step is to load from the mzML files the traces within the retention time and m/z ranges selected.
# Average retention time of the main peak
av_rt <- mean(rbindlist(ms$chromatograms_peaks)$rt)
# Range of m/z and rt for deconvolution
spectra_ranges <- data.table("mzmin" = 1200, "mzmax" = 2000, "rtmin" = av_rt - 1, "rtmax" = av_rt + 1)
# Load the spectra within the ranges
ms$load_spectra(mz = spectra_ranges)
Below we create an ordered list of ProcessingSettings
objects for each processing method to be applied and add the list to the
engine. The exact parameters for each method can be optimized
individually by calling the specific method directly. Once the settings
are added, the workflow can be inspected with the
print_workflow()
method.
ms_workflow_settings <- list(
MassSpecSettings_ClusterSpectra_StreamFind(val = "mz", clustVal = 0.01, presence = 0.2),
MassSpecSettings_CalculateSpectraCharges_StreamFind(roundVal = 15, relLowCut = 0.2, absLowCut = 8000),
MassSpecSettings_DeconvoluteSpectra_StreamFind(),
MassSpecSettings_SmoothSpectra_movingaverage(windowSize = 15),
MassSpecSettings_BinSpectra_StreamFind(bins = list("rt" = 5, "mass" = 20), refBinAnalysis = 1),
MassSpecSettings_AverageSpectra_StreamFind(),
MassSpecSettings_NormalizeSpectra_minmax(),
MassSpecSettings_NormalizeSpectra_blockweight()
)
# Remove the settings added to integrate the TIC chromatograms
ms$remove_settings()
Removed all processing settings!
ms$add_settings(ms_workflow_settings)
ms$print_workflow()
Workflow
1: ClusterSpectra (StreamFind)
2: CalculateSpectraCharges (StreamFind)
3: DeconvoluteSpectra (StreamFind)
4: SmoothSpectra (movingaverage)
5: BinSpectra (StreamFind)
6: AverageSpectra (StreamFind)
7: NormalizeSpectra (minmax)
8: NormalizeSpectra (blockweight)
The run_workflow()
method is called to run the data
processing. The workflow will be applied in the order the methods were
added to the engine.
ms$run_workflow()
ms$plot_spectra(colorBy = "analyses")
Raman
The four different BSA products were measured using an online coupling of size exclusion chromatography (SEC) with capillary-enhanced Raman spectroscopy (CERS) through a liquid-core optical fibre flow cell, as reported by Thissen et al. (DOI: 10.1021/acs.analchem.3c03991). The raw spectra was converted to asc format for each retention time. A subset of the background and the main BSA elution peak was selected for pre-processing. The first pre-processing step is to merge the individual asc files into a main file for each BSA product, as shown below.
# Engine with all asc files for each spectrum and each BSA product
# raman <- RamanEngine$new(raman_files)
# Folder names of each BSA product
# base_dirs <- basename(dirname(raman_files))
# The folder name is used as replicate name, which is also used as file name for the unified data file
# raman$add_replicate_names(base_dirs)
# The spectra are merged into a time series for each BSA product
# raman$merge_spectra_time_series()
A new engine is created with the unified file for each BSA product and then plotted as a time series.
raman <- RamanEngine$new(raman_unified_files)
# Modifying the replicate names to match the MS analyses
raman$add_replicate_names(unique(ms$get_replicate_names())[1:4])
raman
RamanEngine
name NA
author NA
file NA
date 2024-08-01 10:03:58.982677
Workflow empty
Analyses
analysis replicate blank spectra traces
<char> <char> <char> <num> <num>
1: BSA_7AqYf_Raman BSA_7AqYf <NA> 3998 4093952
2: BSA_m595c_Raman BSA_m595c <NA> 3998 4093952
3: BSA_utpLV_Raman BSA_utpLV <NA> 3998 4093952
4: BSA_wAgwS_Raman BSA_wAgwS <NA> 3998 4093952
Results empty
raman$plot_chromatograms(yLab = "Abbundance sum for each spectrum", colorBy = "analyses")
raman$plot_spectra(rt = c(5.5, 8), colorBy = "analyses")
As for MS data, the workflow for Raman data is also assembled using an ordered list of processing settings, as shown below.
raman_workflow_settings <- list(
RamanSettings_BinSpectra_StreamFind(unitsVal = "rt", unitsNumber = 19),
#RamanSettings_NormalizeSpectra_minmax(),
RamanSettings_SubtractSpectraSection_StreamFind(sectionVal = "rt", sectionWindow = c(0, 3)), # time window
RamanSettings_DeleteSpectraSection_StreamFind(list("shift" = c(-100, 315))),
RamanSettings_SmoothSpectra_savgol(fl = 11, forder = 2, dorder = 0),
RamanSettings_DeleteSpectraSection_StreamFind(list("shift" = c(315, 330))),
RamanSettings_DeleteSpectraSection_StreamFind(list("shift" = c(2300, 2600))),
#RamanSettings_CorrectSpectraBaseline_airpls(lambda = 60, differences = 1),
RamanSettings_CorrectSpectraBaseline_baseline(method = "als", args = list(lambda = 3, p = 0.015, maxit = 10)),
RamanSettings_DeleteSpectraSection_StreamFind(list("rt" = c(0, 5.5))),
RamanSettings_DeleteSpectraSection_StreamFind(list("rt" = c(8, 11))),
RamanSettings_NormalizeSpectra_minmax(),
RamanSettings_NormalizeSpectra_blockweight()
)
raman$add_settings(raman_workflow_settings)
raman$print_workflow()
Workflow
1: BinSpectra (StreamFind)
2: SubtractSpectraSection (StreamFind)
3: DeleteSpectraSection (StreamFind)
4: SmoothSpectra (savgol)
5: DeleteSpectraSection (StreamFind)
6: DeleteSpectraSection (StreamFind)
7: CorrectSpectraBaseline (baseline)
8: DeleteSpectraSection (StreamFind)
9: DeleteSpectraSection (StreamFind)
10: NormalizeSpectra (minmax)
11: NormalizeSpectra (blockweight)
A particularity of the workflow for Raman data is that the background
subtraction (workflow step number 3) is performed from a time region
(i.e., between 0 and 3 minutes) before the main BSA peak, which is from
approximately from 5 to 8 minutes. The workflow is then applied using
the run_workflow()
method.
raman$run_workflow()
raman$plot_spectra_baseline()
raman$plot_spectra()
Fusion
The fusion of MS and Raman data is performed by combining the processed data from both engines. The data is then merged into a single data frame for statistical evaluation using a MachineLearningEngine, which is under development.
rSpec <- raman$get_spectra()
mrSpec <- matrix(rSpec$intensity, nrow = length(unique(rSpec$replicate)), ncol = length(unique(rSpec$shift)), byrow = TRUE, dimnames = list(as.character(unique(rSpec$replicate)), as.character(unique(rSpec$shift))))
mSpec <- ms$get_spectra()
mmSpec <- matrix(mSpec$intensity, nrow = length(unique(mSpec$replicate)), ncol = length(unique(mSpec$bins)), byrow = TRUE, dimnames = list(unique(mSpec$replicate), unique(mSpec$bins)))
fusedMat <- cbind(mmSpec, mrSpec)
attr(fusedMat, "xaxis.name") = "Keys"
attr(fusedMat, "xaxis.values") = seq_len(ncol(fusedMat))
cl <- StreamFind:::.get_colors(rownames(fusedMat))
fig <- plot_ly()
xVal <- seq_len(length(fusedMat[1, ]))
for (i in seq_len(nrow(fusedMat))) {
fig <- fig %>% add_trace(
x = xVal,
y = fusedMat[i, ],
type = "scatter", mode = "lines",
line = list(width = 0.5, color = unname(cl[i])),
name = names(cl)[i],
legendgroup = names(cl)[i],
showlegend = TRUE
)
}
xaxis <- list(
linecolor = toRGB("black"),
linewidth = 2, title = "Bins",
titlefont = list(size = 12, color = "black")
)
yaxis <- list(
linecolor = toRGB("black"),
linewidth = 2, title = "Intensity",
titlefont = list(size = 12, color = "black")
)
fig <- fig %>% plotly::layout(xaxis = xaxis, yaxis = yaxis)
fig
Statistic analysis
A statistic analysis is performed with the StatisticEngine for evaluation of the differences between the BSA products.
PCA
The fused data is used to create a PCA model with the StatisticEngine using processing settings based on the mdatools package.
stat <- StatisticEngine$new(data = fusedMat)
stat$process(StatisticSettings_MakeModel_pca_mdatools(ncomp = 2, alpha = 0.05, gamma = 0.02))
summary(stat$model) # summary method is from mdatools
Summary for PCA model (class pca)
Type of limits: ddmoments
Alpha: 0.05
Gamma: 0.02
Eigenvals Expvar Cumexpvar Nq Nh
Comp 1 0.227 95.47 95.47 97 91
Comp 2 0.005 2.01 97.48 6 9
stat$plot_model_scores()
# the mdatools S3 class model object is obtained via the active binding of the StatisticEngine
plot(stat$model) # plot method from mdatools
A clear separation of the BSA wAgwS product from the other products is observed in the PCA scores plot. The loadings plot shows the contribution of the MS and Raman data to the separation.
MCR purity
Another approach for the statistic analysis is to use the MCR pure model (based on mdatools R package) to evaluate the purity level in the analyses.
# Note that the model in the engine will be overwritten by the new mcr pure model
stat$process(StatisticSettings_MakeModel_mcrpure_mdatools(ncomp = 2, offset = 0.05))
summary(stat$model)
Summary for MCR Purity case (class mcrpure)
Expvar Cumexpvar Varindex Purity
Comp 1 47.72 47.72 222 26.69
Comp 2 49.47 97.18 48 53.08
stat$plot_model_resolved_spectra()
A clear separation of the BSA product wAgwS from the other products is observed in the resolved spectra plot. This is expected as the wAgwS is fatty-acid free in contrast to the other BSA products. This difference would not be so evident with MS data only. In contrast, the contribution of fatty-acid contamination makes the identification of BSA in the Raman data more difficult. Yet, BSA presence in the wAgwS BSA product is evident with MS data. Using fused data from MS and Raman both compositional and structural information can be combined for a more comprehensive quality evaluation.