Skip to contents




Introduction

Mass Spectrometry (MS) and Raman spectroscopy (Raman) are orthogonal techniques. While MS measures the chemical composition of a given sample, Raman is able to provide structural information of the constituents. This orthogonality is interesting when applying quality evaluation workflows, where both chemical composition and structural integrity are relevant. In this article, we show how MS and Raman can be pre-processed and fused for a combined statistical evaluation to support quality control. As test sets, we use four different bovine serum albumin (BSA) products.

Pre-processing

The preliminary processing is applied differently for MS and Raman as the raw data structure does not allow direct fusion of the data. Therefore, in the following two sub-chapters we describe the pre-processing workflow steps for MS and Raman data.

MS

The four different BSA products were measured in triplicate in full scan MS mode. Giving the following set of .d files.

# Raw data from an Agilent QTOF
basename(ms_files_d)
 [1] "BSA_7AqYf_1.d" "BSA_7AqYf_2.d" "BSA_7AqYf_3.d" "BSA_m595c_1.d"
 [5] "BSA_m595c_2.d" "BSA_m595c_3.d" "BSA_utpLV_1.d" "BSA_utpLV_2.d"
 [9] "BSA_utpLV_3.d" "BSA_wAgwS_1.d" "BSA_wAgwS_2.d" "BSA_wAgwS_3.d"
[13] "H2O.d"        

The first step is conversion to open source mzML format using MSConvert from ProteoWizard CLI, as shwon below.

# Add option to centroid MS1 data before conversion
optList <- list(filter = "peakPicking vendor msLevel=1")

# convert_ms_files(ms_files_d, outputFormat = "mzML", optList = optList)

After conversion, the mzML files are created in the same directory. The converted files are listed below.

basename(ms_files)
 [1] "BSA_7AqYf_1.mzML" "BSA_7AqYf_2.mzML" "BSA_7AqYf_3.mzML" "BSA_m595c_1.mzML"
 [5] "BSA_m595c_2.mzML" "BSA_m595c_3.mzML" "BSA_utpLV_1.mzML" "BSA_utpLV_2.mzML"
 [9] "BSA_utpLV_3.mzML" "BSA_wAgwS_1.mzML" "BSA_wAgwS_2.mzML" "BSA_wAgwS_3.mzML"
[13] "H2O.mzML"        

With the mzML files, a new MassSpecEngine is created to start the MS pre-processing project. Below the engine is created and the analysis replicate names are modified to properly assign the analyses within each replicate measurement.

ms <- MassSpecEngine$new(files = ms_files, headers = list(name = "BSA Quality Evaluation Project"))

# Changes the analysis replicate names
ms$add_replicate_names(sub("_\\d+$", "", ms$get_analysis_names()))

# Overview
ms

MassSpecEngine
name          BSA Quality Evaluation Project
author        NA
file          NA
date          2024-08-01 10:03:40.418132

Workflow empty 

Analyses
       analysis replicate  blank   type polarity spectra chromatograms
         <char>    <char> <char> <char>   <char>   <num>         <num>
 1: BSA_7AqYf_1 BSA_7AqYf   <NA>     MS positive     591             7
 2: BSA_7AqYf_2 BSA_7AqYf   <NA>     MS positive     591             7
 3: BSA_7AqYf_3 BSA_7AqYf   <NA>     MS positive     591             7
 4: BSA_m595c_1 BSA_m595c   <NA>     MS positive     592             7
 5: BSA_m595c_2 BSA_m595c   <NA>     MS positive     591             7
 6: BSA_m595c_3 BSA_m595c   <NA>     MS positive     592             7
 7: BSA_utpLV_1 BSA_utpLV   <NA>     MS positive     591             7
 8: BSA_utpLV_2 BSA_utpLV   <NA>     MS positive     591             7
 9: BSA_utpLV_3 BSA_utpLV   <NA>     MS positive     591             7
10: BSA_wAgwS_1 BSA_wAgwS   <NA>     MS positive     592             7
11: BSA_wAgwS_2 BSA_wAgwS   <NA>     MS positive     591             7
12: BSA_wAgwS_3 BSA_wAgwS   <NA>     MS positive     591             7
13:         H2O       H2O   <NA>     MS positive     591             7

Results empty 
ms$plot_spectra_tic(colorBy = "replicates")

Total ion chromatograms for each analysis replicate.

The data processing workflow is based on applied processing methods within the engine. As pre-processing, the following steps are applied:

  1. Find the retention time elution of BSA based on the total ion chromatogram (TIC) of each analysis;
  2. Find the mass-to-charge (m/z) range of BSA for deconvolution of charges;
  3. Deconvolute charges for estimation of the exact mass, in DA;

The individual processing steps are applied according to ProcessingSettings. Below, the integration of the TIC is shown using the dedicated ProcessingSettings after loading the TIC chromatograms from the analysis files and smoothing using a moving average approach.

ms$load_chromatograms("TIC")
# Smoothing of the TIC chromatogram for improving peak finding
ms$process(MassSpecSettings_SmoothChromatograms_movingaverage(windowSize = 5))
# Settings for integration of the TIC
ps_ic <- MassSpecSettings_IntegrateChromatograms_StreamFind(
  merge = TRUE,
  closeByThreshold = 5,
  minPeakHeight = 3E6,
  minPeakDistance = 3,
  minPeakWidth = 20,
  maxPeakWidth = 200,
  minSN = 5
)

ps_ic

 ProcessingSettings 
 engine       MassSpec
 call         IntegrateChromatograms
 algorithm    StreamFind
 version      0.2.0
 software     StreamFind
 developer    Ricardo Cunha
 contact      cunha@iuta.de
 link         https://odea-project.github.io/StreamFind
 doi          NA

 parameters: 
  -  merge TRUE 
  -  closeByThreshold 5 
  -  minPeakHeight 3e+06 
  -  minPeakDistance 3 
  -  minPeakWidth 20 
  -  maxPeakWidth 200 
  -  minSN 5 
ms$process(ps_ic)
ms$plot_chromatograms_peaks(colorBy = "replicates")

TIC with integrated peaks.

The average retention time of the main peak was NA seconds with a standard deviation of NA. The following step is to estimate the m/z range of BSA for deconvolution of charges. For this, the MS1 spectra of the peak retention time +/- 0.1 seconds is plotted.

ms$plot_spectra_ms1(rt = mean(rbindlist(ms$chromatograms_peaks)$rt), sec = 0.5,
  colorBy = "replicates", presence = 0.1, minIntensity = 1000, interactive = FALSE
)
Spectra of BSA for each analysis replicate at the peak maximum.

Spectra of BSA for each analysis replicate at the peak maximum.

According to the MS1 spectra, the m/z from 1200 to 2000 can be added to the dedicated ProcessingSettings for deconvolution, as shown below. The first step is to load from the mzML files the traces within the retention time and m/z ranges selected.

# Average retention time of the main peak
av_rt <- mean(rbindlist(ms$chromatograms_peaks)$rt)

# Range of m/z and rt for deconvolution
spectra_ranges <- data.table("mzmin" = 1200, "mzmax" = 2000, "rtmin" = av_rt - 1, "rtmax" = av_rt + 1)

# Load the spectra within the ranges
ms$load_spectra(mz = spectra_ranges)

Below we create an ordered list of ProcessingSettings objects for each processing method to be applied and add the list to the engine. The exact parameters for each method can be optimized individually by calling the specific method directly. Once the settings are added, the workflow can be inspected with the print_workflow() method.

ms_workflow_settings <- list(
  MassSpecSettings_ClusterSpectra_StreamFind(val = "mz", clustVal = 0.01, presence = 0.2),
  MassSpecSettings_CalculateSpectraCharges_StreamFind(roundVal = 15, relLowCut = 0.2, absLowCut = 8000),
  MassSpecSettings_DeconvoluteSpectra_StreamFind(),
  MassSpecSettings_SmoothSpectra_movingaverage(windowSize = 15),
  MassSpecSettings_BinSpectra_StreamFind(bins = list("rt" = 5, "mass" = 20), refBinAnalysis = 1),
  MassSpecSettings_AverageSpectra_StreamFind(),
  MassSpecSettings_NormalizeSpectra_minmax(),
  MassSpecSettings_NormalizeSpectra_blockweight()
)

# Remove the settings added to integrate the TIC chromatograms
ms$remove_settings()
Removed all processing settings! 
ms$add_settings(ms_workflow_settings)

ms$print_workflow()

Workflow
 1: ClusterSpectra (StreamFind)
 2: CalculateSpectraCharges (StreamFind)
 3: DeconvoluteSpectra (StreamFind)
 4: SmoothSpectra (movingaverage)
 5: BinSpectra (StreamFind)
 6: AverageSpectra (StreamFind)
 7: NormalizeSpectra (minmax)
 8: NormalizeSpectra (blockweight)

The run_workflow() method is called to run the data processing. The workflow will be applied in the order the methods were added to the engine.

ms$run_workflow()
ms$plot_spectra(colorBy = "analyses")

Deconvoluted and pre-processed spectra for each analysis.

Raman

The four different BSA products were measured using an online coupling of size exclusion chromatography (SEC) with capillary-enhanced Raman spectroscopy (CERS) through a liquid-core optical fibre flow cell, as reported by Thissen et al. (DOI: 10.1021/acs.analchem.3c03991). The raw spectra was converted to asc format for each retention time. A subset of the background and the main BSA elution peak was selected for pre-processing. The first pre-processing step is to merge the individual asc files into a main file for each BSA product, as shown below.

# Engine with all asc files for each spectrum and each BSA product
# raman <- RamanEngine$new(raman_files)

# Folder names of each BSA product
# base_dirs <- basename(dirname(raman_files))

# The folder name is used as replicate name, which is also used as file name for the unified data file
# raman$add_replicate_names(base_dirs)

# The spectra are merged into a time series for each BSA product
# raman$merge_spectra_time_series()

A new engine is created with the unified file for each BSA product and then plotted as a time series.

raman <- RamanEngine$new(raman_unified_files)

# Modifying the replicate names to match the MS analyses
raman$add_replicate_names(unique(ms$get_replicate_names())[1:4])

raman

RamanEngine
name          NA
author        NA
file          NA
date          2024-08-01 10:03:58.982677

Workflow empty 

Analyses
          analysis replicate  blank spectra  traces
            <char>    <char> <char>   <num>   <num>
1: BSA_7AqYf_Raman BSA_7AqYf   <NA>    3998 4093952
2: BSA_m595c_Raman BSA_m595c   <NA>    3998 4093952
3: BSA_utpLV_Raman BSA_utpLV   <NA>    3998 4093952
4: BSA_wAgwS_Raman BSA_wAgwS   <NA>    3998 4093952

Results empty 
raman$plot_chromatograms(yLab = "Abbundance sum for each spectrum", colorBy = "analyses")

Time series chromatogram for each BSA product. Each data point is a Raman spectrum at a specific retention time.

raman$plot_spectra(rt = c(5.5, 8), colorBy = "analyses")

Raw averaged Raman spectra for each BSA product between minute 5.5 and 8.

As for MS data, the workflow for Raman data is also assembled using an ordered list of processing settings, as shown below.

raman_workflow_settings <- list(
  RamanSettings_BinSpectra_StreamFind(unitsVal = "rt", unitsNumber = 19),
  #RamanSettings_NormalizeSpectra_minmax(),
  RamanSettings_SubtractSpectraSection_StreamFind(sectionVal = "rt", sectionWindow = c(0, 3)), # time window
  RamanSettings_DeleteSpectraSection_StreamFind(list("shift" = c(-100, 315))),
  RamanSettings_SmoothSpectra_savgol(fl = 11, forder = 2, dorder = 0),
  RamanSettings_DeleteSpectraSection_StreamFind(list("shift" = c(315, 330))),
  RamanSettings_DeleteSpectraSection_StreamFind(list("shift" = c(2300, 2600))),
  #RamanSettings_CorrectSpectraBaseline_airpls(lambda = 60, differences = 1),
  RamanSettings_CorrectSpectraBaseline_baseline(method = "als", args = list(lambda = 3, p = 0.015, maxit = 10)),
  RamanSettings_DeleteSpectraSection_StreamFind(list("rt" = c(0, 5.5))),
  RamanSettings_DeleteSpectraSection_StreamFind(list("rt" = c(8, 11))),
  RamanSettings_NormalizeSpectra_minmax(),
  RamanSettings_NormalizeSpectra_blockweight()
)

raman$add_settings(raman_workflow_settings)

raman$print_workflow()

Workflow
 1: BinSpectra (StreamFind)
 2: SubtractSpectraSection (StreamFind)
 3: DeleteSpectraSection (StreamFind)
 4: SmoothSpectra (savgol)
 5: DeleteSpectraSection (StreamFind)
 6: DeleteSpectraSection (StreamFind)
 7: CorrectSpectraBaseline (baseline)
 8: DeleteSpectraSection (StreamFind)
 9: DeleteSpectraSection (StreamFind)
 10: NormalizeSpectra (minmax)
 11: NormalizeSpectra (blockweight)

A particularity of the workflow for Raman data is that the background subtraction (workflow step number 3) is performed from a time region (i.e., between 0 and 3 minutes) before the main BSA peak, which is from approximately from 5 to 8 minutes. The workflow is then applied using the run_workflow() method.

raman$run_workflow()
raman$plot_spectra_baseline()

Baseline correction for each spectra.

raman$plot_spectra()

Processed Raman spectra for each BSA product.

Fusion

The fusion of MS and Raman data is performed by combining the processed data from both engines. The data is then merged into a single data frame for statistical evaluation using a MachineLearningEngine, which is under development.

rSpec <- raman$get_spectra()
mrSpec <- matrix(rSpec$intensity, nrow = length(unique(rSpec$replicate)), ncol = length(unique(rSpec$shift)), byrow = TRUE, dimnames = list(as.character(unique(rSpec$replicate)), as.character(unique(rSpec$shift))))

mSpec <- ms$get_spectra()
mmSpec <- matrix(mSpec$intensity, nrow = length(unique(mSpec$replicate)), ncol = length(unique(mSpec$bins)), byrow = TRUE, dimnames = list(unique(mSpec$replicate), unique(mSpec$bins)))

fusedMat <- cbind(mmSpec, mrSpec)

attr(fusedMat, "xaxis.name") = "Keys"
attr(fusedMat, "xaxis.values") = seq_len(ncol(fusedMat))
cl <- StreamFind:::.get_colors(rownames(fusedMat))
fig <- plot_ly()
xVal <- seq_len(length(fusedMat[1, ]))
for (i in seq_len(nrow(fusedMat))) {
  fig <- fig %>% add_trace(
    x = xVal,
    y = fusedMat[i, ],
    type = "scatter", mode = "lines",
    line = list(width = 0.5, color = unname(cl[i])),
    name = names(cl)[i],
    legendgroup = names(cl)[i],
    showlegend = TRUE
  )
}
xaxis <- list(
  linecolor = toRGB("black"),
  linewidth = 2, title = "Bins",
  titlefont = list(size = 12, color = "black")
)
yaxis <- list(
  linecolor = toRGB("black"),
  linewidth = 2, title = "Intensity",
  titlefont = list(size = 12, color = "black")
)
fig <- fig %>% plotly::layout(xaxis = xaxis, yaxis = yaxis)
fig

Statistic analysis

A statistic analysis is performed with the StatisticEngine for evaluation of the differences between the BSA products.

PCA

The fused data is used to create a PCA model with the StatisticEngine using processing settings based on the mdatools package.

stat <- StatisticEngine$new(data = fusedMat)
stat$process(StatisticSettings_MakeModel_pca_mdatools(ncomp = 2, alpha = 0.05, gamma = 0.02))
summary(stat$model) # summary method is from mdatools

Summary for PCA model (class pca)
Type of limits: ddmoments
Alpha: 0.05
Gamma: 0.02

       Eigenvals Expvar Cumexpvar Nq Nh
Comp 1     0.227  95.47     95.47 97 91
Comp 2     0.005   2.01     97.48  6  9
stat$plot_model_scores()

PCA scores plot for the fused data.

stat$plot_model_loadings(colorKey = rep(c("MS", "Raman"), c(ncol(mmSpec), ncol(mrSpec))))

PCA loadings plot for the fused data.

# the mdatools S3 class model object is obtained via the active binding of the StatisticEngine
plot(stat$model) # plot method from mdatools
PCA summary plot using the original mdatools R package.

PCA summary plot using the original mdatools R package.

A clear separation of the BSA wAgwS product from the other products is observed in the PCA scores plot. The loadings plot shows the contribution of the MS and Raman data to the separation.

MCR purity

Another approach for the statistic analysis is to use the MCR pure model (based on mdatools R package) to evaluate the purity level in the analyses.

# Note that the model in the engine will be overwritten by the new mcr pure model
stat$process(StatisticSettings_MakeModel_mcrpure_mdatools(ncomp = 2, offset = 0.05))
summary(stat$model)

Summary for MCR Purity case (class mcrpure)
       Expvar Cumexpvar Varindex Purity
Comp 1  47.72     47.72      222  26.69
Comp 2  49.47     97.18       48  53.08
stat$plot_model_resolved_spectra()

MCR pure model resolved spectra and the original spectra from the fused data.

A clear separation of the BSA product wAgwS from the other products is observed in the resolved spectra plot. This is expected as the wAgwS is fatty-acid free in contrast to the other BSA products. This difference would not be so evident with MS data only. In contrast, the contribution of fatty-acid contamination makes the identification of BSA in the Raman data more difficult. Yet, BSA presence in the wAgwS BSA product is evident with MS data. Using fused data from MS and Raman both compositional and structural information can be combined for a more comprehensive quality evaluation.