Skip to contents




Introduction

Mass Spectrometry (MS) and Raman spectroscopy (Raman) are orthogonal techniques. While MS measures the chemical composition of a given sample, Raman is able to provide structural information of the constituents. This orthogonality is interesting when applying quality evaluation workflows, where both chemical composition and structural integrity are relevant. In this article, we show how MS and Raman can be pre-processed and fused for a combined statistical evaluation to support quality control. As test sets, we use four different bovine serum albumin (BSA) products where we evaluate the quality of the active product ingredient (API).

Pre-processing

The preliminary processing is applied differently for MS and Raman data as the data structure does not allow direct fusion of the data. Therefore, in the following two sub-chapters we describe the pre-processing workflow steps for MS and Raman data.

MS

The four different BSA products were measured in triplicate in full scan MS mode. Giving the following set of .d files.

# Raw data from an Agilent QTOF
basename(ms_files)
 [1] "BSA_7AqYf_1.d" "BSA_7AqYf_2.d" "BSA_7AqYf_3.d" "BSA_m595c_1.d"
 [5] "BSA_m595c_2.d" "BSA_m595c_3.d" "BSA_utpLV_1.d" "BSA_utpLV_2.d"
 [9] "BSA_utpLV_3.d" "BSA_wAgwS_1.d" "BSA_wAgwS_2.d" "BSA_wAgwS_3.d"
[13] "H2O.d"         "H2O_2.d"      

With the .d files, a MassSpecEngine is created to pre-process the MS data. Note that the MS files were centroided and converted to mzML during creation of the MassSpecEngine, using the installed msconvert tool from the ProteoWizard in the background. The analysis replicate names are modified to properly assign the analyses within each replicate measurement. Two water (H2O) files were used as blanks for the analyses.

ms <- StreamFind::MassSpecEngine$new(
  headers = list(name = "BSA Quality Evaluation Project"),
  analyses = ms_files,
  centroid = TRUE,
  levels = 1
)

# Changes the analysis replicate names
ms$add_replicate_names(sub("_\\d+$", "", ms$get_analysis_names()))

# Overview
ms

MassSpecEngine
File: 
NA

Headers:name: BSA Quality Evaluation Project
author: NA
date: 2024-11-29 10:02:08.385106


Workflow: 
empty

Analyses: 
       analysis replicate  blank   type polarity spectra chromatograms
         <char>    <char> <char> <char>   <char>   <num>         <num>
 1: BSA_7AqYf_1 BSA_7AqYf   <NA>     MS positive     591             7
 2: BSA_7AqYf_2 BSA_7AqYf   <NA>     MS positive     591             7
 3: BSA_7AqYf_3 BSA_7AqYf   <NA>     MS positive     591             7
 4: BSA_m595c_1 BSA_m595c   <NA>     MS positive     592             7
 5: BSA_m595c_2 BSA_m595c   <NA>     MS positive     591             7
 6: BSA_m595c_3 BSA_m595c   <NA>     MS positive     592             7
 7: BSA_utpLV_1 BSA_utpLV   <NA>     MS positive     591             7
 8: BSA_utpLV_2 BSA_utpLV   <NA>     MS positive     591             7
 9: BSA_utpLV_3 BSA_utpLV   <NA>     MS positive     591             7
10: BSA_wAgwS_1 BSA_wAgwS   <NA>     MS positive     592             7
11: BSA_wAgwS_2 BSA_wAgwS   <NA>     MS positive     591             7
12: BSA_wAgwS_3 BSA_wAgwS   <NA>     MS positive     591             7
13:         H2O       H2O   <NA>     MS positive     591             7
14:       H2O_2       H2O   <NA>     MS positive     591             7
    concentration
            <num>
 1:            NA
 2:            NA
 3:            NA
 4:            NA
 5:            NA
 6:            NA
 7:            NA
 8:            NA
 9:            NA
10:            NA
11:            NA
12:            NA
13:            NA
14:            NA
# Note that the TIC chromatograms are not yet loaded (see below)
# In plot_spectra_tic, the TIC from spectra headers are used
ms$plot_spectra_tic(colorBy = "replicates")

Total ion chromatograms (TIC) for each analysis replicate.

The data processing workflow is based on applied processing methods within the engine. As pre-processing, the following steps are applied:

  1. Find the retention time elution of BSA based on the total ion chromatogram (TIC) of each analysis;
  2. Find the mass-to-charge (m/z) range of BSA for deconvolution of charges;
  3. Deconvolute charges for estimation of the exact mass, in DA;

The individual processing steps are applied according to ProcessingSettings. Below, the integration of the TIC is shown using the dedicated ProcessingSettings after loading the TIC chromatograms from the analysis files and smoothing using a moving average approach.

# Loads the TIC chromatograms
ms$run(MassSpecSettings_LoadChromatograms_StreamFind(chromatograms = "TIC"))
ms$analyses$has_chromatograms
[1] TRUE
# Smoothing of the TIC chromatogram for improving peak finding
ms$run(MassSpecSettings_SmoothChromatograms_movingaverage(windowSize = 5))
ms$analyses$has_chromatograms
[1] TRUE
ms$plot_chromatograms(colorBy = "replicates")

Smoothed TICs for each analysis replicate.

# Settings for integration of the TIC
ps_ic <- MassSpecSettings_IntegrateChromatograms_StreamFind(
  merge = TRUE,
  closeByThreshold = 5,
  minPeakHeight = 3E6,
  minPeakDistance = 3,
  minPeakWidth = 20,
  maxPeakWidth = 200,
  minSN = 5
)

show(ps_ic)

 StreamFind::MassSpecSettings_IntegrateChromatograms_StreamFind 
 engine       MassSpec
 method       IntegrateChromatograms
 algorithm    StreamFind
 version      0.2.0
 software     StreamFind
 developer    Ricardo Cunha
 contact      cunha@iuta.de
 link         https://odea-project.github.io/StreamFind
 doi          NA

 parameters: 
  -  merge TRUE 
  -  closeByThreshold 5 
  -  minPeakHeight 3e+06 
  -  minPeakDistance 3 
  -  minPeakWidth 20 
  -  maxPeakWidth 200 
  -  minSN 5 
ms$run(ps_ic)
ms$plot_chromatograms_peaks(colorBy = "replicates")

TIC with integrated peaks.

The average retention time of the main peak was 363.9823581 seconds with a standard deviation of 0.8060852. The following step is to estimate the m/z range of BSA for deconvolution of charges.

# Average retention time of the main peak
rt_av <- mean(data.table::rbindlist(ms$chromatograms$peaks)$rt)
rt_av
[1] 363.9824
ms$plot_spectra_ms1(
  analyses = 1,
  rt = rt_av,
  sec = 0.5,
  colorBy = "replicates",
  presence = 0.1,
  minIntensity = 5000,
  interactive = FALSE
)
Spectra of BSA for the first analysis at the peak maximum.

Spectra of BSA for the first analysis at the peak maximum.

According to the MS1 spectra, the m/z from 1200 to 2000 can be added to the dedicated ProcessingSettings for deconvolution, as shown below. The workflow is an ordered list of ProcessingSettings objects for each workflow step, which is added to the engine. The exact parameters for each method can be optimized individually by calling the specific method directly. Once the settings are added, the workflow can be inspected with the print_workflow() method.

ms_workflow <- list(

  MassSpecSettings_LoadSpectra_StreamFind(
    mzmin = 1200,
    mzmax = 2000,
    rtmin = rt_av - 1,
    rtmax = rt_av + 1,
    levels = 1
  ),

  MassSpecSettings_ClusterSpectra_StreamFind(
    val = "mz",
    clustVal = 0.005,
    presence = 0.1
  ),

  MassSpecSettings_CalculateSpectraCharges_StreamFind(
    roundVal = 15,
    relLowCut = 0.2,
    absLowCut = 8000
  ),

  MassSpecSettings_DeconvoluteSpectra_StreamFind(),

  MassSpecSettings_SmoothSpectra_movingaverage(
    windowSize = 15
  ),

  MassSpecSettings_BinSpectra_StreamFind(
    binNames = c("rt", "mass"),
    binValues = c(5, 50),
    byUnit = TRUE,
    refBinAnalysis = 1
  ),

  MassSpecSettings_AverageSpectra_StreamFind(),

  MassSpecSettings_NormalizeSpectra_minmax(),

  MassSpecSettings_NormalizeSpectra_blockweight()
)

# Replaces the settings added to integrate the TIC chromatograms
ms$workflow <- ms_workflow

ms$print_workflow()
1: LoadSpectra (StreamFind)
2: ClusterSpectra (StreamFind)
3: CalculateSpectraCharges (StreamFind)
4: DeconvoluteSpectra (StreamFind)
5: SmoothSpectra (movingaverage)
6: BinSpectra (StreamFind)
7: AverageSpectra (StreamFind)
8: NormalizeSpectra (minmax)
9: NormalizeSpectra (blockweight)

The run_workflow() method is called to run the data processing.

ms$run_workflow()
plot_spectra(ms$analyses, colorBy = "analyses")

Deconvoluted and pre-processed spectra for each analysis.

Raman

The four different BSA products were measured using an online coupling of size exclusion chromatography (SEC) with capillary-enhanced Raman spectroscopy (CERS) through a liquid-core optical fibre flow cell, as reported by Thissen et al. (DOI: 10.1021/acs.analchem.3c03991). The raw spectra was converted to asc format for each retention time. A subset of the background and the main BSA elution peak was selected for pre-processing.

raman <- RamanEngine$new(analyses = raman_files)

# Modifying the replicate names to match the MS analyses
raman$add_replicate_names(unique(ms$get_analysis_names()))

raman

RamanEngine
File: 
NA

Headers:name: NA
author: NA
date: 2024-11-29 10:02:20.393593


Workflow: 
empty

Analyses: 
          analysis       replicate  blank   type spectra
            <char>          <char> <char> <char>   <num>
1: BSA_7AqYf_Raman BSA_7AqYf_Raman   <NA>  raman 4093952
2: BSA_m595c_Raman BSA_m595c_Raman   <NA>  raman 4093952
3: BSA_utpLV_Raman BSA_utpLV_Raman   <NA>  raman 4093952
4: BSA_wAgwS_Raman BSA_wAgwS_Raman   <NA>  raman 4093952
raman$plot_chromatograms(yLab = "Abbundance sum for each spectrum", colorBy = "analyses")

Time series chromatogram for each BSA product. Each data point is a Raman spectrum at a specific retention time.

raman$plot_spectra(rt = c(430, 440), colorBy = "analyses")

Raw averaged Raman spectra for each BSA product between 430 and 440 seconds.

As for MS data, the workflow for Raman data is also assembled using an ordered list of processing settings, as shown below.

raman_workflow <- list(
  RamanSettings_BinSpectra_StreamFind(
    binNames = "rt",
    binValues = 20,
    byUnit = FALSE
  ),

  RamanSettings_SubtractSpectraSection_StreamFind(
    sectionVal = "rt",
    sectionWindow = c(0, 200)
  ),

  RamanSettings_DeleteSpectraSection_StreamFind(
    shiftmin = -100,
    shiftmax = 315
  ),

  RamanSettings_SmoothSpectra_savgol(
    fl = 11,
    forder = 2,
    dorder = 0
  ),

  RamanSettings_DeleteSpectraSection_StreamFind(
    shiftmin = 315,
    shiftmax = 330
  ),

  RamanSettings_DeleteSpectraSection_StreamFind(
    shiftmin = 2300,
    shiftmax = 2600
  ),

  RamanSettings_CorrectSpectraBaseline_baseline_als(
    lambda = 3,
    p = 0.015,
    maxit = 10
  ),

  RamanSettings_DeleteSpectraSection_StreamFind(
    rtmin = 0,
    rtmax = 420
  ),

  RamanSettings_DeleteSpectraSection_StreamFind(
    rtmin = 430,
    rtmax = 900
  ),

  RamanSettings_AverageSpectra_StreamFind(),

  RamanSettings_NormalizeSpectra_minmax(),

  RamanSettings_NormalizeSpectra_blockweight()
)

raman$workflow <- raman_workflow

raman$print_workflow()
1: BinSpectra (StreamFind)
2: SubtractSpectraSection (StreamFind)
3: DeleteSpectraSection (StreamFind)
4: SmoothSpectra (savgol)
5: DeleteSpectraSection (StreamFind)
6: DeleteSpectraSection (StreamFind)
7: CorrectSpectraBaseline (baseline_als)
8: DeleteSpectraSection (StreamFind)
9: DeleteSpectraSection (StreamFind)
10: AverageSpectra (StreamFind)
11: NormalizeSpectra (minmax)
12: NormalizeSpectra (blockweight)

A particularity of the workflow for Raman data is that the background subtraction (workflow step number 3) is performed from a time region (i.e., between 0 and 3 minutes) before the main BSA peak, which is from approximately from 5 to 8 minutes. The workflow is then applied using the run_workflow() method.

raman$run_workflow()
raman$plot_spectra_baseline()

Baseline correction for each spectra.

raman$plot_spectra()

Processed Raman spectra for each BSA product.

Fusion

The fusion of MS and Raman data is performed by combining the processed data from both engines. The data is then merged into a single data frame for statistical evaluation using a StatisticEngine, as shown in the next chapter.

ms_mat <- get_spectra_matrix(ms$analyses)
raman_mat <- get_spectra_matrix(raman$analyses)
fused_mat <- cbind(ms_mat, raman_mat)
attr(fused_mat, "xaxis.name") = "Keys"
attr(fused_mat, "xaxis.values") = seq_len(ncol(fused_mat))
str(fused_mat)
 num [1:4, 1:880] 0.00593 0.00808 0 0 0.00519 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:4] "BSA_7AqYf" "BSA_m595c" "BSA_utpLV" "BSA_wAgwS"
  ..$ : chr [1:880] "364-65516" "364-65566" "364-65616" "364-65666" ...
 - attr(*, "xaxis.name")= chr "Keys"
 - attr(*, "xaxis.values")= int [1:880] 1 2 3 4 5 6 7 8 9 10 ...

Statistic analysis

A statistic analysis is performed with the StatisticEngine for evaluation of the differences between the BSA products.

PCA

The fused data is used to create a PCA model with the StatisticEngine using processing settings based on the mdatools package.

stat <- StatisticEngine$new(analyses = fused_mat)
# Plots the data from the added fused data.frame/matrix
plot_data(stat$analyses)
# Note that scale is set to FALSE to avoid scaling the data
# as blockweight normalization was applied for each dataset
stat$run(
  StatisticSettings_MakeModel_pca_mdatools(
    center = TRUE,
    ncomp = 2,
    alpha = 0.05,
    gamma = 0.02
  )
)
summary(stat$model) # summary method is from mdatools

Summary for PCA model (class pca)
Type of limits: ddmoments
Alpha: 0.05
Gamma: 0.02

       Eigenvals Expvar Cumexpvar Nq Nh
Comp 1     0.005  65.82     65.82  8  3
Comp 2     0.001  20.79     86.61  2  8
plot(stat$model)

Overview plot for the PCA model.

# Adds a color key to the loadings plot to evaluate the contribution of each dataset (i.e., MS and Raman)
plot_loadings(stat$analyses, colorKey = rep(c("MS", "Raman"), c(ncol(ms_mat), ncol(raman_mat))))

PCA loadings plot for the fused data.

# The mdatools S3 class model object is obtained via the active binding of the StatisticEngine
plot(stat$model$model, show.labels = TRUE) # plot method from mdatools
PCA summary plot using the original mdatools R package.

PCA summary plot using the original mdatools R package.

MCR purity

An alternative approach for the statistic analysis is the MCR pure model (based on mdatools R package) to evaluate the purity level in the analyses.

# Note that the model in the engine will be overwritten by the MCR model
stat$run(StatisticSettings_MakeModel_mcrpure_mdatools(ncomp = 2, offset = 0.05))
summary(stat$model)

Summary for MCR Purity case (class mcrpure)
       Expvar Cumexpvar Varindex Purity
Comp 1  43.24     43.24       17 17.715
Comp 2  55.98     99.21       19 25.003
plot(stat$model)

Overview plot of the MCR pure model.

# Gets the most pure features for each component
get_model_data(stat$model)$purity
    vars     vals   feature result
   <int>    <num>    <char> <char>
1:    17 17.71518 364-66316  model
2:    19 25.00344 364-66416  model