Skip to contents




The StreamFind R package is a data processing workflow designer. Besides data processing, the platform can also be used for data management, visualization and reporting. This guide focuses on describing the general framework behind StreamFind. The StreamFind is centered around R6 classes, serving as data processing engines (used as metaphor) for different types of data (e.g. mass spectrometry (MS) and Raman spectroscopy data).

Data processing engines

Data processing engines are fundamentally reference classes with methods to manage, process, visualize and report data within a project. The CoreEngine is the parent class of all other data specific engines (e.g. MassSpecEngine and RamanEngine). As parent, the CoreEngine holds uniform functions across child data dedicated engines (e.g. adding and removing analyses from the project).

core <- CoreEngine$new()

core

CoreEngine
File: 
NA

Headers:name: NA
author: NA
date: 2024-11-29 10:05:08.854806


Workflow: 
empty

Analyses: 
empty 

Note that when an empty CoreEngine is created, required ProjectHeaders are created with name, author, path and date.Yet, ProjectHeaders can be specified directly during creation of the CoreEngine via the argument headers or added to the engine as shown in @ref(project-headers). The CoreEngine does not directly handle data processing. Processing methods are data specific and therefore, are used via the data dedicated engines. Yet, the framework to manage the data processing workflow and the results are implemented in the CoreEngine and are therefore, harmonized across engines. Users will not directly use the CoreEngine but it is important to understand that it is in the background.

Project headers

The ProjectHeaders S7 class is meant to hold project information/metadata, such as description, location, etc. The users can add any kind of attribute but it must have length one and be named. Below, a list of headers is created and added to the CoreEngine for demonstration. Internally, the list of headers is converted to a ProjectHeaders object.

headers <- list(
  name = "Project Example", 
  author = "Person Name", 
  description = "Example of project headers"
)

core$headers <- headers

core$print_headers()
name: Project Example
author: Person Name
description: Example of project headers
date: 2024-11-29 10:05:08.936701

Processing settings

A data processing workflow is represented in StreamFind by the S7 class Workflow, which is composed of an ordered list of S7 class ProcessingSettings objects. Each ProcessingSettings object is a representation of a processing method/step that transforms the data according to a specific algorithm. The ProcessingSettings objects are used to harmonize the diversity of processing methods and algorithms available for a given data type.

ProcessingSettings()
<StreamFind::ProcessingSettings>
 @ engine          : chr NA
 @ method          : chr NA
 @ algorithm       : chr NA
 @ parameters      : list()
 @ number_permitted: num NA
 @ version         : chr NA
 @ software        : chr NA
 @ developer       : chr NA
 @ contact         : chr NA
 @ link            : chr NA
 @ doi             : chr NA
 @ call            : chr "NASettings_NA_NA"

A ProcessingSettings object must always have the engine type, the processing method name, the name of the algorithm to be used, the origin software, the main developer name and contact as well as a link to further information and the DOI, when available. Lastly but not least, the parameters which is a flexible list of conditions to apply the algorithm during data processing. As example, ProcessingSettings for annotating features using a native algorithm from StreamFind is shown below. Each ProcessingSettings object has a dedicated constructor method with documentation to support the usage. Help pages for processing methods can be obtained with the native R function ? or help() (e.g., help(MassSpecSettings_AnnotateFeatures_StreamFind)).

# constructor for annotating features workflow step
# the constructor name gives away the engine, method and algorithm
# i.e.
# - the engine is MassSpecEngine
# - the method is AnnotateFeatures
# - the algorithm is StreamFind
MassSpecSettings_AnnotateFeatures_StreamFind()
<StreamFind::MassSpecSettings_AnnotateFeatures_StreamFind>
 @ engine          : chr "MassSpec"
 @ method          : chr "AnnotateFeatures"
 @ algorithm       : chr "StreamFind"
 @ parameters      :List of 4
 .. $ maxIsotopes      : int 8
 .. $ maxCharge        : int 1
 .. $ rtWindowAlignment: num 0.3
 .. $ maxGaps          : int 1
 @ number_permitted: num 1
 @ version         : chr "0.2.0"
 @ software        : chr "StreamFind"
 @ developer       : chr "Ricardo Cunha"
 @ contact         : chr "cunha@iuta.de"
 @ link            : chr "https://odea-project.github.io/StreamFind"
 @ doi             : chr NA
 @ call            : chr "MassSpecSettings_AnnotateFeatures_StreamFind"

Saving and loading

The CoreEngine also holds the functionality to save the project in the engine (as an .rds or .sqlite file) and load it back. As shown below, the save() and load() methods are used for saving and loading the project, respectively.

project_file_path <- file.path(getwd(), "project.rds")
core$save(project_file_path)
file.exists(project_file_path)
[1] TRUE
new_core <- CoreEngine$new()
new_core$load(project_file_path)
# the headers are has the core object although
# a new_core object was created with default headers
new_core$print_headers()
name: Project Example
author: Person Name
description: Example of project headers
date: 2024-11-29 10:05:08.936701

Data specific engines

As above mentioned, the CoreEngine does not handle data processing directly. The data processing is delegated to child engines, where specific ProcessingSettings can be applied. A simple example is given below by creating a child RamanEngine and accessing the spectra from the analyses (added as full paths to .asc files on disk). Note that the workflow and results are still empty, as no data processing methods were applied.

# Example raman .asc files
raman_ex_files <- StreamFindData::get_raman_file_paths()
raman <- RamanEngine$new(analyses = raman_ex_files)
raman

RamanEngine
File: 
NA

Headers:name: NA
author: NA
date: 2024-11-29 10:05:09.579761


Workflow: 
empty

Analyses: 
                         analysis                     replicate  blank   type
                           <char>                        <char> <char> <char>
 1:       raman_Bevacizumab_11731       raman_Bevacizumab_11731   <NA>  raman
 2:       raman_Bevacizumab_11732       raman_Bevacizumab_11732   <NA>  raman
 3:       raman_Bevacizumab_11733       raman_Bevacizumab_11733   <NA>  raman
 4:       raman_Bevacizumab_11734       raman_Bevacizumab_11734   <NA>  raman
 5:       raman_Bevacizumab_11735       raman_Bevacizumab_11735   <NA>  raman
 6:       raman_Bevacizumab_11736       raman_Bevacizumab_11736   <NA>  raman
 7:       raman_Bevacizumab_11737       raman_Bevacizumab_11737   <NA>  raman
 8:       raman_Bevacizumab_11738       raman_Bevacizumab_11738   <NA>  raman
 9:       raman_Bevacizumab_11739       raman_Bevacizumab_11739   <NA>  raman
10:       raman_Bevacizumab_11740       raman_Bevacizumab_11740   <NA>  raman
11:       raman_Bevacizumab_11741       raman_Bevacizumab_11741   <NA>  raman
12: raman_blank_Bevacizumab_10005 raman_blank_Bevacizumab_10005   <NA>  raman
13: raman_blank_Bevacizumab_10006 raman_blank_Bevacizumab_10006   <NA>  raman
14: raman_blank_Bevacizumab_10007 raman_blank_Bevacizumab_10007   <NA>  raman
15: raman_blank_Bevacizumab_10008 raman_blank_Bevacizumab_10008   <NA>  raman
16: raman_blank_Bevacizumab_10009 raman_blank_Bevacizumab_10009   <NA>  raman
17: raman_blank_Bevacizumab_10010 raman_blank_Bevacizumab_10010   <NA>  raman
18: raman_blank_Bevacizumab_10011 raman_blank_Bevacizumab_10011   <NA>  raman
19: raman_blank_Bevacizumab_10012 raman_blank_Bevacizumab_10012   <NA>  raman
20: raman_blank_Bevacizumab_10013 raman_blank_Bevacizumab_10013   <NA>  raman
21: raman_blank_Bevacizumab_10014 raman_blank_Bevacizumab_10014   <NA>  raman
22: raman_blank_Bevacizumab_10015 raman_blank_Bevacizumab_10015   <NA>  raman
                         analysis                     replicate  blank   type
    spectra
      <num>
 1:    1024
 2:    1024
 3:    1024
 4:    1024
 5:    1024
 6:    1024
 7:    1024
 8:    1024
 9:    1024
10:    1024
11:    1024
12:    1024
13:    1024
14:    1024
15:    1024
16:    1024
17:    1024
18:    1024
19:    1024
20:    1024
21:    1024
22:    1024
    spectra
# when interactive is TRUE, the spectra are plotted with plotly
raman$plot_spectra(interactive = FALSE)

Spectra Raw

Managing analyses

Analyses can be added and removed from the engine with the add_analyses() and remove_analyses() methods, respectively. Below, the 1st and 12th analyses are removed from the engine and then added back.

raman$remove_analyses(c(1, 12))
length(raman$analyses)
[1] 20
raman$add_analyses(raman_ex_files[c(1, 12)])
length(raman$analyses)
[1] 22

For data processing, the analysis replicate names and the correspondent blank analysis replicates can be assigned with dedicated methods, as shown below. For instance, the replicate names are used for averaging the spectra in correspondent analyses and the assigned blanks are used for background subtraction, as shown below in @ref(data-processing).

raman$add_replicate_names(c(rep("Sample", 11), rep("Blank", 11)))
raman$add_blank_names(rep("Blank", 22))
# the replicate names are modified and the blanks are assigned
raman

RamanEngine
File: 
NA

Headers:name: NA
author: NA
date: 2024-11-29 10:05:09.579761


Workflow: 
empty

Analyses: 
                         analysis replicate  blank   type spectra
                           <char>    <char> <char> <char>   <num>
 1:       raman_Bevacizumab_11731    Sample  Blank  raman    1024
 2:       raman_Bevacizumab_11732    Sample  Blank  raman    1024
 3:       raman_Bevacizumab_11733    Sample  Blank  raman    1024
 4:       raman_Bevacizumab_11734    Sample  Blank  raman    1024
 5:       raman_Bevacizumab_11735    Sample  Blank  raman    1024
 6:       raman_Bevacizumab_11736    Sample  Blank  raman    1024
 7:       raman_Bevacizumab_11737    Sample  Blank  raman    1024
 8:       raman_Bevacizumab_11738    Sample  Blank  raman    1024
 9:       raman_Bevacizumab_11739    Sample  Blank  raman    1024
10:       raman_Bevacizumab_11740    Sample  Blank  raman    1024
11:       raman_Bevacizumab_11741    Sample  Blank  raman    1024
12: raman_blank_Bevacizumab_10005     Blank  Blank  raman    1024
13: raman_blank_Bevacizumab_10006     Blank  Blank  raman    1024
14: raman_blank_Bevacizumab_10007     Blank  Blank  raman    1024
15: raman_blank_Bevacizumab_10008     Blank  Blank  raman    1024
16: raman_blank_Bevacizumab_10009     Blank  Blank  raman    1024
17: raman_blank_Bevacizumab_10010     Blank  Blank  raman    1024
18: raman_blank_Bevacizumab_10011     Blank  Blank  raman    1024
19: raman_blank_Bevacizumab_10012     Blank  Blank  raman    1024
20: raman_blank_Bevacizumab_10013     Blank  Blank  raman    1024
21: raman_blank_Bevacizumab_10014     Blank  Blank  raman    1024
22: raman_blank_Bevacizumab_10015     Blank  Blank  raman    1024
                         analysis replicate  blank   type spectra
# the spectra are plotted with the replicates colored
raman$plot_spectra(interactive = FALSE, colorBy = "replicates")

Spectra Raw Replicates

Processing workflow

As above mentioned, ProcessingSettings are used to design an ordered list of processing methods in a Workflow object. Below we create a list of ProcessingSettings for processing the Raman spectra in the engine and add to the raman engine.

ps <- list(
  # averages the spectra for each analysis replicate
  RamanSettings_AverageSpectra_StreamFind(),
  
  # simple normalization based on maximum intensity
  RamanSettings_NormalizeSpectra_minmax(),
  
  # background subtraction
  RamanSettings_SubtractBlankSpectra_StreamFind(),
  
  # applies smoothing based on moving average
  RamanSettings_SmoothSpectra_movingaverage(windowSize = 4),
  
  # removes a section from the spectra from -40 to 470
  RamanSettings_DeleteSpectraSection_StreamFind(shiftmin = -40, shiftmax = 300),
  
  # removes a section from the spectra from -40 to 470
  RamanSettings_DeleteSpectraSection_StreamFind(shiftmin = 2000, shiftmax = 3000),
 
  # performs baseline correction 
  RamanSettings_CorrectSpectraBaseline_baseline_als(lambda = 3, p = 0.06, maxit = 10)
)
# the workflow is added to the engine but not yet applied
# the results are still empty
raman$workflow <- ps

raman$print_workflow()
1: AverageSpectra (StreamFind)
2: NormalizeSpectra (minmax)
3: SubtractBlankSpectra (StreamFind)
4: SmoothSpectra (movingaverage)
5: DeleteSpectraSection (StreamFind)
6: DeleteSpectraSection (StreamFind)
7: CorrectSpectraBaseline (baseline_als)
# the data processing workflow is applied
raman$run_workflow()

The method run() can be used to applied a single ProcessingSettings object to the data. Note that the ProcessingSettings step is always added to the bottom of the workflow in the engine. Below, the normalization based on minimum and maximum is applied to the Raman spectra and then the workflow is shown, including another normalization step in the last position.

# performs again normalization using minimum and maximum
raman$run(RamanSettings_NormalizeSpectra_minmax())
# the workflow is shown with another normalization step at the end
raman$print_workflow()
1: AverageSpectra (StreamFind)
2: NormalizeSpectra (minmax)
3: SubtractBlankSpectra (StreamFind)
4: SmoothSpectra (movingaverage)
5: DeleteSpectraSection (StreamFind)
6: DeleteSpectraSection (StreamFind)
7: CorrectSpectraBaseline (baseline_als)
8: NormalizeSpectra (minmax)

Results

Once the data processing methods are applied, the results can be accessed with the dedicated and engine specific active fields, as shown below. The results are always added as S7 Results child classes.

# the spectra results were added
raman

RamanEngine
File: 
NA

Headers:name: NA
author: NA
date: 2024-11-29 10:05:09.579761


Workflow: 
1: AverageSpectra (StreamFind)
2: NormalizeSpectra (minmax)
3: SubtractBlankSpectra (StreamFind)
4: SmoothSpectra (movingaverage)
5: DeleteSpectraSection (StreamFind)
6: DeleteSpectraSection (StreamFind)
7: CorrectSpectraBaseline (baseline_als)
8: NormalizeSpectra (minmax)


Analyses: 
                         analysis replicate  blank   type spectra
                           <char>    <char> <char> <char>   <num>
 1:       raman_Bevacizumab_11731    Sample  Blank  raman    1024
 2:       raman_Bevacizumab_11732    Sample  Blank  raman    1024
 3:       raman_Bevacizumab_11733    Sample  Blank  raman    1024
 4:       raman_Bevacizumab_11734    Sample  Blank  raman    1024
 5:       raman_Bevacizumab_11735    Sample  Blank  raman    1024
 6:       raman_Bevacizumab_11736    Sample  Blank  raman    1024
 7:       raman_Bevacizumab_11737    Sample  Blank  raman    1024
 8:       raman_Bevacizumab_11738    Sample  Blank  raman    1024
 9:       raman_Bevacizumab_11739    Sample  Blank  raman    1024
10:       raman_Bevacizumab_11740    Sample  Blank  raman    1024
11:       raman_Bevacizumab_11741    Sample  Blank  raman    1024
12: raman_blank_Bevacizumab_10005     Blank  Blank  raman    1024
13: raman_blank_Bevacizumab_10006     Blank  Blank  raman    1024
14: raman_blank_Bevacizumab_10007     Blank  Blank  raman    1024
15: raman_blank_Bevacizumab_10008     Blank  Blank  raman    1024
16: raman_blank_Bevacizumab_10009     Blank  Blank  raman    1024
17: raman_blank_Bevacizumab_10010     Blank  Blank  raman    1024
18: raman_blank_Bevacizumab_10011     Blank  Blank  raman    1024
19: raman_blank_Bevacizumab_10012     Blank  Blank  raman    1024
20: raman_blank_Bevacizumab_10013     Blank  Blank  raman    1024
21: raman_blank_Bevacizumab_10014     Blank  Blank  raman    1024
22: raman_blank_Bevacizumab_10015     Blank  Blank  raman    1024
                         analysis replicate  blank   type spectra

Result 1: StreamFind::Spectra
# results can be obtained with the dedicated active fields
raman$spectra
<StreamFind::Spectra>
 @ name          : chr "Spectra"
 @ software      : chr "StreamFind"
 @ version       : chr "0.2.0"
 @ spectra       :List of 2
 .. $ Sample:Classes 'data.table' and 'data.frame': 690 obs. of  5 variables:
 ..  ..$ shift    : num [1:690] 300 303 306 309 312 ...
 ..  ..$ intensity: num [1:690] 0.0886 0.0386 0.1075 0.1457 0.2491 ...
 ..  ..$ blank    : num [1:690] 0.75 0.733 0.717 0.706 0.695 ...
 ..  ..$ baseline : num [1:690] 0.0454 0.0456 0.0458 0.0461 0.0463 ...
 ..  ..$ raw      : num [1:690] 0.0452 0.0453 0.0458 0.0461 0.0467 ...
 ..  ..- attr(*, ".internal.selfref")=<externalptr> 
 .. $ Blank :Classes 'data.table' and 'data.frame': 0 obs. of  0 variables
 ..  ..- attr(*, ".internal.selfref")=<externalptr> 
 @ is_averaged   : logi TRUE
 @ is_neutralized: logi FALSE
 @ peaks         : list()
 @ has_peaks     : logi FALSE
 @ charges       : list()
# resulting spectrum
raman$plot_spectra()

Conclusion

This quick guide introduced the general framework of StreamFind. The StreamFind is a data processing workflow designer that uses R6 classes to manage, process, visualize and report data within a project. The CoreEngine is the parent class of all other data specific engines and manages the project information via the class ProjectHeaders. The ProcessingSettings are used to harmonize the diversity of processing methods and algorithms available in a Workflow object. The data processing is delegated to child engines, such as the RamanEngine and MassSpecEngine. The Workflow is assembled by combining different ProcessingSettings in a specific order. The results can be accessed with dedicated fields (e.g. spectra and plot_spectra). StreamFind can be used via scripting as demonstrated in this guide or via the embedded shiny app for a graphical user interface. See the StreamFind App Guide for more information.