StreamFind General Introduction

Introduction

The StreamFind R package is a data agnostic data processing workflow designer. Besides data processing, the package can also be used for data management, visualization and reporting. This guide focuses on describing the general framework of StreamFind. StreamFind is centered around R6 classes, serving as data processing engines (used as metaphor) for different types of data (e.g. mass spectrometry (MS) and Raman spectroscopy (RS) data).

StreamFind Concept

Internally, the engines (R6 classes) use seven central S7 classes:

The Metadata class flexibly holds project information, such as name, author, date and file. The Workflow class is an ordered list of ProcessingStep objects, which are used to harmonize the diversity of processing methods and algorithms available for a given data type. The ProcessingStep class is a representation of a processing method/step that transforms the data according to a specific algorithm. The Analyses class holds the data to be processed and the Results class holds the results of the data processing. The AuditTrail class records any modification to the project. The Config class holds configuration parameters for the engines, app, etc.

`CoreEngine`

Data processing engines are fundamentally reference classes with methods to manage, process, visualize and report data within a project. The CoreEngine is the parent class of all other data specific engines (e.g. MassSpecEngine and RamanEngine). As parent, the CoreEngine holds uniform functions across child data dedicated engines (e.g. managing the Metadata, recording the AuditTrail and applying a data processing Workflow).

# Creates an empty CoreEngine
core <- CoreEngine$new()

# Prints the engine
core


CoreEngine

Metadata
name: NA
author: NA
date: 2025-06-13 12:16:04.831486
file: NA

Workflow
empty

Analyses
empty

Note that when an empty CoreEngine or any data specific engine is created, required entries of Metadata are created with default name, author, date and file. Yet, Metadata entries can be specified directly during creation of the CoreEngine via the argument metadata or added to the engine as shown in @ref(metadata). The CoreEngine does not directly handle data processing. Processing methods are data specific and therefore, are used via the data dedicated engines. Yet, the framework to manage the data processing workflow is implemented in the CoreEngine and is therefore, harmonized across engines. Users will not directly use the CoreEngine but it is important to understand that it is in the background.

`Metadata`

The Metadata S7 class is meant to hold project information, such as description, location, etc. The users can add any kind of entry to a named list. Below, a list of metadata is created and added to the CoreEngine for demonstration. Internally, the list is converted to a Metadata object. Modifying the entries in the Metadata is as modifying a list in R and the Metadata can be accessed by the active field Metadata in the CoreEngine or any other data specific engine.

# Creates a named list with project metadata
mtd <- list(
  name = "Project Example",
  author = "Person Name",
  description = "Example of project description"
)

# Adds/updates the Metadata in the CoreEngine
core$Metadata <- mtd

# Show mwthod for the Metadata class
show(core$Metadata)

name: Project Example
author: Person Name
description: Example of project description
date: 2025-06-13 12:16:04.931265
file: NA

# Adding a new entry to the Metadata
core$Metadata[["second_author"]] <- "Second Person Name"

show(core$Metadata)

name: Project Example
author: Person Name
description: Example of project description
date: 2025-06-13 12:16:04.931265
file: NA
second_author: Second Person Name

`Workflow`

A data processing workflow is represented in StreamFind by the class Workflow. A Workflow is composed of an ordered list of ProcessingStep objects. Each ProcessingStep object is a representation of a processing method/step that transforms the data according to a specific algorithm. The ProcessingStep class is used to harmonize the diversity of processing methods and algorithms available for a given data type.

# Constructor for a processing step
ProcessingStep()

<StreamFind::ProcessingStep>
 @ data_type       : chr NA
 @ method          : chr NA
 @ required        : chr NA
 @ algorithm       : chr NA
 @ parameters      : list()
 @ number_permitted: num NA
 @ version         : chr NA
 @ software        : chr NA
 @ developer       : chr NA
 @ contact         : chr NA
 @ link            : chr NA
 @ doi             : chr NA
 @ call            : chr "NAMethod_NA_NA"

A ProcessingStep object must always have the data type, the processing method name, the name of the algorithm to be used, the origin software, the main developer name and contact as well as a link to further information and the DOI, when available. Lastly but not least, the parameters which is a flexible list of conditions to apply the algorithm during data processing.

The ProcessingStep is a generic parent class which delegates to child classes for specific data processing methods and algorithms. As example, the ProcessingStep child class for annotating features within a non-target screening workflow using a native algorithm from StreamFind is shown below. Each ProcessingStep child class has a dedicated constructor method with documentation to support the usage. Help pages for processing methods can be obtained with the native R function ? or help() (e.g., help(MassSpecMethod_AnnotateFeatures_StreamFind)).

# constructor of ProcessingStep child class
# for annotating features in a non-target screening workflow
# the constructor name gives away the engine, method and algorithm
# i.e.
# - the data type is MassSpec
# - the method name is AnnotateFeatures
# - the algorithm name is StreamFind
MassSpecMethod_AnnotateFeatures_StreamFind()

<StreamFind::MassSpecMethod_AnnotateFeatures_StreamFind>
 @ data_type       : chr "MassSpec"
 @ method          : chr "AnnotateFeatures"
 @ required        : chr "FindFeatures"
 @ algorithm       : chr "StreamFind"
 @ parameters      :List of 4
 .. $ maxIsotopes      : int 8
 .. $ maxCharge        : int 1
 .. $ rtWindowAlignment: num 0.3
 .. $ maxGaps          : int 1
 @ number_permitted: num 1
 @ version         : chr "0.2.0"
 @ software        : chr "StreamFind"
 @ developer       : chr "Ricardo Cunha"
 @ contact         : chr "cunha@iuta.de"
 @ link            : chr "https://odea-project.github.io/StreamFind"
 @ doi             : chr NA
 @ call            : chr "MassSpecMethod_AnnotateFeatures_StreamFind"

`Analyses`

As above mentioned, the CoreEngine does not handle data processing directly. The data processing is delegated to child engines. A simple example is given below by creating a child RamanEngine from a vector of paths to asc files with Raman spectra. The Raman spectra are used internally to initiate a RamanAnalyses (child class of Analyses), holding the raw data and any data processing Results objects. Note that the Workflow and Results are still empty, as no data processing methods were applied.

# Example raman .asc files
raman_ex_files <- StreamFindData::get_raman_file_paths()

# Creates a RamanEngine with the example files
raman <- RamanEngine$new(analyses = raman_ex_files)

# Show the engine class hierarchy
class(raman)

[1] "RamanEngine" "CoreEngine"  "R6"

Data specific engines have dedicated active fields to access the data. For instance, the Analyses active field in the RamanEngine is used to access the raw spectra and any Results. Note that accessing properties of S7 classes should be done with @ instead of $ but the $ operator is also available for convenience.

# Gets the length of Analyses in the RamanEngine
length(raman$Analyses)

[1] 22

# Gets the names of the Analyses in the RamanEngine
names(raman$Analyses)

        raman_Bevacizumab_11731         raman_Bevacizumab_11732 
      "raman_Bevacizumab_11731"       "raman_Bevacizumab_11732" 
        raman_Bevacizumab_11733         raman_Bevacizumab_11734 
      "raman_Bevacizumab_11733"       "raman_Bevacizumab_11734" 
        raman_Bevacizumab_11735         raman_Bevacizumab_11736 
      "raman_Bevacizumab_11735"       "raman_Bevacizumab_11736" 
        raman_Bevacizumab_11737         raman_Bevacizumab_11738 
      "raman_Bevacizumab_11737"       "raman_Bevacizumab_11738" 
        raman_Bevacizumab_11739         raman_Bevacizumab_11740 
      "raman_Bevacizumab_11739"       "raman_Bevacizumab_11740" 
        raman_Bevacizumab_11741   raman_blank_Bevacizumab_10005 
      "raman_Bevacizumab_11741" "raman_blank_Bevacizumab_10005" 
  raman_blank_Bevacizumab_10006   raman_blank_Bevacizumab_10007 
"raman_blank_Bevacizumab_10006" "raman_blank_Bevacizumab_10007" 
  raman_blank_Bevacizumab_10008   raman_blank_Bevacizumab_10009 
"raman_blank_Bevacizumab_10008" "raman_blank_Bevacizumab_10009" 
  raman_blank_Bevacizumab_10010   raman_blank_Bevacizumab_10011 
"raman_blank_Bevacizumab_10010" "raman_blank_Bevacizumab_10011" 
  raman_blank_Bevacizumab_10012   raman_blank_Bevacizumab_10013 
"raman_blank_Bevacizumab_10012" "raman_blank_Bevacizumab_10013" 
  raman_blank_Bevacizumab_10014   raman_blank_Bevacizumab_10015 
"raman_blank_Bevacizumab_10014" "raman_blank_Bevacizumab_10015"

# Access the spectrum of the first analysis in the Analyses object
head(raman$Analyses@Spectra@spectra[[1]])

       shift intensity
       <num>     <num>
1: -33.11349       569
2: -29.93873       572
3: -26.76505       573
4: -23.59243       570
5: -20.42305       573
6: -17.25473       576

The methods for data access and visualization are also implemented as public methods in the data specific engine class. Although data can be obtained directly from the Analyses child classes, using the public methods in the engine is a preferable interface. Below, the plot_spectra() method is used to plot the raw spectra from analyses 1 and 12.

# Plots the spectrum from analyses 1 and 12 in the RamanEngine
raman$plot_spectra(analyses = c(1, 12))

Managing `Analyses`

Analyses can be added and removed from the engine with the add_analyses() and remove_analyses() methods, respectively. Below, the 1st and 12th analyses are removed from the engine and then added back.

raman$remove_analyses(c(1, 12))
length(raman$Analyses)

[1] 20

raman$add_analyses(raman_ex_files[c(1, 12)])
length(raman$Analyses)

[1] 22

For data processing, the analysis replicate names and the correspondent blank analysis replicates can be assigned with dedicated methods, as shown below. For instance, the replicate names are used for averaging the spectra in correspondent analyses and the assigned blanks are used for background subtraction.

# Adds replicate names and blank names
raman$add_replicate_names(c(rep("Sample", 11), rep("Blank", 11)))
raman$add_blank_names(rep("Blank", 22))

# the replicate names are modified
raman$Analyses$info[, c(1:3)]

                         analysis replicate  blank
                           <char>    <char> <char>
 1:       raman_Bevacizumab_11731    Sample  Blank
 2:       raman_Bevacizumab_11732    Sample  Blank
 3:       raman_Bevacizumab_11733    Sample  Blank
 4:       raman_Bevacizumab_11734    Sample  Blank
 5:       raman_Bevacizumab_11735    Sample  Blank
 6:       raman_Bevacizumab_11736    Sample  Blank
 7:       raman_Bevacizumab_11737    Sample  Blank
 8:       raman_Bevacizumab_11738    Sample  Blank
 9:       raman_Bevacizumab_11739    Sample  Blank
10:       raman_Bevacizumab_11740    Sample  Blank
11:       raman_Bevacizumab_11741    Sample  Blank
12: raman_blank_Bevacizumab_10005     Blank  Blank
13: raman_blank_Bevacizumab_10006     Blank  Blank
14: raman_blank_Bevacizumab_10007     Blank  Blank
15: raman_blank_Bevacizumab_10008     Blank  Blank
16: raman_blank_Bevacizumab_10009     Blank  Blank
17: raman_blank_Bevacizumab_10010     Blank  Blank
18: raman_blank_Bevacizumab_10011     Blank  Blank
19: raman_blank_Bevacizumab_10012     Blank  Blank
20: raman_blank_Bevacizumab_10013     Blank  Blank
21: raman_blank_Bevacizumab_10014     Blank  Blank
22: raman_blank_Bevacizumab_10015     Blank  Blank
                         analysis replicate  blank

# the spectra between shift values 700 and 800 are plotted
# the colorBy is set to replicates to legend by replicate names
raman$plot_spectra(shift = c(700, 800), colorBy = "replicates")

Processing `Workflow`

As above mentioned, a Workflow is designed by an ordered list of ProcessingStep child class objects. Below a list of ProcessingStep child class objects for processing the Raman spectra is created and added to the active field Workflow of the RamanEngine.

ps <- list(
  # averages the spectra for each analysis replicate
  RamanMethod_AverageSpectra_native(),

  # simple normalization based on maximum intensity
  RamanMethod_NormalizeSpectra_minmax(),

  # background subtraction
  RamanMethod_SubtractBlankSpectra_StreamFind(),

  # applies smoothing based on moving average
  RamanMethod_SmoothSpectra_movingaverage(windowSize = 4),

  # removes a section from the spectra from -40 to 300
  RamanMethod_DeleteSpectraSection_native(min = -40, max = 300),

  # removes a section from the spectra from 2000 to 3000
  RamanMethod_DeleteSpectraSection_native(min = 2000, max = 3000),

  # performs baseline correction
  RamanMethod_CorrectSpectraBaseline_baseline_als(lambda = 3, p = 0.06, maxit = 10)
)

# The workflow is added to the engine but not yet applied
# The results are still empty
raman$Workflow <- ps

# Gets the names of the results in the Analyses object
# As data processing was yet applied, the results field in Analyses is empty
names(raman$Analyses$results)

NULL

# Shows the workflow
show(raman$Workflow)

1: AverageSpectra (native)
2: NormalizeSpectra (minmax)
3: SubtractBlankSpectra (StreamFind)
4: SmoothSpectra (movingaverage)
5: DeleteSpectraSection (native)
6: DeleteSpectraSection (native)
7: CorrectSpectraBaseline (baseline_als)

# The data processing workflow is applied
raman$run_workflow()

# Gets the names of the results in the Analyses object
# A RamanSpectra (Results child class) is now added with the processed spectra
names(raman$Analyses@results)

[1] "RamanSpectra"

The method run() can be used to applied a single ProcessingStep object to the data. Note that the ProcessingStep step is always added to the bottom of the Workflow in the engine. Below, the normalization based on minimum and maximum is applied to the Raman spectra and then the Workflow is shown, including another normalization step in the last position.

# performs again normalization using minimum and maximum
raman$run(RamanMethod_NormalizeSpectra_minmax())

# the workflow is shown with another normalization step at the end
show(raman$Workflow)

1: AverageSpectra (native)
2: NormalizeSpectra (minmax)
3: SubtractBlankSpectra (StreamFind)
4: SmoothSpectra (movingaverage)
5: DeleteSpectraSection (native)
6: DeleteSpectraSection (native)
7: CorrectSpectraBaseline (baseline_als)
8: NormalizeSpectra (minmax)

`Results`

Once the data processing methods are applied, the results can be accessed with the dedicated and engine specific active fields and methods, as shown below. The results are always added as S7 Results child classes in the results field of the Analyses.

# The spectra results were added
names(raman$Analyses$results)

[1] "RamanSpectra"

# Results can be obtained with the dedicated active fields
# The Results active fields are engine specific
show(raman$Spectra)

Number spectra:  2 
Averaged:  TRUE 
Number peaks:  0 
Number chrom peaks:  0

# Processed spectrum, note that the blank was subtracted
raman$plot_spectra()

Saving and loading

The CoreEngine also holds the functionality to save the project in the engine (as an .rds or .sqlite file) and load it back. As shown below, the save() and load() methods are used for saving and loading the RamanEngine, respectively.

project_file_path <- file.path(getwd(), "raman_project.rds")
raman$save(project_file_path)

file.exists(project_file_path)

[1] TRUE

new_raman <- RamanEngine$new()
new_raman$load(project_file_path)

# the Metadata are has the raman object although
# a new_raman object was created with default Metadata
show(new_raman$Metadata)

name: NA
author: NA
date: 2025-06-13 12:16:05.381731
file: C:/Users/apoli/Documents/github/StreamFind/vignettes/articles/raman_project.rds

# the results are also available in the new_raman object
show(new_raman$Spectra)

Number spectra:  2 
Averaged:  TRUE 
Number peaks:  0 
Number chrom peaks:  0

Conclusion

This quick guide introduced the general framework of StreamFind. The StreamFind is a data agnostic processing workflow designer that uses R6 classes to manage, process, visualize and report data within a project. The CoreEngine is the parent class of all other data specific engines and manages the project information via the class Metadata. The ProcessingStep are used to harmonize the diversity of processing methods and algorithms available in a Workflow object. The data processing is delegated to child engines, such as the RamanEngine and MassSpecEngine. The Workflow is assembled by combining different ProcessingStep child class objects in a specific order. The Results can be accessed with dedicated fields (e.g. spectra and plot_spectra). StreamFind can be used via scripting as demonstrated in this guide or via the embedded shiny app for a graphical user interface. See the StreamFind App Guide for more information.

Ricardo Cunha

13 June, 2025

Introduction

CoreEngine

Metadata

Workflow

Analyses

Managing Analyses

Processing Workflow

Results