StreamFind General Introduction
Ricardo Cunha
cunha@iuta.de13 June, 2025
Source:vignettes/articles/general_guide.Rmd
general_guide.Rmd
Introduction
The StreamFind R package is a data agnostic data processing workflow designer. Besides data processing, the package can also be used for data management, visualization and reporting. This guide focuses on describing the general framework of StreamFind. StreamFind is centered around R6 classes, serving as data processing engines (used as metaphor) for different types of data (e.g. mass spectrometry (MS) and Raman spectroscopy (RS) data).

StreamFind Concept
Internally, the engines (R6 classes) use seven central S7 classes:
The Metadata
class flexibly holds project information,
such as name, author, date and file. The Workflow
class is
an ordered list of ProcessingStep
objects, which are used
to harmonize the diversity of processing methods and algorithms
available for a given data type. The ProcessingStep
class
is a representation of a processing method/step that transforms the data
according to a specific algorithm. The Analyses
class holds
the data to be processed and the Results
class holds the
results of the data processing. The AuditTrail
class
records any modification to the project. The Config
class
holds configuration parameters for the engines, app, etc.
CoreEngine
Data processing engines are fundamentally reference classes with
methods to manage, process, visualize and report data within a project.
The CoreEngine
is the parent class of all other data specific engines (e.g. MassSpecEngine
and RamanEngine
).
As parent, the CoreEngine
holds uniform functions across
child data dedicated engines (e.g. managing the Metadata
,
recording the AuditTrail
and applying a data processing
Workflow
).
# Creates an empty CoreEngine
core <- CoreEngine$new()
# Prints the engine
core
CoreEngine
Metadata
name: NA
author: NA
date: 2025-06-13 12:16:04.831486
file: NA
Workflow
empty
Analyses
empty
Note that when an empty CoreEngine
or any data specific
engine is created, required entries of Metadata
are created
with default name, author, date and file. Yet, Metadata
entries can be specified directly during creation of the
CoreEngine
via the argument metadata
or added
to the engine as shown in @ref(metadata). The CoreEngine
does not directly handle data processing. Processing methods are data
specific and therefore, are used via the data dedicated engines. Yet,
the framework to manage the data processing workflow is implemented in
the CoreEngine
and is therefore, harmonized across engines.
Users will not directly use the CoreEngine
but it is
important to understand that it is in the background.
Metadata
The Metadata
S7 class is meant to hold project information, such as description,
location, etc. The users can add any kind of entry to a named
list
. Below, a list
of metadata is created and
added to the CoreEngine
for demonstration. Internally, the
list
is converted to a Metadata
object.
Modifying the entries in the Metadata
is as modifying a
list
in R and the Metadata
can be accessed by
the active field Metadata
in the CoreEngine
or
any other data specific engine.
# Creates a named list with project metadata
mtd <- list(
name = "Project Example",
author = "Person Name",
description = "Example of project description"
)
# Adds/updates the Metadata in the CoreEngine
core$Metadata <- mtd
# Show mwthod for the Metadata class
show(core$Metadata)
name: Project Example
author: Person Name
description: Example of project description
date: 2025-06-13 12:16:04.931265
file: NA
# Adding a new entry to the Metadata
core$Metadata[["second_author"]] <- "Second Person Name"
show(core$Metadata)
name: Project Example
author: Person Name
description: Example of project description
date: 2025-06-13 12:16:04.931265
file: NA
second_author: Second Person Name
Workflow
A data processing workflow is represented in StreamFind by the class
Workflow
.
A Workflow
is composed of an ordered list of
ProcessingStep
objects. Each ProcessingStep
object is a representation of a processing method/step that transforms
the data according to a specific algorithm. The
ProcessingStep
class is used to harmonize the diversity of
processing methods and algorithms available for a given data type.
# Constructor for a processing step
ProcessingStep()
<StreamFind::ProcessingStep>
@ data_type : chr NA
@ method : chr NA
@ required : chr NA
@ algorithm : chr NA
@ parameters : list()
@ number_permitted: num NA
@ version : chr NA
@ software : chr NA
@ developer : chr NA
@ contact : chr NA
@ link : chr NA
@ doi : chr NA
@ call : chr "NAMethod_NA_NA"
A ProcessingStep
object must always have the data type, the processing method name, the
name of the algorithm to be used, the origin software, the main
developer name and contact as well as a link to further information and
the DOI, when available. Lastly but not least, the parameters which is a
flexible list of conditions to apply the algorithm during data
processing.
The ProcessingStep
is a generic parent class which delegates to child classes for specific
data processing methods and algorithms. As example, the
ProcessingStep
child class for annotating features within a
non-target screening workflow using a native algorithm from StreamFind
is shown below. Each ProcessingStep
child class has a
dedicated constructor method with documentation to support the usage.
Help pages for processing methods can be obtained with the native R
function ?
or help()
(e.g.,
help(MassSpecMethod_AnnotateFeatures_StreamFind)
).
# constructor of ProcessingStep child class
# for annotating features in a non-target screening workflow
# the constructor name gives away the engine, method and algorithm
# i.e.
# - the data type is MassSpec
# - the method name is AnnotateFeatures
# - the algorithm name is StreamFind
MassSpecMethod_AnnotateFeatures_StreamFind()
<StreamFind::MassSpecMethod_AnnotateFeatures_StreamFind>
@ data_type : chr "MassSpec"
@ method : chr "AnnotateFeatures"
@ required : chr "FindFeatures"
@ algorithm : chr "StreamFind"
@ parameters :List of 4
.. $ maxIsotopes : int 8
.. $ maxCharge : int 1
.. $ rtWindowAlignment: num 0.3
.. $ maxGaps : int 1
@ number_permitted: num 1
@ version : chr "0.2.0"
@ software : chr "StreamFind"
@ developer : chr "Ricardo Cunha"
@ contact : chr "cunha@iuta.de"
@ link : chr "https://odea-project.github.io/StreamFind"
@ doi : chr NA
@ call : chr "MassSpecMethod_AnnotateFeatures_StreamFind"
Analyses
As above mentioned, the CoreEngine
does not handle data
processing directly. The data processing is delegated to child engines.
A simple example is given below by creating a child
RamanEngine
from a vector of paths to asc files
with Raman spectra. The Raman spectra are used internally to initiate a
RamanAnalyses
(child class of Analyses
),
holding the raw data and any data processing Results
objects. Note that the Workflow
and Results
are still empty, as no data processing methods were applied.
# Example raman .asc files
raman_ex_files <- StreamFindData::get_raman_file_paths()
# Creates a RamanEngine with the example files
raman <- RamanEngine$new(analyses = raman_ex_files)
# Show the engine class hierarchy
class(raman)
[1] "RamanEngine" "CoreEngine" "R6"
Data specific engines have dedicated active fields to access the
data. For instance, the Analyses
active field in the
RamanEngine
is used to access the raw spectra and any
Results
. Note that accessing properties of S7 classes should be done
with @
instead of $
but the $
operator is also available for convenience.
# Gets the length of Analyses in the RamanEngine
length(raman$Analyses)
[1] 22
# Gets the names of the Analyses in the RamanEngine
names(raman$Analyses)
raman_Bevacizumab_11731 raman_Bevacizumab_11732
"raman_Bevacizumab_11731" "raman_Bevacizumab_11732"
raman_Bevacizumab_11733 raman_Bevacizumab_11734
"raman_Bevacizumab_11733" "raman_Bevacizumab_11734"
raman_Bevacizumab_11735 raman_Bevacizumab_11736
"raman_Bevacizumab_11735" "raman_Bevacizumab_11736"
raman_Bevacizumab_11737 raman_Bevacizumab_11738
"raman_Bevacizumab_11737" "raman_Bevacizumab_11738"
raman_Bevacizumab_11739 raman_Bevacizumab_11740
"raman_Bevacizumab_11739" "raman_Bevacizumab_11740"
raman_Bevacizumab_11741 raman_blank_Bevacizumab_10005
"raman_Bevacizumab_11741" "raman_blank_Bevacizumab_10005"
raman_blank_Bevacizumab_10006 raman_blank_Bevacizumab_10007
"raman_blank_Bevacizumab_10006" "raman_blank_Bevacizumab_10007"
raman_blank_Bevacizumab_10008 raman_blank_Bevacizumab_10009
"raman_blank_Bevacizumab_10008" "raman_blank_Bevacizumab_10009"
raman_blank_Bevacizumab_10010 raman_blank_Bevacizumab_10011
"raman_blank_Bevacizumab_10010" "raman_blank_Bevacizumab_10011"
raman_blank_Bevacizumab_10012 raman_blank_Bevacizumab_10013
"raman_blank_Bevacizumab_10012" "raman_blank_Bevacizumab_10013"
raman_blank_Bevacizumab_10014 raman_blank_Bevacizumab_10015
"raman_blank_Bevacizumab_10014" "raman_blank_Bevacizumab_10015"
# Access the spectrum of the first analysis in the Analyses object
head(raman$Analyses@Spectra@spectra[[1]])
shift intensity
<num> <num>
1: -33.11349 569
2: -29.93873 572
3: -26.76505 573
4: -23.59243 570
5: -20.42305 573
6: -17.25473 576
The methods for data access and visualization are also implemented as
public methods in the data specific engine class. Although data can be
obtained directly from the Analyses
child classes, using
the public methods in the engine is a preferable interface. Below, the
plot_spectra()
method is used to plot the raw spectra from
analyses 1 and 12.
# Plots the spectrum from analyses 1 and 12 in the RamanEngine
raman$plot_spectra(analyses = c(1, 12))
Managing Analyses
Analyses can be added and removed from the engine with the
add_analyses()
and remove_analyses()
methods,
respectively. Below, the 1st and 12th analyses are removed from the
engine and then added back.
[1] 20
[1] 22
For data processing, the analysis replicate names and the correspondent blank analysis replicates can be assigned with dedicated methods, as shown below. For instance, the replicate names are used for averaging the spectra in correspondent analyses and the assigned blanks are used for background subtraction.
# Adds replicate names and blank names
raman$add_replicate_names(c(rep("Sample", 11), rep("Blank", 11)))
raman$add_blank_names(rep("Blank", 22))
# the replicate names are modified
raman$Analyses$info[, c(1:3)]
analysis replicate blank
<char> <char> <char>
1: raman_Bevacizumab_11731 Sample Blank
2: raman_Bevacizumab_11732 Sample Blank
3: raman_Bevacizumab_11733 Sample Blank
4: raman_Bevacizumab_11734 Sample Blank
5: raman_Bevacizumab_11735 Sample Blank
6: raman_Bevacizumab_11736 Sample Blank
7: raman_Bevacizumab_11737 Sample Blank
8: raman_Bevacizumab_11738 Sample Blank
9: raman_Bevacizumab_11739 Sample Blank
10: raman_Bevacizumab_11740 Sample Blank
11: raman_Bevacizumab_11741 Sample Blank
12: raman_blank_Bevacizumab_10005 Blank Blank
13: raman_blank_Bevacizumab_10006 Blank Blank
14: raman_blank_Bevacizumab_10007 Blank Blank
15: raman_blank_Bevacizumab_10008 Blank Blank
16: raman_blank_Bevacizumab_10009 Blank Blank
17: raman_blank_Bevacizumab_10010 Blank Blank
18: raman_blank_Bevacizumab_10011 Blank Blank
19: raman_blank_Bevacizumab_10012 Blank Blank
20: raman_blank_Bevacizumab_10013 Blank Blank
21: raman_blank_Bevacizumab_10014 Blank Blank
22: raman_blank_Bevacizumab_10015 Blank Blank
analysis replicate blank
# the spectra between shift values 700 and 800 are plotted
# the colorBy is set to replicates to legend by replicate names
raman$plot_spectra(shift = c(700, 800), colorBy = "replicates")
Processing Workflow
As above mentioned, a Workflow
is designed by an ordered
list of ProcessingStep
child class objects. Below a
list
of ProcessingStep
child class objects for
processing the Raman spectra is created and added to the active field
Workflow
of the RamanEngine
.
ps <- list(
# averages the spectra for each analysis replicate
RamanMethod_AverageSpectra_native(),
# simple normalization based on maximum intensity
RamanMethod_NormalizeSpectra_minmax(),
# background subtraction
RamanMethod_SubtractBlankSpectra_StreamFind(),
# applies smoothing based on moving average
RamanMethod_SmoothSpectra_movingaverage(windowSize = 4),
# removes a section from the spectra from -40 to 300
RamanMethod_DeleteSpectraSection_native(min = -40, max = 300),
# removes a section from the spectra from 2000 to 3000
RamanMethod_DeleteSpectraSection_native(min = 2000, max = 3000),
# performs baseline correction
RamanMethod_CorrectSpectraBaseline_baseline_als(lambda = 3, p = 0.06, maxit = 10)
)
# The workflow is added to the engine but not yet applied
# The results are still empty
raman$Workflow <- ps
# Gets the names of the results in the Analyses object
# As data processing was yet applied, the results field in Analyses is empty
names(raman$Analyses$results)
NULL
# Shows the workflow
show(raman$Workflow)
1: AverageSpectra (native)
2: NormalizeSpectra (minmax)
3: SubtractBlankSpectra (StreamFind)
4: SmoothSpectra (movingaverage)
5: DeleteSpectraSection (native)
6: DeleteSpectraSection (native)
7: CorrectSpectraBaseline (baseline_als)
# The data processing workflow is applied
raman$run_workflow()
# Gets the names of the results in the Analyses object
# A RamanSpectra (Results child class) is now added with the processed spectra
names(raman$Analyses@results)
[1] "RamanSpectra"
The method run()
can be used to applied a single
ProcessingStep
object to the data. Note that the
ProcessingStep
step is always added to the bottom of the
Workflow
in the engine. Below, the normalization based on
minimum and maximum is applied to the Raman spectra and then the
Workflow
is shown, including another normalization step in
the last position.
# performs again normalization using minimum and maximum
raman$run(RamanMethod_NormalizeSpectra_minmax())
# the workflow is shown with another normalization step at the end
show(raman$Workflow)
1: AverageSpectra (native)
2: NormalizeSpectra (minmax)
3: SubtractBlankSpectra (StreamFind)
4: SmoothSpectra (movingaverage)
5: DeleteSpectraSection (native)
6: DeleteSpectraSection (native)
7: CorrectSpectraBaseline (baseline_als)
8: NormalizeSpectra (minmax)
Results
Once the data processing methods are applied, the results can be
accessed with the dedicated and engine specific active fields and
methods, as shown below. The results are always added as S7
Results
child classes in the results field of the
Analyses
.
# The spectra results were added
names(raman$Analyses$results)
[1] "RamanSpectra"
# Results can be obtained with the dedicated active fields
# The Results active fields are engine specific
show(raman$Spectra)
Number spectra: 2
Averaged: TRUE
Number peaks: 0
Number chrom peaks: 0
# Processed spectrum, note that the blank was subtracted
raman$plot_spectra()
Saving and loading
The CoreEngine
also holds the functionality to save the
project in the engine (as an .rds or .sqlite file) and
load it back. As shown below, the save()
and
load()
methods are used for saving and loading the
RamanEngine
, respectively.
file.exists(project_file_path)
[1] TRUE
new_raman <- RamanEngine$new()
new_raman$load(project_file_path)
# the Metadata are has the raman object although
# a new_raman object was created with default Metadata
show(new_raman$Metadata)
name: NA
author: NA
date: 2025-06-13 12:16:05.381731
file: C:/Users/apoli/Documents/github/StreamFind/vignettes/articles/raman_project.rds
# the results are also available in the new_raman object
show(new_raman$Spectra)
Number spectra: 2
Averaged: TRUE
Number peaks: 0
Number chrom peaks: 0
Conclusion
This quick guide introduced the general framework of StreamFind. The
StreamFind is a data agnostic processing workflow designer that uses R6
classes to manage, process, visualize and report data within a project.
The CoreEngine
is the parent class of all other data
specific engines and manages the project information via the class
Metadata
. The ProcessingStep
are used to
harmonize the diversity of processing methods and algorithms available
in a Workflow
object. The data processing is delegated to
child engines, such as the RamanEngine
and
MassSpecEngine
. The Workflow
is assembled by
combining different ProcessingStep
child class objects in a
specific order. The Results
can be accessed with dedicated
fields (e.g. spectra
and plot_spectra
).
StreamFind can be used via scripting as demonstrated in this guide or
via the embedded shiny app for a graphical user interface. See the StreamFind
App Guide for more information.