StreamFind General Introduction
Ricardo Cunha
cunha@iuta.de29 November, 2024
Source:vignettes/articles/general_guide.Rmd
general_guide.Rmd
The StreamFind R package is a data processing workflow designer. Besides data processing, the platform can also be used for data management, visualization and reporting. This guide focuses on describing the general framework behind StreamFind. The StreamFind is centered around R6 classes, serving as data processing engines (used as metaphor) for different types of data (e.g. mass spectrometry (MS) and Raman spectroscopy data).
Data processing engines
Data processing engines are fundamentally reference classes with
methods to manage, process, visualize and report data within a project.
The CoreEngine
is the parent class of all other data
specific engines (e.g. MassSpecEngine
and
RamanEngine
). As parent, the CoreEngine
holds
uniform functions across child data dedicated engines (e.g. adding and
removing analyses from the project).
core <- CoreEngine$new()
core
CoreEngine
File:
NA
Headers:name: NA
author: NA
date: 2024-11-29 10:05:08.854806
Workflow:
empty
Analyses:
empty
Note that when an empty CoreEngine
is created, required
ProjectHeaders
are created with name, author, path and
date.Yet, ProjectHeaders
can be specified directly during
creation of the CoreEngine
via the argument
headers
or added to the engine as shown in
@ref(project-headers). The CoreEngine
does not directly
handle data processing. Processing methods are data specific and
therefore, are used via the data dedicated engines. Yet, the framework
to manage the data processing workflow and the results are implemented
in the CoreEngine
and are therefore, harmonized across
engines. Users will not directly use the CoreEngine
but it
is important to understand that it is in the background.
Project headers
The ProjectHeaders
S7 class is meant to hold project
information/metadata, such as description, location, etc. The users can
add any kind of attribute but it must have length one and be named.
Below, a list of headers is created and added to the
CoreEngine
for demonstration. Internally, the list of
headers is converted to a ProjectHeaders
object.
headers <- list(
name = "Project Example",
author = "Person Name",
description = "Example of project headers"
)
core$headers <- headers
core$print_headers()
name: Project Example
author: Person Name
description: Example of project headers
date: 2024-11-29 10:05:08.936701
Processing settings
A data processing workflow is represented in StreamFind by the S7
class Workflow
, which is composed of an ordered list of S7
class ProcessingSettings
objects. Each
ProcessingSettings
object is a representation of a
processing method/step that transforms the data according to a specific
algorithm. The ProcessingSettings
objects are used to
harmonize the diversity of processing methods and algorithms available
for a given data type.
ProcessingSettings()
<StreamFind::ProcessingSettings>
@ engine : chr NA
@ method : chr NA
@ algorithm : chr NA
@ parameters : list()
@ number_permitted: num NA
@ version : chr NA
@ software : chr NA
@ developer : chr NA
@ contact : chr NA
@ link : chr NA
@ doi : chr NA
@ call : chr "NASettings_NA_NA"
A ProcessingSettings
object must always have the engine
type, the processing method name, the name of the algorithm to be used,
the origin software, the main developer name and contact as well as a
link to further information and the DOI, when available. Lastly but not
least, the parameters which is a flexible list of conditions to apply
the algorithm during data processing. As example,
ProcessingSettings
for annotating features using a native
algorithm from StreamFind is shown below. Each
ProcessingSettings
object has a dedicated constructor
method with documentation to support the usage. Help pages for
processing methods can be obtained with the native R function
?
or help()
(e.g.,
help(MassSpecSettings_AnnotateFeatures_StreamFind)
).
# constructor for annotating features workflow step
# the constructor name gives away the engine, method and algorithm
# i.e.
# - the engine is MassSpecEngine
# - the method is AnnotateFeatures
# - the algorithm is StreamFind
MassSpecSettings_AnnotateFeatures_StreamFind()
<StreamFind::MassSpecSettings_AnnotateFeatures_StreamFind>
@ engine : chr "MassSpec"
@ method : chr "AnnotateFeatures"
@ algorithm : chr "StreamFind"
@ parameters :List of 4
.. $ maxIsotopes : int 8
.. $ maxCharge : int 1
.. $ rtWindowAlignment: num 0.3
.. $ maxGaps : int 1
@ number_permitted: num 1
@ version : chr "0.2.0"
@ software : chr "StreamFind"
@ developer : chr "Ricardo Cunha"
@ contact : chr "cunha@iuta.de"
@ link : chr "https://odea-project.github.io/StreamFind"
@ doi : chr NA
@ call : chr "MassSpecSettings_AnnotateFeatures_StreamFind"
Saving and loading
The CoreEngine
also holds the functionality to save the
project in the engine (as an .rds or .sqlite file) and
load it back. As shown below, the save()
and
load()
methods are used for saving and loading the project,
respectively.
file.exists(project_file_path)
[1] TRUE
new_core <- CoreEngine$new()
new_core$load(project_file_path)
# the headers are has the core object although
# a new_core object was created with default headers
new_core$print_headers()
name: Project Example
author: Person Name
description: Example of project headers
date: 2024-11-29 10:05:08.936701
Data specific engines
As above mentioned, the CoreEngine
does not handle data
processing directly. The data processing is delegated to child engines,
where specific ProcessingSettings
can be applied. A simple
example is given below by creating a child RamanEngine
and
accessing the spectra from the analyses (added as full paths to
.asc files on disk). Note that the workflow and results are
still empty, as no data processing methods were applied.
# Example raman .asc files
raman_ex_files <- StreamFindData::get_raman_file_paths()
raman <- RamanEngine$new(analyses = raman_ex_files)
raman
RamanEngine
File:
NA
Headers:name: NA
author: NA
date: 2024-11-29 10:05:09.579761
Workflow:
empty
Analyses:
analysis replicate blank type
<char> <char> <char> <char>
1: raman_Bevacizumab_11731 raman_Bevacizumab_11731 <NA> raman
2: raman_Bevacizumab_11732 raman_Bevacizumab_11732 <NA> raman
3: raman_Bevacizumab_11733 raman_Bevacizumab_11733 <NA> raman
4: raman_Bevacizumab_11734 raman_Bevacizumab_11734 <NA> raman
5: raman_Bevacizumab_11735 raman_Bevacizumab_11735 <NA> raman
6: raman_Bevacizumab_11736 raman_Bevacizumab_11736 <NA> raman
7: raman_Bevacizumab_11737 raman_Bevacizumab_11737 <NA> raman
8: raman_Bevacizumab_11738 raman_Bevacizumab_11738 <NA> raman
9: raman_Bevacizumab_11739 raman_Bevacizumab_11739 <NA> raman
10: raman_Bevacizumab_11740 raman_Bevacizumab_11740 <NA> raman
11: raman_Bevacizumab_11741 raman_Bevacizumab_11741 <NA> raman
12: raman_blank_Bevacizumab_10005 raman_blank_Bevacizumab_10005 <NA> raman
13: raman_blank_Bevacizumab_10006 raman_blank_Bevacizumab_10006 <NA> raman
14: raman_blank_Bevacizumab_10007 raman_blank_Bevacizumab_10007 <NA> raman
15: raman_blank_Bevacizumab_10008 raman_blank_Bevacizumab_10008 <NA> raman
16: raman_blank_Bevacizumab_10009 raman_blank_Bevacizumab_10009 <NA> raman
17: raman_blank_Bevacizumab_10010 raman_blank_Bevacizumab_10010 <NA> raman
18: raman_blank_Bevacizumab_10011 raman_blank_Bevacizumab_10011 <NA> raman
19: raman_blank_Bevacizumab_10012 raman_blank_Bevacizumab_10012 <NA> raman
20: raman_blank_Bevacizumab_10013 raman_blank_Bevacizumab_10013 <NA> raman
21: raman_blank_Bevacizumab_10014 raman_blank_Bevacizumab_10014 <NA> raman
22: raman_blank_Bevacizumab_10015 raman_blank_Bevacizumab_10015 <NA> raman
analysis replicate blank type
spectra
<num>
1: 1024
2: 1024
3: 1024
4: 1024
5: 1024
6: 1024
7: 1024
8: 1024
9: 1024
10: 1024
11: 1024
12: 1024
13: 1024
14: 1024
15: 1024
16: 1024
17: 1024
18: 1024
19: 1024
20: 1024
21: 1024
22: 1024
spectra
# when interactive is TRUE, the spectra are plotted with plotly
raman$plot_spectra(interactive = FALSE)
Managing analyses
Analyses can be added and removed from the engine with the
add_analyses()
and remove_analyses()
methods,
respectively. Below, the 1st and 12th analyses are removed from the
engine and then added back.
[1] 20
[1] 22
For data processing, the analysis replicate names and the correspondent blank analysis replicates can be assigned with dedicated methods, as shown below. For instance, the replicate names are used for averaging the spectra in correspondent analyses and the assigned blanks are used for background subtraction, as shown below in @ref(data-processing).
raman$add_replicate_names(c(rep("Sample", 11), rep("Blank", 11)))
raman$add_blank_names(rep("Blank", 22))
# the replicate names are modified and the blanks are assigned
raman
RamanEngine
File:
NA
Headers:name: NA
author: NA
date: 2024-11-29 10:05:09.579761
Workflow:
empty
Analyses:
analysis replicate blank type spectra
<char> <char> <char> <char> <num>
1: raman_Bevacizumab_11731 Sample Blank raman 1024
2: raman_Bevacizumab_11732 Sample Blank raman 1024
3: raman_Bevacizumab_11733 Sample Blank raman 1024
4: raman_Bevacizumab_11734 Sample Blank raman 1024
5: raman_Bevacizumab_11735 Sample Blank raman 1024
6: raman_Bevacizumab_11736 Sample Blank raman 1024
7: raman_Bevacizumab_11737 Sample Blank raman 1024
8: raman_Bevacizumab_11738 Sample Blank raman 1024
9: raman_Bevacizumab_11739 Sample Blank raman 1024
10: raman_Bevacizumab_11740 Sample Blank raman 1024
11: raman_Bevacizumab_11741 Sample Blank raman 1024
12: raman_blank_Bevacizumab_10005 Blank Blank raman 1024
13: raman_blank_Bevacizumab_10006 Blank Blank raman 1024
14: raman_blank_Bevacizumab_10007 Blank Blank raman 1024
15: raman_blank_Bevacizumab_10008 Blank Blank raman 1024
16: raman_blank_Bevacizumab_10009 Blank Blank raman 1024
17: raman_blank_Bevacizumab_10010 Blank Blank raman 1024
18: raman_blank_Bevacizumab_10011 Blank Blank raman 1024
19: raman_blank_Bevacizumab_10012 Blank Blank raman 1024
20: raman_blank_Bevacizumab_10013 Blank Blank raman 1024
21: raman_blank_Bevacizumab_10014 Blank Blank raman 1024
22: raman_blank_Bevacizumab_10015 Blank Blank raman 1024
analysis replicate blank type spectra
# the spectra are plotted with the replicates colored
raman$plot_spectra(interactive = FALSE, colorBy = "replicates")
Processing workflow
As above mentioned, ProcessingSettings
are used to
design an ordered list of processing methods in a Workflow
object. Below we create a list of ProcessingSettings
for
processing the Raman spectra in the engine and add to the
raman
engine.
ps <- list(
# averages the spectra for each analysis replicate
RamanSettings_AverageSpectra_StreamFind(),
# simple normalization based on maximum intensity
RamanSettings_NormalizeSpectra_minmax(),
# background subtraction
RamanSettings_SubtractBlankSpectra_StreamFind(),
# applies smoothing based on moving average
RamanSettings_SmoothSpectra_movingaverage(windowSize = 4),
# removes a section from the spectra from -40 to 470
RamanSettings_DeleteSpectraSection_StreamFind(shiftmin = -40, shiftmax = 300),
# removes a section from the spectra from -40 to 470
RamanSettings_DeleteSpectraSection_StreamFind(shiftmin = 2000, shiftmax = 3000),
# performs baseline correction
RamanSettings_CorrectSpectraBaseline_baseline_als(lambda = 3, p = 0.06, maxit = 10)
)
# the workflow is added to the engine but not yet applied
# the results are still empty
raman$workflow <- ps
raman$print_workflow()
1: AverageSpectra (StreamFind)
2: NormalizeSpectra (minmax)
3: SubtractBlankSpectra (StreamFind)
4: SmoothSpectra (movingaverage)
5: DeleteSpectraSection (StreamFind)
6: DeleteSpectraSection (StreamFind)
7: CorrectSpectraBaseline (baseline_als)
# the data processing workflow is applied
raman$run_workflow()
The method run()
can be used to applied a single
ProcessingSettings
object to the data. Note that the
ProcessingSettings
step is always added to the bottom of
the workflow in the engine. Below, the normalization based on minimum
and maximum is applied to the Raman spectra and then the workflow is
shown, including another normalization step in the last position.
# performs again normalization using minimum and maximum
raman$run(RamanSettings_NormalizeSpectra_minmax())
# the workflow is shown with another normalization step at the end
raman$print_workflow()
1: AverageSpectra (StreamFind)
2: NormalizeSpectra (minmax)
3: SubtractBlankSpectra (StreamFind)
4: SmoothSpectra (movingaverage)
5: DeleteSpectraSection (StreamFind)
6: DeleteSpectraSection (StreamFind)
7: CorrectSpectraBaseline (baseline_als)
8: NormalizeSpectra (minmax)
Results
Once the data processing methods are applied, the results can be
accessed with the dedicated and engine specific active fields, as shown
below. The results are always added as S7 Results
child
classes.
# the spectra results were added
raman
RamanEngine
File:
NA
Headers:name: NA
author: NA
date: 2024-11-29 10:05:09.579761
Workflow:
1: AverageSpectra (StreamFind)
2: NormalizeSpectra (minmax)
3: SubtractBlankSpectra (StreamFind)
4: SmoothSpectra (movingaverage)
5: DeleteSpectraSection (StreamFind)
6: DeleteSpectraSection (StreamFind)
7: CorrectSpectraBaseline (baseline_als)
8: NormalizeSpectra (minmax)
Analyses:
analysis replicate blank type spectra
<char> <char> <char> <char> <num>
1: raman_Bevacizumab_11731 Sample Blank raman 1024
2: raman_Bevacizumab_11732 Sample Blank raman 1024
3: raman_Bevacizumab_11733 Sample Blank raman 1024
4: raman_Bevacizumab_11734 Sample Blank raman 1024
5: raman_Bevacizumab_11735 Sample Blank raman 1024
6: raman_Bevacizumab_11736 Sample Blank raman 1024
7: raman_Bevacizumab_11737 Sample Blank raman 1024
8: raman_Bevacizumab_11738 Sample Blank raman 1024
9: raman_Bevacizumab_11739 Sample Blank raman 1024
10: raman_Bevacizumab_11740 Sample Blank raman 1024
11: raman_Bevacizumab_11741 Sample Blank raman 1024
12: raman_blank_Bevacizumab_10005 Blank Blank raman 1024
13: raman_blank_Bevacizumab_10006 Blank Blank raman 1024
14: raman_blank_Bevacizumab_10007 Blank Blank raman 1024
15: raman_blank_Bevacizumab_10008 Blank Blank raman 1024
16: raman_blank_Bevacizumab_10009 Blank Blank raman 1024
17: raman_blank_Bevacizumab_10010 Blank Blank raman 1024
18: raman_blank_Bevacizumab_10011 Blank Blank raman 1024
19: raman_blank_Bevacizumab_10012 Blank Blank raman 1024
20: raman_blank_Bevacizumab_10013 Blank Blank raman 1024
21: raman_blank_Bevacizumab_10014 Blank Blank raman 1024
22: raman_blank_Bevacizumab_10015 Blank Blank raman 1024
analysis replicate blank type spectra
Result 1: StreamFind::Spectra
# results can be obtained with the dedicated active fields
raman$spectra
<StreamFind::Spectra>
@ name : chr "Spectra"
@ software : chr "StreamFind"
@ version : chr "0.2.0"
@ spectra :List of 2
.. $ Sample:Classes 'data.table' and 'data.frame': 690 obs. of 5 variables:
.. ..$ shift : num [1:690] 300 303 306 309 312 ...
.. ..$ intensity: num [1:690] 0.0886 0.0386 0.1075 0.1457 0.2491 ...
.. ..$ blank : num [1:690] 0.75 0.733 0.717 0.706 0.695 ...
.. ..$ baseline : num [1:690] 0.0454 0.0456 0.0458 0.0461 0.0463 ...
.. ..$ raw : num [1:690] 0.0452 0.0453 0.0458 0.0461 0.0467 ...
.. ..- attr(*, ".internal.selfref")=<externalptr>
.. $ Blank :Classes 'data.table' and 'data.frame': 0 obs. of 0 variables
.. ..- attr(*, ".internal.selfref")=<externalptr>
@ is_averaged : logi TRUE
@ is_neutralized: logi FALSE
@ peaks : list()
@ has_peaks : logi FALSE
@ charges : list()
# resulting spectrum
raman$plot_spectra()
Conclusion
This quick guide introduced the general framework of StreamFind. The
StreamFind is a data processing workflow designer that uses R6 classes
to manage, process, visualize and report data within a project. The
CoreEngine
is the parent class of all other data specific
engines and manages the project information via the class
ProjectHeaders
. The ProcessingSettings
are
used to harmonize the diversity of processing methods and algorithms
available in a Workflow
object. The data processing is
delegated to child engines, such as the RamanEngine
and
MassSpecEngine
. The Workflow
is assembled by
combining different ProcessingSettings
in a specific order.
The results can be accessed with dedicated fields
(e.g. spectra
and plot_spectra
). StreamFind
can be used via scripting as demonstrated in this guide or via the
embedded shiny app for a graphical user interface. See the StreamFind
App Guide for more information.