StreamFind Developer Guide
Implementation of processing methods and
algorithms
Ricardo Cunha
cunha@iuta.de01 August, 2024
Source:vignettes/articles/developer_guide.Rmd
developer_guide.Rmd
StreamFind
The StreamFind is an R package and can be used for data management, processing, visualization and reporting. This guide uses mass spectrometry (MS) data as example and aims to instruct developers to implement new processing modules and additional processing algorithms for new or existing processing methods in StreamFind.
Setup
The R package is in the StreamFind GitHub repository of the ODEA Project. For development, the recommendation is to download the repository locally using git tracking for version control. The GitHub desktop tool can be used for more easily install and configure git with your GitHub account, which is recommended for authoring contributions. Since it is an R package, the RStudio IDE is recommended for development. Yet, others (e.g., VS Code) will also work. When using RStudio, the repository can be downloaded via new project, selecting version control, then git and finally adding the GitHub url https://github.com/odea-project/StreamFind. This should create a local image of the StreamFind repository directly with git tracking, if git or GitHub desktop were properly installed and configured. When using RStudio, the project should directly be identified as package development where all tools are available to support development. We recommend setting the Use devtools package functions if available and generate documentation via Roxygen (in the configure bottom you select all options) located in the Build tab under the Configure Build Tools…. For other IDEs, we recommend using the package devtools. Considering that the local image of the StreamFind repository is installed with git tracking, the first step for development is to create a dedicated branch for implementation of new processing modules and/or algorithms. The master branch should not be changed directly but modified by pull requests from the dedicated development branch, giving the opportunity for code revision.
Structure
The streamFind R package is centered around the R6 class system, which brings object oriented programming to R. For MS, the MassSpecEngine R6 class is used to encapsulate both the data and methods. The data is stored in private fields within the MassSpecEngine and can only be accessed and processed via the public methods. Below the creation of a MassSpecEngine object and the way to access and change data is briefly shown.
ms <- MassSpecEngine$new()
ms$add_headers(name = "Example", author = "Person A")
ms$get_headers()
##
## ProjectHeaders
## file: NA
## date: 2024-08-01 10:06:11.943092
## name: Example
## author: Person A
# print method. Note that MS data files were not yet added!
ms
##
## MassSpecEngine
## name Example
## author Person A
## file NA
## date 2024-08-01 10:06:11.943092
##
## Workflow empty
##
## Analyses empty
##
## Results empty
For implementation of processing methods, the S3 class ProcessingSettings
is used to dispatch settings to processing methods. See article Evaluation
of Wastewater Ozonation with Mass Spectrometry for demonstration of
usage. In essence, a given algorithm to be applied to a processing
method is added to the MassSpecEngine as a
ProcessingSettings object which is then used to process the
data with the defined settings or parameters. The structure of a
ProcessingSettings is exemplified below for the algorithm
openms
to be applied to the method
MassSpecEngine$find_features()
.
ffs <- MassSpecSettings_FindFeatures_openms()
ffs
##
## ProcessingSettings
## engine MassSpec
## call FindFeatures
## algorithm openms
## version 0.2.0
## software openms
## developer Oliver Kohlbacher
## contact oliver.kohlbacher@uni-tuebingen.de
## link https://openms.de/
## doi https://doi.org/10.1038/nmeth.3959
##
## parameters:
## - noiseThrInt 1000
## - chromSNR 3
## - chromFWHM 7
## - mzPPM 15
## - reEstimateMTSD TRUE
## - traceTermCriterion sample_rate
## - traceTermOutliers 5
## - minSampleRate 1
## - minTraceLength 4
## - maxTraceLength 70
## - widthFiltering fixed
## - minFWHM 4
## - maxFWHM 35
## - traceSNRFiltering TRUE
## - localRTRange 0
## - localMZRange 0
## - isotopeFilteringModel none
## - MZScoring13C FALSE
## - useSmoothedInts FALSE
## - extraOpts
## - intSearchRTWindow 3
## - useFFMIntensities FALSE
## - verbose FALSE
As shown, the constructor of a ProcessingSettings is a
function always including
[engine name upper cammel case]Settings_[method name upper cammel case]_[algorithm name]
;
More details are given in the Semantics (@ref(semantics)). Then, the
ProcessingSettings can be directly added to the
MassSpecEngine.
ms$add_settings(ffs)
ms
##
## MassSpecEngine
## name Example
## author Person A
## file NA
## date 2024-08-01 10:06:11.943092
##
## Workflow
## 1: FindFeatures (openms)
##
## Analyses empty
##
## Results empty
Alternatively, the ProcessingSettings can be saved as a JSON string and imported from a JSON file, as demonstrated below.
save_default_ProcessingSettings(
engine = "MassSpec",
call = "FindFeatures",
algorithm = "xcms3_centwave",
name = "ffs",
path = getwd()
)
## {
## "engine": [
## "MassSpec"
## ],
## "call": [
## "FindFeatures"
## ],
## "algorithm": [
## "xcms3_centwave"
## ],
## "parameters": {
## "class": [
## "CentWaveParam"
## ],
## "ppm": [
## 12
## ],
## "peakwidth": [
## 5,
## 60
## ],
## "snthresh": [
## 15
## ],
## "prefilter": [
## 5,
## 1500
## ],
## "mzCenterFun": [
## "wMean"
## ],
## "integrate": [
## 1
## ],
## "mzdiff": [
## -0.0002
## ],
## "fitgauss": [
## true
## ],
## "noise": [
## 500
## ],
## "verboseColumns": [
## true
## ],
## "roiList": [
##
## ],
## "firstBaselineCheck": [
## false
## ],
## "roiScales": [
##
## ],
## "extendLengthMSW": [
## false
## ]
## },
## "version": [
## "0.2.0"
## ],
## "software": [
## "xcms"
## ],
## "developer": [
## "Ralf Tautenhahn, Johannes Rainer"
## ],
## "contact": [
## "rtautenh@ipb-halle.de"
## ],
## "link": [
## "https://bioconductor.org/packages/release/bioc/html/xcms.html"
## ],
## "doi": [
## "https://doi.org/10.1186/1471-2105-9-504"
## ]
## }
##
ms$import_settings("ffs.json")
# "openms" replaced by "xcms3_centwave"
ms
##
## MassSpecEngine
## name Example
## author Person A
## file NA
## date 2024-08-01 10:06:11.943092
##
## Workflow
## 1: FindFeatures (xcms3_centwave)
##
## Analyses empty
##
## Results empty
The use of the S3 object system for ProcessingSettings gives
flexibility to the list of parameters, meaning that each parameter entry
can be a single numeric value, a vector of strings or even a full
data.frame if required. Each ProcessingSettings constructor
(i.e.,
[engine name upper cammel case]Settings_[method name upper cammel case]_[algorithm name]
)
has a dedicated validation method to ensure that the parameters and
metadata are in conformity (as shown below). The validation of a
ProcessingSettings is always performed before applying it to a
processing method.
validate(ffs)
## [1] TRUE
Besides the S3 class ProcessingSettings, the
ProcessingSettings object receives other class names that are
used for S3 method dispatchment (i.e., direct the object to the
dedicated S3 method where the actual processing algorithm is applied).
Below we show the classes of the ffs
ProcessingSettings. The class patRoon means that the
algorithm openms
is applied via the package patRoon. The class
MassSpecSettings_FindFeatures_openms directs the object to the
right processing method and indicates which algorithm to be applied. For
this, an S3 generic is used
in each processing method (e.g.,
MassSpecEngine$find_features()
or
MassSpecEngine$group_features()
) for the dispatchment. This
process is not visible to the user but is essential for the developer.
Implementation of new processing methods and/or algorithms must consider
this structure. In the section Implementation (@ref(implementation)) the
process of adding new methods and algorithms is described in more
detail.
class(ffs)
## [1] "ProcessingSettings"
## [2] "MassSpecSettings_FindFeatures_openms"
## [3] "FindFeatures_patRoon"
Semantics
Consistent semantics are attempted within the StreamFind R package.
Some of the class and method names were already mentioned above and a
clear use of the underscore to separate words for methods and
use of Upper Camel Case for classes is visible. In this
section, we try to highlight the defined rules for the most important
semantic aspects. All the methods available via the class
MassSpecEngine$
are written with underscore to separate the
words (e.g., get_analysis_names()
or
annotate_features()
). The arguments of methods, functions
and class constructors are always written with Lower Camel Case
when more than one word is needed (e.g., colorBy
or
minIntensity
). Classes are written with Upper Camel
Case when two or more words are used (e.g.,
MassSpecAnalysis or ProjectHeaders) with the exception
of the specific constructor functions for the different algorithm
settings, which use the syntax
[engine name]Settings_[method name]_[algorithm name]
(e.g.,
MassSpecSettings_FilterFeatures_StreamFind or
RamanSettings_BinSpectra_StreamFind); This supports and
facilitates the association of the settings with the respective engine
and processing method. Functions or methods not available to the user
(i.e., not exported via the package NAMESPACE) are written with
.
at the beginning, followed by underscore to
separate words (e.g., .get_colors() or
.plot_spectra_ms2_static()). This is also applied to the S3
generics of the processing modules, which use the syntax
.s3_[module method]
(e.g., .s3_FindFeatures
or
.s3_GroupFeatures
).
Files
The file structure of the StreamFind package is in line with the CRAN
official package development guideline.). All relevant files for the
developer are in the R, src, man-roxygen, tests and vignettes. In the R
folder are the R scripts, in the src folder are C++ libraries and the Rcpp
interface functions, in the man-roxygen are the templates for
documentation of arguments, in the tests are the test units that should
be applied for each processing method and each algorithm implementation
and finally, in the vignettes are articles, tutorials and guides for the
users. R file names in the R folder have a defined name syntax according
to the content/function. Class files are named with
class_[class type in capital_[class name].R]
. Exported MS
function files are named with
fct_[optional engine associated]_[unique name].R
. Utility
functions not exported are named with
utils_[unique name].R
. S3 methods for processing modules
are written with
methods_S3_[engine name]_[method name]_[algorithm].R
.
ProcessingSettings constructors for a given engine are placed
in a file named class_S3_[engine name]Settings.R
.
Implementation
The implementation of new processing methods and new algorithms for a given processing method differs in terms of impact change. While addition of new processing methods require the change of the main MassSpecEngine class and addition of new S3 generics, adding new algorithms for existing modules do not require changes in the existing files. Therefore, we describe their implementation in two separate sections.