Documentation
The following section will introduce how to use STREAM and what
input/output formats the program expects and produces. STREAM can be
used as a
- command line tool or
- by using the graphical user interface (GUI).
This introduction first starts with the GUI-usage, the command line
usage is explained below.
STREAM : GUI
- Opening the program
- Opening a file
Opening the program
STREAM can be opened by double-clicking the STREAM.jar file or by
typing into the command line:
java -jar STREAM.jar
You now should see the empty STREAM scaffold (Fig.2) containing:
- The File menu
- The TFBS-Viewer
- The Expression-Viewer
- The Parameter-Chooser
- The Settings panel
Fig.2.
Screen shot of STREAM identifying the different display panels.
 |
Opening a file
The input data can be presented to STREAM in two ways:
- as the output of a previous STREAM
execution, by reading
in a .xml - file
- as individual files, to initially
read in information.
(There are example input files for both methods in the archive you have
downloaded.)
To initially load a problem into the program you need three types of
information:
- The
TFBS-map / PWMs+Sequence
- The TF expression information
- Target expression information
The TFBS-map can either be provided diretly or obtained by STREAM by
providing position-weight-matrices (PWMs) and a DNA sequence. The
information in the required files can be obtained by multiple tools
and methods. The required information and format is described here.
Display
After loading the date you should see a change in the four display
panels of the program.
TFBS-Viewer
The TFBS-Viewer displays the loaded TFBS-map. Along a horizontal line,
representing the DNA, colored boxes are displayed, which represent the
individual TFBSs (Fig.3). The distance between the boxes represents the
distance between the TFBSs as given in the file. The start as well as
end position of the site is displayed above the box. Activator sites
are depicted as filled boxes, whereas repressor sites have only an
outline. The distance dependent repression of the activator sites is
displayed as arcs, where the color identifies the repressor.
Fig.3.
Detail from the TFBS-Viewer. The figure shows one activator site
(brown) and three repressor sites (red, green and blue).
 |
Fig.4 shows the additional information that can be displayed about each
TFBS. The first line of the text section in gray shows the unique name
of the TFBS. `KaT` and `Eb` are the current values of the two free
parameters for
this TF type (association and effectiveness). `PWM` shows the log-odds
score for this TFBS. `Koef` is the scaling
factor to determine the association coeficient, `KaS` , for the current
site.
Koef = exp(S^max^-S)
KaS = KaT x Koef
See more [Inputdata_format].
Fig.4.
Informat for each TFBS.
 |
When zoomed out to see the whole TFBsmap, the TFBS info will become too
small to read. Moving the mouse over a particular TFBS will display the
name in a popup window above the TFBS. When the TFBS is clicked an
expanded version appears holding more detained information about this
particular site, see Fig. 4.1.
Fig.4.1.
Extended information for each TFBS.
 |
Fig.4.1.
Extended information for each TFBS.
Expression-Viewer
The expression viewer is subdivided into two display panels labeled
`TFs` and `Target Gene`. The `TFs` panel shows the data given in
the TF expression information file, whereas the other panel
shows the Target expression information. After the free
parameters are set
(after training or manually editing) the `Target Gene` panel will also
show the prediction of the target expression information (Fig.6).
Fig.6.
Expression viewer
displaying the TF concentration, the target gene concentration (blue)
and the prediction of the target gene concentration (red).
 |
STREAM allows to divide the expression data in up to two dimensions,
here, "time" and "tissues". Fig.6 shows all data points along the A-P
axis associated with the last time point. The data in Fig.6 are
displayed as continuous datapoints, however it is also
possible to display categorical data. An example how one dataset can be
viewed differently according to which dimension is displayed, is shown
in Fig. 6.1.
Fig.6.1.
Expression viewer displaying a toy example for categorical data.
| Categorical view for all cell-types at the
time point 35
minutes |
Continuous view of a time course for the
cell-type kidney |
 |
 |
If it is not necessary to divide the dataset into two dimensions one of
the dimension specifying columns can be filled with "NA".
Controle Panel
Parameter-Chooser
The Params-Chooser displays the values the current set of parameters
have. These values can be edited and the information in the
Expression-viewer and TFBS-Viewer change accordingly after the
*Update*-Button is pressed.
Settings-panel
The Settings-panel controls the Expression-viewer and TFBS-Viewer. It
also displays the `Quencher range`, which regulates the distance in
which a repressor can still reduce the effectiveness of an activator
site.
The `Quencher range` can be edited and will be used in all subsequent
actions.
The lower part on the panel holds a summary of the current status of
the program is in.
`TFBSmap`, `Exp(TF)` and `Exp(target)` displays, where apropriet, the
file name of the input data. `Model` displays the used model and
`Transf.` shows if there was a transformation of the parameters
currently chosen. `RMS` displays the current RMS error and `CC` the
correlation coefficient. The last field indicates if the program is
busy, `training`, or finished, `done`.
Obtaining a predictive model
A predictive model needs parameter values for the free paramenters in
the system. They can either be obtained by setting values manually or
by traning a model.
When training to obtain a predictive model the objective is to
determine the set of model
parameters that optimally explains the input data.
Two different optimization methods are available via the GUI: simulated
annealing (SA), gradient descent (GD).
Setting start parameters
Per default, the optimization method will sample start parameters
randomly, however optimization can be done after manually
or the following other automatic methods can be
chosen :
- randomly setting parameters (default)
- estimating parameters from the given input data
- estimating parameters by running gradient
descent on a
small subset of the data
- estimating parameters by running simulated
annealing on a
small subset of the data
Setting random parameters and the three ways of estimating parameters
can be accessed over the 'Optimize menue'.
Gradient Descent
To perform gradient descent choose `Train GD` in
the `Optimize` menue.
Fig.7.
Screenshot of the dialog to use gradient descent as optimizer.
 |
Gradient Descent uses the gradient information to guide its
optimization process. Various settings can be adjusted to tune the
optimization for the given problem. The default settings are likely to
result in the same performance and a change to the settings will only
enable faster convergence rate.
`GD updates` specifies the number of parameter updates are allowed per
run. Reducing the value will result in a quicker program execution but
might compromise the performance.
Besides the given update steps, there are other factors, terminating
the optimization process, e.g. if no better solution can be found or of
the gradient becomes too small. These termination criteria are
satisfied for both the globally optimal solution and local minima.
Since the gradient descent optimization is influenced by the position
of the starting point a restart of the optimization from another start
point could result in a better solution. There are two settings, which
enables the optimization process to restart from different positions
and report on the best solution from these optimization runs: `Fixed
restarts` and `Budget`. `Fixed restarts` specifies how many times a new
optimization from a different (random) start position should be
executed. Each of these runs have the given number of allowed updates.
`Budget` on the other hand restarts the optimization from different
(random) start values as long as the sum of the used updates in each of
the runs is below the given number of updates. `Budget` ensures that
the given number of updates was optimally used during the optimization.
The `Learning rate` specifies the step-size, `n`, for every category of
the free parameters. The step size scales the gradient, `G`, and
influence how much the new parameter value,`p'`, is changed.
p'
= p - nG
`Accelerate Learning Rate` toggles whether the value of the learning
rate should remain constant throughout the optimization, or if it
increases with every successful update step with
n'
= n x 1.05
an unsuccessful update step means that either the new RMS error is
larger than the previous one or if the new parameter value is changed
to a value outside the specified parameter range. In case of an
unsuccessful update the learning rate is iteratively reduced until the
learning rate is too small or a successful update set could be
performed. The reduction is
n'
= n x 0.1
`Rprop` is a variation to the normal gradient descent update function,
where only the sign of the gradient, rather than the gradient value, is
used to determine the new parameter value.
p'
= p - n sign(G)
The value for `RpropLimit` indicates for how many iterations the Rprop
changed update function should be used before the program switches back
to using the normal gradient descent update function.
`Epsion` gives the precision for the stop criterion for the
optimization.
abs(error^i^-error^i-1^) < Epsion
As default the program uses all the data given in the input set.
However, some problems might be sufficiently defined only be a subset
of the data, on which the program could be executed faster. The
`Dataset fraction` hence enables to only use a defined fraction of the
dataset. The program assembles the subset by randomly sampling from the
input data.
Transformation indicates for each of the parameter types if
they should undergo a transformation, where -1 indicates no
transformation. Since gradient descent is normally an unrestricted
optimization, the optimized values for the free parameters can become
invalid (e.g. negative). It is therefore not advisable to use no
transformation. Any transformation value, `a`, larger than 1 will
trigger a transformation from the _constrained_ space (`p`) to a
_unconstrained_ space (`p^u^`), which is used during the optimization
process:
p^u^ = -log( (a/p) - 1.0) ) / 0.01 ) .
After the optimization is done the parameters are transformed back to
be displayed as the optimized free parameters:
p =
a / (1.0 + exp(b*-p^u^)) .
'Cross Validation Folds' indicates in how may subset (folds) the input
data should be divided. The training is performed on all but one set,
which is used for testing. The training is repeated on different
subsets until all folds were used for testing.
Each of the models which can be chosen from the pull-down menue
represents a variation of the original function calculating the
transcription rate `R`. Fig.8 shows a comparison of the model
variations.
The original function, here called `JanssensModelCapped` calculates `R`
as how much of the maximal transcription rate `R0` is used follows:
R =
R0 x scaling : iff scaling
<=1, R = R0 : otherwise .
This function is not differentiable, hence can not be used for the
gradient descent optimization. But we can use the uncapped version of
this function, called `JanssensModel`. However, the calculated
transcription rate can become larger than `R0` and hence the error as
well as the gradient can become very large, which might perturbed the
optimization. We therefore introduce the `JanssensModelRSigm`, which
'caps' the calculated rate using a sigmoid function :
R =
R0 x sigm(scaling),
where sigm(x) = 1.0/(1+exp(-(x*2.5)-2)
the scaling values (2.5 and 2) are used to shift the sigmoid
transformed transcription rate as close as possible to the
untransformed values (see Fig.8)
Fig.8.
Figure illustrating the difference of the calculated transcription rate
using the different models.
 |
Simulated Annealing
To perform gradient descent choose `Train GD` in
the `Optimize` menue.
Simulated Annealing has considerably less parameters to adjust.
`Iterations` gives the maximal number of iteration for the
optimization. The number of iterations also influences the cooling rate
:
rate = e ^log(0.006)/maxIter)^
which ensures that there is still enough temperature left to make
changes to the parameters towards the end of the optimization.
`Dataset fraction` and `Cross Validation Fold` have the same function
as explained above. As for the model choosing option :
`JanssensModelCapped` is now available and the preferable option.
Fig.8.
Screenshot of the dialog to use simulated annealing as optimizer.
 |
Changing/adjusting
data/parameters
Manually-updating-parameters
Parameters can be updated manually at any time by changing the values
in the Parameter-Chooser and clicking 'Update' to apply.
Setting parameters from
string
Parameters can also be given to the program as formated string. The
string can be inserted via the dialog 'Edit -> Set
Parameters'. For more details on the format see
Annotated parameter string.
TF status
A TF can be either a repressor or an activator. Which status a TF has
in specified by the input data. The dialogue 'Edit -> Edit TFs'
allows to change this specification.
Changing parameter ranges
To change the default range of parameters go to 'Edit -> Set
Ranges'. These ranges are used to sample the random start parameters,
limit the optimization in the SA and transform the parameters for the
GA optimization. It also limits the values that can be set in the
Parameter-Chooser.
Adding Complexes
to be added
Managing problems
to be added
Obtaining results
Displaying Log
The log file can be obtained from 'Result -> return Log'
Extracting Parameters
to be added
Saving to .xml-file
The xml file contains all necesary information to continue from were
you left. For more information see Xml-file.
STREAM
: command-line
More efficient computational experiments can be performed by using
STREAM in the command line mode. The same jar file is used but this
time arguments are given to STREAM.jar :
java -jar
STREAM.jar -h
calls the command-line usage.
java -jar
STREAM.jar <file.stream>
executes an experiment defined in a xml-file.
If you want to ran STREAM from scratch you need to type in the
arguments as followed:
java -jar
STREAM.jar -S <TFBS-map file> -TF
<TF-Expression file> -G <Target-Expression
file> [optimization method] [model definition] [transformation]
[options]
Input data
arguments
| -S
<file> |
File containing the TFBSs in the TFBS-map format. |
| -TF <file> |
File containing the expression informations
of the TFs in the TF
expression
format. |
| -G <file> |
File containing the expression
informations of the gene in the Gene
expression
format. |
*optimization
method*
|| -gradientDescent || Using gradient descent for the optimization ||
|| -SA || Using simulated annealing for the optimization ||
|| -hybrid || Using simulated annealing first and then gradient descent
for the optimization ||
|| -LBFGS || limited-memory quasi-Newton unconstrained optimization
(buggy) ||
*model
definition*
|| -cR || Using the normal R-function as defined by Janssens et
al.^1. ||
|| -nR || Using the uncapped version of R. ||
|| -sR || Using the sigmoid transformed R. ||
*transformation*
|| -nT || no transformation of the parameters - can not be used with
gradient descent ||
|| -sT || sigmoid transformation of the parameters according fixed
ranges ||
*options*
|| -silent || No output during run ||
|| -detailed || Print values from model evaluation ||
|| -V <string> || Annotated start variable list ||
|| -graphic || Launch visualizer at the end ||
||-justPlot || plots the prediction of given variables does not perform
optimization (requires -V) ||
|| -cv <number> || CV folds _default 0_||
|| -multistart <number> || Perform multistart: starting
<number> times with randomly picked parameters within
range _default 1_||
|| -multistartbudget <number> ||
Performs optimization runs and keeps restarting as long as
the budget of iterations is not reached _default 0_||
|| -seed <number>|| Random number seed _default 1_||
|| -save <filename> || save results and configuration to
<filename>.stream file ||
*SA specific options*
|| -maxiter <number> || Number if SA iteration steps per
restart _default 10000_ ||
*GD specific options*
|| -updatas <number> || Number of GD iteration steps per
restart _default 10000_ ||
|| -momentum || Momentum _default 0_ ||
|| -changeNu || Change learning rate : nu=nu*1.05 iff successfully
updated; nu=nu*0.1 otherwise _default_||
|| -rprop || Use only the sign of the gradient ||
|| -rpropLimit <number> || Use only the sign of the
gradient for the first <number> iterations then perform
normal GD _default 0_ ||
=FAQ =
Please go to [FAQ]
=References=
(5) Denis C Bauer and Timothy L Bailey. Assessing quality and
efficiency of diffe-
rent methods for the optimization of thermodynamic models for
transcriptional
regulation. In preparation. |