Home    |   Introduction    |   Documentation   |   Download 
 

Documentation


The following section will introduce how to use STREAM and what input/output formats the program expects and produces. STREAM can be used as a
  • command line tool or
  • by using the graphical user interface (GUI).

This introduction first starts with the GUI-usage, the command line usage is explained below.


STREAM : GUI

  • Opening the program
  • Opening a file

Opening the program

STREAM can be opened by double-clicking the STREAM.jar file or by typing into the command line:

java -jar STREAM.jar


You now should see the empty STREAM scaffold (Fig.2) containing:
  • The File menu
  • The TFBS-Viewer
  • The Expression-Viewer
  • The Parameter-Chooser
  • The Settings panel

Fig.2. Screen shot of STREAM identifying the different display panels.
Default screenshot


Opening a file

The input data can be presented to STREAM in two ways:
  •  as the output of a previous STREAM execution, by reading in a .xml - file
  •  as individual files, to initially read in information.

(There are example input files for both methods in the archive you have downloaded.)

To initially load a problem into the program you need three types of information:
  • The TFBS-map / PWMs+Sequence
  • The TF expression information
  • Target expression information

The TFBS-map can either be provided diretly or obtained by STREAM by providing position-weight-matrices (PWMs) and a DNA sequence. The information in the required files can be obtained by multiple tools and methods. The required information and format is described here.


Display

After loading the date you should see a change in the four display panels of the program.

TFBS-Viewer

The TFBS-Viewer displays the loaded TFBS-map. Along a horizontal line, representing the DNA, colored boxes are displayed, which represent the individual TFBSs (Fig.3). The distance between the boxes represents the distance between the TFBSs as given in the file. The start as well as end position of the site is displayed above the box. Activator sites are depicted as filled boxes, whereas repressor sites have only an outline. The distance dependent repression of the activator sites is displayed as arcs, where the color identifies the repressor.

Fig.3. Detail from the TFBS-Viewer. The figure shows one activator site (brown) and three repressor sites (red, green and blue).
Detail from the TFBS-Viewer


Fig.4 shows the additional information that can be displayed about each TFBS. The first line of the text section in gray shows the unique name of the TFBS. `KaT` and `Eb` are the current values of the two free parameters for this TF type (association and effectiveness). `PWM` shows the log-odds score for this TFBS. `Koef` is the scaling factor to determine the association coeficient, `KaS` , for the current site.

  Koef = exp(S^max^-S)

  KaS = KaT x Koef

See more [Inputdata_format].

Fig.4.  Informat for each TFBS.
Information for each TFBS

When zoomed out to see the whole TFBsmap, the TFBS info will become too small to read. Moving the mouse over a particular TFBS will display the name in a popup window above the TFBS. When the TFBS is clicked an expanded version appears holding more detained information about this particular site, see Fig. 4.1.

Fig.4.1. Extended information for each TFBS.
Extended information for each TFBS.
Fig.4.1. Extended information for each TFBS.




Expression-Viewer


The expression viewer is subdivided into two display panels labeled `TFs` and `Target Gene`. The `TFs` panel shows the data given in the TF expression information file, whereas the other panel shows the Target expression information. After the free parameters are set (after training or manually editing) the `Target Gene` panel will also show the prediction of the target expression information (Fig.6).

Fig.6. Expression viewer displaying the TF concentration, the target gene concentration (blue) and the prediction of the target gene concentration (red).
Expression Viewer


STREAM allows to divide the expression data in up to two dimensions, here, "time" and "tissues". Fig.6 shows all data points along the A-P axis associated with the last time point. The data in Fig.6 are displayed as continuous datapoints, however it is also
possible to display categorical data. An example how one dataset can be viewed differently according to which dimension is displayed, is shown in Fig. 6.1.

Fig.6.1. Expression viewer displaying a toy example for categorical data.
Categorical view for all cell-types at the time point 35 minutes Continuous view of a time course for the cell-type kidney
Categorical continuous

If it is not necessary to divide the dataset into two dimensions one of the dimension specifying columns can be filled with "NA".



Controle Panel


Parameter-Chooser
The Params-Chooser displays the values the current set of parameters have. These values can be edited and the information in the Expression-viewer and TFBS-Viewer change accordingly after the *Update*-Button is pressed.

Settings-panel
The Settings-panel controls the Expression-viewer and TFBS-Viewer. It also displays the `Quencher range`, which regulates the distance in which a repressor can still reduce the effectiveness of an activator site.
The `Quencher range` can be edited and will be used in all subsequent actions.

The lower part on the panel holds a summary of the current status of the program is in.
`TFBSmap`, `Exp(TF)` and `Exp(target)` displays, where apropriet, the file name of the input data. `Model` displays the used model and `Transf.` shows if there was a transformation of the parameters currently chosen. `RMS` displays the current RMS error and `CC` the correlation coefficient. The last field indicates if the program is busy, `training`, or finished, `done`.


Obtaining a predictive model

A predictive model needs parameter values for the free paramenters in the system. They can either be obtained by setting values manually or by traning a model.

When training to obtain a predictive model the objective is to determine the set of model parameters that optimally explains the input data.

Two different optimization methods are available via the GUI: simulated annealing (SA), gradient descent (GD).

Setting start parameters

Per default, the optimization method will sample start parameters randomly, however optimization can be done after manually or the following other automatic methods can be chosen :
  • randomly setting parameters (default)
  • estimating parameters from the given input data
  • estimating parameters by running gradient descent on a small subset of the data
  • estimating parameters by running simulated annealing on a small subset of the data

Setting random parameters and the three ways of estimating parameters can be accessed over the 'Optimize menue'.


Gradient Descent

To perform gradient descent choose `Train GD` in the `Optimize` menue.

Fig.7. Screenshot of the dialog to use gradient descent as optimizer.
GD training menue


Gradient Descent uses the gradient information to guide its optimization process. Various settings can be adjusted to tune the optimization for the given problem. The default settings are likely to result in the same performance and a change to the settings will only enable faster convergence rate.

`GD updates` specifies the number of parameter updates are allowed per run. Reducing the value will result in a quicker program execution but might compromise the performance.

Besides the given update steps, there are other factors, terminating the optimization process, e.g. if no better solution can be found or of the gradient becomes too small. These termination criteria are satisfied for both the globally optimal solution and local minima. Since the gradient descent optimization is influenced by the position of the starting point a restart of the optimization from another start point could result in a better solution. There are two settings, which enables the optimization process to restart from different positions and report on the best solution from these optimization runs: `Fixed restarts` and `Budget`. `Fixed restarts` specifies how many times a new optimization from a different (random) start position should be executed. Each of these runs have the given number of allowed updates. `Budget` on the other hand restarts the optimization from different (random) start values as long as the sum of the used updates in each of the runs is below the given number of updates. `Budget` ensures that the given number of updates was optimally used during the optimization.

The `Learning rate` specifies the step-size, `n`, for every category of the free parameters. The step size scales the gradient, `G`, and influence how much the new parameter value,`p'`, is changed.

  p' = p - nG
 
`Accelerate Learning Rate` toggles whether the value of the learning rate should remain constant throughout the optimization, or if it increases with every successful update step with

  n' = n x 1.05

an unsuccessful update step means that either the new RMS error is larger than the previous one or if the new parameter value is changed to a value outside the specified parameter range. In case of an unsuccessful update the learning rate is iteratively reduced until the learning rate is too small or a successful update set could be performed. The reduction is

  n' = n x 0.1

`Rprop` is a variation to the normal gradient descent update function, where only the sign of the gradient, rather than the gradient value, is used to determine the new parameter value. 

  p' = p - n sign(G)

The value for `RpropLimit` indicates for how many iterations the Rprop changed update function should be used before the program switches back to using the normal gradient descent update function.

`Epsion` gives the precision for the stop criterion for the optimization.

  abs(error^i^-error^i-1^) < Epsion

As default the program uses all the data given in the input set. However, some problems might be sufficiently defined only be a subset of the data, on which the program could be executed faster. The `Dataset fraction` hence enables to only use a defined fraction of the dataset. The program assembles the subset by randomly sampling from the input data.

 Transformation indicates for each of the parameter types if they should undergo a transformation, where -1 indicates no transformation. Since gradient descent is normally an unrestricted optimization, the optimized values for the free parameters can become invalid (e.g. negative). It is therefore not advisable to use no transformation. Any transformation value, `a`, larger than 1 will trigger a transformation from the _constrained_ space (`p`) to a _unconstrained_ space (`p^u^`), which is used during the optimization process:

  p^u^ = -log( (a/p) - 1.0) ) / 0.01 ) .

After the optimization is done the parameters are transformed back to be displayed as the optimized free parameters:
 
  p = a / (1.0 + exp(b*-p^u^)) .

'Cross Validation Folds' indicates in how may subset (folds) the input data should be divided. The training is performed on all but one set, which is used for testing. The training is repeated on different subsets until all folds were used for testing.

Each of the models which can be chosen from the pull-down menue represents a variation of the original function calculating the transcription rate `R`.  Fig.8 shows a comparison of the model variations.

The original function, here called `JanssensModelCapped` calculates `R` as how much of the maximal transcription rate `R0` is used follows:

  R = R0 x scaling :  iff scaling <=1,   R = R0 : otherwise .

This function is not differentiable, hence can not be used for the gradient descent optimization. But we can use the uncapped version of this function, called `JanssensModel`. However, the calculated transcription rate can become larger than `R0` and hence the error as well as the gradient can become very large, which might perturbed the optimization. We therefore introduce the `JanssensModelRSigm`, which 'caps' the calculated rate using a sigmoid function :

  R = R0 x sigm(scaling),

  where sigm(x) = 1.0/(1+exp(-(x*2.5)-2)

the scaling values (2.5 and 2) are used to shift the sigmoid transformed transcription rate as close as possible to the untransformed values (see Fig.8)

Fig.8. Figure illustrating the difference of the calculated transcription rate using the different models.
models


Simulated Annealing

To perform gradient descent choose `Train GD` in the `Optimize` menue.

Simulated Annealing has considerably less parameters to adjust.

`Iterations` gives the maximal number of iteration for the optimization. The number of iterations also influences the cooling rate :

  rate = e ^log(0.006)/maxIter)^

which ensures that there is still enough temperature left to make changes to the parameters towards the end of the optimization.

`Dataset fraction` and `Cross Validation Fold` have the same function as explained above. As for the model choosing option : `JanssensModelCapped` is now available and the preferable option.

Fig.8. Screenshot of the dialog to use simulated annealing as optimizer.
GD training menue


Changing/adjusting data/parameters


Manually-updating-parameters

Parameters can be updated manually at any time by changing the values in the Parameter-Chooser and clicking 'Update' to apply.

Setting parameters from string

Parameters can also be given to the program as formated string. The string can be inserted  via the dialog 'Edit -> Set Parameters'. For more details on the format see 
Annotated parameter string.

TF status

A TF can be either a repressor or an activator. Which status a TF has in specified by the input data. The dialogue 'Edit -> Edit TFs' allows to change this specification.

Changing parameter ranges

To change the default range of parameters go to 'Edit -> Set Ranges'. These ranges are used to sample the random start parameters, limit the optimization in the SA and transform the parameters for the GA optimization. It also limits the values that can be set in the Parameter-Chooser.

Adding Complexes

to be added


Managing problems

to be added

Obtaining results

Displaying Log

The log file can be obtained from 'Result -> return Log'

Extracting Parameters

to be added

Saving to .xml-file

The xml file contains all necesary information to continue from were you left. For more information see Xml-file.

STREAM : command-line

More efficient computational experiments can be performed by using STREAM in the command line mode. The same jar file is used but this time arguments are given to STREAM.jar :

java -jar STREAM.jar -h

calls the command-line usage.

java -jar STREAM.jar <file.stream>

executes an experiment defined in a xml-file.

If you want to ran STREAM from scratch you need to type in the arguments as followed:

java -jar STREAM.jar -S <TFBS-map file> -TF <TF-Expression file> -G <Target-Expression file> [optimization method] [model definition] [transformation] [options]

Input data arguments


-S <file> File containing the TFBSs in the TFBS-map format.
-TF <file> File containing the expression informations of the TFs in the TF expression format.
-G <file> File containing the expression informations of the gene in the Gene expression format.



*optimization method*      
|| -gradientDescent || Using gradient descent for the optimization ||
|| -SA || Using simulated annealing for the optimization ||
|| -hybrid || Using simulated annealing first and then gradient descent for the optimization ||
|| -LBFGS || limited-memory quasi-Newton unconstrained optimization (buggy) ||

*model definition*        
|| -cR || Using the normal R-function as defined by Janssens et al.^1.  ||
|| -nR || Using the uncapped version of R. ||
|| -sR || Using the sigmoid transformed R. ||

*transformation*        
|| -nT || no transformation of the parameters - can not be used with gradient descent ||
|| -sT || sigmoid transformation of the parameters according fixed ranges ||

*options*
|| -silent || No output during run ||
|| -detailed || Print values from model evaluation ||
|| -V <string> || Annotated start variable list ||
|| -graphic || Launch visualizer at the end ||
||-justPlot || plots the prediction of given variables does not perform optimization (requires -V) ||
|| -cv <number> || CV folds  _default 0_||
|| -multistart <number> || Perform multistart: starting <number> times with randomly picked parameters within range _default 1_||
|| -multistartbudget <number>  || Performs  optimization runs and keeps restarting as long as the budget of iterations is not reached _default 0_||
|| -seed <number>|| Random number seed _default 1_||
|| -save <filename> || save results and configuration to <filename>.stream file  ||

*SA specific options*
|| -maxiter <number> || Number if SA iteration steps per restart _default 10000_ ||

*GD specific options*
|| -updatas <number> || Number of GD iteration steps per restart _default 10000_ ||
|| -momentum || Momentum _default 0_ ||
|| -changeNu || Change learning rate : nu=nu*1.05 iff successfully updated; nu=nu*0.1 otherwise _default_||
|| -rprop || Use only the sign of the gradient ||
|| -rpropLimit <number> || Use only the sign of the gradient for the first <number> iterations then perform normal GD _default 0_ ||



=FAQ =
Please go to [FAQ]

=References=
(5) Denis C Bauer and Timothy L Bailey. Assessing quality and efficiency of diffe-
rent methods for the optimization of thermodynamic models for transcriptional
regulation. In preparation.