OARTTRAIN

Description

OARTTRAIN uses a set of object samples that are stored in a segmentation-attribute table. The attributes are used to develop a predictive model by creating a set of random decision-trees and by taking the majority vote for the predicted class. The model is stored in an Extensible Markup file (.xml) file.

Parameters

oarttrain (filv, dbvs, trnfld, tfile, maxdepth, minsamp, actvars, termcrit, maxtrees, treesacc, trnmodel)

Name	Type	Caption	Length	Value range
FILV*	str	Input vector file	1 -
DBVS*	List[int]	Segment number of vector layer	1 - 1	1 -
TRNFLD	str	Name of training field	0 -	Default: Training
TFILE*	str	Text file of field names	1 -
MAXDEPTH	List[int]	Maximum possible depth of the tree	0 - 1	5 - 25 Default: 10
MINSAMP	List[int]	Minimum sample count	0 - 1	1 - Default: 5
ACTVARS	List[int]	Active variables	0 - 1	0 - Default: 0
TERMCRIT	str	Termination criteria	0 - 8	BOTH \| MAXTREES \| TREESACC Default: BOTH
MAXTREES	List[int]	Maximum number of trees	0 - 1	10 - Default: 100
TREESACC	List[float]	Tree accuracy	0 - 1	0.01 - Default: 0.05
TRNMODEL*	str	Output file containing training model	1 -

* Required parameter

Parameter descriptions

FILV

The name of the vector file that contains the segmentation layer and training field.

DBVS

The segment number of the vector layer that contains the segmentation and training field.

TRNFLD

The name of the field containing the training samples.

The default name is Training.

TFILE

The name of a text file that contains the names of the fields to use for training.

The file name extension must be .txt.

Typically, this is a file generated by running the OAFLDNMEXP algorithm.

MAXDEPTH

The maximum tree depth. This parameter specifies the maximum number of levels leaf nodes below the root node. The training algorithms attempt to split a node while its depth is less than the value you specify. The depth of the optimal tree may be less than you specify if the other termination criteria are met before the maximal depth is reached.

A small value will likely result in a tree with a large variance while, conversely, a large value will tend to overfit the training model.

MINSAMP

The minimum number of samples at a leaf node to allow it to be further split into child nodes.

A reasonable value is a small percentage of the total number of training samples, such as, one percent. For example, if the attribute table contains 2000 training samples (_T), a reasonable value would be 20.

ACTVARS

The number of randomly selected subset of attributes at each tree node to decide if that node should be split.

Set to 0 (zero) by default, the size is set to the square root of the total number of input attributes. This generally gives near optimum results according to Breiman (2001). It is recommended to specify a higher number if many attributes are noisy.

TERMCRIT

The termination criteria that determines when to stop the training algorithm.

The criteria can be the specified number of trees trained and added to the ensemble, when sufficient accuracy is achieved, or both.

MAXTREES learning ends when the maximum number of trees in the forest is reached.
TREESACC learning ends after the trees accuracy is reached.
BOTH uses both termination criteria. The training ends when either of the termination criteria is met.

MAXTREES

Specifies the maximum number of trees to generate during the classification process.

Typically, the more trees you have the better the accuracy. However, the improvement in accuracy generally diminishes and asymptotes past a certain number of trees.

Since the Random Trees classifier is not as computationally intensive as the SVM classifier, a value in the range of 3000 to 5000 is reasonable.

This parameter might not be used if the termination criteria (TERMCRIT) is set to BOTH and the tree accuracy is reached before the maximum number of trees.

TREESACC

Specifies the accuracy value to stop growing a tree once the value is reached.

This parameter might not be used if the termination criteria (TERMCRIT) is set to BOTH and the maximum number of trees is reached before the trees accuracy reaches your specified limit.

TRNMODEL

The name of a text file to create containing the Random Trees training model.

A file of the same name must not exist in the output folder.

The file name extension must be .xml.

Details

Random Trees workflow

A typical workflow starts by running the OASEG algorithm, to segment your image into a series of object polygons. Next you would calculate a set of attributes (statistical, geometrical, textural, and so on) by running the OACALCATT algorithm. Alternatively, when you are working with SAR data, you would use OASEGSAR and OACALCATTSAR. You can then, in Focus Object Analyst, manually collect or import training samples for some land-cover or land-use classes; alternatively, use OAGTIMPORT for this task. The training samples are stored in a field of the segmentation attribute table with a default name of Training.

To train a Random Trees model with OARTTRAIN, the following is required as input:

A segmentation with a field containing training samples
A list of attributes

You can create the list of attributes by running OAFLDNMEXP. Alternatively, the list can be read directly from the table of segmentation attributes using field metadata that was created by OACALCATT or OACALCATTSAR.

Figure 1. Workflow of Random Trees training

Workflow of Random Trees training

Random Trees classification

A single decision tree is easy to conceptualize but will typically suffer from high variance, which makes them not competitive in terms of accuracy.

One way to overcome this limitation is to produce many variants of a single decision tree by selecting every time a different subset of the same training set in the context of randomization-based ensemble methods (Breiman, 2001). Random Forest Trees (RFT) is a machine learning algorithm based on decision trees. Random Trees (RT) belong to a class of machine learning algorithms which does ensemble classification. The term ensemble implies a method which makes predictions by averaging over the predictions of several independent base models.

The Random Forest algorithm, called thereafter Random Trees for trademark reasons, was originally conceived by Breiman (2001) “as a method of combining several CART style decision trees using bagging [...] Since its introduction by Breiman (2001) the random forests framework has been extremely successful as a general-purpose classification and regression method" (Denil et al., 2014).

The fundamental principle of ensemble methods based on randomization “is to introduce random perturbations into the learning procedure in order to produce several different models from a single learning set L and then to combine the predictions of those models to form the prediction of the ensemble” (Louppe, 2014). In other words, "signiﬁcant improvements in classiﬁcation accuracy have resulted from growing an ensemble of trees and letting them vote for the most popular class. In order to grow these ensembles, often random vectors are generated that govern the growth of each tree in the ensemble" (Breiman, 2001).

"There are three main choices to be made when constructing a random tree. These are (1) the method for splitting the leafs, (2) the type of predictor to use in each leaf, and (3) the method for injecting randomness into the trees" (Denil et al., 2014). A common technique for introducing randomness in a Tree "is to build each tree using a bootstrapped or sub-sampled data set. In this way, each tree in the forest is trained on slightly different data, which introduces differences between the trees" (Denil et al., 2014). Randomization can also occur by randomizing "the choice of the best split at a given node... experiments show however that when noise is important, Bagging usually yield better results" (Louppe, 2014).

When optimizing a Random Trees model, “special care must be taken so that the resulting model is neither too simple nor too complex. In the former case, the model is indeed said to underfit the data, i.e., to be not flexible enough the capture the structure between X and Y. In the latter case, the model is said to overfit the data, i.e., to be too flexible and to capture isolated structures (i.e., noise) that are specific to the learning set" (Louppe, 2014).

There is then a need to define stropping criteria to stop the growing of a tree before it reaches too many levels to prevent overfitting: “Stopping criteria are defined in terms of user defined hyper-parameters" (Louppe, 2014). Among those parameters, the most common are:

The minimum number of samples in a terminal node to allow it to split
The minimum number of samples in a leaf node when the terminal node is split
The maximum tree depth, that is, the maximum number of levels a tree can grow
Once the Trees accuracy (defined by the Gini Impurity index) is less than a fixed threshold

"These parameters have to be tuned in order to find the right trade-off, they need to be such that they are neither too strict nor too loose for the tree to be neither too shallow nor too deep" (Louppe, 2014). As described in Breiman (2002), some of the key features of Random Trees are:

It is an excellent classifier--comparable in accuracy to support vector machines.
It generates an internal unbiased estimate of the generalization error as the forest building progresses.
It has an effective method for estimating missing data and maintains accuracy when up to 80% of the data are missing.
It has a method for balancing error in unbalanced class population data sets.
Generated forests can be saved for future use on other data.
It gives estimates of what variables are important in the classification.
Output is generated that gives information about the relation between the variables and the classification.
It computes proximities between pairs of cases that can be used in clustering, locating outliers, or by scaling, give interesting views of the data.

In general, the random trees classifier, unlike the Support Vector Machine (SVM), can handle a mix of categorical and numerical variable. The Random Trees is also less sensitive to data scaling while SVM often required data to be normalized prior to the training/classification. However, SVM is reported to perform better when the training set is small or unbalanced. The Random Trees classifier is computationally less intensive than SVM and works better and faster with large training sets.

Many versions of the Random Trees algorithm exist. Object Analyst uses the OpenCV implementation which use the Gini Impurity index to determine what is a good split point for a node on the classification tree and the minimum number of samples, the maximum tree depth and the accuracy of the trees as stopping criteria. An in-depth review of the popular implementation of Random Trees is provided in Louppe (2014) at section 5.4.2.

Example

from pci.oarttrain import oarttrain
filv="l7_ms_seg25_0.5_0.5.pix"
dbvs=[2]
trnfld="Training_set1"                          # Field containing the training samples
tfile="l7_ms_seg25_0.5_0.5_attributes.txt"      # List of oaattributes to use to train the random trees (RT) model
maxdepth=[20]
minsamp=[5]
actvars=[0]
termcrit="both"
maxtrees=[2000]
treesacc=[0.05]
trnmodel="RT_training_set1.xml"                 # Output RT model

oarttrain (filv, dbvs, trnfld, tfile, maxdepth, minsamp, actvars, termcrit, maxtrees, treesacc, trnmodel)

References

The core Random Trees algorithm described herein was originally based on the OpenCV Random Trees implementation (docs.opencv.org/2.4/modules/ml/doc/random_trees.html) in C++ which is based on Breiman and Cutler's Random Forests for Classification and Regression originally coded in Fortran.

Environments	PYTHON :: EASI
Quick links	Description :: Parameters :: Parameter descriptions :: Details :: Example :: References :: Related