Index


NAME

MachineLearning::DecisionTrees – create, test, print out, store, and employ decision trees

SYNOPSIS

  use MachineLearning::DecisionTrees;
  my $grove = MachineLearning::DecisionTrees->new({
    "InputFieldNames" => ["Input1", "Input2", "Input3"],
    "OutputFieldNames" => ["Output1", "Output2"],
    "InputDataTypes" => {
      "Input1" => ["male", "female"],
      "Input2" => "number",
      "Input3" => "special_number"},
    "TrainingData" => $training_data_file_path});
  my $all_is_well = $grove->get_success();
  my $message = $grove->get_message();
  my $saved_ok = $grove->save($file_path);
  my $grove = MachineLearning::DecisionTrees->open($file_path);
  my $tree_printout = $grove->print_out($tree_name);
  my $tree_test_results = $grove->test({
    "TreeName" => $tree_name,
    "ValidationData" => $validation_data_file_path,
    "NodeList" => \@selected_node_numbers});
  my $completed_ok = $grove->employ({
    "TreeName" => $tree_name,
    "TargetData" => $writable_data_file_path,
    "NodeList" => \@selected_node_numbers});

DESCRIPTION

This module defines a package for creating a MachineLearning::DecisionTrees object.

Decision trees as implemented by the MachineLearning::DecisionTrees module favor quality over quantity. That is, they are optimized to find the highest quality predictions for a single output value without undue regard to how many data records (or instances) get screened out in the process. This approach is highly useful for many applications. For example, out of many potential financial investments, you might want to identify a small group that have unusually good prospects for success.

The output values supported by the MachineLearning::DecisionTrees module are 1 and 0, and the accuracy of the results is important only for the 1 values. When choosing the attribute on which to split a node, the criterion is maximization of the highest resulting ones percentage. (In the case of a tie, the highest quantity of records breaks the tie.) NOTE: The selection of the first three attributes (resulting in the children, grandchildren, and great-grandchildren of the root node) is performed organically. That is, the best out of all possible permutations of attributes for the first three slots "wins". Subsequent attributes are chosen individually.

Decision trees as implemented by the MachineLearning::DecisionTrees module come in groves. A grove is one or more decision trees, all built from the same data file. The data file used to build (or train) a grove contains one or more output fields, and each output field corresponds to one tree. When printed out, a decision tree identifies itself using the name of the output field for which it was built.

A decision tree is made up of nodes, starting with a root node (which typically appears at the top in graphical representations, highlighting the fact that trees in nature are upside down :-) and branching from there via child nodes. During the building, testing, or employment of a decision tree, each parent node divides data records associated with it among its children according to each record's value for a particular input field (or attribute). A node with no children is a leaf node and terminates a particular path through the tree.

A path through a decision tree always starts at the root node and ends when one of the following conditions occurs during tree creation, which comprises training and pruning phases:

Pruning

After a tree has been built, it gets pruned from the bottom up. The pruning algorithm removes any leaf node if the records associated with one of the node's ancestors (other than the root node) have an equal or higher percentage of ones for the output field on which the tree is based than do the records associated with the leaf node. Any nodes that become leaf nodes during the pruning process are themselves subject to pruning.

Once a tree has been pruned, the remaining leaf nodes are sorted by quality (defined as the magnitude of the ones percentage for the records associated with the leaf node). In a printout of the tree, the leaf nodes appear in order by quality (highest to lowest) together with the exact ones percentage for each leaf node based on the training data. For each leaf node, the printout also gives the number of associated training records and the path through the tree.

Input fields and output fields

The data used to build, test, or employ decision trees must include at least two input fields, which are used as attributes by nodes during spawning.

The data must also include one or more output fields in which each value is either a 1 or a 0. (The output values can be blank when the data is passed to the employ() method.) For example, there might be a field for whether you enjoy certain TV shows that have been on in the past and another field for whether your spouse enjoys them. You could create a grove of two decision trees from this data: one tree to select shows that you would most likely enjoy out of the new television offerings starting up in the fall, and another tree to select shows that your spouse would most likely enjoy. (With luck, there will be some that you both enjoy!)

Input data types

Each input field uses one of three data types: enumeration, number, or special_number.

An enumeration type comprises a list of two or more possible unique text values (for example, male and female for a Gender field).

A number type comprises decimal numbers. The tree-building algorithm divides the values in a number field into the following categories after statistically analyzing all the values for that field in the data set. (In the Criteria column below, m is the mean and d is the population standard deviation.)

    Category           Criteria
    --------           --------

    abnormally_low     Less than m - 2d

    low                Greater than or equal to m - 2d, and less than
                       m - d

    medium             Greater than or equal to m - d, and less than
                       or equal to m + d (that is, within one standard
                       deviation of the mean)

    high               Greater than m + d, and less than or equal to
                       m + 2d

    abnormally_high    Greater than m + 2d

A special_number type is a decimal number for which the sign and the absolute value have separate significance. For example, the sign might indicate direction (up or down) while the absolute value indicates momentum, or the sign might indicate approval versus disapproval while the absolute value indicates the intensity of the response. The categories for this type are abnormally_strong_negative, strong_negative, medium_negative, weak_negative, abnormally_weak_negative, abnormally_weak_positive_or_zero, weak_positive, medium_positive, strong_positive, and abnormally_strong_positive.

The categories into which the algorithm places numeric inputs act as enumerated values.

PREREQUISITES

To use this module, you must have the MachineLearning module installed.

METHODS

MachineLearning::DecisionTrees->new($args);

This is the constructor.

In addition to the class-name argument, which is passed automatically when you use the MachineLearning::DecisionTrees->new() syntax, the constructor takes a reference to a hash containing the following keys:

  InputFieldNames
  OutputFieldNames
  InputDataTypes
  TrainingData

The values associated with the InputFieldNames and OutputFieldNames keys must each be a reference to an array of field names. All field names (for input and output fields combined) must be unique. Field names must not contain commas or line-break characters. There must be at least two input field names and at least one output field name.

The value associated with the InputDataTypes key must be a reference to a hash in which the keys are input field names (which must match those specified by the InputFieldNames argument) and each value indicates the data type for the corresponding input field. If the value is an array reference, the field has a data type of enumeration and the array must contain two or more strings, all unique, representing the possible values for the input field. Otherwise, the value indicating the data type must be the string number or the string special_number.

The value associated with the TrainingData key must be the path to a file containing CSV-format training data. The first line in the file must contain field names, which must match the field names specified by the InputFieldNames and OutputFieldNames arguments.

The constructor returns a reference to a MachineLearning::DecisionTrees object (a grove), which is implemented internally as a hash. All functionality is exposed through methods.

If the constructor receives a valid hash reference providing all required information, the get_success() instance method returns true (1) and the get_message() instance method returns an empty string; otherwise, get_success() returns false (0) and get_message() returns a string containing an error message.

$grove->get_success();

This returns true (1) if the grove was initialized successfully; otherwise, it returns false (0).

$grove->get_message();

When get_success() returns true (1), this returns an empty string. When get_success() returns false (0), get_message() returns an error message.

$grove->save($file_path);

This method saves (serializes) the grove to the specified file.

If the serialization and file-writing operations succeed, this method returns true (1); otherwise, the method sets the _success field to false (0), places an error message in the _message field, and returns false (0).

MachineLearning::DecisionTrees->open($file_path);

This method returns a new MachinLearning::DecisionTrees object (a grove) created by restoring such an object from the specified file. If the file-reading and deserialization operations succeed, the resulting object's get_success() method returns true (1) and the get_message() method returns an empty string. Otherwise, the open() method creates and returns a new object that has a false value in the _success field and an error message in the _message field.

$grove->print_out($tree_name);

This method prints out a particular tree within the grove. The name that you provide to the method is the name of the output field corresponding to the tree that you want to study.

The printout lists each leaf node in order by quality (highest to lowest). For each leaf node, the printout gives a number with which you can specify the node when you call the test() method or the employ() method, the ones percentage for the node after training, the number of records associated with the node after training, and the path through the tree leading to the node.

This method returns a string.

$grove->test($args);

This method takes a reference to a hash with the following fields:

    TreeName – The name of the tree, which is the same as the name
    of the output field for which the tree was built.

    ValidationData – The path to a file containing CSV-format
    validation data.  This must have the same fields and data types
    as the training data used to create the grove.

    NodeList – A reference to an array containing the numbers of the
    nodes that you want to use in the test to indicate "1" results.

This method returns a string giving the test results, which comprise the number of records in the test data, the number of records assigned a "1" result during the test, and the accuracy (as a percentage) of those "1" results based on the validation data.

$grove->employ($args);

This method takes a reference to a hash with the following fields:

    TreeName – The name of the tree, which is the same as the name
    of the output field for which the tree was built.

    TargetData – The path to a file containing CSV-format data.
    This must have the same fields and data types as the training
    data used to create the grove.

    NodeList – A reference to an array containing the numbers of the
    nodes that you want to use to generate "1" results.

This method divides the records from the specified data file among the leaf nodes of the specified tree, assigning a "1" result to any record associated with one of the nodes given by the NodeList argument. For those records, the method places a 1 in the output field associated with the tree. For the remaining records, the method places a 0 in that field.

The data file that you provide to this method must contain an output field with the same name as the tree; however, the values in that field can be blank. If an output value is not blank, it must be a 1 or a 0 to conform to the data type defined for output fields. This method populates or overwrites the values in the output field with new results.

If everything goes OK, this method returns true (1); otherwise, the method sets the _success field to false (0), places an error message in the _message field, and returns false (0).

AUTHOR

Benjamin Fitch, <blernflerkl@yahoo.com>

COPYRIGHT AND LICENSE