top of page

C4.5 Algorithm

  • Writer: Mark Mauricio
    Mark Mauricio
  • Feb 24, 2018
  • 3 min read

C4.5 is the extension of ID3 algorithm designed by Quilan. It is a decision tree-based algorithms which can be used for classification problems. C4.5 algorithm deal with both continuous and discrete attributes, missing values and pruning trees after construction [73]. A decision tree have different parts: the root, node, edge, and leaf. The root is the initial decision node; the node represent a test on an attribute; the edge is the rule to follow; leaf represents a classification or decision.




Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.

Splitting: It is a process of dividing a node into two or more sub-nodes.

Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.

Leaf/Terminal Node: Nodes with no children (no further split) is called Leaf or Terminal node.

Pruning: When we reduce the size of decision trees by removing nodes (opposite of Splitting), the process is called pruning.

Branch / Sub-Tree: A sub section of decision tree is called branch or sub-tree.

Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes whereas sub-nodes are the child of parent node.


Algorithm: C4.5 decision tree

Input: an attri bute – value dataset D

  1. Tree = {}

  2. If D is “pure” or stopping criteria met then

  3. Terminate

  4. else if

  5. for all attribute a D do

  6. Compute information-theoretic (splitting) criteria if we split

  7. end for

  8. abest= Best attribute according to above computed criteria

  9. Tree = Create a decision node that tests abestin the root

  10. Dv = included sub –datasets from D based on abest

  11. For all Dv do

  12. Treev = C4.5(Dv)

  13. Attach Treev, to the corresponding branch of tree

  14. end for

  15. return Tree


C4.5 algorithm checked if the stopping criteria are met and will find the best attribute (parent node) using the splitting criterion, and performing tree pruning.


In order to handle continuous attributes, C4.5 splits the attribute values into two partitions based on the selected threshold such that all the values above the threshold as one child and the remaining as another child. It also handles missing attribute values. C4.5 uses Gain Ratio as an attribute selection measure to build a decision tree.


Gain Ratio:

IntrinsicInfoA(D) = -∑vj=1

GainRatio(A)=

It removes the biasness of information gain when there are many outcome values of an attribute. At first, calculate the gain ratio of each attribute. The root node will be the attribute whose gain ratio is maximum. C4.5 uses pessimistic pruning to remove unnecessary branches in the decision tree to improve the accuracy of classification.

One of the questions that arises in a decision tree algorithm is the optimal size of the final tree. A tree that is too large risks over fitting the training data and poorly generalizing to new samples. A small tree might not capture important structural information about the sample space. However, it is hard to tell when a tree algorithm should stop because it is impossible to tell if the addition of a single extra node will dramatically decrease error. This problem is known as the horizon effect. A common strategy is to grow the tree until each node contains a small number of instances then use pruning to remove nodes that do not provide additional information. Pruning should reduce the size of a learning tree without reducing predictive accuracy as measured by a cross-validation set. There are many techniques for tree pruning that differ in the measurement that is used to optimize performance.

Reduced error pruning: One of the simplest forms of pruning is reduced error pruning. Starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected then the change is kept. While somewhat naive, reduced error pruning has the advantage of simplicity and speed.

Cost complexity pruning: This pruning approach is not straightforward unlike reduced error pruning. The idea is to consider the size and the estimated error rate of the tree.


Utelserv Home Service Monitoring for UTELCO

Utelserv

C4.5 Algorithm



 
 
 

Comments


Featured Posts
Check back soon
Once posts are published, you’ll see them here.
Recent Posts
Archive
Search By Tags
Follow Us
  • Facebook Basic Square
  • Twitter Basic Square
  • Google+ Basic Square

Call Us:

+63975-155-5754

Contact Us:

utelservg5a@.gmail.com

Follow Us

  • Facebook Social Icon
  • Instagram Social Icon
  • Twitter Social Icon

© 2018 by G5A.

bottom of page