C4.5 Algorithm
- Mark Mauricio

- Feb 24, 2018
- 3 min read

C4.5 is the extension of ID3 algorithm designed by Quilan. It is a decision tree-based algorithms which can be used for classification problems. C4.5 algorithm deal with both continuous and discrete attributes, missing values and pruning trees after construction [73]. A decision tree have different parts: the root, node, edge, and leaf. The root is the initial decision node; the node represent a test on an attribute; the edge is the rule to follow; leaf represents a classification or decision.
Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
Leaf/Terminal Node: Nodes with no children (no further split) is called Leaf or Terminal node.
Pruning: When we reduce the size of decision trees by removing nodes (opposite of Splitting), the process is called pruning.
Branch / Sub-Tree: A sub section of decision tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes whereas sub-nodes are the child of parent node.
Algorithm: C4.5 decision tree
Input: an attri bute – value dataset D
Tree = {}
If D is “pure” or stopping criteria met then
Terminate
else if
for all attribute a D do
Compute information-theoretic (splitting) criteria if we split
end for
abest= Best attribute according to above computed criteria
Tree = Create a decision node that tests abestin the root
Dv = included sub –datasets from D based on abest
For all Dv do
Treev = C4.5(Dv)
Attach Treev, to the corresponding branch of tree
end for
return Tree
C4.5 algorithm checked if the stopping criteria are met and will find the best attribute (parent node) using the splitting criterion, and performing tree pruning.
In order to handle continuous attributes, C4.5 splits the attribute values into two partitions based on the selected threshold such that all the values above the threshold as one child and the remaining as another child. It also handles missing attribute values. C4.5 uses Gain Ratio as an attribute selection measure to build a decision tree.
Gain Ratio:
IntrinsicInfoA(D) = -∑vj=1
GainRatio(A)=
It removes the biasness of information gain when there are many outcome values of an attribute. At first, calculate the gain ratio of each attribute. The root node will be the attribute whose gain ratio is maximum. C4.5 uses pessimistic pruning to remove unnecessary branches in the decision tree to improve the accuracy of classification.
One of the questions that arises in a decision tree algorithm is the optimal size of the final tree. A tree that is too large risks over fitting the training data and poorly generalizing to new samples. A small tree might not capture important structural information about the sample space. However, it is hard to tell when a tree algorithm should stop because it is impossible to tell if the addition of a single extra node will dramatically decrease error. This problem is known as the horizon effect. A common strategy is to grow the tree until each node contains a small number of instances then use pruning to remove nodes that do not provide additional information. Pruning should reduce the size of a learning tree without reducing predictive accuracy as measured by a cross-validation set. There are many techniques for tree pruning that differ in the measurement that is used to optimize performance.
Reduced error pruning: One of the simplest forms of pruning is reduced error pruning. Starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected then the change is kept. While somewhat naive, reduced error pruning has the advantage of simplicity and speed.
Cost complexity pruning: This pruning approach is not straightforward unlike reduced error pruning. The idea is to consider the size and the estimated error rate of the tree.
Utelserv Home Service Monitoring for UTELCO
Utelserv
#C45 #Algorithm #Quinlan #decisiontree #classificationproblems #attributes #missingvalues #prunning #root #node #edge #leaf #splitting #branch #datasets #gainratio #accuracy #finaltree #treepruning #complexity #complexitypruning #trainingdatasets #UtelservHomeServiceMonitoringforUTELCO #Utelserv #C45Algorithm #utersev























Comments