Multiclass Classification and Multiple Binary Classifiers

Placeholder

Should we always use hierarchical softmax instead of softmax?

Pros

Each update only affects relevant nodes on path to the root, it’s more efficient
It’s usually better on infrequent categories when the training set is imbalanced https://stats.stackexchange.com/q/180076/95569

Cons

XOR problem for a root node