Placeholder
Should we always use hierarchical softmax instead of softmax?
Pros
- Each update only affects relevant nodes on path to the root, it’s more efficient
- It’s usually better on infrequent categories when the training set is imbalanced https://stats.stackexchange.com/q/180076/95569
Cons
XOR problem for a root node