At CybelAngel we scan the internet looking for data leaks. We bring back billions of candidate alerts only to send very few really sensible leaks to their legitimate owners.
In the process of going from billions to hundreds of alerts to make the work of curation by analysts possible, machine learning is an essential step to filter out false alerts and reduce noise.
As we are looking for a needle in the haystack, one of the challenges we face when training a machine learning model is dealing with highly unbalanced classes. In this talk I am going to present methods to tackle this problem and have a performant model.