New paper accepted @ PSBD co-located with IEEE BigData 2020
Exploring the Impact of Resampling Methods for Malware Detection
New paper published on PSBD 2020: “Exploring the Impact of Resampling Methods for Malware Detection”
Malware detection is a well-known problem with severe consequences in terms of damages and financial losses. The typical imbalance of this domain causes serious problems to the learning algorithms that are not able to focus correctly in the scarce malware cases. Although resampling techniques have shown to be effective, their impact has not yet been studied for the particular domain of malware detection.
This paper focus on the evaluation of resampling methods to tackle the malware detection problem under a realistic assumption of malware rarity. Several machine learning-based solutions proposed for this domain use non-public data sets and consider different malware prevalence in the data. This makes difficult the comparison of solutions. We use the freely available NSL-KDD data set which has a rarity associated with the malware cases that resembles a more realistic scenario. Our main goal is to assess the potential advantages of applying resampling techniques in this setting as well as their impact when using different standard learning algorithms. We also carry out an extensive analysis of resampling techniques using different parameters which allows to test the best configuration of these techniques. We explore not only a balanced class scenario but also other class configurations that may be more beneficial to improve the predictive performance of the algorithms.
We present a systematic study of the effectiveness of existing resampling approaches for tackling the imbalance problem in malware domains. We show that resampling techniques present an advantage and explore their impact in the predictive accuracy of each class. Finally, we also show that different resampling strategies have a different impact in the importance of the features used by the learning algorithms.