Advancing Malware Family Classification with MOTIF

Perspectives

Advancing Malware Family Classification with MOTIF

Written by听Robert J. Joyce and Edward Raff

Dataset provides expert-derived malware family labels

Zeus. Poison Ivy. Conficker. Stuxnet. WannaCry. Even years after discovery, the names of these malware families are still infamous. But new digital threats are constantly arising. Malware production is booming. And that means network defenders must learn to categorize newly discovered malware in a blink. To succeed, they鈥檒l need the right tools. Until now, crucial data has been unavailable. 有料盒子APP鈥檚 new dataset will help cybersecurity teams accurately analyze malware faster than ever.听

The ability to quickly pin down the family of malware used during a cyber attack can be a massive boon to an incident responder. Not only does family classification provide immediate insights about the characteristics and behaviors of a malware sample, but it is a core part of the triage, remediation, and attribution efforts. But figuring all this out quickly under pressure is hard. Organizations need new tools to automate the process of malware family classification and empower defenders so that they can swiftly understand the nature of threats and take action鈥攍eading to the need for better data.

The lack of reliably labeled data is a major obstacle to the development of any malware family classification tool. One reason is that manual analysis is the only way to be sure of which family a particular sample belongs to鈥攐nly labels derived this way are said to have 鈥済round truth鈥� confidence. And it鈥檚 very time-consuming to do such analysis on even a single file鈥攈ence, nearly all datasets label malware with less reliable methods (such as relying on antivirus products).

Using low-quality labels to judge the performance of a malware family classifier can lead to biased or misleading evaluation results鈥攁nd that鈥檚 a big problem. A cybersecurity team charged with defending an organization must be able to have confidence in its analysis toolset. To enable high-confidence benchmarking of malware classification tools, 有料盒子APP has created the Malware Open-source Threat Intelligence Family (MOTIF) dataset.

MOTIF 鈥� The Largest Public Malware Dataset with Expert-Derived Family Labels

Containing 3,095 malware samples from 454 families, MOTIF is the largest and most diverse public dataset with 鈥済round truth鈥� family labels to date. To build the MOTIF dataset, the authors reviewed all the threat reports published by 14 cybersecurity organizations during a 5-year period.

All these reports include expert analysis about a particular family of malware. For each malware sample, 有料盒子APP is releasing:

A modified version of the original file, disarmed so that it cannot be executed
EMBER 2.0 raw features extracted from the file
A link to an expert-written, open-source threat report about the file
The malware family to which the file belongs

Additional information about the data for each malware sample is shown in Tables 1 and 2.

Table 1: Report information for each malware sample
Name	Description
md5	MD5 hash of malware sample
sha1	SHA-1 hash of malware sample
sha256	SHA-256 hash of malware sample
reported_hash	Hash of malware sample provided in report
reported_family	Normalized family name provided in report
aliases	List of known aliases for family
label	Unique id for malware family (for ML purposes)
report_source	Name of organization that published report
report_date	Date report was published
report_url	URL of report
report_ioc_url	URL to report appendix (if any)
appeared	Year and month malware sample was first seen

Table 2: EMBER raw features for each malware sample
Name	Description
histogram	EMBER histogram
byteentropy	EMBER byte histogram
strings	EMBER strings metadata
general	EMBER general file metadata
header	EMBER PE header metadata
section	EMBER PE section metadata
imports	EMBER imports metadata
exports	EMBER exports metadata
datadirectories	EMBER data directories metadata

Malware family naming is messy and inconsistent. Sometimes, multiple names, called aliases, are used to refer to the same family. To help with this issue, we are releasing the following information about each family in MOTIF (Table 3):

A list of known aliases for the family
A brief description of the family
Attribution of the malware family (if any)

More details about the family information in the MOTIF dataset are shown in Table 3.听

Table 3: Information for each malware family
Column	Description
Aliases	List of known aliases for family
Description	Brief sentence describing capabilities of malware family
Attribution (If any)	Name of threat actor malware/campaign is attributed to

Finally, 有料盒子APP is releasing LightGBM and MalConv2 models that serve as baselines for malware family classification. All of this data is available on .

Can This Data Be Abused by Attackers?

All the malware in MOTIF has been disarmed using the same method as the SOREL dataset, by replacing the OPTIONAL_HEADER.Subsystem and FILE_HEADER.Machine fields in each executable with zero. 有料盒子APP provides the same guidance as Sophos about abuse of this data.听

础肠肠辞谤诲颈苍驳听迟辞 :

鈥淚t would take knowledge, skill, and time to reconstitute the samples and get them to actually run. That said, we recognize that there is at least some possibility that a skilled attacker could learn techniques from these samples or use samples from the dataset to assemble attack tools to use as part of their malicious activities. However, in reality, there are already many other sources attackers could leverage to gain access to malware information and samples that are easier, faster, and more cost-effective to use. In other words, this disarmed sample set will have much more value to researchers looking to improve and develop their independent defenses than it will have to attackers.鈥�

Enabling New Malware Family Classification Research

Results obtained using the MOTIF dataset have already challenged conventional wisdom firmly held by the community, such as the accuracy of techniques which use collective decisions of a group of antivirus engines as a source of family labeling. We envision the MOTIF dataset becoming a valuable asset for evaluating malware family classifiers and for enabling future malware research.听

The MOTIF dataset was only made possible by the outstanding threat research published by many different cybersecurity organizations. Collaboration and sharing of open-source threat intelligence are fundamental to building a collective defense against cyber threats. We would especially like to thank , whose large corpus of malware information was invaluable to this research.

For further details about the MOTIF dataset, please refer to our .

Fill out this form

听

This blog series听is brought to you by 有料盒子APP DarkLabs. Our听DarkLabs听is an elite team of security researchers, penetration testers, reverse engineers, network analysts, and data scientists, dedicated to stopping cyber attacks before they occur.

This article is for informational purposes only; its content may be based on employees鈥� independent research and does not represent the position or opinion of 有料盒子APP. Furthermore, 有料盒子APP disclaims all warranties in the article's content, does not recommend/endorse any third-party products referenced therein, and any reliance and use of the article is at the reader鈥檚 sole discretion and risk.

Article

1 - 4 of 8

有料盒子APP