McCatch: Scalable Microcluster Detection in Dimensional and Nondimensional Datasets
CoRR(2024)
摘要
How could we have an outlier detector that works even with nondimensional
data, and ranks together both singleton microclusters ('one-off' outliers) and
nonsingleton microclusters by their anomaly scores? How to obtain scores that
are principled in one scalable and 'hands-off' manner? Microclusters of
outliers indicate coalition or repetition in fraud activities, etc.; their
identification is thus highly desirable. This paper presents McCatch: a new
algorithm that detects microclusters by leveraging our proposed 'Oracle' plot
(1NN Distance versus Group 1NN Distance). We study 31 real and synthetic
datasets with up to 1M data elements to show that McCatch is the only method
that answers both of the questions above; and, it outperforms 11 other methods,
especially when the data has nonsingleton microclusters or is nondimensional.
We also showcase McCatch's ability to detect meaningful microclusters in
graphs, fingerprints, logs of network connections, text data, and satellite
imagery. For example, it found a 30-elements microcluster of confirmed 'Denial
of Service' attacks in the network logs, taking only 3 minutes for 222K data
elements on a stock desktop.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要