Bypassing NSFW Gatekeepers

A class project for CSE 509 (Network Security) Fall '23, Stony Brook University

This project explores and exploits the content moderation filters on social media platforms. The study delves deep into the vulnerabilities of these filters, employing a systematic “black-box” attack approach. We unveil an innovative technique harnessing Grad-CAM heatmaps, which highlight key pixels crucial for image classification as NSFW or not. Armed with this knowledge, we strategically inject calculated noise into these areas, crafting “adversarial attacks” that bypass the filters’ defenses. By systematically testing the attack on popular social media platforms - Bumble and Reddit, we expose the potential for misuse and raise critical questions about the effectiveness and reliability of content moderation systems.

Project Report here
Code here

Share on

Twitter Facebook LinkedIn

Neelesh Verma

Share on