prompt-guard

A Streamlit app for testing Prompt Guard, a classifier model by Meta for detecting prompt attacks.

MIT License

Stars
0
Committers
2

prompt-guard

Prompt Guard is a classifier model by Meta, trained on a large corpus of attacks, capable of detecting both explicitly malicious prompts (jailbreaks) as well as data that contains injected inputs (prompt injections). Upon analysis, it returns one or more of the following verdicts, along with a confidence score for each:

  • INJECTION
  • JAILBREAK
  • BENIGN

This repository contains a Streamlit app for testing Prompt Guard. Note that you'll need an HuggingFace access token to access the model.

Here's a sample response by Prompt Guard upon detecting a prompt injection attempt.

Here's a sample response by Prompt Guard upon detecting a jailbreak attempt.

Related Projects