Cybersecurity researchers discover “Bad Likert Judge,” a new AI jailbreaking technique

Hogan Lovells
Contact

Hogan Lovells[co-author: Surya Swaroop]

The “Bad Likert Judge” jailbreaking technique boasts a high attack success rate by using a three-step approach which employs the target LLM’s own understanding of harmful content to bypass the target LLM’s safety guardrails.

Researchers have identified a new AI jailbreaking technique, referred to as the “Bad Likert Judge” AI jailbreaking techniques are strategies to circumvent protections AI tools have in place to attempt to prevent their use for problematic purposes, such as creating hate speech or malware. This technique, when tested against six advanced LLM models, was shown to increase the success rate for an attack by an average of 75%.

This technique works in a three-step approach:

Step 1: The Bad Likert Judge will ask the target LLM to act like a judge and evaluate the responses that “another” LLM generates. This acts as a trick because there is no other LLM and it is just the Bad Likert Judge using the target LLM’s own guardrails as a judgment system.

Step 2: The target LLM is given certain guidelines on how to score the responses based on what is considered “harmful” content. For example, the target LLM may be given instructions on how to score responses based on their potential to promote violence.

Step 3: Rather than directly asking the target LLM to produce harmful content, the Bad Likert Judge will ask it to give examples of responses that would score high according to the guidelines provided.

By using the LLM’s own judgement capabilities, the Bad Likert Judge can convince it to create outputs that the LLM’s creator does not intend it to produce. The same researchers who discovered this technique also found that the use of content filters were able to reduce the success rate of the attack by an average of 89.2%.

[View source.]

DISCLAIMER: Because of the generality of this update, the information provided herein may not be applicable in all situations and should not be acted upon without specific legal advice based on particular situations. Attorney Advertising.

© Hogan Lovells

Written by:

Hogan Lovells
Contact
more
less

PUBLISH YOUR CONTENT ON JD SUPRA NOW

  • Increased visibility
  • Actionable analytics
  • Ongoing guidance

Hogan Lovells on:

Reporters on Deadline

"My best business intelligence, in one easy email…"

Your first step to building a free, personalized, morning email brief covering pertinent authors and topics on JD Supra:
*By using the service, you signify your acceptance of JD Supra's Privacy Policy.
Custom Email Digest
- hide
- hide