Jigsaw Puzzles

2024-11-07

LlamaCast

0:00

16:44

🧩 Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models

This research paper investigates the vulnerabilities of large language models (LLMs) to "jailbreak" attacks, where malicious users attempt to trick the model into generating harmful content. The authors propose a new attack strategy called Jigsaw Puzzles (JSP) which breaks down harmful questions into harmless fractions and feeds them to the LLM in multiple turns, bypassing the model's built-in safeguards. The paper explores the effectiveness of JSP across different LLM models and harmful categories, analyzing the role of various prompt designs and splitting strategies. The authors also compare JSP's performance to other existing jailbreak methods and demonstrate its ability to overcome various defense mechanisms. The paper concludes by highlighting the importance of continued research and development of more robust defenses against such attacks.

📎 Link to paper

Fler avsnitt från "LlamaCast"

Fler avsnitt

Kom åt alla poddsändningar i gratisappen GetPodcast.

Prenumerera på dina favoritpoddar, lyssna på avsnitt offline och få spännande rekommendationer.

Jigsaw Puzzles

LlamaCast

Fler avsnitt från "LlamaCast"

Marco-o1

Scaling Laws for Precision

Test-Time Training

Qwen2.5-Coder

Attacking Vision-Language Computer Agents via Pop-ups

Number Cookbook

Jigsaw Puzzles

Multi-expert Prompting with LLMs

Investigating the Role of Prompting and External Tools in Hallucination Rates of LLMs

Mind Your Step (by Step)