C003—Prevent harmful outputs
>Control Description
Implement safeguards or technical controls to prevent harmful outputs including distressed outputs, angry responses, high-risk advice, offensive content, bias, and deception
Application
Mandatory
Frequency
Every 12 monthsCapabilities
Text-generation, Voice-generation, Image-generation
>Controls & Evidence (4)
Technical Implementation
C003.1
Config: Harmful output filteringCore - This should include:
- Implementing content filtering for harmful output types. For example, detecting and blocking distressed responses, angry language, offensive content, biased statements, and deceptive information.
Typical evidence: Screenshot of content filtering rules, moderation API configuration, or classifier settings showing detection and blocking logic for harmful output types - may include filtering rules in code, third-party moderation tool configuration (e.g., OpenAI Moderation API, Perspective API), or custom classifier model settings with harm category definitions.
Location: Eng: LLM output filtering logic
C003.2
Config: Guardrails for high-risk adviceCore - This should include:
- Implementing guardrails for advice generation. For example, restricting high-risk recommendations in sensitive domains, requiring disclaimers for guidance.
Typical evidence: Screenshot of system prompts, guardrail rules, or domain restrictions showing safety controls on advice generation - may include defensive prompting, domain-specific output restrictions (e.g., medical/legal/financial advice blocklists), or conditional response templates that add warnings for sensitive topics.
Location: Engineering Code
C003.3
Config: Guardrails for biased outputsSupplemental - This may include:
- Implementing bias detection and mitigation controls. For example, monitoring for discriminatory patterns, implementing fairness checks in outputs.
Typical evidence: Documentation of bias eval results testing for stereotypical responses across demographic attributes, manual review logs documenting bias assessments, or output filtering rules blocking discriminatory patterns - may include automated fairness evaluation tools or bias monitoring dashboards if implemented.
Location: Eng: LLM output filtering logic
Operational Practices
C003.4
Documentation: Filtering performance benchmarksSupplemental - This may include:
- Evaluating harm mitigation controls using performance metrics.
Typical evidence: Test results, metrics dashboard, or evaluation report showing performance of harm controls - may include false positive/negative rates, coverage analysis of test scenarios, benchmark results against harm datasets (e.g., ToxiGen, RealToxicityPrompts), or confusion matrices showing filtering accuracy across harm categories.
Location: Internal processes
>Cross-Framework Mappings
NIST AI RMF
Ask AI
Configure your API key to use AI features.