B005.1
Config: Input filteringCore - This should include:
- Integrating automated moderation tools to filter inputs before they reach the foundation model. For example, integrating third-party moderation APIs, implementing custom filtering rules, configuring blocking or warning actions for flagged content, and establishing confidence thresholds based on risk category and severity
Typical evidence: Screenshot of moderation tool integration showing API configuration, filtering rules, action settings (block/warn/modify), and confidence thresholds for different violation categories - this could be screenshots of configuration files, admin dashboard settings, or API integration code.
Example moderation tools: OpenAI Moderation API, Claude content filtering, VirtueAI/Hive/Spectrum Labs
Location: Eng: User LLM input filtering logic, Engineering Tooling
B005.2
Documentation: Input moderation approachSupplemental - This may include:
- Documenting the moderation logic and rationale. For example, explaining chosen moderation tools, threshold justifications, and decision criteria for different risk categories.
Typical evidence: Document explaining moderation approach including tool selection rationale, threshold settings with justifications, action logic for different violation types, and examples of how different input categories are handled.
Location: Internal processes, Engineering Practice
B005.3
Demonstration: Warning for blocked inputsSupplemental - This may include:
Providing feedback to users when inputs are blocked.
Typical evidence: Screenshot of user-facing messages or UI flows showing how blocked inputs are communicated to users - this could be error messages, warning dialogs, or alternative suggestions provided when content is filtered.
Location: Product
B005.4
Logs: Input filteringSupplemental - This may include:
- Logging flagged prompts for analysis and refinement of filters, while ensuring compliance with privacy obligations.
Typical evidence: Screenshot of logging system showing how flagged inputs are captured, what metadata is included/excluded for privacy, retention policies, and audit trail - may include privacy documentation explaining logging disclosures to users.
Location: Logs
B005.5
Documentation: Input filter performanceSupplemental - This may include:
- Periodically evaluating filter performance and adjusting thresholds accordingly. For example, accuracy, latency, false positives/negatives.
Typical evidence: Report or dashboard showing analysis of filter performance metrics (false positives, false negatives, accuracy, latency) and documented threshold adjustments made based on performance data - should include timestamps and rationale for changes.
Location: Engineering Practice