ModEval

Compare how AI moderation models interpret content across platform contexts

Try an Example

Content Input

0 / 500

PLATFORM

Analysis Output

Model Comparison

Waiting for input

Run your first analysis

ModEval will compare model decisions, policy alignment, and disagreement signals here

METHODOLOGY & FRAMEWORK

How It Works

8 independent models analyze every submission simultaneously. Outputs are normalized, scored against platform-specific AI policy alignment, and resolved into a single verdict.

System Architecture

The technical flow of information from request to high-fidelity safety score.

User Input

Parallel API Calls

8 Models

Normalize + Platform

Results

SECTION 1

Unified Output Normalization

Each model returns data in a different format. toxic-bert returns multi-label probabilities. roberta-offensive returns a binary score. The hate speech model returns hate/not-hate. Raw outputs are incompatible for direct comparison.

The Normalizer converts every model's output into a unified schema:

Standardized category mapping across all providers
Floating point score alignment (0.00 - 1.00)
Action assignment based on fixed thresholds

OUTPUT_NORMALIZER.JSON

{
  model: string,
  top_category: string,
  confidence: number (0-1),
  action: "Allow" | "Review" | "Remove",
  flagged: boolean
}

SECTION 2

Context Engine

The models themselves are frozen -- thresholds are fixed, not adjusted per platform. All platforms use identical base thresholds. Platform context instead determines how Claude Haiku interprets each model's output against that platform's specific content policy. The same confidence score can mean different things depending on what the platform prohibits.

Fixed thresholds: Allow < 0.40 · Review 0.40–0.70 · Remove > 0.70

Platform Context

Reddit	Reddit content policy rules
Discord	Discord community guidelines
Facebook	Facebook community standards
Instagram	Instagram community standards
Custom	User-defined policy text

SECTION 3

Policy Alignment Engine

After all eight models return normalized results, ModEval evaluates how each model aligns with the selected platform's content policy. Instead of simple keyword matching, a batched AI call to Claude Haiku assesses each model's output in context of platform-specific rules.

The platform context determines which categories are zero-tolerance, which are deprioritized, and what the expected enforcement rigor is. Each model's category prediction and confidence score are evaluated against these expectations. The AI returns three values per model:

alignment_score: A floating point value (0.0-1.0) measuring fit with platform policy
aligned: A boolean indicating whether the model's output matches platform expectations (score ≥ 0.70)
alignment_reason: A plain-English explanation of why the model is or isn't aligned

This approach captures nuance that keyword rules cannot. A high-confidence hate speech detection on Reddit aligns perfectly. The same detection on Discord may not align as strongly, since Discord deprioritizes general toxicity in favour of zero-tolerance for CSAM and threatening behaviour. The AI understands context; the numbers follow.

If the Claude call fails, ModEval falls back to keyword-based alignment rules, ensuring the analysis always completes.

SECTION 4

Disagreement Detection

A disagreement is flagged when two or more models produce different final actions after normalization. Three types of disagreement are detected:

Action Mismatch	Models recommend different actions (Allow vs Remove)
Category Mismatch	Models flag different top violation categories

Disagreements are not errors. They are the most analytically interesting output ModEval produces.

The ModEval Philosophy

Content that triggers a disagreement is content that sits in an ambiguous zone, which is exactly the content that requires human review in a real T&S pipeline.

SECTION 5

AI Interpretation Layer

After all eight models have returned results and disagreements have been detected, ModEval passes the full output to a language model for synthesis. Claude Haiku receives each model's action, confidence score, and top category, along with AI alignment assessments per model and the detected disagreement type, and generates an analytical verdict.

This layer exists because structured data alone does not always communicate risk clearly. A Trust & Safety reviewer scanning hundreds of results needs to understand not just what the models said, but what the disagreement means in context. The AI interpretation translates model outputs into an actionable summary.

If the Claude call fails or times out, ModEval falls back to a rule-based consensus summary generated from the same structured data — ensuring the Explainability section always renders.

Input	Output
8 model results (action, confidence, top category), AI alignment assessments per model	Analytical verdict: CLEAR VIOLATION, CLEAR SAFE, or GENUINE GREY AREA with reasoning
Detected disagreement type	Consensus recommendation with semantic highlights
Platform context	Fallback rule-based summary if Claude unavailable

SECTION 6

Why These Models

Each model was chosen to cover a distinct safety dimension with a different architecture and training dataset. Using models that all do the same thing would produce correlated outputs and defeat the purpose of comparison. For detailed model cards including architecture, training data, strengths, and limitations, see the Models tab.

SECTION 7

Known Limitations

ModEval is a transparent tool and transparency includes acknowledging what it cannot do.

Models Are Frozen

The eight models cannot be retrained, and novel slang may score incorrectly. Platform context is handled through AI policy alignment, not by adjusting model thresholds.

Platform Policies Are Approximations

Guidelines are simplified from official pages. Real enforcement also depends on account history, human judgment, and community rules.

English Only

All models were trained primarily on English data. Non-English content produces unreliable scores and multilingual support is planned.

Text Only

Images, video, audio, and multimodal content are outside the scope of the current system.

Rate Limits

HuggingFace free inference and Hive V3 Playground may rate-limit under high traffic. Errors surface transparently rather than silently failing.

Inference Pipeline

Models

Eight independent AI models, each trained on different datasets to detect a distinct safety dimension.

Enterprise APIs

Hive Moderation

REST API

The Hive AI

Detects: Sexual, Violence, Hate, Bullying, Spam

Trained on Purpose-built for T&S pipelines

Strengths

Purpose-built for T&S pipelines with multi-class text output
V3 self-serve with instant API access and flexible pricing
Covers 5 core violation categories in single API call

Limitations

100 req/day on V3 free tier
Severity levels rather than float confidence scores
Limited to 1024 characters per request

View Documentation →

Azure Content Safety

Proprietary

Microsoft Azure

Detects: Hate speech, violence, sexual content, self-harm

Trained on Microsoft proprietary enterprise dataset

Strengths

Enterprise SLA and SOC2 compliance
Designed specifically for production T&S pipelines
Integrated with Azure ecosystem

Limitations

Limited to 4 safety categories
Severity scale (0-6) requires normalization to 0-1
Paid service with per-request pricing

View Documentation →

Google NLP

REST API

Google Cloud

Detects: 8 moderation categories including weapons, drugs, and more

Trained on Google proprietary dataset

Strengths

Broader category coverage (8 categories) including weapons and illicit drugs
Integrates with Google Cloud ecosystem
Proprietary training from Google's scale

Limitations

Paid service requiring Google Cloud billing
ModerateText endpoint is newer, less documented than core NLP features

View Documentation →

OpenAI Moderation

Proprietary

OpenAI

Detects: Multi-category content violations across 13 safety dimensions including harassment, hate, violence, sexual content, and self-harm

Trained on Proprietary dataset curated by OpenAI

Strengths

Industry-standard moderation API used in production at scale
Covers 13 categories including illicit content and self-harm
Continuously updated by OpenAI with latest safety research

Limitations

Black box model with no public training data details
Less granular explainability compared to open-source models

View on OpenAI →

Open Source Models

toxic-bert

BERT

Unitary AI

Detects: General toxicity, insults, threats, obscene language and severe toxic content

Trained on Jigsaw Toxic Comment Classification dataset (Wikipedia comments)

Strengths

Multi-label classification covering 6 toxicity dimensions simultaneously
Best general-purpose toxicity baseline
Widely used in production T&S pipelines

Limitations

Trained on Wikipedia comments which skews toward formal English
May underperform on informal social media slang

View on HuggingFace →

RoBERTa Offensive

RoBERTa

Cardiff NLP

Detects: Offensive and harassing language in social media context

Trained on Twitter data (SemEval 2019 OffComEval dataset)

Strengths

Specifically trained on real social media content
Better at detecting informal offensive language and slang than general toxicity models

Limitations

Twitter-specific training may miss platform-specific offensive patterns from other networks

View on HuggingFace →

RoBERTa Hate Speech

RoBERTa

Facebook AI Research

Detects: Hate speech targeting protected groups and identity-based attacks

Trained on DynaBench R4 — an adversarially collected dataset designed to be harder to game than previous benchmarks

Strengths

Trained on adversarially collected data making it more robust against evasion attempts
Specifically targets identity-based hate rather than general toxicity

Limitations

Binary hate/not-hate output provides less granularity than multi-label models

View on HuggingFace →

DistilRoBERTa Bias

DistilRoBERTa

Valurank

Detects: Biased and non-neutral language in text

Trained on Wikipedia revision history (WNC corpus) — edits where neutral editors removed biased language

Strengths

Unique training methodology using real editorial decisions
Detects subtle linguistic bias rather than just overt violations
Highly relevant for misinformation and propaganda detection

Limitations

Trained on Wikipedia-style formal writing
May flag strongly opinionated but legitimate content as biased

View on HuggingFace →

Model Comparison

Run your first analysis

AI Summary

How It Works

System Architecture

Unified Output Normalization

Context Engine

Policy Alignment Engine

Disagreement Detection

AI Interpretation Layer

Why These Models

Known Limitations

Models Are Frozen

Platform Policies Are Approximations

English Only

Text Only

Rate Limits

Models

Hive Moderation

Azure Content Safety

Google NLP

OpenAI Moderation

toxic-bert

RoBERTa Offensive

RoBERTa Hate Speech

DistilRoBERTa Bias