Analysis Output
Model Comparison
Run your first analysis
ModEval will compare model decisions, policy alignment, and disagreement signals here
Decision Matrix
| Model | Top Category | Severity | Confidence | Action | Policy Alignment |
|---|
Interpretation Layer
Explainability
System Methodology
How It Works
ModEval combines independent model outputs, threshold adjustment, and policy scoring to make moderation tradeoffs explicit instead of hiding them.
Section 1
System Architecture
ModEval runs five independent AI moderation models in parallel on every analysis request. Each model was trained on different datasets and detects a different safety dimension. No single model sees the outputs of another -- they evaluate content independently, which is intentional. The value of running multiple models simultaneously is that disagreements between them are signal, not noise. When two models reach different conclusions about the same content, that tells you something important about the content's ambiguity.
Section 2
Output Normalization
Each model returns data in a different format. toxic-bert returns multi-label probabilities. roberta-offensive returns a binary offensive/not-offensive score. The hate speech model returns hate/not-hate. The spam model returns spam/ham. The bias model returns biased/neutral. Raw outputs are incompatible for direct comparison.
The Normalizer converts every model's output into a unified schema:
{
model: string,
top_category: string,
severity: number (1-10),
confidence: number (0-1),
action: "Allow" | "Review" | "Remove",
flagged: boolean
}
Severity is derived by scaling the highest raw score linearly from 0-1 to 1-10. Confidence is the raw highest score preserved directly. Action is determined by comparing confidence against the context-adjusted threshold.
Section 3
Context Engine
The models themselves are frozen -- thresholds adjust, not models. This mirrors how real Trust & Safety pipelines work. A high-confidence toxicity score on a gaming platform may result in Review. The same score on a professional platform results in Remove. The model's output is identical -- the interpretation changes based on context.
Default Action Thresholds
| 0.00 - 0.39 | Allow |
| 0.40 - 0.69 | Review |
| 0.70 - 1.00 | Remove |
Platform Modifiers
| Gaming | -0.10 | Higher tolerance for competitive language |
| Social Media | 0.00 | Baseline |
| Professional | +0.15 | Lower tolerance, reputational risk |
| Forum | -0.05 | Slightly higher tolerance for debate |
| VR/Metaverse | -0.15 | Evolving norms, higher tolerance |
Content Type Modifiers
| Original Post | 0.00 | Baseline |
| Comment/Reply | -0.05 | Reactive content, slight tolerance |
| Username | +0.20 | Permanent identity-linked content |
| Bio | +0.15 | Persistent public-facing content |
| UGC | -0.05 | Creative content, some leeway |
Strictness Modifiers
| Strict | +0.15 |
| Balanced | 0.00 |
| Lenient | -0.15 |
All thresholds are clamped between 0.10 and 0.90 to prevent extreme values.
Section 4
Policy Alignment Scoring
After a model produces a context-adjusted action, ModEval scores how well that output aligns with the selected platform's known policy. Each platform has predefined zero-tolerance categories (always Remove regardless of score) and deprioritized categories (threshold raised by +0.20).
| 🛡 violence, self-harm, sexual/minors | Deprioritized: profanity, insult | |
| Discord | 🛡 harassment/threatening, sexual/minors, hate | Deprioritized: profanity |
| 🛡 hate, violence, sexual, self-harm | Deprioritized: none | |
| 🛡 sexual/minors, harassment, identity_attack | Deprioritized: profanity, insult |
alignment_score = 1 - abs(model_confidence - policy_expected_threshold)
Displayed as 0-1 with Aligned / Misaligned binary label.
Section 5
Disagreement Detection
A disagreement is flagged when two or more models produce different final actions after context adjustment. Three types of disagreement are detected:
| Action Mismatch | Models recommend different actions (Allow vs Remove) |
| Category Mismatch | Models flag different top violation categories |
| Severity Gap | Severity scores differ by 3 or more points |
"Disagreements are not errors — they are the most analytically interesting output ModEval produces."
Content that triggers a disagreement is content that sits in an ambiguous zone, which is exactly the content that requires human review in a real T&S pipeline.
Section 6
Why These Models
Each model was chosen to cover a distinct safety dimension with a different architecture and training dataset. Using models that all do the same thing would produce correlated outputs and defeat the purpose of comparison.
| Toxicity Classifier | unitary/toxic-bert | BERT | Jigsaw Toxic Comments dataset | General toxicity baseline |
| Offensive Language Detector | cardiffnlp/roberta-offensive | RoBERTa | Twitter data | Social media offensive language |
| Hate Speech Detector | facebook/roberta-hate-speech | RoBERTa | DynaBench R4 | Adversarially collected hate speech |
| Spam Detector | mrm8488/bert-tiny | BERT-tiny | SMS Spam Collection | Manipulation and unsolicited content |
| Bias Detector | valurank/distilroberta-bias | DistilRoBERTa | Wikipedia revisions | Non-neutral language detection |
All models are accessed via the HuggingFace Inference API using a single read-access token. No model weights are downloaded or run locally.
Section 7
Known Limitations
ModEval is a transparent tool and transparency includes acknowledging what it cannot do.
Models Are Frozen
The five models cannot be retrained, and novel slang may score incorrectly. The Context Engine adjusts thresholds rather than changing model knowledge.
Platform Policies Are Approximations
Guidelines are simplified from official pages. Real enforcement also depends on account history, human judgment, and community rules.
English Only
All five models were trained primarily on English data. Non-English content produces unreliable scores and multilingual support is planned.
Text Only
Images, video, audio, and multimodal content are outside the scope of the current system.
Free Tier Rate Limits
HuggingFace free inference may rate-limit under high traffic. The system surfaces these errors transparently rather than silently failing.
Inference Pipeline
Models
Five independent AI models, each trained on different datasets to detect a distinct safety dimension.
Toxicity Classifier
BERTUnitary AI
Detects: General toxicity, insults, threats, obscene language and severe toxic content
Trained on Jigsaw Toxic Comment Classification dataset (Wikipedia comments)
Strengths
- Multi-label classification covering 6 toxicity dimensions simultaneously
- Best general-purpose toxicity baseline
- Widely used in production T&S pipelines
Limitations
- Trained on Wikipedia comments which skews toward formal English
- May underperform on informal social media slang
Offensive Language Detector
RoBERTaCardiff NLP
Detects: Offensive and harassing language in social media context
Trained on Twitter data (SemEval 2019 OffComEval dataset)
Strengths
- Specifically trained on real social media content
- Better at detecting informal offensive language and slang than general toxicity models
Limitations
- Twitter-specific training may miss platform-specific offensive patterns from other networks
Hate Speech Detector
RoBERTaFacebook AI Research
Detects: Hate speech targeting protected groups and identity-based attacks
Trained on DynaBench R4 — an adversarially collected dataset designed to be harder to game than previous benchmarks
Strengths
- Trained on adversarially collected data making it more robust against evasion attempts
- Specifically targets identity-based hate rather than general toxicity
Limitations
- Binary hate/not-hate output provides less granularity than multi-label models
Spam Detector
BERT-tinyManuel Romero (mrm8488)
Detects: Spam, scam messages, and unsolicited manipulative content
Trained on SMS Spam Collection dataset
Strengths
- Extremely lightweight (4.4M parameters) making it the fastest model in the pipeline
- 98% validation accuracy on SMS spam detection
Limitations
- Trained on SMS data so may miss sophisticated social media spam patterns
- Not designed for long-form content
Bias Detector
DistilRoBERTaValurank
Detects: Biased and non-neutral language in text
Trained on Wikipedia revision history (WNC corpus) — edits where neutral editors removed biased language
Strengths
- Unique training methodology using real editorial decisions
- Detects subtle linguistic bias rather than just overt violations
- Highly relevant for misinformation and propaganda detection
Limitations
- Trained on Wikipedia-style formal writing
- May flag strongly opinionated but legitimate content as biased