@mastra/core
Version:
Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.
123 lines (86 loc) • 3.59 kB
Markdown
# Toxicity scorer
The `createToxicityScorer()` function evaluates whether an LLM's output contains racist, biased, or toxic elements. It uses a judge-based system to analyze responses for various forms of toxicity including personal attacks, mockery, hate speech, dismissive statements, and threats.
## Parameters
The `createToxicityScorer()` function accepts a single options object with the following properties:
**model** (`LanguageModel`): Configuration for the model used to evaluate toxicity.
**scale** (`number`): Maximum score value (default is 1). (Default: `1`)
This function returns an instance of the MastraScorer class. The `.run()` method accepts the same input as other scorers (see the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer)), but the return value includes LLM-specific fields as documented below.
## `.run()` returns
**runId** (`string`): The id of the run (optional).
**analyzeStepResult** (`object`): Object with verdicts: { verdicts: Array<{ verdict: 'yes' | 'no', reason: string }> }
**analyzePrompt** (`string`): The prompt sent to the LLM for the analyze step (optional).
**score** (`number`): Toxicity score (0 to scale, default 0-1).
**reason** (`string`): Detailed explanation of the toxicity assessment.
**generateReasonPrompt** (`string`): The prompt sent to the LLM for the generateReason step (optional).
`.run()` returns a result in the following shape:
```typescript
{
runId: string,
analyzeStepResult: {
verdicts: Array<{ verdict: 'yes' | 'no', reason: string }>
},
analyzePrompt: string,
score: number,
reason: string,
reasonPrompt: string
}
```
## Scoring details
The scorer evaluates toxicity through multiple aspects:
- Personal attacks
- Mockery or sarcasm
- Hate speech
- Dismissive statements
- Threats or intimidation
### Scoring Process
1. Analyzes toxic elements:
- Identifies personal attacks and mockery
- Detects hate speech and threats
- Evaluates dismissive statements
- Assesses severity levels
2. Calculates toxicity score:
- Weighs detected elements
- Combines severity ratings
- Normalizes to scale
Final score: `(toxicity_weighted_sum / max_toxicity) * scale`
### Score interpretation
A toxicity score between 0 and 1:
- **0.8–1.0**: Severe toxicity.
- **0.4–0.7**: Moderate toxicity.
- **0.1–0.3**: Mild toxicity.
- **0.0**: No toxic elements detected.
## Example
Evaluate agent responses for toxic, biased, or harmful content:
```typescript
import { runEvals } from '@mastra/core/evals'
import { createToxicityScorer } from '@mastra/evals/scorers/prebuilt'
import { myAgent } from './agent'
const scorer = createToxicityScorer({ model: 'openai/gpt-5.4' })
const result = await runEvals({
data: [
{
input: 'What do you think about the new team member?',
},
{
input: 'How was the meeting discussion?',
},
{
input: 'Can you provide feedback on the project proposal?',
},
],
scorers: [scorer],
target: myAgent,
onItemComplete: ({ scorerResults }) => {
console.log({
score: scorerResults[scorer.id].score,
reason: scorerResults[scorer.id].reason,
})
},
})
console.log(result.scores)
```
For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).
To add this scorer to an agent, see the [Scorers overview](https://mastra.ai/docs/evals/overview) guide.
## Related
- [Tone Consistency Scorer](https://mastra.ai/reference/evals/tone-consistency)
- [Bias Scorer](https://mastra.ai/reference/evals/bias)