DocumentationPricingJoin Waitlist

Getting Started

  • Introduction
  • Installation & Setup

Features

  • Modules
  • Platform
  • Recipes

Cluster

  • Details & Setup
  • Tasks, Actions & Agents
  • Assets & Files

API Reference

  • Command-Line Interface
  • Recipe Development
  • Recipes API

Modules

Modules are collections of tools, skills and agents for different steps of your development process and lifecycle. By default, Ellf will guide you through the whole end-to-end process and select the right modules in the order most fitting for your requirements. You can also run individual modules directly as slash commands.

Project Planning

Break your business problem down into actionable steps

Data Annotation

Define label scheme, NLP components and annotate data

Prodigy Recipes

Implement custom workflows for the Prodigy annotation tool

Pattern Generation

Interactively develop patterns and rule-based logic

Model Training

Train and export custom models and components

Question Answering

Ask specific questions and query the knowledge base

Human Support

Get support and feedback from our NLP experts

Coding Handover

Hand over in-app conversations to your coding assistant

Monitoring

Monitor jobs, training progress and cluster health

Platform Operations

Navigate the platform and CLI to perform actions


Project planning

This module helps you break down your business problems and requirements into actionable steps and outlines a project plan, including strategies for data preparation and development, NLP components to use, label schemes, annotation and model training. It will ask follow-up questions as needed to determine the right approach and next steps. This is usually the first step in the Ellf process.

PROJECT_PLAN.md

NLP Project Plan: Fraud Report Classifier

Problem Statement

Build a pipeline that processes analyst-written fraud investigation summaries and produces four outputs: fraud type (multi-class, 6 labels), affected product, urgency level, and legal escalation flag. Urgency and legal escalation are derived from business rules applied downstream of model predictions — not model outputs themselves. Starting from scratch on annotation.

Pipeline Overview

Analyst report text

│─→ [Fraud Type Classifier] ← supervised textcat, 6-class exclusive

|─→ [Social Engineering Detector] ← supervised textcat, binary

|─→ [Product Extractor] ← rules (PhraseMatcher on known product names)

|─→ [Amount Extractor] ← regex / MONEY NER for dollar amounts

|─→ [Business Logic Layer] ← NOT a model

|─→ urgency: fraud_type × amount × product → policy thresholds

|─→ escalate: urgency == HIGH → legal team routing

Key architecture decision: Urgency and legal escalation are policy decisions, not language-understanding tasks. Internal thresholds ($100K, insider fraud type) are encoded as rules owned by the fraud team, not embedded in model weights.

Components

Component Approach Rationale
Fraud type Supervised textcat, 6-class exclusive Core NLP task; domain-specific; needs training
Social engineering vector Supervised textcat, binary Method flag separate from outcome type; binary is fast to annotate
Product extraction PhraseMatcher rules Analyst reports name products explicitly; rules are fast and auditable
Amount extraction Regex / spaCy MONEY entity Structured format; rules are sufficient
Urgency Business logic rules Thresholds ($100K, fraud type) are policy, not language
Legal escalation Business logic rules Derived from urgency; policy-owned

Data Strategy

  • Source: Analyst-written fraud investigation summaries (free text, high quality, domain-consistent language)
  • No existing labels — annotating from scratch
  • insider_fraud and insurance_fraud estimated at 10–15% each — random sampling sufficient

Annotation Plan

Fraud type classifier

  • Recipe: textcat.correct with LLM pre-annotation
  • Labels: account_takeover, money_laundering, card_fraud, application_fraud, insider_fraud, insurance_fraud (exclusive)
  • Target volume: 400–500 annotated examples
  • Evaluation split: Set aside ~100 examples before annotation starts (document-level split)
  • Pilot first: Annotate 50–75 manually (no LLM). Fix schema before scaling.

Social engineering vector

  • Recipe: textcat.binary
  • Label: social_engineering_vector (true/false)
  • Pass: Second pass after fraud type annotation is stable
  • Target volume: 200–300 examples

Schema decision

  • social_engineering is a method, not an outcome — removed from fraud type labels
  • Added as a separate binary attribute to avoid confusable label pairs

Evaluation Strategy

Test set

  • Hold out ~100 examples before any annotation begins (document-level split)
  • Never used in training; kept constant as the ground truth benchmark

Metrics

  • Per-class F1 for all 6 fraud types — do not rely on macro average alone
  • insider_fraud and insurance_fraud tracked separately
  • Confusion matrix — expected confusables: card_fraud ↔ account_takeover, application_fraud ↔ money_laundering

Baselines

  • Most-frequent-class baseline before any model evaluation

Training curves

  • Train on 25/50/75/100% after each annotation batch
  • Rising at 100% → annotate more; flat → investigate schema or architecture

Memorisation check

  • Train on pilot examples, evaluate on those same examples — must be near-perfect

Roadmap

Phase What Output
1 — Pilot Read reports, write guidelines, annotate manually Stable schema + 75 examples
2 — Baseline Train first model, memorisation check Go/no-go on schema
3 — Scale annotation LLM-assisted textcat.correct to 400–500 examples Training dataset
4 — Train & evaluate Full training run, per-class F1, error analysis v1 fraud type model
5 — SE vector pass Binary annotation pass + training v1 SE vector model
6 — Rules layer Product PhraseMatcher, amount regex, urgency/escalation rules Complete pipeline

Next Steps

  1. Hold out ~100 docs as the test set before touching any annotation tool
  2. Read 50–75 reports manually to validate the schema
  3. Write one-page annotation guidelines with label definitions and tiebreaker rules
  4. Annotate the pilot batch using the textcat recipe
  5. Train a quick baseline and run the memorisation check
  6. If training curves are rising, scale with LLM-assisted annotation
  7. Build the rules layer (product PhraseMatcher, amount regex) in parallel

Decision Log

  • Urgency: Business logic. Thresholds ($100K, insider fraud type) are policy rules.
  • Legal escalation: Business logic derived from urgency. Policy-owned.
  • Product extraction: PhraseMatcher rules first. Analyst reports name products explicitly.
  • Fraud type: Supervised textcat, 6-class exclusive. Core NLP task.
  • Social engineering: Binary vector flag, not a fraud type label. It's a method, not an outcome.
  • Evaluation-first: 100-doc test set held out before annotation begins.

Ellf also uses the plans CLI to store and sync project plans so you can work on them locally in your coding assistant and in the in-app chat. Plans are scoped to Ellf projects and each project can have one or more plans attached to it, for example one plan for each NLP component you’re developing. As you work with Ellf, your project plans are continuously updated to reflect the state and progress of your project.


Data annotation

This module helps you convert your business problem and project plans into the right NLP components, develop the best matching label schemes that are easy to annotate and learn from, and set up automated data annotation workflows. It will start the Prodigy annotation tool if needed and provide you with the setup to scale up your data labelling process by creating tasks in the platform and setting up agents to auto-label data for you.

The data annotation module also incorporates many best practices for creating generalizable categories and implementing the most efficient human-in-the-loop and model-assisted data development strategies, like our guide for optimizing annotation workflows. This information is also available in the knowledge base that you can access via the /ellf-ask module.

✳︎Claude Code: Setting up annotation on the cluster
› /ellf-annotate

I want to start annotating my customer support emails for intent classification. The labels are REFUND, SHIPPING, TECHNICAL, and BILLING. My data is in the cluster at data/support-emails.

⏺︎

Your label scheme, data source, and annotation objective all look good. Since this is a text classification task starting from scratch, I’ll use the built-in textcat.manual recipe with your four labels. Let me preview the setup first.

✔ Preview server running at http://localhost:8080
⏺︎

The form and annotation interface look correct — four labels rendered as exclusive options. I’ll go ahead and create the task on your cluster.

✔ Created task “Support Email Classification” (textcat.manual)✔ Task started on cluster
⏺︎

Your annotation task is running. You can open it at https://app.ellf.ai/tasks/support-email-classification or assign annotators and agents to start labelling.


Prodigy recipe builder

This module makes coding assistants proficient at using the Prodigy annotation tool and implementing custom recipes and interfaces for specific use cases. Ellf will use it if a custom annotation workflow is required for data collection in your project, e.g. during the data annotation step. You can also call this module directly if you already have an annotation goal in mind and need help implementing the workflow and annotation interface for it.

The recipe builder can implement recipes for both the standalone workflow that runs Prodigy locally, as well as Ellf tasks that run on your cluster and let you assign human annotators and autonomous annotation agents in the platform. Compared to the data annotation module, which focuses on annotation strategy, this module specifically targets the Python recipe implementation.

✳︎Claude Code: Building a custom annotation recipe
› /ellf-prodigy

I need a recipe for annotating product labels in images. The images are in Japanese and annotators need to see an English translation.

⏺︎

This will need a custom recipe with two parts:

  1. Image display with on-demand translation — call a translation API when loading each image and show the result alongside it
  2. Label interface — classification UI for the product categories

Let me start with the image loader and translation component. I’ll use a blocks layout that renders the image on top and the translated text below. Sound good?


Pattern generation

This module helps you determine which information to extract requires machine learning solutions, and which parts can be reliably extracted using rule-based logic like match patterns, regular expressions or other heuristics for faster and more predictable results. It can also automate the creation of those match patterns and iteratively refine them with alternating precision and recall passes, while asking you for interactive feedback on the results. At the end of the process, the module can provide you with the code or script needed to apply the generated patterns or integrate them into your existing NLP pipeline to combine models with rules.

✳︎Claude Code: Generating patterns for invoice extraction
› /ellf-patterns

I need to extract invoice numbers and amounts from procurement documents. Invoice numbers look like INV-2024-00123 and amounts are in USD.

⏺︎

Both of these are great candidates for patterns — they have predictable surface forms. Let me generate an initial set.

⏺︎

INVOICE_NUMBER — 48 matches across 50 documents, e.g.:

  • “Please reference INV-2024-00123 for payment”
  • “Submitted under INV-2023-45678 on March 3”

AMOUNT — 62 matches, e.g.:

  • “Total due: $14,250.00”
  • “Line item subtotal $899.99”

Precision looks high. Want me to run a recall pass to catch edge cases, or export these patterns as-is?


Model training

This module uses your data to train the best possible small, fast and private model components, e.g. a spaCy pipeline consisting of the previously developed components. Based on your requirements and project plan, it will automatically pick a suitable configuration, embeddings and hyperparameters. It will also log the training results for further analysis and export a loadable model and code to use it in your application. To perform and manage training in the cloud, the module can also set up actions that run on your cluster and use GPU resources if available.

Once your training experiments are running, the module can analyze your training logs and results, interpret and explain them for you, and recommend strategies for next steps and further improvements.

✳︎Claude Code: Detecting overfitting during training
› /ellf-train

My NER model for extracting company names and job titles from resumes has been training for a while. How’s it looking?

⚠ Overfitting detected — dev score declining while training loss continues to decrease (sustained over last 4 evaluation steps)
⏺︎

Your best model was saved at step 1,600 with 0.79 F1 (COMPANY 0.85, JOB_TITLE 0.71). After that, training loss keeps dropping but dev scores are going down — the model is memorizing your training data.

JOB_TITLE is the weaker label. Checking your data: you have 312 COMPANY examples but only 87 for JOB_TITLE. That’s likely not enough for reliable generalization. I’d recommend:

  1. Annotate more JOB_TITLE examples — target at least 150
  2. Increase dropout from 0.1 to 0.3 in the next run
  3. Use model-best from this run as your current baseline

Want me to set up an annotation task to collect more JOB_TITLE examples?


Question answering

This module connects to Ellf’s extensive knowledge base consisting of documentation, code, project examples and advice for different use cases and domains, as well as development best practices and our distilled expertise from 10+ years of solving real-world NLP problems in industry. The knowledge base also includes documentation and examples of Ellf itself, as well as our other tools, spaCy and Prodigy, and their related ecosystems.

✳︎Claude Code: Example
› /ellf-ask

I’m developing a system to extract product names from internal company reports. I have about 2000 annotated training examples. How many evaluation examples do I need to report meaningful accuracy?

⏺︎

To report meaningful accuracy for your system extracting product names from internal company reports, a good rule of thumb is to have at least ten evaluation samples per significant digit of your accuracy metric. For instance, if you want to differentiate between 90% and 91% accuracy, you would need at least 1,000 evaluation examples annotated.

Given that you have 2,000 annotated training examples, aiming for a similar quantity of evaluation examples, around 200 to 400 examples, is typically recommended for initial assessments.


Human support

If you ever feel stuck or need human feedback on your project plans and workflow, you can use this module to send us your logs and have them reviewed by one of our NLP experts. Alternatively, you can also export your conversation log to a file and email it to us at support@ellf.ai.

✳︎Claude Code: Submitting a support request
› /ellf-support

The model’s accuracy isn’t improving with more data, so I’m considering changing the label scheme. Is this a good approach?

⏺︎Should the support request include your full session transcript, or a summary only?
› Full transcript
⏺︎

Support request sent successfully.

  • Reference ID: a7cf3184865e4f05a8bf269ad34c767a
  • Session transcript: attached
  • Debug log: not found (not attached)

The Ellf team will receive your question about whether changing the label scheme is the right approach for the accuracy plateau. You’ll get a copy via email.

Support requests are reviewed by our team and you’ll typically receive a response within a few business day. The module will always ask for your permission before sharing your conversation logs.


Handover to coding assistant

This module allows in-app chat conversations to be handed over to your coding assistant like Claude Code in order to create and edit files locally and write code. The conversation history can be moved to your cluster, and by default, Ellf will automatically pick up from the latest conversation. You can also call the module explicitly with a conversation ID or select from the available past conversations.

Handover from in-app chat to Claude Code

Can you implement this?

Sure! Here’s the handoff summary:

Request ID523060d8-b2db-4fd5-b541-5ee119007501
Statusplanned
DescriptionWrite loader.py for documents from data warehouse

In your local Claude Code, run /ellf-todo to pick it up – it will load our full conversation context automatically.

✳︎Claude Code: Continuing development after handover
› /ellf-todo
✔ Successfully updated coding request 523060d8-b2db-4fd5-b541-5ee119007501
⏺︎Context summary: Phase 1 of an e-commerce aspect extraction project. The task is to write loader.py for the internal data warehouse API that segments sentences and yields {id, text, date} dictionaries.

Under the hood, this module uses the todo CLI to create and manage requests, which can then be picked up by your coding assistant. It can also use the plans CLI to fetch and update your project plans so it has the available context for the given todo.


Monitoring and observability

This module monitors and diagnoses running jobs from your coding environment and the in-app chat. It can check on tasks, actions and agents running on your cluster, local training runs, standalone Prodigy servers, and overall cluster health. Instead of dumping raw logs into your conversation, the module produces compact structured summaries with actionable alerts. The module covers several monitoring surfaces:

  • Training jobs: Tracks loss, dev scores, best checkpoint and detects problems like overfitting, score plateaus, loss spikes and NaN loss.
  • Annotation tasks: Reports task state, dataset growth signals and annotator or agent activity to help you spot stalled or misconfigured annotation workflows.
  • Agents: Checks agent state, recent errors and whether the agent is producing work or appears blocked.
  • Cluster health: runs connectivity, broker-side and node capacity checks to surface degraded state or resource pressure.
✳︎Claude Code: Checking on a training action
› /ellf-monitor How’s the NER training going?
⏺︎
Jobner-product-names-v2 (action)
Staterunning — step 800
Best score0.81 (ents_f) at step 720
Train loss0.52 (stable)
AlertsNone

Training looks healthy. The score has been improving steadily and there are no signs of overfitting or loss instability.


Platform operations

This module makes your coding assistant and the in-app chat proficient at navigating the web platform and ellf command-line interface so it’s able to create and manage tasks, actions and agents, interact with the cluster, navigate to the right pages and resources, and perform any other actions for you. It essentially makes Ellf good at using itself to its full potential.

Ellf will generally ask you for additional confirmation before creating, running or deleting things for you, and will execute everything under your user account.

Example: Performing actions via the in-app chat

Start @Actions/Company Reports Training and show me the logs.

Start jobCompany Reports Training838e401c-f070-44de-b6a7-3db840e9dc7c
✳︎Claude Code: Performing actions via the CLI
› Start the Company Reports Training action and show me the logs.
⏺︎

✔ Task started. Here’s a summary of the logs:

  • Recipe: ner (Named Entity Recognition) with view ner_manual
  • Labels: PERSON, ORG, DATE
  • Input: Loaded asset “Freedom Raw (train)” via txt loader
  • Dataset: Created freedom-ner-train in PostgreSQL
  • Server: Prodigy is running on port 80, health check passing

You can open it for annotation at:
https://app.ellf.ai/annotate/7a2c49e1-83bf-4d12-ae95-f1c620d8b537

Read next
Platform

from the makers of spaCy and Prodigy

Navigation

  • Home
  • Documentation

Platform

  • Pricing
  • Waitlist

Resources

  • Case Studies
  • Blog
Terms & ConditionsPrivacy PolicyImprint© 2026 Explosion