Recipes

Recipes are functions that define workflows for annotation, model training, data analysis, automated actions and more. Ellf comes with a range of built-in workflows for different use cases and also lets you implement your own custom recipes that run on your cluster.

Use Ellf to configure recipes for you

If you’ve connected Ellf to your coding assistant, it will be able to create and start tasks, actions, agents and services for you. You can also use the in-app chat and reference resources via @, for example to start a task using a data source, train from a dataset or assign an agent to a running task.

Tasks Annotation and review

Tasks are workflows that preprocess and queue up data for annotation or review and start the annotation server. You can view and create them in the UI via Tasks or using the CLI commands under ellf tasks.

Named Entity Recognition

Annotate labeled text spans representing real-world objects like names, persons, countries or products.

Span Categorization

Annotate potentially overlapping and nested spans in the data.

Text Classification

Assign categories to whole documents or sentences.

Relation Extraction

Annotate relations between tokens and spans. Also supports joint span and relation annotation.

Coreference Resolution

Annotate coreference, i.e. links of ambiguous mentions like "her" or "the woman" back to an antecedent providing more context about the entity in question

Dependency Parsing

Annotate syntactic dependencies.

Part of Speech tagging recipe

Annotate word types.

Terminology List

Bootstrap a terminology list from word vectors. Terminology lists can be converted into patterns to help pre-select entity spans during annotation.

Image Annotation & Classification

Annotate bounding boxes and segments, or assign categories to images.

Annotate Audio

Annotate regions, assign categories to audio content or transcribe audio files.

Annotate Video

Annotate regions, assign categories to video content or transcribe video files.

Curate and Explore

View what's in your data and accept or reject examples

Review Annotations

Review existing annotations created by multiple annotators and resolve potential conflicts by creating one final annotation.

Secrets Example

Annotate 'hello world'

Sentence Segmentation

Create gold data for sentence boundaries by correcting a model's predictions

Debug Task

Task with tunable delays and errors for debugging.

Pattern Generation MCP Server

spaCy NER pattern generation as an MCP server. Uses Google Gemini to propose and iteratively refine spaCy matcher patterns given example text and a description of the entity types.

PDF Span Annotation

Annotate text spans in PDFs pre-processed into a layout asset with PDF Layout Fetch.

PDF OCR Correction

Apply OCR to annotated bounding boxes from PDF Image Annotation and correct the extracted text.

PDF Image Annotation

Annotate bounding boxes and segments on PDF pages rendered as images.

Actions Training, evaluation and more

Actions are workflows that execute any logic and exit, similar to jobs running in a CI system. You can view and create them in the UI via Actions or using the CLI commands under ellf actions.

Dataset operations

Merge, copy and export annotated data

Migrate dataset to structured

Convert an unstructured dataset to the structured format

Hello world

Print 'hello world'

Wait and exit

Wait and exit with a given code

Print file length

Print dataset or file length

Call PAM with dummy metrics data

Download spaCy models

Download and install one or more spaCy models to shared storage so they can be loaded with spacy.load()

Generate synthetic data with Gemini Flash

Generate diverse synthetic examples from a plain-English task description using Google Gemini Flash, write them as a JSONL file and register the result as an Input asset on the cluster.

Train a spaCy pipeline

Train a spaCy model with one or more components on annotated data

Textcat LLM fetch

Gather text categorization predictions from an LLM

PDF Ingest

Render a published 'pdf' asset's source PDFs into page images + a manifest, making it ready to annotate. Run via `ellf publish data --kind pdf --render`, or directly against an already-published asset.

PDF Layout Fetch

Extract text and layout from a PDF asset with spacy-layout and save as a layout asset for fast annotation with PDF Span Annotation.

Agents Auto-annotation and automation

Agents are autonomous workers and annotators that can be assigned to tasks. They’re typically powered by LLMs and can use models running on the cluster or via APIs. You can view and create them in the UI via Agents or using the CLI commands under ellf agents.

Gemini Annotation Agent

Autonomous annotation agent powered by Google Gemini

spaCy Test Agent

Deterministic local annotation agent for tests and development

Services Apps and APIs

Services are long-running background processes like REST or MCP APIs and apps that are served from your cluster. You can view and create them in the UI via Services or using the CLI commands under ellf services.

Coming soon: This section is still under construction.

Community Recipes Third-party and other plugins

These recipes can be installed to your cluster separately and are provided by other packages by us or the developer community. If you want to contribute a recipe you’ve built, get in touch! For more details on custom recipes, see the recipe development guide.

Coming soon: This section is still under construction.

Working with PDFs

Ellf ships a set of recipes for annotating PDF documents. They all build on the built-in PDF asset type, so the first step is always to publish and render your documents as a pdf asset. Rendering rasterizes each page once, via the --render flag of publish data or as the pdf_ingest action. From there, the recipes reuse those page images so you never re-render. There are two main workflows, depending on what you want to annotate.

Region annotation and OCR correction

This workflow treats each page as an image and is ideal for scanned documents, forms, or extracting text from specific regions:

PDF Image Annotation (pdf_image) – draw labeled bounding boxes and regions directly on the rendered pages. By default all the pages of a document are grouped into a single task that you page through together, keeping them in context. Set split_pages to instead put each page in the queue as its own separate task, which is handy for distributing a long document across annotators.
PDF OCR Correction (pdf_ocr_correct) – takes the boxes annotated in the previous step, crops each region from the page, runs Tesseract OCR over it, and presents the extracted text for you to correct. Filter which boxes to OCR by label, and OCR in other languages with custom tessdata.

Custom OCR language data

Only English (eng) is bundled in the cluster image, so to OCR in any other language you need to upload its Tesseract data files as an asset. Place the .traineddata file(s) you need – for example from the tessdata repository – inside a directory and publish that directory as an input asset, then select it as the Custom tessdata field when starting the task. The asset directory is used as the Tesseract data prefix, so each file name must match the language code you pass (e.g. pol.traineddata for pol, combined as eng+pol).

Layout-based span annotation

This workflow extracts the document’s text and layout so you can annotate spans (e.g. for NER) over real tokens, with the page image shown alongside for context. It’s a two-recipe process:

PDF Layout Fetch (pdf_layout_fetch) – an action that runs spacy-layout over the PDFs to extract text, tokens and layout structure (headings, lists, paragraphs) into a separate, lightweight layout asset. Prepare it in focus mode to annotate only specific layout types (e.g. text or list_item) section-by-section, or leave it full-page.
PDF Span Annotation (pdf_spans) – annotate text spans over the extracted layout, with the rendered page shown alongside for reference; it reuses the layout asset’s source PDF automatically, so you only select the layout asset. Pick an annotation mode: view whole pages grouped into one task, one task per page, or — for a layout asset prepared with focus — section-by-section. You label whatever spans your scheme defines, not only named entities, and can optionally select a spaCy model with an NER component to pre-highlight its entity predictions as suggestions.

Pick the workflow that matches your data

Reach for region annotation + OCR when the text isn’t reliably extractable (scans, images, complex forms) and you need to define and read regions yourself. Reach for layout-based spans when the PDFs have a recoverable text layer and you want to label spans over the real text, the way you would with NER or span categorization on plain text.