Assets & Files
Assets are pointers to data resources that are registered with Ellf and can refer to files, directories and machine learning models on your cluster, as well as any other sources like data from an API call or a custom remote database. In addition to assets, you can upload and manage Python packages on your cluster and create datasets used to store data and annotations. In the web application, the ClusterAssets page shows you an overview of assets available on your cluster.
Use Ellf to manage files and resources
If you’ve connected Ellf to your coding assistant, it will be able to create and manage assets for you. You can also use the in-app chat and reference resources via @, for example to start a task using a data source, train from a dataset or assign an agent to a running task.
Asset types
Assets cover all common data types like input files and resources, match patterns, annotated and raw datasets, models, vectors, and Python packages, including built-in and custom recipes. You can also implement your own custom asset types.
Data and data files
Naturally, data is one of the most important parts of many workflows and you often want to start by adding your input data to the app. The hybrid cloud architecture of Ellf means that all of your data stays private and on the data processing cluster hosted by you, and our servers only store a record of the data, called an “asset”.
Assets can point to pretty much anything that’s loadable – commonly, this includes files and directories of files, but it can also include remote resources from a database or an API. See the section on custom assets for details on how to implement your own dataclasses for custom asset types and associated logic like loading.
Datasets
Datasets are named collections of data and annotations in Prodigy’s JSON format, stored in the database on your cluster. When you start an annotation task, you typically provide the name of a new or existing dataset to save the collected annotations to. Datasets can then be used for training and evaluating models and performing other analysis. The datasets command lets you manage your datasets.
Exporting datasets
The datasets export command lets you export examples from an existing dataset to a file, e.g. to inspect it locally or to integrate the data into a larger automation pipeline. Datasets are stored in Prodigy’s JSONL (newline-delimited JSON) format. Also see the docs on Prodigy’s annotation interfaces for more details on the data format and structure they produce.
{"text":"Uber\u2019s Lesson: Silicon Valley\u2019s Start-Up Machine Needs Fixing","meta":{"source":"The New York Times"},"_input_hash":1886699658,"_task_hash":-1952856502,"tokens":[{"text":"Uber","start":0,"end":4,"id":0},{"text":"\u2019s","start":4,"end":6,"id":1},{"text":"Lesson","start":7,"end":13,"id":2},{"text":":","start":13,"end":14,"id":3},{"text":"Silicon","start":15,"end":22,"id":4},{"text":"Valley","start":23,"end":29,"id":5},{"text":"\u2019s","start":29,"end":31,"id":6},{"text":"Start","start":32,"end":37,"id":7},{"text":"-","start":37,"end":38,"id":8},{"text":"Up","start":38,"end":40,"id":9},{"text":"Machine","start":41,"end":48,"id":10},{"text":"Needs","start":49,"end":54,"id":11},{"text":"Fixing","start":55,"end":61,"id":12}],"_session_id":null,"_view_id":"ner_manual","spans":[{"start":0,"end":4,"token_start":0,"token_end":0,"label":"ORG"},{"start":15,"end":29,"token_start":4,"token_end":5,"label":"LOCATION"}],"answer":"accept"}
{"text":"Pearl Automation, Founded by Apple Veterans, Shuts Down","meta":{"source":"The New York Times"},"_input_hash":1487477437,"_task_hash":-1298236362,"tokens":[{"text":"Pearl","start":0,"end":5,"id":0},{"text":"Automation","start":6,"end":16,"id":1},{"text":",","start":16,"end":17,"id":2},{"text":"Founded","start":18,"end":25,"id":3},{"text":"by","start":26,"end":28,"id":4},{"text":"Apple","start":29,"end":34,"id":5},{"text":"Veterans","start":35,"end":43,"id":6},{"text":",","start":43,"end":44,"id":7},{"text":"Shuts","start":45,"end":50,"id":8},{"text":"Down","start":51,"end":55,"id":9}],"_session_id":null,"_view_id":"ner_manual","spans":[{"start":0,"end":16,"token_start":0,"token_end":1,"label":"ORG"},{"start":29,"end":34,"token_start":5,"token_end":5,"label":"ORG"}],"answer":"accept"}
{"text":"How Silicon Valley Pushed Coding Into American Classrooms","meta":{"source":"The New York Times"},"_input_hash":1842734674,"_task_hash":636683182,"tokens":[{"text":"How","start":0,"end":3,"id":0},{"text":"Silicon","start":4,"end":11,"id":1},{"text":"Valley","start":12,"end":18,"id":2},{"text":"Pushed","start":19,"end":25,"id":3},{"text":"Coding","start":26,"end":32,"id":4},{"text":"Into","start":33,"end":37,"id":5},{"text":"American","start":38,"end":46,"id":6},{"text":"Classrooms","start":47,"end":57,"id":7}],"_session_id":null,"_view_id":"ner_manual","spans":[{"start":4,"end":18,"token_start":1,"token_end":2,"label":"LOCATION"}],"answer":"accept"}
{"text":"Women in Tech Speak Frankly on Culture of Harassment","meta":{"source":"The New York Times"},"_input_hash":-487516519,"_task_hash":62119900,"tokens":[{"text":"Women","start":0,"end":5,"id":0},{"text":"in","start":6,"end":8,"id":1},{"text":"Tech","start":9,"end":13,"id":2},{"text":"Speak","start":14,"end":19,"id":3},{"text":"Frankly","start":20,"end":27,"id":4},{"text":"on","start":28,"end":30,"id":5},{"text":"Culture","start":31,"end":38,"id":6},{"text":"of","start":39,"end":41,"id":7},{"text":"Harassment","start":42,"end":52,"id":8}],"_session_id":null,"_view_id":"ner_manual","answer":"accept"}Keep in mind that the data may contain references to files hosted on your cluster, like image or audio file paths. If you want to export all assets and datasets, e.g. for backup purposes, you can use the ellf export command.
Models
Models are a sub-type of assets, since they’re also just collections of files under the hood. They can be required in recipes, produced by actions (e.g. after training), or used by agents to perform specific tasks.
Models can be uploaded to your cluster using publish data, pointing the destination path to a location on your cluster’s storage bucket. You can also produce models from action recipes – for instance, training a model from a dataset will automatically register the resulting model as a new asset. To use a model in a recipe, annotate the argument with the Model type, which provides a load method that returns the loaded model. For more details and examples, see the recipe development guide.
Packages
Recipes often depend on various other Python packages, which can include libraries available on PyPi, as well as your own private packages. Ellf manages those centrally alongside assets, to ensure each worker of the cluster has the correct set of package dependencies available when executing a recipe. Recipes are packaged and uploaded as versioned Python packages and you’ll be able to see the recipe version used by a task or action in the app and CLI info.
Publishing data assets
The easiest way to upload and publish data to your cluster is by using the publish data command. This takes care of uploading the files to the storage bucket on your cluster and creates a record of the asset in Ellf so you can use it within the app. The {__bucket__} variable in the destination path is a path alias built in by default, which refers to the cluster’s storage bucket URL.
$ellfpublish
data
./example.jsonl
"{__bucket__}/example/data.jsonl"
--name "My first asset"--version 1.0.0--kind input--loader jsonl
The newly created asset will now show up in ClusterAssets in the UI and in assets list on the CLI. The copied files are also shown when you run files ls on the assets directory.
Assets can also be created dynamically from Python within recipes, for example to save preprocessed data or import from an internal or external resource like a database. If your data source requires authentication, you can use the built-in secrets feature to securely make your API keys or credentials available to the recipe.
Path aliases
Assets are pointers to files that can be located anywhere, most commonly on your cluster. Under the hood, those files will be added to a cloud storage bucket, e.g. an S3 bucket if you’re using AWS. To make it easier to manage file paths on your cluster, Ellf lets you create named path aliases using the paths command. When managing files and creating assets, you can then use the alias as a variable in the path, e.g. {alias}/ so you don’t have to repeat the full URL. Built-in path aliases are formatted between __ and can also be used in custom path aliases.
| Built-in alias | Description |
|---|---|
__nfs__ | The path to the NFS drive. |
__bucket__ | The data bucket of the cluster. |
For example, you can have an alias {train} that points to {__bucket__}/training. After adding a path, you can then use it in commands like publish and files.
Secrets
Secrets let you securely manage API keys and other credentials across your organization, which you can then select and use in recipes – for example, if you need to connect to a model via an API. Under the hood, secrets are named pointers and you can view and manage them using the secrets command or via ClusterAssetsSecrets in the UI. Secret values are stored only on your cluster and never sent to our servers.
Adding secrets
You can create a secret using secrets create on the CLI or via the UI. If you don’t provide a --value, you’ll be prompted to enter it interactively so it doesn’t appear in your shell history.
Hiding the secret in your shell history
To make sure your secret isn’t saved in the history, don’t set the --value and enter it when prompted instead. You can also prefix the command with a space to keep it out of history if your shell is configured with HISTCONTROL=ignorespace in bash or setopt HIST_IGNORE_SPACE in zsh.
Using secrets in recipes
Recipes for tasks, actions and agents can require secrets as their arguments, e.g. to authenticate with an API. In the UI, you can then see a dropdown of the available secrets. On the CLI, you can simply provide the secret name.
$ellfagents
create
gemini_agent"Gemini Auto-Labeler Agent"--model gemini-2.0-flash
--api-key GEMINI_API_KEYIf you’re developing your own recipes, you can require the Secret type for an argument, which will then allow the user to select a secret and pass an instance of it to your recipe function. Secret.value will return the plain-text string, which you can then pass forward to API calls and other libraries that require it.
from ellf_recipes_sdk import agent_recipe, Secret
@agent_recipe(title="My Recipe")
def recipe(*, api_key: Secret):
gemini_key = api_key.value()