mkdir src/datasets/loaders/myloader
Create a dataset loader
A dataset loader is used to generate one or more raw datasets. A raw dataset is processed by the Dataset preprocessing workflow to create a common dataset that can be used by multiple benchmarking tasks.
This guide will show you how to create a new Viash component to fetch or generate datasets.
Make sure you have followed the “Getting started” guide.
OpenProblems datasets
Common datasets are created by generating raw datasets with a data loader and running them through the pre-processing pipeline (Figure 1). Afterwards, further task-specific processing occurs prior to the task-specific benchmarking workflow.
See the reference documentation for more information on how each of these steps works.
Step 1: Create a directory for the dataset loader
To add a dataset to OpenProblems, you will need to create a Viash component for a dataset loader.
Take a look at the dataset loaders that are already in the src/datasets/loaders
! Likely there is already a dataset loader that already does something similar to what you need.
Step 2: Create a Viash config
Next, create a config for the dataset loader. The Viash config contains metadata of your dataset, which script is used to run it, and the required dependencies. The simplest dataset loader you can create looks as follows.
Contents of src/datasets/loaders/myloader/config.vsh.yaml
:
functionality:
name: "myloader"
namespace: "datasets/loaders"
description: "A new dataset loader"
arguments:
- name: "--output"
__merge__: ../../api/file_raw.yaml
direction: "output"
resources:
- type: python_script
path: script.py
platforms:
- type: docker
image: python:3.10
setup:
- type: python
pypi: anndata~=0.8.0
- type: nextflow
Contents of src/datasets/loaders/myloader/config.vsh.yaml
:
functionality:
name: "myloader"
namespace: "datasets/loaders"
description: "A new dataset loader"
arguments:
- name: "--output"
__merge__: ../../api/file_raw.yaml
direction: "output"
resources:
- type: python_script
path: script.py
platforms:
- type: docker
image: eddelbuettel/r2u:22.04
setup:
- type: r
cran: anndata
- type: apt
packages: [ libhdf5-dev, libgeos-dev, python3, python3-pip, python3-dev, python-is-python3 ]
- type: nextflow
For more parameter options, refer to “Parameters” section.
Step 3: Create a script
Next, create a script that will generate or load the dataset. Here we show an example script that generates a random dataset, but check out src/datasets/loaders
for real data examples. The script must ensure that that the output anndata object has the format as described in the “Format of a raw dataset object” section.
Contents of src/datasets/loaders/myloader/script.py
:
import anndata as ad
import pandas as pd
import scipy
import numpy as np
import random
## VIASH START
= {"output": "output.h5ad"}
par ## VIASH END
# Create obs data for the cells
= pd.DataFrame({
obs "cell_type": random.choices(["enterocyte", "intestine goblet cell", "stem cell"], k=100),
"batch": random.choices(["experiment1", "experiment2"], k=100),
"tissue": random.choices(["colon", "ileum"], k=100)
})
# Create var data for the genes
= pd.DataFrame(
var =["APP", "AXL", "ADA", "AMH"]
index
)
# Create counts data
= scipy.sparse.csr_matrix(
counts 0.3, (obs.shape[0], var.shape[0]))
np.random.poisson(
)
# Create an AnnData dataset
= ad.AnnData(
adata = {
layers "counts": counts
},= obs,
obs = var,
var = {
uns "dataset_id": "my_dataset",
"dataset_name": "My dataset",
"data_url": "https://url.to/dataset/source",
"data_reference": "mydatasetbibtexreference",
"dataset_summary": "A short description of the dataset.",
"dataset_description": "Long description of the dataset.",
"dataset_organism": "homo_sapiens"
}
)
# Write to file
"output"], compression="gzip") adata.write_h5ad(par[
Contents of src/datasets/loaders/myloader/script.R
:
library(anndata)
library(Matrix)
library(dplyr)
## VIASH START
<- list("output" = "output.h5ad")
par ## VIASH END
# Create obs data for the cells
<- data.frame(
obs "cell_type" = sample(c("enterocyte", "intestine goblet cell", "stem cell"), 100, replace = TRUE),
"batch" = sample(c("experiment1", "experiment2"), 100, replace = TRUE),
"tissue" = sample(c("colon", "ileum"), 100, replace = TRUE)
)
# Create var data for the genes
<- data.frame(
var row.names = c("APP", "AXL", "ADA", "AMH")
)
# Create counts data
<- Matrix::rsparsematrix(
counts nrow(obs),
nrow(var),
density = 0.3,
rand.x = function(n) rpois(n, 100)
)
# Create an AnnData dataset
<- AnnData(
adata layers = list(
"counts" = counts
),obs = obs,
var = var,
uns = list(
dataset_id = "my_dataset",
dataset_name = "My dataset",
data_url = "https://url.to/dataset/source",
data_reference = "mydatasetbibtexreference",
dataset_summary = "A short description of the dataset.",
dataset_description = "Long description of the dataset.",
dataset_organism = "homo_sapiens"
)
)
# Write to file
$write_h5ad(par[["output"]], compression = "gzip") adata
Step 4: Run the component
Try running your component! You can start off by running your script inside your IDE.
To check whether your component works as a standalone component, run the following commands.
(Re)build the Docker container after changing the platforms
section in the Viash config:
viash run src/datasets/loaders/myloader/config.vsh.yaml -- \
---setup cachedbuild
Run the component:
viash run src/datasets/loaders/myloader/config.vsh.yaml -- \
--output mydataset.h5ad
Parameters
It’s possible to add arguments to the dataset loader by adding additional entries to the functionality.arguments
section in the config.vsh.yaml
. For example:
arguments:
- name: "--n_obs"
type: "integer"
description: "Number of cells to generate."
default: 100
- name: "--n_vars"
type: "integer"
description: "Number of genes to generate."
default: 100
You can then use the n_obs
and n_vars
values in the par
object to get access to the runtime parameters:
= pd.DataFrame({
obs "cell_type": random.choices(["enterocyte", "intestine goblet cell", "stem cell"], k=par["n_obs"]),
"batch": random.choices(["experiment1", "experiment2"], k=par["n_obs"]),
"tissue": random.choices(["colon", "ileum"], k=par["n_obs"])
})= pd.DataFrame(
var =[f"Gene_{i}" for i in range(100)]
index )
<- data.frame(
obs "cell_type" = sample(c("enterocyte", "intestine goblet cell", "stem cell"), par$n_obs, replace = TRUE),
"batch" = sample(c("experiment1", "experiment2"), par$n_obs, replace = TRUE),
"tissue" = sample(c("colon", "ileum"), par$n_obs, replace = TRUE)
)<- data.frame(
var row.names = paste0("Gene_", seq_len(par$n_vars))
)
Format of a raw dataset object
Ideally, the AnnData output object should at least contain the following slots:
AnnData object
obs: 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'organism', 'organism_ontology_term_id', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'batch', 'soma_joinid'
var: 'feature_id', 'feature_name', 'soma_joinid'
layers: 'counts'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism'
The full file format spec of a raw dataset AnnData is shown in the table below.
Slot | Type | Description |
---|---|---|
obs["dataset_id"] |
string |
(Optional) Identifier for the dataset from which the cell data is derived, useful for tracking and referencing purposes. |
obs["assay"] |
string |
(Optional) Type of assay used to generate the cell data, indicating the methodology or technique employed. |
obs["assay_ontology_term_id"] |
string |
(Optional) Experimental Factor Ontology (EFO: ) term identifier for the assay, providing a standardized reference to the assay type. |
obs["cell_type"] |
string |
(Optional) Classification of the cell type based on its characteristics and function within the tissue or organism. |
obs["cell_type_ontology_term_id"] |
string |
(Optional) Cell Ontology (CL: ) term identifier for the cell type, offering a standardized reference to the specific cell classification. |
obs["development_stage"] |
string |
(Optional) Stage of development of the organism or tissue from which the cell is derived, indicating its maturity or developmental phase. |
obs["development_stage_ontology_term_id"] |
string |
(Optional) Ontology term identifier for the developmental stage, providing a standardized reference to the organism’s developmental phase. If the organism is human (organism_ontology_term_id == 'NCBITaxon:9606' ), then the Human Developmental Stages (HsapDv: ) ontology is used. If the organism is mouse (organism_ontology_term_id == 'NCBITaxon:10090' ), then the Mouse Developmental Stages (MmusDv: ) ontology is used. Otherwise, the Uberon (UBERON: ) ontology is used. |
obs["disease"] |
string |
(Optional) Information on any disease or pathological condition associated with the cell or donor. |
obs["disease_ontology_term_id"] |
string |
(Optional) Ontology term identifier for the disease, enabling standardized disease classification and referencing. Must be a term from the Mondo Disease Ontology (MONDO: ) ontology term, or PATO:0000461 from the Phenotype And Trait Ontology (PATO: ). |
obs["donor_id"] |
string |
(Optional) Identifier for the donor from whom the cell sample is obtained. |
obs["is_primary_data"] |
boolean |
(Optional) Indicates whether the data is primary (directly obtained from experiments) or has been computationally derived from other primary data. |
obs["organism"] |
string |
(Optional) Organism from which the cell sample is obtained. |
obs["organism_ontology_term_id"] |
string |
(Optional) Ontology term identifier for the organism, providing a standardized reference for the organism. Must be a term from the NCBI Taxonomy Ontology (NCBITaxon: ) which is a child of NCBITaxon:33208 . |
obs["self_reported_ethnicity"] |
string |
(Optional) Ethnicity of the donor as self-reported, relevant for studies considering genetic diversity and population-specific traits. |
obs["self_reported_ethnicity_ontology_term_id"] |
string |
(Optional) Ontology term identifier for the self-reported ethnicity, providing a standardized reference for ethnic classifications. If the organism is human (organism_ontology_term_id == 'NCBITaxon:9606' ), then the Human Ancestry Ontology (HANCESTRO: ) is used. |
obs["sex"] |
string |
(Optional) Biological sex of the donor or source organism, crucial for studies involving sex-specific traits or conditions. |
obs["sex_ontology_term_id"] |
string |
(Optional) Ontology term identifier for the biological sex, ensuring standardized classification of sex. Only PATO:0000383 , PATO:0000384 and PATO:0001340 are allowed. |
obs["suspension_type"] |
string |
(Optional) Type of suspension or medium in which the cells were stored or processed, important for understanding cell handling and conditions. |
obs["tissue"] |
string |
(Optional) Specific tissue from which the cells were derived, key for context and specificity in cell studies. |
obs["tissue_ontology_term_id"] |
string |
(Optional) Ontology term identifier for the tissue, providing a standardized reference for the tissue type. For organoid or tissue samples, the Uber-anatomy ontology (UBERON: ) is used. The term ids must be a child term of UBERON:0001062 (anatomical entity). For cell cultures, the Cell Ontology (CL: ) is used. The term ids cannot be CL:0000255 , CL:0000257 or CL:0000548 . |
obs["tissue_general"] |
string |
(Optional) General category or classification of the tissue, useful for broader grouping and comparison of cell data. |
obs["tissue_general_ontology_term_id"] |
string |
(Optional) Ontology term identifier for the general tissue category, aiding in standardizing and grouping tissue types. For organoid or tissue samples, the Uber-anatomy ontology (UBERON: ) is used. The term ids must be a child term of UBERON:0001062 (anatomical entity). For cell cultures, the Cell Ontology (CL: ) is used. The term ids cannot be CL:0000255 , CL:0000257 or CL:0000548 . |
obs["batch"] |
string |
(Optional) A batch identifier. This label is very context-dependent and may be a combination of the tissue, assay, donor, etc. |
obs["soma_joinid"] |
integer |
(Optional) If the dataset was retrieved from CELLxGENE census, this is a unique identifier for the cell. |
var["feature_id"] |
string |
(Optional) Unique identifier for the feature, usually a ENSEMBL gene id. |
var["feature_name"] |
string |
A human-readable name for the feature, usually a gene symbol. |
var["soma_joinid"] |
integer |
(Optional) If the dataset was retrieved from CELLxGENE census, this is a unique identifier for the feature. |
layers["counts"] |
integer |
Raw counts. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. This is different from the obs.dataset_id field, which is the identifier for the dataset from which the cell data is derived. |
uns["dataset_name"] |
string |
A human-readable name for the dataset. |
uns["dataset_url"] |
string |
(Optional) Link to the original source of the dataset. |
uns["dataset_reference"] |
string |
(Optional) Bibtex reference of the paper in which the dataset was published. |
uns["dataset_summary"] |
string |
Short description of the dataset. |
uns["dataset_description"] |
string |
Long description of the dataset. |
uns["dataset_organism"] |
string |
(Optional) The organism of the sample in the dataset. |