Dataset processor
In this section, we’ll create the Dataset processor component and generate the first test resources. The Dataset processor component uses one of the pre-defined Common datasets to create the AnnData files used in your task. These AnnData object can then serve as the first test resources to be used to test Method and Metric components.
Why?
The Dataset processor component splits up the Common dataset into different AnnData objects such that both Method and Metric components. This is done so the Method and Metric can only observe the information specified by the file format specification files. This not only safe-guards against data leakage, but also ensures that the Method component can’t accidentally make changes to the data structure that the Metric component relies on.
How?
The Dataset processor, like any component in OpenProblems, is a Viash component, which consists of a config and a script.
Step 1: Create the Viash config
Start by creating the Viash config. Luckily, we already created its file format and argument list in the previous section, so we can just import the interface using the __merge__
field:
src/tasks/<task_id>/process_dataset/config.vsh.yaml
1__merge__: ../api/comp_process_dataset.yaml
functionality:
2 name: "process_dataset"
3 resources:
4 - type: python_script
path: script.py
5 - path: /src/common/helper_functions/subset_anndata.py
platforms:
6 - type: docker
image: "ghcr.io/openproblems-bio/base_images/python:1.1.0"
setup:
- type: python
packages:
- pyyaml
7 - type: nextflow
directives:
label: [ highmem, highcpu ]
- 1
- The interface to use for this component.
- 2
-
The name of the component. In this case, this should always be set to
process_dataset
. - 3
- Resources are files that provide the logic of the component.
- 4
- We’ll create the Python script in the next step.
- 5
- Optional helper functions you can use in your script.
- 6
- A Docker platform to make the component reproducible.
- 7
- A Nextflow platform to allow turning this component into a Nextflow module (and also standalone pipeline).
src/tasks/<task_id>/process_dataset/config.vsh.yaml
1__merge__: ../api/comp_process_dataset.yaml
functionality:
2 name: "process_dataset"
3 resources:
4 - type: r_script
path: script.R
5 - path: /src/common/helper_functions/subset_anndata.R
platforms:
6 - type: docker
image: ghcr.io/openproblems-bio/base_images/r:1.1.0
7 - type: nextflow
directives:
label: [ highmem, highcpu ]
- 1
- The interface to use for this component.
- 2
-
The name of the component. In this case, this should always be set to
process_dataset
. - 3
- Resources are files that provide the logic of the component.
- 4
- We’ll create the Python script in the next step.
- 5
- (Optional) Helper functions you can use in your script.
- 6
- A Docker platform to make the component reproducible.
- 7
- A Nextflow platform to allow turning this component into a Nextflow module (and also standalone pipeline).
Step 2: Create the script
Next, create the script which will split the common dataset into a train
src/tasks/<task_id>/process_dataset/script.py
import anndata as ad
## VIASH START
= {
par "input": "resources_test/common/pancreas/dataset.h5ad",
"output_dataset": "dataset.h5ad",
"output_solution": "solution.h5ad",
}## VIASH END
print(">> Load common dataset", flush=True)
= ad.read_h5ad(par["input"])
adata
# <- implement logic for splitting the common dataset
# into a dataset and a solution object here.
print(">> Create dataset for methods", flush=True)
= ad.AnnData(
output_dataset # <- add the data you wish to store in the output_dataset object here
)
print(">> Create solution object for metrics", flush=True)
= ad.AnnData(
output_solution # <- add the data you wish to store in the output_solution object here
)
print(">> Write to disk", flush=True)
"output_dataset"])
output_dataset.write_h5ad(par["output_solution"]) output_solution.write_h5ad(par[
src/tasks/<task_id>/process_dataset/script.R
library(anndata)
## VIASH START
<- list(
par input = "resources_test/common/pancreas/dataset.h5ad",
output_dataset = "dataset.h5ad",
output_solution = "solution.h5ad",
)## VIASH END
cat(">> Load common dataset\n")
<- anndata::read_h5ad(par[["input"]])
adata
# <- implement logic for splitting the common dataset
# into a dataset and a solution object here.
cat(">> Create dataset for methods\n")
<- anndata::AnnData(
output_dataset # <- add the data you wish to store in the output_dataset object here
)
cat(">> Create solution object for metrics\n")
<- anndata::AnnData(
output_solution # <- add the data you wish to store in the output_solution object here
)
print(">> Write to disk\n")
$write_h5ad(par[["output_dataset"]])
output_dataset$write_h5ad(par[["output_solution"]]) output_solution
You can use helper functions read_config_slots_info
and subset_anndata
defined in src/common/helper_functions/subset_anndata.py
to easily subset the AnnData to having only the slots specified by the corresponding API files. Examples of tasks where this is used: Dimensionality reduction, Label projection.
Step 3: Create test resources
Next, we’ll create part of the test resources by creating a test resource script. Create a Bash script which runs the process_dataset
component.
src/tasks/<task_id>/resource_test_script/pancreas.sh
#!/bin/bash
# get the root of the directory
REPO_ROOT=$(git rev-parse --show-toplevel)
# ensure that the command below is run from the root of the repository
cd "$REPO_ROOT"
TASK_ID="<task_id>"
RAW_DATA=resources_test/common/pancreas/dataset.h5ad
DATASET_DIR=resources_test/$TASK_ID/pancreas
if [ ! -f $RAW_DATA ]; then
echo "Error! Could not find raw data"
exit 1
fi
mkdir -p $DATASET_DIR
# Run the process dataset component
viash run "src/tasks/$TASK_ID/process_dataset/config.vsh.yaml" -- \
--input "$RAW_DATA" \
--output_dataset "$DATASET_DIR/dataset.h5ad" \
--output_solution "$DATASET_DIR/solution.h5ad"
# TODO: uncomment and update this code block when a method component has been added to your task
# # Run one method
# viash run "src/tasks/$TASK_ID/methods/<method_id>/config.vsh.yaml" -- \
# --input "$DATASET_DIR/dataset.h5ad" \
# --output "$DATASET_DIR/prediction.h5ad"
# TODO: uncomment and update this code block when a metric component has been added to your task
# # Run one metric
# viash run src/tasks/$TASK_ID/metrics/<metric_id>/config.vsh.yaml -- \
# --input_prediction $DATASET_DIR/prediction.h5ad \
# --input_solution $DATASET_DIR/solution.h5ad \
# --output $DATASET_DIR/score.h5ad
You’ll have to substitute the correct value for <task_id>
, and also update the arguments to the output arguments set in the API file.
Now run the script to generate the first couple of test resources:
chmod +x src/tasks/<task_id>/resource_test_script/pancreas.sh
src/tasks/<task_id>/resource_test_script/pancreas.sh
Next steps
Having created the dataset processor and the first test resources, you can now start creating method and metric components. We recommend creating one method component and then one metric component first, so you can use those components to generate the test resources needed for testing all of the components.