Design the API
After having created the task metadata file (see “Define task”), we will next define the type of components and file formats your task consists of. Concretely, you need to define:
- Which types of components your task consists of, that is, a dataset processor, methods, control methods, and metrics.
- Which interface each of these component types has, that is, one or more input files and one or more output files.
- The file format of each of these files (typically an AnnData).
When put together, a typical task API looks somewhat like this:
The dimensionality reduction task is an example of an OpenProblems task with this topology, where the output is an embedding of the original dataset, and the solution is cell annotations which are used to verify whether the resulting embedding represents the intended biological information.
Why?
Having a formally defined API ensures consistency and interoperability across different components of your task. This makes it easier for others to contribute and build upon your work. Not only that, but creating API files (partially) automates the following steps:
- Creating new methods and metrics using the
create_component
method. - Automated testing of components.
- Generating reference documentation.
- Implement a component for processing the common datasets.
How?
We’ll need to create API files for each component and AnnData file separately. However, this is actually quite easy to do, as we will show in the following sections.
Step 1: Create task API diagram
First start by creating a diagram similar to what is shown in Figure 1. We recommend drawing the diagram on paper at first.
Here are the most common types of components and file formats:
- Common dataset: OpenProblems offers a standard collection of datasets, which can be used to kickstart a new task.
- Dataset processor: This component ingests a Common dataset and splits it into one or more task-specific dataset objects. We recommend at least having a Dataset and Solution object, such that a Method component never “sees” the ground-truth information needed by the Metric component.
- Dataset: The data used by a method to create an output (i.e. prediction).
- Solution: The ground-truth information needed by a Metric to compare an output against. It’s highly recommended to store the ground-truth information as a separate AnnData object, such that a Method cannot (accidentally) cheat.
- Method: An algorithm used to make predictions for or process an input dataset in some way.
- Control method: A quality control for methods, metrics and the pipeline as a whole. A control method can either be a positive control (which uses the ground-truth information in from the solution to create a perfect output) or a negative control (which uses random distributions to generate outputs in the correct format).
- Output: The output generated by a (control) method.
- Metric: A quatitative measure used to evaluate the performance of a method.
- Score: An AnnData object containing one or more metric values.
Figure 2 and Figure 3 are examples of two OpenProblems tasks with slightly different workflow layouts.
Step 2: Create file formats
Now that you’ve created the topology of the task workflow, the next step is to translate that information into the required file format specification files.
Let’s start by creating one for the solution object:
src/tasks/<task_id>/api/file_solution.yaml
- 1
-
This YAML file will be used to define the arguments of a Viash component. This must always be set to
type: file
. - 2
- Description of the file, useful for quickly understanding what type of data such a file represents. Used for generating reference documentation.
- 3
- An example of this file. At this stage, this file does not exist yet, but it will be created later on, as this file is used for unit testing components.
- 4
- A short label used to represent the file in diagrams in the reference documentation.
- 5
- Which AnnData slots need to be present in the file which will be defined in Step 4.
Create a YAML file for each of the other AnnData files in the task workflow. For example, src/tasks/<task_id>/api/file_dataset.yaml
, src/tasks/<task_id>/api/file_output.yaml
, and so on.
Each file format specification file is actually a Viash file argument. That’s because these YAML files will be used as arguments in the different component types.
Step 3: Create component types
Next, we will create the API specification files for each of the components (i.e. purple rhomboids) in your diagram.
Start by creating the method component type:
src/tasks/<task_id>/api/comp_method.yaml
functionality:
1 namespace: <task_id>/methods
2 info:
3 type: method
type_info:
4 label: "Method"
5 description: "FILL IN: A description of what this type of component does."
6 arguments:
- name: "--input"
__merge__: file_dataset.yaml
required: true
- name: "--output"
__merge__: file_prediction.yaml
required: true
direction: output
- 1
-
The
namespace
for the component type. format<task_id>/<component_type>
. The namespace is used to group similar components together and ensures that they can be easily found and used within the task. - 2
- Metadata about the component type.
- 3
- A unique identifier for the type of component.
- 4
- A formatted label for the component type.
- 5
- A description of the component type.
- 6
-
The
arguments
that the component accepts. Each argument has a name (e.g.,--input
), a direction (input
(default) oroutput
) and whether it’s required or not. Note that this information is partially provided by merging the file API YAML file specified earlier, using the__merge__
notation.
Create a YAML file for each of the other component types files in the task workflow. For example, src/tasks/<task_id>/api/comp_dataset_processor.yaml
, src/tasks/<task_id>/api/comp_metric.yaml
, and so on.
Again, each component type is formatted as a Viash config file, because they will be used to create components.
Step 4: Add slots to file formats
Finally, the last step is to define the actual required and optional slots each of the file format specifications. Since each of these files are AnnData HDF5 files, the file format specifications is structured analagously to the AnnData data structures: layers
, obs
, obsm
, obsp
, var
, varm
, varp
, and uns
.
Below is the slot information of the solution AnnData object:
src/tasks/<task_id>/api/file_solution.yaml
type: file
description: "FILL IN: what this file represents"
example: "resources_test/<task_id>/pancreas/solution.h5ad"
info:
label: Solution
1 slots:
2 layers:
- type: integer
name: counts
description: Raw counts
- type: double
name: normalized
description: Normalized counts
3 obs:
- type: string
name: label
description: Ground truth cell type labels
- type: string
name: batch
description: Batch information
4 var:
- type: boolean
name: hvg
description: Whether or not the feature is considered to be a 'highly variable gene'
required: true
- type: integer
name: hvg_score
description: A ranking of the features by hvg.
required: true
5 obsm:
- type: double
name: X_pca
description: The resulting PCA embedding.
required: true
6 uns:
- type: string
name: dataset_id
description: "A unique identifier for the dataset"
required: true
- type: string
name: normalization_id
description: "Which normalization was used"
required: true
- 1
- The mandatory and optional slots in the AnnData file.
- 2
- Specification of one or more AnnData layers (matrices).
- 3
- Specification for cell-level metadata (one or more columns).
- 4
- Specification for feature-level metadata (one or more columns).
- 5
- Specification for unstructured data.
- 6
- Other AnnData slots.
Each required or optional slot in the file format should have the following fields:
name
: The name of the slot.type
: Which data type (string, boolean, integer or double).description
: What this data represents.required
: Whether or not this slot is required (default:true
).
Go through each file format specification file and add the expected slots accordingly.
Look at the Common dataset reference docs to see which slots the common datasets have. The AnnData file at resources_test/common/pancreas/dataset.h5ad
is also an example of a Common dataset, though note that this object contains more slots than what is defined by the spec.