Repository
In the OpenProblems codebase, the src
directory contains Viash components that manage various aspects of the project, such as common datasets, tasks, and common processing components. The target
folder is where artifacts generated from these Viash components are stored, including Dockerized Nextflow modules. The resources_test
directory contains the test resources required for running unit tests on the Viash components. It is important to note that these test resources are not stored within the git repository. Instead, they are obtained by running the sync test resources component (See “Getting started”).
The main data flow of the pipeline is shown in Figure 1. The common dataset components create common dataset objects which are used in one or more tasks.
Directory Structure
src/common
: This subdirectory contains helper components that helps with creating new components, unit testing other components, or managing task results.src/datasets
: The dataset processing pipeline uses dataset loaders to create raw dataset files. The raw dataset files are then processed to generate common dataset files. Common dataset files are used in one or more tasks.src/tasks/<task_id>
: Each task should contain a data processor (to transform common datasets into task-specific datasets), methods, control methods (for quality control), and metrics.resources_test
: This directory contains the test resources required for running unit tests on the Viash componentstarget
: This directory contains the artifacts built from the Viash components in thesrc
directory.
Technology stack
AnnData: A file format designed for handling annotated, high-dimensional biological data (Virshup et al. 2021). In OpenProblems, AnnData serves as the standard data format for both input and output files of components, ensuring a consistent and seamless exchange of data between different components of the benchmarking pipelines.
AWS: Amazon Web Services provides scalable and cost-effective cloud computing and storage. AWS is being used to store datasets, test resources, and run the Nextflow benchmarking pipelines.
CELLxGENE Census: A cloud-based library of single-cell RNA sequencing (scRNA-seq) datasets, developed by the Chan Zuckerberg Initiative. OpenProblems uses the CELLxGENE Census platform to fetch datasets for benchmarking.
Docker: Provides a consistent and reproducible environment for building, packaging, and deploying applications and dependencies across different platforms. Docker images are generated by Viash and stored on ghcr.io.
GitHub Actions: A continuous integration and continuous deployment (CI/CD) platform integrated with GitHub. This project uses GitHub Actions to perform continuously build and unit test the components in the project.
Nextflow: A workflow management system that simplifies the design, deployment, and execution of complex data processing pipelines, enabling seamless scaling and parallelization. All Nextflow modules are generated by Viash and are stored in the
target/nextflow/
folder in the project releases (and on themain_build
branch).Python: A widely used, high-level programming language, offering extensive libraries and packages for data manipulation, analysis, and machine learning. Most of the OpenProblems components are written in Python.
R: A programming language and software environment for statistical computing and graphics, widely used in data analysis and bioinformatics. OpenProblems also offers support for R components.
Viash: A tool that facilitates the creation of modular pipeline components by allowing developers to combine a code block or script with a small amount of metadata (Cannoodt et al. 2021). Viash components are used in OpenProblems for dataset loaders, dataset processors, methods, and metrics, enabling developers to focus on the core functionality of their components without worrying about the chosen pipeline framework.