LEMOrtho

---

Table of content:

# 1. Methods Kuznetsov et al. 2022

The benchmark included in this manuscript is a port of our continuous benchmarking concept https://lemmi.ezlab.org (opens new window) http://dx.doi.org/10.1101/gr.260398.119 (opens new window) that evaluates taxonomic classifiers for metagenomics. The expected workflow is that we populate the benchmark with publicly available methods, wrapping them in software containers (https://www.docker.com/ (opens new window) or Singularity on https://apptainer.org/ (opens new window)) and run them in our computational environment while measuring the computational usage (memory, runtime). Results are presented on https://lemortho.ezlab.org (opens new window); they can be updated rapidly every time a new software or an updated version is made available as a container, by our team, or by developers submitting their tool to us. This enables the most up to date version to be evaluated so prospective users of orthology delineation methods do not have to rely on obsolete evaluations. We hope that offering an online benchmark that can be updated within days after receiving a new method will encourage developers to be proactive and submit their latest developments.

# The evaluation workflow

The benchmarking pipeline is based on Snakemake 6 (http://dx.doi.org/10.1093/bioinformatics/bty350 (opens new window)) and can run each tool in either Docker or Singularity containers. Any existing container for a method can be used; a single script, LEM_ORTHO_analysis.sh, needs to be added to call the commands specific to the benchmark. The given inputs are the protein sets on which the orthology delineation has to be done and the expected output is a tsv file matching the gene ids to the ortholog group ids. For instance

7867 ENSPTRP00000002962
7867 ENSRNOP00000068794
7867 ENSTNIP00000021314
7868 ENSCAFP00000020005
7868 ENSP00000485750
7868 ENSP00000486772

Resulting set of clusters is evaluated against a ground truth by an in-house program (cmpclus). For each reference cluster it finds all overlapping clusters in the candidate set. A match may be ignored if the number of matching genes is too small. From the obtained set of matches, it will compute recall, precision and F1. Any predicted cluster with no gene in the reference is ignored. The matches are also classified as either being split or merge events. In addition the program computes a distance metric between two sets of clusters. The metric is called "Variation of information" http://dx.doi.org/10.1007/978-3-540-45167-9_14 (opens new window) and the values computed is normalised such that 0.0 represents a perfect match and 1.0 a perfect mismatch. It is a true metric in the sense that it obeys the triangle inequality.

A note on variation of information - it assumes you have a set K of elements and you have to partitions of K , X and Y where both contain all elements of K but partitioned differently. cmpclus uses the total nr of keys in X+Y (only keys from clusters that have been matched). Hence to get a proper correct VI one should only count genes that are common for X and Y.

So when we compare to different candidate sets of clusters, the distance metric VI is not exactly comparable with the value when comparing the reference with a candidate.

# The webapp

https://lemortho.ezlab.org (opens new window) is built with vue3.js (https://vuejs.org/ (opens new window)) and the visualizations are based on Apache echarts (https://echarts.apache.org (opens new window)) and the AG-GRID framework https://echarts.apache.org (opens new window).

# The standalone pipeline

To enable reproducibility and allow developers to prepare and test a software container compatible with LEMOrtho, the pipeline can be obtained and installed on https://gitlab.com/ezlab/lemortho (opens new window). The documentation can be found in the next section.

# RefOGs

The dataset used as the reference on LEMOrtho is a customized version of the Orthobench revisited study (https://dx.doi.org/10.1093/gbe/evaa211 (opens new window), https://dx.doi.org/10.1002/bies.201100062 (opens new window)), from which we removed a few problematic genes:

ENSCAFP00000022898,ENSCINP00000001707, ENSCINP00000002932, ENSCINP00000010295, ENSCINP00000018374, ENSCINP00000025546, ENSCINP00000025552, ENSCINP00000027090, ENSCINP00000035497, ENSDARP00000002330, ENSDARP00000076594, ENSDARP00000131597, ENSDARP00000156542, ENSMODP00000012647, ENSMUSP00000033502, ENSP00000316782, ENSP00000358380, ENSP00000359057, ENSP00000365858, ENSP00000371923, ENSP00000463957, ENSP00000474456, ENSP00000477509, ENSP00000477979, ENSP00000478609, ENSP00000478752, ENSP00000479545, ENSP00000479693, ENSP00000480818, ENSP00000481542, ENSP00000481835, ENSP00000498781, ENSPTRP00000001775, ENSPTRP00000040053, ENSPTRP00000056359, ENSPTRP00000058531, ENSPTRP00000086593, ENSPTRP00000092368, ENSPTRP00000092968, ENSRNOP00000038111, ENSRNOP00000067336, ENSTNIP00000010123, ENSTNIP00000017372, FBpp0073951, FBpp0075520, FBpp0082643, FBpp0082970, FBpp0291496, FBpp0291497, FBpp0309618, WBGene00001184.1, WBGene00001249.1, WBGene00006925.1, WBGene00006926.1, WBGene00006927.1, WBGene00006928.1, WBGene00006929.1, WBGene00006930.1, WBGene00010139.1, WBGene00015237.

A second version is used in which we filtered out all genes from the inputs but the one belonging to RefOGs.

The datasets ready to be used with LEMOrtho can be found on https://zenodo.org/search?q=ezlab_lemortho (opens new window)

# Computing environment

All tools were run on a machine with 500GB of ram made available to the tools. 64 cpus were given to each tool. The containers were built with Docker version 20.10 and run with Singularity version 3.8.1-1.el8. The conversions were done with https://quay.io/singularity/docker2singularity (opens new window). OrthMCL was run with Docker as it requires the root account to be used. The memory usage reported is based on the peak in resident set size memory that occurs while the container is loaded.

# Configuration of each tools

Orthologer is based on version 2.6.3.

Orthofinder is based on the container “docker pull davidemms/orthofinder:2.5.4” and was run with Diamond version 2.0.12.

Sonicparanoid version is 1.3.8 installed in an ubuntu:20.04 container through pip install and was run with MMseqs2 Version: 45111b641859ed0ddd875b94d6fd1aef1a675b7e

OMA is based on the container “docker pull dessimozlab/oma_standalone:2.5.0” and run with the following parameters: DistConfLevel := 2, MinProbContig := 0.4, MaxContigOverlap := 5, MinSeqLenContig := 20, MinBestScore := 250.

OrthoMCL 2.0.9 was manually installed in an Ubuntu 20.04 container with the following parameters: dbVendor=mysql, dbConnectString=dbi:mysql:orthomcl:localhost:mysql_local_infile=1, similarSequencesTable=SimilarSequences, orthologTable=Ortholog, inParalogTable=InParalog, coOrthologTable=CoOrtholog, interTaxonMatchView=InterTaxonMatch, evalueExponentCutoff=-5, percentMatchCutoff=50

Containers are available on https://quay.io/user/ezlab (opens new window) with the suffix _lemortho.



# 2. How to use the LEMOrtho standalone pipeline ?

Here you will learn how to set up the standalone version of LEMOrtho, to reproduce or expand what is presented on https://lemortho.ezlab.org (opens new window)

TIP

Check the compatible containers and the compatible datasets sections in the website to recover the content not included with the standalone pipeline

To successfully run this tutorial, you will need:

  • The Docker or Singularity engine installed and running
  • A working version of Conda

# LEMOrtho code: clone the sources and set ENV variables

git clone https://gitlab.com/ezlab/lemortho.git
export LEM_ORTHO_ROOT=/your/path/lemortho
export PATH=${LEM_ORTHO_ROOT}/workflow/scripts:$PATH

# Install dependencies in a mamba environment

conda install -n base -c conda-forge mamba
mamba env update -n lemortho --file ${LEM_ORTHO_ROOT}/workflow/envs/lem_ortho.yaml
conda activate lemortho

# to deactivate or remove if necessary
conda deactivate
conda remove --name lemortho --all

# Container engines

No specific version is recommended for Docker, LEMORTHO was developed and tested with Docker version 20.10.5, build 55c4c88.

For singularity, version 3+ will be necessary. LEMORTHO was tested on our HPC environment with the following versions: module load GCCcore/8.2.0 Python/3.7.2 Singularity/3.4.0-Go-1.12

LEMORTHO is based on snakemake and is run by calling

lem_ortho --cores 8 # running locally on Docker
lem_ortho --cores 8 --use-singularity # running locally on Singularity
lem_ortho --use-singularity --profile cluster --jobs 8  # running on Singularity on a cluster using profiles.

TIP

You can pass to the lem_ortho command all Snakemake standard parameters, such as --dry-run or --unlock

# If using Singularity, export all containers as .sif files

Singularity requires all containers (lem_ortho_master and candidate tools) to be exported as .sif files and placed in the ${LEM_ORTHO_ROOT}/benchmark/sif/ folder.

The name of the file is the name of the equivalent Docker container without the repository and tag. E.g. in the main config file, lemortho_master:quay.io/ezlab/lemortho_master:v1.0 would point to ${LEM_ORTHO_ROOT}/benchmark/sif/lemortho_master.sif. You can keep the full Docker path in the config as long as the .sif file exists.

To convert a Docker image to singularity images, you can use the docker2singularity image as follows:

docker run -v /var/run/docker.sock:/var/run/docker.sock -v $(pwd):/output --privileged -t --rm quay.io/singularity/docker2singularity lemortho_master

You will obtain a .sif file that you can rename according to your need and place in the ${LEM_ORTHO_ROOT}/benchmark/sif/ folder.

You can do the same for quay.io/ezlab/orthofinder_254_lemortho

# Defining runs

in

${LEM_ORTHO_ROOT}/benchmark/yaml/

for each combination of tool and dataset you want to evaluate, create a file named tool.dataset.yaml, for instance:

orthofinder.refogs_pure.yaml

container: quay.io/ezlab/orthofinder_254_lemortho
dataset: refogs_pure
tmp: orthofinder_pure
params_analysis:

The tmp folder will be an isolated folder dedicated to the run. It will be provided as the current working directory to the tool container. If the tool write in the current directory and not in absolute paths such as /tmp, no conflicting files will exist between run. If you rerun with the same tmp folder name, existing files will be available.

TIP

Any compatible container on a public Docker repository can be declared and LEMOrtho will pull it when run with Docker

# Let's run it

lem_ortho --cores 4

This will download the containers (if running on Docker, else you need to manually create the sif files), run the tools on the datasets, and produce the evaluation.

# Explore the results: files

You can see the predictions made by the tools in ${LEM_ORTHO_ROOT}/benchmark/analysis_outputs/

# Explore the results: web

Once the LEMORTHO pipeline has run to the end, all benchmarking results exist in the ${LEM_ORTHO_ROOT}/benchmark/final_results/ folder.

To explore them in the webapp, call

lem_ortho_web_docker
# or
lem_ortho_web_singularity

This will start a web server running locally on your machine, on port 8080 if available.

Use a web browser to navigate the results on http://127.0.0.1:8080/ (opens new window)

TIP

If you are running LEMORTHO on a Singularity engine, you are likely to need a privileged (root, sudo) account to run a web server like this.

# 3. How to make your own container compatible with LEMORtho ?

You need to include a script that can be executed as follows:

docker run -u $(id -u) container_name LEM_ORTHO_analysis.sh or singularity exec container_name.sif LEM_ORTHO_analysis.sh

You can adapt the following, used with Orthofinder:

#!/usr/bin/env bash
set -o xtrace

dataset=$1
output_file=$2

# the nr of cpu's
if [ -z ${cpus+x} ]; then cpus=$(nproc);fi

cp ../../datasets/$dataset/fasta/* .

# run orthofinder - cmd is in the path
orthofinder -f . -t $cpus -S diamond

# convert orthofinder output to cmpclus compatible format
output=$(find OrthoFinder/ -name Orthogroups.txt)
if [[ -f $output ]]; then
    conv2odb.py -c $output > $output_file
else
    echo "error: missing output file after orthofinder run $dataset"
fi

Once built, you container should use the inputs provided by LEMOrtho (dataset name as $1, output path as $2) and use the fasta files in ../../datasets/$1/fasta/ to delineate the orthologous groups.

The current directory is the tmp folder created by LEMOrtho following the yaml file that defines the run (see above).

The script need to write a final output file in the path provided as $2, and the structure should match:

7867 ENSPTRP00000002962
7867 ENSRNOP00000068794
7867 ENSTNIP00000021314
7868 ENSCAFP00000020005
7868 ENSP00000485750
7868 ENSP00000486772

Group id, gene name

The python script conv2odb.py reads the orthofinder output to turn it into the expected format