[![gemma-wrapper gem version](https://badge.fury.io/rb/bio-gemma-wrapper.svg)](https://badge.fury.io/rb/bio-gemma-wrapper) # GEMMA with LOCO, permutations and slurm support (and caching) ![Genetic associations identified in CFW mice using GEMMA (Parker et al, Nat. Genet., 2016)](cfw.gif) ## Introduction Gemma-wrapper allows running GEMMA with LOCO, GEMMA with caching, GEMMA in parallel (now the default with LOCO), and GEMMA on PBS. Gemma-wrapper is used to run GEMMA as part of the https://genenetwork.org/ environment. Note that a version of gemma-wrapper is projected to be integrated into gemma itself. GEMMA is a software toolkit for fast application of linear mixed models (LMMs) and related models to genome-wide association studies (GWAS) and other large-scale data sets. This repository contains gemma-wrapper, essentially a wrapper of GEMMA that provides support for caching the kinship or relatedness matrix (K) and caching LM and LMM computations with the option of full leave-one-chromosome-out genome scans (LOCO). Jobs can also be submitted to HPC PBS, i.e., slurm. gemma-wrapper requires a recent version of GEMMA and essentially does a pass-through of all standard GEMMA invocation switches. On return gemma-wrapper can return a JSON object (--json) which is useful for web-services. ## Performance LOCO runs in parallel by default which is at least a 5x performance improvement on a machine with enough cores. GEMMA without LOCO, however, does not run in parallel by default. Performance improvements with the parallel implementation for LOCO and non-LOCO can be viewed [here](./test/performance/releases.gmi). ## Installation Prerequisites are * A recent version of [GEMMA](https://github.com/genetics-statistics/GEMMA) * Standard [Ruby >2.0 ](https://www.ruby-lang.org/en/) which comes on almost all Linux systems gemma-wrapper comes as a Ruby [gem](https://rubygems.org/gems/bio-gemma-wrapper) and can be installed with gem install bio-gemma-wrapper Invoke the tool with gemma-wrapper --help and it will render something like ``` Usage: gemma-wrapper [options] -- [gemma-options] --permutate n Permutate # times by shuffling phenotypes --permute-phenotypes filen Phenotypes to be shuffled in permutations --loco Run full leave-one-chromosome-out (LOCO) --chromosomes [1,2,3] Run specific chromosomes --input filen JSON input variables (used for LOCO) --cache-dir path Use a cache directory --json Create output file in JSON format --force Force computation (override cache) --parallel Run jobs in parallel --no-parallel Do not run jobs in parallel --slurm[=opts] Use slurm PBS for submitting jobs --q, --quiet Run quietly -v, --verbose Run verbosely -d, --debug Show debug messages and keep intermediate output --dry-run Show commands, but don't execute -- Anything after gets passed to GEMMA -h, --help display this help and exit ``` Alternatively, fetch a [release](https://github.com/genetics-statistics/gemma-wrapper/releases) of [gemma-wrapper](https://github.com/genetics-statistics/gemma-wrapper) Unpack it and run the tool as ./bin/gemma-wrapper --help See below for using a GNU Guix environment. ## Usage gemma-wrapper picks up GEMMA from the PATH. To override that behaviour use the GEMMA_COMMAND environment variable, e.g. env GEMMA_COMMAND=~/opt/gemma/bin/gemma ./bin/gemma-wrapper --help to pass switches to GEMMA put them after '--' e.g. gemma-wrapper -v -- -h prints the GEMMA help ## Caching of K To compute K run the following command from the source directory (so the data files are found): gemma-wrapper -- \ -g test/data/input/BXD_geno.txt.gz \ -p test/data/input/BXD_pheno.txt \ -a test/data/input/BXD_snps.txt \ -gk \ -debug Run it twice to see /tmp/0bdd7add5e8f7d9af36b283d0341c115124273e0.log.txt CACHE HIT! gemma-wrapper computes the unique HASH value over the command line switches passed into GEMMA as well as the contents of the files passed in (here the genotype and phenotype files - actually it ignores the phenotype with K because GEMMA always computes the same K). You can also get JSON output on STDOUT by providing the --json switch gemma-wrapper --json -- \ -g test/data/input/BXD_geno.txt.gz \ -p test/data/input/BXD_pheno.txt \ -a test/data/input/BXD_snps.txt \ -gk \ -debug > K.json K.json is something that can be parsed with a calling program, and is also below as input for the GWA step. Example: ```json {"warnings":[],"errno":0,"debug":[],"type":"K","files":[["/tmp/18ce786ab92064a7ee38a7422e7838abf91f5eb0.log.txt","/tmp/18ce786ab92064a7ee38a7422e7838abf91f5eb0.cXX.txt"]],"cache_hit":true,"gemma_command":"../gemma/bin/gemma -g test/data/input/BXD_geno.txt.gz -p test/data/input/BXD_pheno.txt -gk -debug -outdir /tmp -o 18ce786ab92064a7ee38a7422e7838abf91f5eb0"} ``` Note that GEMMA's -o (output) and --outdir switches should not be used. gemma-wrapper stores the cached matrices in TMPDIR by default. If you want something else provide a --cache-dir, e.g. gemma-wrapper --cache-dir ~/.gemma-cache -- \ -g test/data/input/BXD_geno.txt.gz \ -p test/data/input/BXD_pheno.txt \ -a test/data/input/BXD_snps.txt \ -gk \ -debug will store K in ~/.gemma-cache. ### GWA Run the LMM using the K's captured earlier in K.json using the --input switch gemma-wrapper --json --input K.json -- \ -g test/data/input/BXD_geno.txt.gz \ -p test/data/input/BXD_pheno.txt \ -c test/data/input/BXD_covariates2.txt \ -a test/data/input/BXD_snps.txt \ -lmm 2 -maf 0.1 \ -debug > GWA.json Running it twice should show that GWA is not recomputed. /tmp/9e411810ad341de6456ce0c6efd4f973356d0bad.log.txt CACHE HIT! ### LOCO Recent versions of GEMMA have LOCO support for a single chromosome using the -loco switch (for supported formats check https://github.com/genetics-statistics/GEMMA/issues/46). To loop all chromosomes first create all K's with gemma-wrapper --json \ --loco -- \ -g test/data/input/BXD_geno.txt.gz \ -p test/data/input/BXD_pheno.txt \ -a test/data/input/BXD_snps.txt \ -gk \ -debug > K.json and next run the LMM's using the K's captured in K.json using the --input switch gemma-wrapper --json --loco --input K.json -- \ -g test/data/input/BXD_geno.txt.gz \ -p test/data/input/BXD_pheno.txt \ -c test/data/input/BXD_covariates2.txt \ -a test/data/input/BXD_snps.txt \ -lmm 2 -maf 0.1 \ -debug > GWA.json GWA.json contains the file names of every chromosome ```json {"warnings":[],"errno":0,"debug":[],"type":"GWA","files":[["1","/tmp/9e411810ad341de6456ce0c6efd4f973356d0bad.1.assoc.txt.log.txt","/tmp/9e411810ad341de6456ce0c6efd4f973356d0bad.1.assoc.txt.assoc.txt"],["2","/tmp/9e411810ad341de6456ce0c6efd4f973356d0bad.2.assoc.txt.log.txt","/tmp/9e411810ad341de6456ce0c6efd4f973356d0bad.2.assoc.txt.assoc.txt"]... ``` The -k switch is injected automatically. Again output switches are not allowed (-o, -outdir) ### Permutations Permutations can be run with and without LOCO. First create K gemma-wrapper --json -- \ -g test/data/input/BXD_geno.txt.gz \ -p test/data/input/BXD_pheno.txt \ -gk \ -debug > K.json Next, using K.json, permute the phenotypes with something like gemma-wrapper --json --loco --input K.json \ --permutate 100 --permute-phenotype test/data/input/BXD_pheno.txt -- \ -g test/data/input/BXD_geno.txt.gz \ -p test/data/input/BXD_pheno.txt \ -c test/data/input/BXD_covariates2.txt \ -a test/data/input/BXD_snps.txt \ -lmm 2 -maf 0.1 \ -debug > GWA.json This should get the estimated 95% (significant) and 67% (suggestive) thresholds: ["95 percentile (significant) ", 1.92081e-05, 4.7] ["67 percentile (suggestive) ", 5.227785e-05, 4.3] ### Slurm PBS To run gemma-wrapper on HPC use the '--slurm' switch. ## Development We use GNU Guix for development and deployment. Use the [.guix-deploy](.guix-deploy) script in the checked out git repo: ``` source .guix-deploy ruby bin/gemma-wrapper --help ``` ## Copyright Copyright (c) 2017-2021 Pjotr Prins. See [LICENSE.txt](LICENSE.txt) for further details.