Quickstart

Installation

MIXALIME is best installed via pip package manager:
> pip install mixalime
Alternatively, you can fetch the latest version from the git repository:
> git clone https://github.com/autosome-ru/MixALime
> cd MixALime
> python3 setup.py install

Overview

A user communicates with MIXALIME through the command-line interface. After installing the package, your environment should become aware of the mixalime command. Check if it is indeed the case by invoking
> mixalime --help
You should now be greeted with a list of available commands alongside their short descriptions. Also, you can get help on any listed command as well, e.g.
> mixalime create --help
In theory, help should provide sufficient information to deduce how to do everything. This can be infeasible in practice, though, and the goal of this tutorial is to explain everything in a greater detail and easier-to-follow fashion.
When working with MixALime, you should always start with creating a MixALime project from the data of interest. That is done with the create command. For example, if your dataset is a folder full of vcf files, you run
> mixalime create MyProject path/to/folder
Supported data formats and preprocessing parameters are explained in the next section. After MixALime is done preprocessing data, it should output MyProject.init.lzma file. MixALime stores intermediate results in special (possibly, compressed) files whose names start with the project name followed by the step name and a compression method, if any:
{ProjectName}.{StepName}.{CompressionMethod}
The power of this approach is that one can freely redistribute intermediate, e.g. init, files to different users, and they should perform all the other steps of the pipeline with no hassle and no requirements to transfer the source files along. Usually, after the project is created, we fit model parameters to the data with the fit command:
> mixalime fit MyProject MCNB
Here, MCNB is a name of the distribution (Marginalized Compound Negative Binomial, to be precise). MixALime has a variety of distributions at its disposal, namely NB (plain negative binomial), MCNB and BetaNB. More on that in the Fit section of the tutorial. Successful invocation of the command should result in MyProject.fit.lzma file being created.
Next, we compute p-values and effect sizes (ES) with test:
> mixalime test MyProject
The result is saved to MyProject.test.lzma. Those p-values can now be combined across samples with combine:
> mixalime combine MyProject
This produces MyProject.comb.lzma file.
Of course, it is possible to define groups across which to combine p-values. combine and test are covered in the Scoring section.
Likewise, we can perform differential tests if necessary:
> mixalime difftest MyProject control_group.txt test_group.txt
Here, control_group.txt and test_group.txt are text files containing list of read counts filenames that represent control and test groups (those files should be be amongst those we supplied to the create previously). You should see MyProject.difftest.lzma file after calling this command in your current working directory.
Finally, we export all our results as tabular data to a folder result_folder:
> mixalime export all MyProject result_folder
Some commands can be used independently, but most of them are applied consequently. A possible workflow is depicted in the diagram below:
Read & preprocess data
Read & preprocess d...
Estimate model parameters
Estimate model para...
Compute p-values
Compute p-values
Combine p-values across groups/individuals
Combine p-values ac...
Export fit indices and/or p-values
Export fit indices...
Plot fit statistics
Plot fit statistics
Do differential tests
Do differential tes...
> mixalime combine
> mixalime combine
> mixalime export
> mixalime export
> mixalime fit
> mixalime fit
> mixalime difftest
> mixalime difftest
> mixalime test
> mixalime test
> mixalime plot
> mixalime plot
> mixalime create
> mixalime create
Text is not SVG - cannot display

Example

MixALime arrives with a demo dataset (namely, DNase of the K562 cell line). First, let's extract it:
> mixalime export demo
In the current working directory, a folder scorefiles with a bunch of VCF files should appear. Then we run the create command with the --no-snp-bad-check argument. By default, MixALime throws an error if the same SNV belongs regions of different background allelic dosage (BAD, see ADASTRA paper and glossary) in data, as it is not possible in a single cell type. However, for datasets of multiple samples, this is fully OK, hence we turn this check off:
> mixalime create MyProject scorefiles --no-snp-bad-check
After it is done, MixALime will provide some dataset statistics:
┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ BAD  ┃ SNVs  ┃ Obvservations ┃
┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
│ 1.00 │ 10320 │ 24893         │
│ 1.33 │ 7408  │ 13359         │
│ 1.50 │ 12409 │ 37584         │
│ 2.00 │ 41763 │ 199208        │
│ 2.50 │ 2195  │ 3585          │
│ 3.00 │ 6897  │ 26558         │
│ 4.00 │ 1677  │ 2420          │
│ 5.00 │ 712   │ 926           │
│ 6.00 │ 731   │ 1143          │
└──────┴───────┴───────────────┘
Total unique SNVs: 84112, total observations: 309676
✔️ Done!  time: 9.27 s.
Now, lets obtain parameter estimates for the model. We are going to use beta negative binomial model here:
> mixalime fit MyProject BetaNB --n-jobs -1
We also asked MixALime to use all threads to parallelize jobs across different BADs. After some time, we should obtain parameter estimates. You might also notice warnings like this:
BAD 4.0 is less than a window size (2420 < 10000). Number of samples is too small for a sensible fit, a conservative fit will be used.
That's fine and almost self-explanatory. MixALime found that there is little data in BAD 4.0 region and decided to use predetermined values of parameters (aka "conservative fit") instead of trying to estimate them. It is possible to alter this behavior if necessary.
It is a good idea now to take a look atthe obtained model fit in a visual form before proceeding any further:
> mixalime plot all MyProject results
You'll find the results folder with subfolders BAD1.00, BAD1.50, BAD2.0 etc. We discuss the interpretation of plots in the in the Fit section. Figures below look good, hence everything's OK:
results/BAD1.0/slices_20.png
results/BAD1.0/r.png
results/BAD2.0/r.png
results/BAD3.0/r.png
results/BAD1.0/k.png
results/BAD2.0/k.png
results/BAD3.0/k.png
There is no w.png figure for BAD 1.0. That's because it is a mixture parameter applicable only for BADs higher than 1, so let's plot one for BAD 4.0 instead.
results/BAD2.0/w.png
results/BAD3.0/w.png
results/BAD4.0/w.png
The last one, results/BAD4.0/w.png, looks really different, but that's OK. Remember that warning we had at the fit stage? No parameters were estimated for BAD 4.0, instead MixALime used "conservative" values of parameters. Specifically, "conservative" value for $w$ is $1$.
results/scorefiles_qc.png
Also, one might notice scorefiles_qc.png files in each BAD directory and even in the outer results folder. This is a helpful plot that can be used to spot bad data. If we have "scorefiles" (e.g. VCF files) that happen to be strong outliers, we might consider removing them from the project. Here, only a single file seems like an outlier (the one with a common greatest postfix '1426'), but notice that it it has very low total coverage (3000; in this particular case there are just a few SNP in the file), so it can be totally by chance. Usually, pathological cases are both outliers and have reasonably high total coverage.
Let's move on. Now, when we are sure that everything is fine with fits, we can compute raw p-values and effect-sizes (ES):
> mixalime test MyProject
We can think of the "raw" statistic obtained at this step as of an intermediate step before getting to the actual p-values and ES for each SNV/SNP. Technically, "raw" pvalue and ES are statistics computed for each unique pair of (reference allele read count, alternative allele read count). Note that there are many more SNV/SNPs than those unique pairs. At the next step we need to decide how to "aggregate"/combine them. Depending on the experiment, we might want to combine SNVs with respect to different conditions or cell types, see more details in the Scoring section. But here we are fine with combining all p-values and ESes associated with a particular SNV together unconditionally, so no extra options are passed to the command combine:
> mixalime combine MyProject
After MixALime is done combining statistics, it shall inform us on the number of SNVs passing significance threshold after multiple test correction:
Number of ignificantly imbalanced SNVs after FDR correction:
┏━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      ┃      ┃      ┃ Total significant          ┃
┃ Ref  ┃ Alt  ┃ Both ┃ (Percentage of total SNVs) ┃
┡━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 2349 │ 2081 │ 0    │ 4430 (13.15%)              │
└──────┴──────┴──────┴────────────────────────────┘
Total SNVs tested: 33689
Finally, we can export results in a tabular form:
> mixalime export all MyProject results
You'll find p-values in a table at results/pvalues/pvals.tsv.