Long Read Single-Cell Analysis Tutorial
======================

This tutorial will guide you through the automated combined analysis of multiple long read single-cell datasets. Automation includes defaults for GFF merging and reference transcript selection, protein translation and statistical filtering. These steps can be further customized (alternative options, statistical thresholds) through separate calls to the underlying python functions.

Prerequisites
-------------
**Requirements**:

- Python 3.11 or higher
- AltAnalyze3 installed via pip (`pip install altanalyze3`)
- 10x Genomics long read matrices (.mtx) and associated GFF files

**Sample Metadata**:
Ensure that your samples are properly annotated in a metadata file. A sample metadata file may look like this:

.. code-block:: text

   uid     gff                   matrix              library       reverse    groups
   D001    /Diag1-1/D001.gff     /Diag1-1/sciso      D001-HSC      TRUE       Diagnosis
   D001    /Diag1-2/D001.gff     /Diag1-2/sciso      D001-MPP      TRUE       Diagnosis
   D002    /Diag2/D002.gff       /Diag2/sciso        D002-HSPC     TRUE       Diagnosis
   D003    /Diag3/D003.gff       /Diag3/sciso        D003-HSPC     TRUE       Diagnosis
   D004    /Relapse1/D004.gff    /Relapse1/sciso     D004-HSPC     TRUE       Relapse
   D005    /Relapse2/D005.gff    /Relapse2/sciso     D005-HSPC     TRUE       Relapse
   D006    /Relapse3/D006.gff    /Relapse3/sciso     D006-HSPC     FALSE      Relapse

Note: Multiple sequencing runs or libraries will be combined with the same uid. If the cell barcodes are reverse complemented from the barcode-to-cluster relationships (two column file), enter reverse as True.

**Install Dependencies**:
Use the following command to install dependencies:

.. code-block:: bash

   pip install altanalyze3

   curl -O https://altanalyze.org/isoform/Hs.zip

   unzip Hs.zip


Step-by-Step Preprocessing
--------------------------
**Prepare Metadata and Cluster Files**:
You need metadata and barcode-cluster files for cluster-guided analyses. Extract database files from the Hs.zip file. Example:

.. code-block:: text

   /path/to/metadata.txt
   /path/to/barcode_to_clusters.txt

   /path/to/gencode.annotation.gff3
   /path/to/Hs_Ensembl-annotations.txt
   /path/to/Hs_Ensembl_exon.txt
   /path/to/genome.fa

**Run Preprocessing Script**:
In your Python environment or script, run:
   
.. code-block:: python

   import altanalyze3.components.long_read.isoform_matrix as iso
   import altanalyze3.components.long_read.isoform_automate as isoa

   metadata_file = "/path/to/metadata.txt"
   ensembl_exon_dir = "/path/to/Hs_Ensembl_exon.txt"
   barcode_cluster_dirs = ["/path/to/barcode_to_clusters.txt"]

   sample_dict = isoa.import_metadata(metadata_file)
   isoa.pre_process_samples(metadata_file, barcode_cluster_dirs, ensembl_exon_dir)

Note: ensembl_exon_dir, gene_symbol_file, genome_fasta and gencode_gff must pre-downloaded from the Hs.zip above.

**Combining Processed Samples**:
Once preprocessed, combine them using:

.. code-block:: python

   import altanalyze3.components.long_read.comparisons as comp
   gencode_gff = "/path/to/gencode.annotation.gff3"
   genome_fasta = "/path/to/genome.fa"

   isoa.combine_processed_samples(
      metadata_file,
      barcode_cluster_dirs,
      ensembl_exon_dir,
      gencode_gff,
      genome_fasta
   )

**Compute and Annotate Differential Splicing Events and Isoforms**:
Once preprocessed, combine them using:

.. code-block:: python

   gene_symbol_file = "/path/to/Hs_Ensembl-annotations.txt"

   # Import all cell clusters in order or replace with a list of select cluster(s)
   cluster_order = iso.return_cluster_order(barcode_cluster_dirs)

   # Differential analyses to perform
   analyses = ['junction', 'isoform', 'isoform-ratio']

   condition1 = 'Diagnosis'
   condition2 = 'Relapse'
   conditions = [(condition1, condition2)]

   comp.compute_differentials(
      sample_dict,
      conditions,
      cluster_order,
      gene_symbol_file,
      analyses=analyses
   )

**Expected Outputs**:

- *gff_output* - Directory of isoform exon structure and isoform mappings
- *sample.h5ad* - Anndata for each sample with consensus isoform or junctions IDs
- *protein_sequences.fasta*  - protein sequence for consensus isoforms
- *protein_summary.txt*  - isoform NMD prediction
- *isoform_combined_pseudo_cluster_tpm.txt* - Cluster-level pseudobulks TPMs
- *junction_combined_pseudo_cluster_counts.txt* - Junction, intron & 3' end counts
- *protein_summary.txt*  - isoform NMD prediction
- *psi_combined_pseudo_cluster_counts.txt* - PSI for junctions in >2 cluster pseudobulks
- *junction_combined_pseudo_cluster_counts.txt* - Junction, intron & 3' end counts
- *dPSI-events.txt* - Pairwise group Mann-Whitney U differential PSI events
- *dPSI-cluster/covariate* - Pairwise group Mann-Whitney U differential PSI events
- *diff-cluster/covariate-isoform* - Pairwise group Mann-Whitney U differential isoform log2 TPM
- *diff-cluster/covariate-ratio* - Pairwise group Mann-Whitney U differential isoform/gene ratios

**Verify Output**:
Ensure that the processed outputs include files with differential splicing, isoform, and ratio data in the current working directory.

Next Steps
----------
After preprocessing, you are ready to inspect your results in a spreadsheet editor, **Perform Secondary Analyses** or **Visualize Results**. See the relevant tutorials for these steps.

Support
-------
For issues, please refer to our GitHub repository:  
https://github.com/SalomonisLab/altanalyze3