---
title: "Adaptive Gene - Multitrait Association testing with GWAS Summary Statistics (MTaSPUsSet() )"
date: "`r Sys.Date()`"
author: "Il-Youp Kwak"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{gene - multitrait aSPU with GWAS Summary Statistics}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r knitr_options, echo=FALSE, results=FALSE}
library(knitr)
opts_chunk$set(fig.width = 12)
```

This vignette illustrates the use of MTaSPUsSet test, an adaptive gene-multitrait association testing with GWAS summary statistics. 

## Data preparation

We first downloaded GWAS summary statistics of Genetic
Investigation of ANthropometric Traits ([GIANT](http://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files)) consortium data. We get the genomic coordinates of SNPs using [SnpTracker](https://www.g3journal.org/content/6/1/205) software (hg19 used), and then we mapped SNPs to Genes using [MAGMA](http://ctg.cncr.nl/software/magma) software (build 37 used). We will consider testing a gene named *SAMD11* for example.

Let's load *SAMD11* data first.
```{r loading, include=FALSE}
library(aSPU)
```

```{r loading2}
data(SAMD11)
attach(SAMD11)
```

`ZsM` and `PsM` are Z-scores and P-values for GIANT Man data mapped on gene *SAMD11*. (We used [MAGMA](http://ctg.cncr.nl/software/magma) software to map rs ids to genes. )

```{r ZsPsM}
round(ZsM,3)
PsM
```
Both `ZsM` and `PsM` are `r dim(PsM)[1]` by `r dim(PsM)[2]` matrix. Row represent SNPs and column represent phenotypes. `r dim(PsM)[1]` SNPs are mapped on gene *SAMD11* and `r dim(PsM)[2]` number of phenotypes are considered in testing.

`corSNPM` and `corPheM` are correlation among SNPs and Phenotypes respectively for GIANT Man data. 
```{r corM}
round(corSNPM,2)
round(corPheM,2)
```

`ZsF` and `PsF` are Z-scores and P-values for GIANT Woman data mapped on gene *SAMD11*.
```{r ZsPsF}
round(ZsF,3)
PsF
```

`corSNPF` and `corPheF` are correlation among SNPs and Phenotypes respectively for GIANT Woman data. 
```{r corF}
round(corSNPF,2)
round(corPheF,2)
```

You can see that `corSNPM` and `corSNPF` are same. Yes, they are calculated from the reference population (here, we used 1000 genome CEU panel). They are same under H0 no matter what phenotype is.  

## Data analysis

We can perform MTaSPUsSet using following command.
```{r outFZ}
(outFZ <- MTaSPUsSet(ZsF, corSNP=corSNPF, corPhe = corPheF,
           pow=c(1,2,4,8),  pow2 = c(1,2,4,8), n.perm=100, Ps=FALSE))

# available in R/aSPUc from my github
#(outFZC <- MTaSPUsSetC(ZsF, corSNP=corSNPF, corPhe = corPheF,
#           pow=c(1,2,4,8),  pow2 = c(1,2,4,8), n.perm=100, Ps=FALSE))

```
To use Python version, first save files in `.txt` format.

```{r wrwr}
#write.table(ZsF, quote=FALSE, row.names=FALSE, col.names=FALSE, file="ZsF.txt")
#write.table(corPheF, quote=FALSE, row.names=FALSE, col.names=FALSE, file="corPheF.txt")
#write.table(corSNPF, quote=FALSE, row.names=FALSE, col.names=FALSE, file="corSNPF.txt")

```
`MTaSPUsSet.py` function in [py/aSPU_py](https://github.com/ikwak2/aSPU_py) do the same job with `./MTaSPUsSet.py ZsF.txt corPheF.txt corSNPF.txt 100 outF.txt`.


Next, We can use the P-values for MTaSPUsSet test using `MTaSPUsSet` functions. Put P-value matrix `PsF` and set `Ps=TRUE` in the option.
```{r outFP}
(outFP <- MTaSPUsSet(PsF, corSNP=corSNPF, corPhe = corPheF,
           pow=c(1,2,4,8),  pow2 = c(1,2,4,8), n.perm=100, Ps=TRUE))

# available in R/aSPUc from my github
#(outFPC <- MTaSPUsSetC(PsF, corSNP=corSNPF, corPhe = corPheF,
#           pow=c(1,2,4,8),  pow2 = c(1,2,4,8), n.perm=100, Ps=TRUE))

```
Results are not much different.


Next, let's try MTaSPUsSet test using GIANT Man data with input of P-values and Z-scores.
```{r outMPZ}
(outMPC <- MTaSPUsSet(PsM, corSNP=corSNPM, corPhe = corPheM,
           pow=c(1,2,4,8),  pow2 = c(1,2,4,8), n.perm=100, Ps=TRUE))
(outMZC <- MTaSPUsSet(ZsM, corSNP=corSNPM, corPhe = corPheM,
           pow=c(1,2,4,8),  pow2 = c(1,2,4,8), n.perm=100, Ps=FALSE))
```
This time we got smaller P-value using `ZsM` input than `PsM` input. Why is it so? This can be answered from `corSNPM`, `corPheM` and `ZsM`.
```{r Zsmcors}
round(ZsM,3)
round(corSNPM,2)
round(corPheM,2)
```
In `corSNPM`, correlation between `rs4951864` and others are negative. The second column of `ZsM` looks a bit odd. It could be probable since the p-value is not very much significant and `rs11240777` might be related a bit with `Height`. However It might be related to the coding error as well. The original data set did not provide Z-scores. P-values for each SNP and beta estimate was provided in the download page of GIANT data. I transformed P-values to Z-scores by `absZs = qnorm(1 - (Ps)/2)` and then multiplied the sign of beta estimates to recover the original Z-scores. Maybe I am wrong somewhere or provided beta estimates are not all correct. In any case, it would be safe to analyze GIANT data using P-values rather than using Z-scores.


```{r plots, echo=FALSE}
plotG <- function(Ps, zlim = NULL, main = NULL, yt = NULL, title = "SNPs") {        
    log10P <- -log(Ps,10)  
    pos = 1:nrow(log10P)
    y = 1:ncol(log10P)
    log10P <- log10P
    val <- sqrt(seq(0, 1, len=251))
    col <- rgb(1, rev(val), rev(val))

    if(is.null(yt)) {
        yt = -length(pos)/15
    }

    if(is.null(zlim)) {
        maxP <- max(log10P, na.rm=TRUE)
        zlim <- c(0, maxP)
    }
    image.plot(pos, y, log10P, xaxt="n", yaxt="n", ylab="", xlab="",
                    zlim=zlim, col=col, mgp=c(2.6, 1, 0), bty="n", main = main )
    title(xlab=title, mgp=c(1, 0, 0))
    text(yt,1,"BMI", xpd = TRUE)
    text(yt,2,"Height", xpd = TRUE)
    text(yt,3,"HIP", xpd = TRUE)
    text(yt,4,"WC", xpd = TRUE)
    text(yt,5,"Weight", xpd = TRUE)
    text(yt,6,"WHR", xpd = TRUE)
}

plotLD <- function(ldmatrix, zlim = NULL, main = NULL, yt = NULL, title = "SNPs") {
#    log10P <- -log(Ps,10)
    pos = 1:nrow(ldmatrix)
    y = 1:ncol(ldmatrix)
    val <- sqrt(seq(0, 1, len=251))
    col <- rgb(1, rev(val), rev(val))

    if(is.null(yt)) {
        yt = -length(pos)/15
    }

    if(is.null(zlim)) {
        maxP <- max(ldmatrix, na.rm=TRUE)
        zlim <- c(0, maxP)
    }
    image.plot(pos, y, ldmatrix, xaxt="n", yaxt="n", ylab="", xlab="",
        zlim=zlim, col=col, mgp=c(2.6, 1, 0), bty="n", main = main )
    title(xlab=title, mgp=c(1, 0, 0))
    title(ylab=title, mgp=c(1, 0, 0))
}

```


## Data Visualization of some detected genes
So far, we demostrated how we can use the software. In this section we will visualize some identified genes. 

Gene *LCORL* and *RASA2* are two of identified genes by [MGAS](https://pubmed.ncbi.nlm.nih.gov/25431328/) using their software.  

```{r plot_MGAS, echo=FALSE, fig.width=7, fig.height=7}
data(someGs)
par(mfrow = c(2,2))
plotG(someGs$LCORL[[1]], main = "LCORL (P-values)", zlim = c(0,18))
plotG(someGs$RASA2[[1]], main = "RASA2 (P-values)", zlim = c(0,12))
plotLD(abs(someGs$LCORL[[2]]), main = "LCORL (LDmatrix)")
plotLD(abs(someGs$RASA2[[2]]), main = "RASA2 (LDmatrix)")
```


In the Figure, we draw image of $-\log_{10}$ transformed P-values for each SNP and trait.
Gene *LCORL* was the most significant gene. As we can see from the Figure, there are many very significant SNPs for trait Height in Gene *LCORL*. No doubt MGAS and MTaSPUsSet identified this gene.

MGAS use extended Simes procedure so it considers top few p-values in their algorithm. Thus, MGAS works well with sparse signal such as *RASA2*. Since the signal is sparse, MTaSPUsSet detect this gene with larger $\gamma_1$ and $\gamma_2$. With ($\gamma_1, \gamma_2$) = (1,8), (2,8), (4,8) and (8,8), MTaSPUsSet P-values were less than 1e-7 with $10^7$ permutations.

```{r plot_MT, echo=FALSE, fig.width=7, fig.height=7}
data(someGs)
par(mfrow = c(2,2))
plotG(someGs$STK33[[1]], main = "STK33 (P-values)", zlim = c(0,12))
plotG(someGs$RPGRIP1L[[1]], main = "RPGRIP1L (P-values)", zlim = c(0,12))
plotLD(abs(someGs$STK33[[2]]), main = "STK33 (LDmatrix)")
plotLD(abs(someGs$RPGRIP1L[[2]]), main = "RPGRIP1L (LDmatrix)")
```

MTaSPUsSet test have advantage in detecting genes loke *STK33* and *RPGRIP1L*. We can see that signals are widely spread. MTaSPUsSet test have higher power with smaller $\gamma_1$ and $\gamma_2$. As we can see from the figure, each P-values are not very big. These genes would be hard to detect using SNP based, or gene - single trait association testing. However, MTaSPUsSet test can detect these genes by aggregating signals for all traits and SNPs.