allelefrequencies

📂 HLA allele frequencies in tab-delimited format, downloaded from AFND.

MIT License

Stars
17

output:
md_document:
variant: "gfm"
standalone: true
toc: false
toc_depth: 2
html_document:
toc: false
toc_depth: 2
self_contained: true

HLA allele frequencies in tab-delimited format

Kamil Slowikowski

r format(Sys.Date())

options(width=100)
library(data.table)
library(dplyr)
library(glue)
library(readr)
library(magrittr)
library(ggplot2)
library(ggstance)
devtools::source_gist("c83e078bf8c81b035e32c3fc0cf04ee8", filename = 'render_toc.R')
file_size <- function(x) glue("{fs::file_size(x)}B")
d <- fread("afnd.tsv")
n_hla <- d %>% filter(group == "hla") %>% count(gene) %>% nrow
n_kir <- d %>% filter(group == "kir") %>% count(gene) %>% nrow
n_mic <- d %>% filter(group == "mic") %>% count(gene) %>% nrow
n_cyt <- d %>% filter(group == "cyt") %>% count(gene) %>% nrow

Table of Contents

render_toc("README.Rmd", toc_depth = 2)

Introduction

Here, we share a single file afnd.tsv (r file_size("afnd.tsv")) in tab-delimited format with all allele frequencies for r n_hla HLA genes, r n_kir KIR genes, r n_mic MIC genes, and r n_cyt cytokine genes from Allele Frequency Net Database (AFND).

The script allelefrequencies.py automatically downloads allele frequencies from the website.

What is the Allele Frequency Net Database?

The Allele Frequency Net Database (AFND) is a public database which contains frequency information of several immune genes such as Human Leukocyte Antigens (HLA), Killer-cell Immunoglobulin-like Receptors (KIR), Major histocompatibility complex class I chain-related (MIC) genes, and a number of cytokine gene polymorphisms.

The afnd.tsv file looks like this:

d <- fread("afnd.tsv")
head(d)

Definitions:

  • alleles_over_2n (Alleles / 2n) Allele Frequency: total number of copies of the allele in the population sample in three decimal format.

  • indivs_over_n (100 * Individuals / n) Percentage of individuals who have the allele or gene.

  • n (Individuals) Number of individuals sampled from the population.

Examples

Here are a few examples of how we can use R to analyze these data.

View the largest and smallest populations available in the data:

d %>%
  mutate(n = parse_number(n)) %>%
  select(population, n) %>%
  unique() %>%
  arrange(-n)

Count the number of alleles for each gene:

d %>%
  count(group, gene, allele) %>%
  count(group, gene) %>%
  arrange(-n) %>%
  head(15)

Sum the allele frequencies for each gene in each population. This allows us to see which populations have a set of allele frequencies that adds up to 100 percent:

d %>%
  mutate(alleles_over_2n = parse_number(alleles_over_2n)) %>%
  filter(alleles_over_2n > 0) %>%
  group_by(group, gene, population) %>%
  summarize(sum = sum(alleles_over_2n)) %>%
  count(sum == 1)
theme_set(
  theme_bw(base_size = 14) +
  theme(
    plot.caption.position = "plot"
  )  
)

Plot the frequency of a specific allele in populations with more than 1000 sampled individuals:


my_allele <- "DQB1*02:01"
my_d <- d %>% filter(allele == my_allele) %>%
  mutate(
    n = parse_number(n),
    alleles_over_2n = parse_number(alleles_over_2n)
  ) %>%
  filter(n > 1000) %>%
  arrange(-alleles_over_2n)

ggplot(my_d) +
  aes(x = alleles_over_2n, y = reorder(population, alleles_over_2n)) +
  scale_y_discrete(position = "right") +
  geom_colh() +
  labs(
    x = "Allele Frequency (Alleles / 2N)",
    y = NULL,
    title =  glue("Frequency of {my_allele} across populations"),
    caption = "Data from AFND http://allelefrequencies.net"
  )

Citation

If you use this data, please cite the latest manuscript about Allele Frequency Net Database:

@ARTICLE{Gonzalez-Galarza2020,
  title    = "{Allele frequency net database (AFND) 2020 update: gold-standard
              data classification, open access genotype data and new query
              tools}",
  author   = "Gonzalez-Galarza, Faviel F and McCabe, Antony and Santos, Eduardo
              J Melo Dos and Jones, James and Takeshita, Louise and
              Ortega-Rivera, Nestor D and Cid-Pavon, Glenda M Del and
              Ramsbottom, Kerry and Ghattaoraya, Gurpreet and Alfirevic, Ana
              and Middleton, Derek and Jones, Andrew R",
  journal  = "Nucleic acids research",
  volume   =  48,
  number   = "D1",
  pages    = "D783--D788",
  month    =  jan,
  year     =  2020,
  language = "en",
  issn     = "0305-1048, 1362-4962",
  pmid     = "31722398",
  doi      = "10.1093/nar/gkz1029",
  pmc      = "PMC7145554"
}

Related work

Here are all of the resources I could find that have information about HLA allele frequencies in different populations.

HLAfreq

https://github.com/Vaccitech/HLAfreq/

CIWD version 3.0.0

The authors provide xlsx files on this website:

But the frequency information is binned into categories:

  • C, common
  • I, intermediate
  • WD, well-documented
  • NA, not applicable

There is a tool called HLA-Net that provides a visualization of the CIWD data.

IEDB Tools

http://tools.iedb.org/population/download

At the IEDB Tools page, we can find a tool called Population Coverage. The authors have downloaded the HLA frequency information from AFND and saved it in a Python pickle file.

dbMHC

https://www.ncbi.nlm.nih.gov/gv/mhc

The dbMHC database and website appears to be discontinued. But an archive of old files is still available via FTP.

NMDP

https://bioinformatics.bethematchclinical.org/hla-resources/haplotype-frequencies/high-resolution-hla-alleles-and-haplotypes-in-the-us-population/

Acknowledgments

Thanks to David A. Wells for sharing scrapeAF, which inspired me to work on this project.