📂 HLA allele frequencies in tab-delimited format, downloaded from AFND.
MIT License
Kamil Slowikowski
r format(Sys.Date())
options(width=100)
library(data.table)
library(dplyr)
library(glue)
library(readr)
library(magrittr)
library(ggplot2)
library(ggstance)
devtools::source_gist("c83e078bf8c81b035e32c3fc0cf04ee8", filename = 'render_toc.R')
file_size <- function(x) glue("{fs::file_size(x)}B")
d <- fread("afnd.tsv")
n_hla <- d %>% filter(group == "hla") %>% count(gene) %>% nrow
n_kir <- d %>% filter(group == "kir") %>% count(gene) %>% nrow
n_mic <- d %>% filter(group == "mic") %>% count(gene) %>% nrow
n_cyt <- d %>% filter(group == "cyt") %>% count(gene) %>% nrow
Table of Contents
render_toc("README.Rmd", toc_depth = 2)
Here, we share a single file afnd.tsv (r file_size("afnd.tsv")
) in tab-delimited format with all allele frequencies for r n_hla
HLA genes, r n_kir
KIR genes, r n_mic
MIC genes, and r n_cyt
cytokine genes from Allele Frequency Net Database (AFND).
The script allelefrequencies.py automatically downloads allele frequencies from the website.
What is the Allele Frequency Net Database?
The Allele Frequency Net Database (AFND) is a public database which contains frequency information of several immune genes such as Human Leukocyte Antigens (HLA), Killer-cell Immunoglobulin-like Receptors (KIR), Major histocompatibility complex class I chain-related (MIC) genes, and a number of cytokine gene polymorphisms.
The afnd.tsv file looks like this:
d <- fread("afnd.tsv")
head(d)
Definitions:
alleles_over_2n
(Alleles / 2n)
Allele Frequency: total number of copies of
the allele in the population sample in three decimal format.
indivs_over_n
(100 * Individuals / n)
Percentage of individuals who have the allele or gene.
n
(Individuals)
Number of individuals sampled from the population.
Here are a few examples of how we can use R to analyze these data.
View the largest and smallest populations available in the data:
d %>%
mutate(n = parse_number(n)) %>%
select(population, n) %>%
unique() %>%
arrange(-n)
Count the number of alleles for each gene:
d %>%
count(group, gene, allele) %>%
count(group, gene) %>%
arrange(-n) %>%
head(15)
Sum the allele frequencies for each gene in each population. This allows us to see which populations have a set of allele frequencies that adds up to 100 percent:
d %>%
mutate(alleles_over_2n = parse_number(alleles_over_2n)) %>%
filter(alleles_over_2n > 0) %>%
group_by(group, gene, population) %>%
summarize(sum = sum(alleles_over_2n)) %>%
count(sum == 1)
theme_set(
theme_bw(base_size = 14) +
theme(
plot.caption.position = "plot"
)
)
Plot the frequency of a specific allele in populations with more than 1000 sampled individuals:
my_allele <- "DQB1*02:01"
my_d <- d %>% filter(allele == my_allele) %>%
mutate(
n = parse_number(n),
alleles_over_2n = parse_number(alleles_over_2n)
) %>%
filter(n > 1000) %>%
arrange(-alleles_over_2n)
ggplot(my_d) +
aes(x = alleles_over_2n, y = reorder(population, alleles_over_2n)) +
scale_y_discrete(position = "right") +
geom_colh() +
labs(
x = "Allele Frequency (Alleles / 2N)",
y = NULL,
title = glue("Frequency of {my_allele} across populations"),
caption = "Data from AFND http://allelefrequencies.net"
)
If you use this data, please cite the latest manuscript about Allele Frequency Net Database:
@ARTICLE{Gonzalez-Galarza2020,
title = "{Allele frequency net database (AFND) 2020 update: gold-standard
data classification, open access genotype data and new query
tools}",
author = "Gonzalez-Galarza, Faviel F and McCabe, Antony and Santos, Eduardo
J Melo Dos and Jones, James and Takeshita, Louise and
Ortega-Rivera, Nestor D and Cid-Pavon, Glenda M Del and
Ramsbottom, Kerry and Ghattaoraya, Gurpreet and Alfirevic, Ana
and Middleton, Derek and Jones, Andrew R",
journal = "Nucleic acids research",
volume = 48,
number = "D1",
pages = "D783--D788",
month = jan,
year = 2020,
language = "en",
issn = "0305-1048, 1362-4962",
pmid = "31722398",
doi = "10.1093/nar/gkz1029",
pmc = "PMC7145554"
}
Here are all of the resources I could find that have information about HLA allele frequencies in different populations.
https://github.com/Vaccitech/HLAfreq/
The authors provide xlsx files on this website:
But the frequency information is binned into categories:
There is a tool called HLA-Net that provides a visualization of the CIWD data.
http://tools.iedb.org/population/download
At the IEDB Tools page, we can find a tool called Population Coverage. The authors have downloaded the HLA frequency information from AFND and saved it in a Python pickle file.
https://www.ncbi.nlm.nih.gov/gv/mhc
The dbMHC database and website appears to be discontinued. But an archive of old files is still available via FTP.
Thanks to David A. Wells for sharing scrapeAF, which inspired me to work on this project.