A frictionless, pipeable approach to dealing with summary statistics
library(skimr)
options(tibble.width = Inf)
options(width = 100)
skimr
provides a frictionless approach to summary statistics which conforms
to the principle of least
surprise,
displaying summary statistics the user can skim quickly to understand their
data. It handles different data types and returns a skim_df
object which can
be included in a pipeline or displayed nicely for the human reader.
Note: skimr
version 2 has major changes when skimr is used programmatically.
Upgraders should review this document, the release notes and vignettes
carefully.
The current released version of skimr
can be installed from CRAN. If you wish
to install the current build of the next release you can do so using the
following:
# install.packages("devtools")
devtools::install_github("ropensci/skimr")
The APIs for this branch should be considered reasonably stable but still subject to change if an issue is discovered.
To install the version with the most recent changes that have not yet been incorporated in the main branch (and may not be):
devtools::install_github("ropensci/skimr", ref = "develop")
Do not rely on APIs from the develop branch, as they are likely to change.
skimr
:
summary()
, including missing,skim(chickwts)
skim(iris)
skim(dplyr::starwars)
skim(iris) %>%
summary()
skim(iris, Sepal.Length, Petal.Length)
skim()
can handle data that has been grouped using dplyr::group_by()
.
iris %>%
dplyr::group_by(Species) %>%
skim()
iris %>%
skim() %>%
dplyr::filter(numeric.sd > 1)
Simply skimming a data frame will produce the horizontal print
layout shown above. We provide a knit_print
method for the types of objects
in this package so that similar results are produced in documents. To use this,
make sure the skimmed
object is the last item in your code chunk.
faithful %>%
skim()
Although skimr provides opinionated defaults, it is highly customizable. Users can specify their own statistics, change the formatting of results, create statistics for new classes and develop skimmers for data structures that are not data frames.
Users can specify their own statistics using a list combined with the
skim_with()
function factory. skim_with()
returns a new skim
function that
can be called on your data. You can use this factory to produce summaries for
any type of column within your data.
Assignment within a call to skim_with()
relies on a helper function, sfl
or
skimr
function list. By default, functions in the sfl
call are appended to
the default skimmers, and names are automatically generated as well.
my_skim <- skim_with(numeric = sfl(mad))
my_skim(iris, Sepal.Length)
But you can also helpers from the tidyverse
to create new anonymous functions
that set particular function arguments. The behavior is the same as in purrr
or dplyr
, with both .
and .x
as acceptable pronouns. Setting the
append = FALSE
argument uses only those functions that you've provided.
my_skim <- skim_with(
numeric = sfl(
iqr = IQR,
p01 = ~ quantile(.x, probs = .01)
p99 = ~ quantile(., probs = .99)
),
append = FALSE
)
my_skim(iris, Sepal.Length)
And you can remove default skimmers by setting them to NULL
.
my_skim <- skim_with(numeric = sfl(hist = NULL))
my_skim(iris, Sepal.Length)
skimr
has summary functions for the following types of data by default:
numeric
(which includes both double
and integer
)character
factor
logical
complex
Date
POSIXct
ts
AsIs
skimr
also provides a small API for writing packages that provide their own
default summary functions for data types not covered above. It relies on
R S3 methods for the get_skimmers
function. This function should return
a sfl
, similar to customization within skim_with()
, but you should also
provide a value for the class
argument. Here's an example.
get_skimmers.my_data_type <- function(column) {
sfl(
.class = "my_data_type",
p99 = quantile(., probs = .99)
)
}
We are aware that there are issues with rendering the inline histograms and line charts in various contexts, some of which are described below.
There are known issues with printing the spark-histogram characters when
printing a data frame. For example, "▂▅▇"
is printed as
"<U+2582><U+2585><U+2587>"
. This longstanding problem originates in
the low-level
code
for printing dataframes.
While some cases have been addressed, there are, for example, reports of this
issue in Emacs ESS. While this is a deep issue, there is ongoing
work to address it in base R.
This means that while skimr
can render the histograms to the console and in
RMarkdown documents, it cannot in other circumstances. This includes:
skimr
data frame to a vanilla R data frame, but tibbles renderOne workaround for showing these characters in Windows is to set the CTYPE part
of your locale to Chinese/Japanese/Korean with Sys.setlocale("LC_CTYPE", "Chinese")
. The helper function fix_windows_histograms()
does this for you.
And last but not least, we provide skim_without_charts()
as a fallback.
This makes it easy to still get summaries of your data, even if unicode issues
continue.
Spark-bar and spark-line work in the console, but may not work when you knit
them to a specific document format. The same session that produces a correctly
rendered HTML document may produce an incorrectly rendered PDF, for example.
This issue can generally be addressed by changing fonts to one with good
building block (for histograms) and Braille support (for line graphs). For
example, the open font "DejaVu Sans" from the extrafont
package supports
these. You may also want to try wrapping your results in knitr::kable()
.
Please see the vignette on using fonts for details.
Displays in documents of different types will vary. For example, one user found that the font "Yu Gothic UI Semilight" produced consistent results for Microsoft Word and Libre Office Write.
The earliest use of unicode characters to generate sparklines appears to be from 2009.
Exercising these ideas to their fullest requires a font with good support for block drawing characters. PragamataPro is one such font.
We welcome issue reports and pull requests, including potentially adding support for commonly used variable classes. However, in general, we encourage users to take advantage of skimr's flexibility to add their own customized classes. Please see the contributing and conduct documents.