Advanced and Fast Data Transformation in R
OTHER License
Updated 'collapse and sf' vignette to reflect the recent support for units objects, and added a few more examples.
Fixed a bug in join()
where a full join silently became a left join if there are no matches between the tables (#574). Thanks @D3SL for reporting.
Added function group_by_vars()
: A standard evaluation version of fgroup_by()
that is slimmer and safer for programming, e.g. data |> group_by_vars(ind1) |> collapg(custom = list(fmean = ind2, fsum = ind3))
. Or, using magrittr:
library(magrittr)
set_collapse(mask = "manip") # for fgroup_vars -> group_vars
data %>%
group_by_vars(ind1) %>% {
add_vars(
group_vars(., "unique"),
get_vars(., ind2) %>% fmean(keep.g = FALSE) %>% add_stub("mean_"),
get_vars(., ind3) %>% fsum(keep.g = FALSE) %>% add_stub("sum_")
)
}
Added function as_integer_factor()
to turn factors/factor columns into integer vectors. as_numeric_factor()
already exists, but is memory inefficient for most factors where levels can be integers.
join()
now internally checks if the rows of the joined datasets match exactly. This check, using identical(m, seq_row(y))
, is inexpensive, but, if TRUE
, saves a full subset and deep copy of y
. Thus join()
now inherits the intelligence already present in functions like fsubset()
, roworder()
and funique()
- a key for efficient data manipulation is simply doing less.
In join()
, if attr = TRUE
, the count
option to fmatch()
is always invoked, so that the attribute attached always has the same form, regardless of verbose
or validate
settings.
roworder[v]()
has optional setting verbose = 2L
to indicate if x
is already sorted, making the call to roworder[v]()
redundant.
Published by SebKrantz 6 months ago
group_by_vars()
: A standard evaluation version of fgroup_by()
that is slimmer and safer for programming, e.g. data |> group_by_vars(ind1) |> collapg(custom = list(fmean = ind2, fsum = ind3))
. Or, using magrittr:library(magrittr)
set_collapse(mask = "manip") # for fgroup_vars -> group_vars
data %>%
group_by_vars(ind1) %>% {
add_vars(
group_vars(., "unique"),
get_vars(., ind2) %>% fmean(keep.g = FALSE) %>% add_stub("mean_"),
get_vars(., ind3) %>% fsum(keep.g = FALSE) %>% add_stub("sum_")
)
}
join()
now internally checks if the rows of the joined datasets match exactly. This check, using identical(m, seq_row(y))
, is inexpensive, but, if TRUE
, saves a full subset and deep copy of y
. Thus join()
now inherits the intelligence already present in functions like fsubset()
, roworder()
and funique()
- a key for efficient data manipulation is simply doing less.
In join()
, if attr = TRUE
, the count
option to fmatch()
is always invoked, so that the attribute attached always has the same form, regardless of verbose
or validate
settings.
roworder[v]()
has optional setting verbose = 2L
to indicate if x
is already sorted, making the call to roworder[v]()
redundant.
Published by SebKrantz 6 months ago
collapse now explicitly supports xts/zoo and units objects and concurrently removes an additional check in the .default
method of statistical functions that called the matrix method if is.matrix(x) && !inherits(x, "matrix")
. This was a smart solution to account for the fact that xts objects are matrix-based but don't inherit the "matrix"
class, thus wrongly calling the default method. The same is the case for units, but here, my recent more intensive engagement with spatial data convinced me that this should be changed. For one, under the previous heuristic solution, it was not possible to call the default method on a units matrix, e.g., fmean.default(st_distance(points_sf))
called fmean.matrix()
and yielded a vector. This should not be the case. Secondly, aggregation e.g. fmean(st_distance(points_sf))
or fmean(st_distance(points_sf), g = group_vec)
yielded a plain numeric object that lost the units class (in line with the general attribute handling principles). Therefore, I have now decided to remove the heuristic check within the default methods, and explicitly support zoo and units objects. For Fast Statistical Functions, the methods are FUN.zoo <- function(x, ...) if(is.matrix(x)) FUN.matrix(x, ...) else FUN.default(x, ...)
and FUN.units <- function(x, ...) if(is.matrix(x)) copyMostAttrib(FUN.matrix(x, ...), x) else FUN.default(x, ...)
. While the behavior for xts/zoo remains the same, the behavior for units is enhanced, as now the class is preserved in aggregations (the .default
method preserves attributes except for ts), and it is possible to manually invoke the .default
method on a units matrix and obtain an aggregate statistic. This change may impact computations on other matrix based classes which don't inherit from "matrix"
(mts does inherit from "matrix"
, and I am not aware of any other affected classes, but user code like m <- matrix(rnorm(25), 5); class(m) <- "bla"; fmean(m)
will now yield a scalar instead of a vector. Such code must be adjusted to either class(m) <- c("bla", "matrix")
or fmean.matrix(m)
). Overall, the change makes collapse behave in a more standard and predictable way, and enhances its support for units objects central in the sf ecosystem.
fquantile()
now also preserves the attributes of the input, in line with quantile()
.
Published by SebKrantz 7 months ago
Published by SebKrantz 8 months ago
An article on collapse has been submitted to the Journal of Statistical Software. The preprint is available through arXiv.
Removed magrittr from most documentation examples (using base pipe).
Improved plot.GRP
a little bit - on request of JSS editors.
Published by SebKrantz 9 months ago
Fixed a bug in fmatch()
when matching integer vectors to factors. This also affected join()
.
Improved cross-platform compatibility of OpenMP flags. Thanks @kalibera.
Added stub = TRUE
argument to the grouped_df methods of Fast Statistical Functions supporting weights, to be able to remove or alter prefixes given to aggregated weights columns if keep.w = TRUE
. Globally, users can set st_collapse(stub = FALSE)
to disable this prefixing in all statistical functions and operators.
Published by SebKrantz 10 months ago
Added functions na_locf()
and na_focb()
for fast basic C implementations of these procedures (optionally by reference). replace_na()
now also has a type
argument which supports options "locf"
and "focb"
(default "const"
), similar to data.table::nafill
. The implementation also supports character data and list-columns (NULL/empty
elements). Thanks @BenoitLondon for suggesting (#489). I note that na_locf()
exists in some other packages (such as imputeTS) where it is implemented in R and has additional options. Users should utilize the flexible namespace i.e. set_collapse(remove = "na_locf")
to deal with this.
Fixed a bug in weighted quantile estimation (fquantile()
) that could lead to wrong/out-of-range estimates in some cases. Thanks @zander-prinsloo for reporting (#523).
Improved right join such that join column names of x
instead of y
are preserved. This is more consistent with the other joins when join columns in x
and y
have different names.
More fluent and safe interplay of 'mask' and 'remove' options in set_collapse()
: it is now seamlessly possible to switch from any combination of 'mask' and 'remove' to any other combination without the need of setting them to NULL
first.
Published by SebKrantz 10 months ago
In pivot(..., values = [multiple columns], labels = "new_labels_column", how = "wieder")
, if the columns selected through values
already have variable labels, they are concatenated with the new labels provided through "new_labels_col"
using " - "
as a separator (similar to names
where the separator is "_"
).
whichv()
and operators %==%
, %!=%
now properly account for missing double values, e.g. c(NA_real_, 1) %==% c(NA_real_, 1)
yields c(1, 2)
rather than 2
. Thanks @eutwt for flagging this (#518).
In setv(X, v, R)
, if the type of R
is greater than X
e.g. setv(1:10, 1:3, 9.5)
, then a warning is issued that conversion of R
to the lower type (real to integer in this case) may incur loss of information. Thanks @tony-aw for suggesting (#498).
frange()
has an option finite = FALSE
, like base::range
. Thanks @MLopez-Ibanez for suggesting (#511).
varying.pdata.frame(..., any_group = FALSE)
now unindexes the result (as should be the case).
Published by SebKrantz 11 months ago
Fixed bug in full join if verbose = 0
. Thanks @zander-prinsloo for reporting.
Added argument multiple = FALSE
to join()
. Setting multiple = TRUE
performs a multiple-matching join where a row in x
is matched to all matching rows in y
. The default FALSE
just takes the first matching row in y
.
Improved recode/replace functions. Notably, replace_outliers()
now supports option value = "clip"
to replace outliers with the respective upper/lower bounds, and also has option single.limit = "mad"
which removes outliers exceeding a certain number of median absolute deviations. Furthermore, all functions now have a set
argument which fully applies the transformations by reference.
Functions replace_NA
and replace_Inf
were renamed to replace_na
and replace_inf
to make the namespace a bit more consistent. The earlier versions remain available.
Published by SebKrantz 12 months ago
Fixed a serious bug in qsu()
where higher order weighted statistics were erroneous, i.e. whenever qsu(x, ..., w = weights, higher = TRUE)
was invoked, the 'SD', 'Skew' and 'Kurt' columns were wrong (if higher = FALSE
the weighted 'SD' is correct). The reason is that there appears to be no straightforward generalization of Welford's Online Algorithm to higher-order weighted statistics. This was not detected earlier because the algorithm was only tested with unit weights. The fix involved replacing Welford's Algorithm for the higher-order weighted case by a 2-pass method, that additionally uses long doubles for higher-order terms. Thanks @randrescastaneda for reporting.
Fixed some unexpected behavior in t_list()
where names 'V1', 'V2', etc. were assigned to unnamed inner lists. It now preserves the missing names. Thanks @orgadish for flagging this.
Published by SebKrantz 12 months ago
In join
, the if y
is an expression e.g. join(x = mtcars, y = subset(mtcars, mpg > 20))
, then its name is not extracted but just set to "y"
. Before, the name of y
would be captured as as.character(substitute(y))[1] = "subset"
in this case. This is an improvement mainly for display purposes, but could also affect code if there are duplicate columns in both datasets and suffix
was not provided in the join
call: before, y-columns would be renamed using a (non-sensible) "_subset"
suffix, but now using a "_y"
suffix. Note that this only concerns cases where y
is an expression rather than a single object.
Small performance improvements to %[!]in%
operators: %!in%
now uses is.na(fmatch(x, table))
rather than fmatch(x, table, 0L) == 0L
, and %in%
, if exported using set_collapse(mask = "%in%"|"special"|"all")
is as.logical(fmatch(x, table, 0L))
instead of fmatch(x, table, 0L) > 0L
. The latter are faster because comparison operators >
, ==
with integers additionally need to check for NA
's (= the smallest integer in C).
Published by SebKrantz 12 months ago
In fnth()/fquantile()
, there has been a slight change to the weighted quantile algorithm. As outlined in the documentation, this algorithm gives weighted versions for all continuous quantile methods (type 7-9) in R by replacing sample quantities with their weighted counterparts. E.g., for the default quantile type 7, the continuous (lower) target element is (n - 1) * p
. In the weighted algorithm, this became (sum(w) - mean(w)) * p
and was compared to the cumulative sum of ordered (by x
) weights, to preserve equivalence of the algorithms in cases where the weights are all equal. However, upon a second thought, the use of mean(w)
does not really reflect a standard interpretation of the weights as frequencies. I have reasoned that using min(w)
instead of mean(w)
better reflects such an interpretation, as the minimum (non-zero) weight reflects the size of the smallest sampled unit. So the weighted quantile type 7 target is now (sum(w) - min(w)) * p
, and also the other methods have been adjusted accordingly (note that zero weight observations are ignored in the algorithm).
This is more a Note than a change to the package: there is an issue with vctrs that users can encounter using collapse together with the tidyverse (especially ggplot2), which is that collapse internally optimizes computations on factors by giving them an additional "na.included"
class if they are known to not contain any missing values. For example pivot(mtcars)
gives a "variable"
factor which has class c("factor", "na.included")
, such that grouping on "variable"
in subsequent operations is faster. Unfortunately, pivot(mtcars) |> ggplot(aes(y = value)) + geom_histogram() + facet_wrap( ~ variable)
currently gives an error produced by vctrs, because vctrs does not implement a standard S3 method dispatch and thus does not ignore the "na.included"
class. It turns out that the only way for me to deal with this is would be to swap the order of classes i.e. c("na.included", "factor")
, import vctrs, and implement vec_ptype2
and vec_cast
methods for "na.included"
objects. This will never happen, as collapse is and will remain independent of the tidyverse. There are two ways you can deal with this: The first way is to remove the "na.included"
class for ggplot2 e.g. facet_wrap( ~ set_class(variable, "factor"))
or
facet_wrap( ~ factor(variable))
will both work. The second option is to define a function vec_ptype2.factor.factor <- function(x, y, ...) x
in your global environment, which avoids vctrs performing extra checks on factor objects.
Published by SebKrantz about 1 year ago
Fixed a signed integer overflow inside a hash function detected by CRAN checks (changing to unsigned int).
Updated the cheatsheet (see README.md).
Published by SebKrantz about 1 year ago
Added global option 'stub' (default TRUE
) to set_collapse
. It is passed to the stub(s)
arguments of the statistical operators, B
, W
, STD
, HDW
, HDW
, L
, D
, Dlog
, G
(in .OPERATOR_FUN
). By default these operators add a prefix/stub to matrix or data.frame columns transformed by them. Setting set_collapse(stub = FALSE)
now allows to switch off this behavior such that columns are not prepended with a prefix by default.
roworder[v]()
now also supports grouped data frames, but prints a message indicating that this is inefficient (also for indexed data). An additional argument verbose
can be set to 0
to avoid such messages.
Published by SebKrantz about 1 year ago
%in%
with set_collapse(mask = "%in%")
does not warn about overidentification when used with data frames.
Fixed several typos in the documentation.
Published by SebKrantz about 1 year ago
collapse 2.0, released in Mid-October 2023, introduces fast table joins and data reshaping capabilities alongside other convenience functions, and enhances the packages global configurability, including interactive namespace control.
.data
is used inside fsummarise()
and fmutate()
, and .cols = NULL
, .data
will contain all columns except for grouping columns (in-line with the .SD
syntax of data.table). Before, .data
contained all columns. The selection in .cols
still refers to all columns, thus it is still possible to select all columns using e.g. grouped_data %>% fsummarise(some_expression_involving(.data), .cols = seq_col(.))
.qsu()
, argument vlabels
was renamed to labels
. But vlabels
will continue to work.fsum()
, fmean()
and fprod()
that returned NA
if and only if there was a single integer followed by NA
's e.g fsum(c(1L, NA, NA))
erroneously gave NA
. This was caused by a C-level shortcut that returned NA
when the first element of the vector had been reached (moving from back to front) without encountering any non-NA-values. The bug consisted in the content of the first element not being evaluated in this case. Note that this bug did not occur with real numbers, and also not in grouped execution. Thanks @blset for reporting (#432).Added join()
: class-agnostic, vectorized, and (default) verbose joins for R, modeled after the polars API. Two different join algorithms are implemented: a hash-join (default, if sort = FALSE
) and a sort-merge-join (if sort = TRUE
).
Added pivot()
: fast and easy data reshaping! It supports longer, wider and recast pivoting, including handling of variable labels, through a uniform and parsimonious API. It does not perform data aggregation, and by default does not check if the data is uniquely identified by the supplied ids. Underidentification for 'wide' and 'recast' pivots results in the last value being taken within each group. Users can toggle a duplicates check by setting check.dups = TRUE
.
Added rowbind()
: a fast class-agnostic alternative to rbind.data.frame()
and data.table::rbindlist()
.
Added fmatch()
: a fast match()
function for vectors and data frames/lists. It is the workhorse function of join()
, and also benefits ckmatch()
, %!in%
, and new operators %iin%
and %!iin%
(see below). It is also possible to set_collapse(mask = "%in%")
to replace base::"%in%"
using fmatch()
. Thanks to fmatch()
, these operators also all support data frames/lists of vectors, which are compared row-wise.
Added operators %iin%
and %!iin%
: these directly return indices, i.e. %[!]iin%
is equivalent to which(x %[!]in% table)
. This is useful especially for subsetting where directly supplying indices is more efficient e.g. x[x %[!]iin% table]
is faster than x[x %[!]in% table]
. Similarly fsubset(wlddev, iso3c %iin% c("DEU", "ITA", "FRA"))
is very fast.
Added vec()
: efficiently turn matrices or data frames / lists into a single atomic vector. I am aware of multiple implementations in other packages, which are mostly inefficient. With atomic objects, vec()
simply removes the attributes without copying the object, and with lists it directly calls C_pivot_longer
.
set_collapse()
now supports options 'mask' and 'remove', giving collapse a flexible namespace in the broadest sense that can be changed at any point within the active session:
'mask' supports base R or dplyr functions that can be masked into the faster collapse versions. E.g. library(collapse); set_collapse(mask = "unique")
(or, equivalently, set_collapse(mask = "funique")
) will create unique <- funique
in the collapse namespace, export unique()
from the namespace, and detach and attach the namespace again so R can find it. The re-attaching also ensures that collapse comes right after the global environment, implying that all it's functions will take priority over other libraries. Users can use fastverse::fastverse_conflicts()
to check which functions are masked after using set_collapse(mask = ...)
. The option can be changed at any time. Using set_collapse(mask = NULL)
removes all masked functions from the namespace, and can also be called simply to ensure collapse is at the top of the search path.
'remove' allows removing arbitrary functions from the collapse namespace. E.g. set_collapse(remove = "D")
will remove the difference operator D()
, which also exists in stats to calculate symbolic and algorithmic derivatives (this is a convenient example but not necessary since collapse::D
is S3 generic and will call stats::D()
on R calls, expressions or names). This is safe to do as it only modifies which objects are exported from the namespace (it does not truly remove objects from the namespace). This option can also be changed at any time. set_collapse(remove = NULL)
will restore the exported namespace.
For both options there exist a number of convenient keywords to bulk-mask / remove functions. For example set_collapse(mask = "manip", remove = "shorthand")
will mask all data manipulation functions such as mutate <- fmutate
and remove all function shorthands such as mtt
(i.e. abbreviations for frequently used functions that collapse supplies for faster coding / prototyping).
set_collapse()
also supports options 'digits', 'verbose' and 'stable.algo', enhancing the global configurability of collapse.
qM()
now also has a row.names.col
argument in the second position allowing generation of rownames when converting data frame-like objects to matrix e.g. qM(iris, "Species")
or qM(GGDC10S, 1:5)
(interaction of id's).
as_factor_GRP()
and finteraction()
now have an argument sep = "."
denoting the separator used for compound factor labels.
alloc()
now has an additional argument simplify = TRUE
. FALSE
always returns list output.
frename()
supports both new = old
(pandas, used to far) and old = new
(dplyr) style renaming conventions.
across()
supports negative indices, also in grouped settings: these will select all variables apart from grouping variables.
TRA()
allows shorthands "NA"
for "replace_NA"
and "fill"
for "replace_fill"
.
group()
experienced a minor speedup with >= 2 vectors as the first two vectors are now hashed jointly.
fquantile()
with names = TRUE
adds up to 1 digit after the comma in the percent-names, e.g. fquantile(airmiles, probs = 0.001)
generates appropriate names (not 0% as in the previous version).
Published by SebKrantz over 1 year ago
New vignette on collapse's handling of R objects.
print.descr()
with groups and option perc = TRUE
(the default) also shows percentages of the group frequencies for each variable.
funique(mtcars[NULL, ], sort = TRUE)
gave an error (for data frame with zero rows). Thanks @NicChr (#406).
Added SIMD vectorization for fsubset()
.
vlengths()
now also works for strings, and is hence a much faster version of both lengths()
and nchar()
. Also for atomic vectors the behavior is like lengths()
, e.g. vlengths(rnorm(10))
gives rep(1L, 10)
.
In collap[v/g]()
, the ...
argument is now placed after the custom
argument instead of after the last argument, in order to better guard against unwanted partial argument matching. In particular, previously the n
argument passed to fnth
was partially matched to na.last
. Thanks @ummel for alerting me of this (#421).
Published by SebKrantz over 1 year ago
Using DATAPTR_RO
to point to R lists because of the use of ALTLISTS
on R-devel.
Replacing !=
loop controls for SIMD loops with <
to ensure compatibility on all platforms. Thanks @albertus82 (#399).
Published by SebKrantz over 1 year ago
Improvements in get_elem()/has_elem()
: Option invert = TRUE
is implemented more robustly, and a function passed to get_elem()/has_elem()
is now applied to all elements in the list, including elements that are themselves list-like. This enables the use of inherits
to find list-like objects inside a broader list structure e.g. get_elem(l, inherits, what = "lm")
fetches all linear model objects inside l
.
Fixed a small bug in descr()
introduced in v1.9.0, producing an error if a data frame contained no numeric columns - because an internal function was not defined in that case. Also, POSIXct columns are handled better in print - preserving the time zone (thanks @cdignam-chwy #392).
fmean()
and fsum()
with g = NULL
, as well as TRA()
, setop()
, and related operators %r+%
, %+=%
etc., setv()
and fdist()
now utilize Single Instruction Multiple Data (SIMD) vectorization by default (if OpenMP is enabled), enabling potentially very fast computing speeds. Whether these instructions are utilized during compilation depends on your system. In general, if you want to max out collapse on your system, consider compiling from source with CFLAGS += -O3 -march=native -fopenmp
and CXXFLAGS += -O3 -march=native
in your .R/Makevars
.
Published by SebKrantz over 1 year ago
Added functions fduplicated()
and any_duplicated()
, for vectors and lists / data frames. Thanks @NicChr (#373)
sort
option added to set_collapse()
to be able to set unordered grouping as a default. E.g. setting set_collapse(sort = FALSE)
will affect collap()
, BY()
, GRP()
, fgroup_by()
, qF()
, qG()
, finteraction()
, qtab()
and internal use of these functions for ad-hoc grouping in fast statistical functions. Other uses of sort
, for example in funique()
where the default is sort = FALSE
, are not affected by the global default setting.
Fixed a small bug in group()
/ funique()
resulting in an unnecessary memory allocation error in rare cases. Thanks @NicChr (#381).