Advanced and Fast Data Transformation in R
OTHER License
Published by SebKrantz over 1 year ago
Further fix to an Address Sanitizer issue as required by CRAN (eliminating an unused out of bounds access at the end of a loop).
qsu()
finally has a grouped_df method.
Added options option("collapse_nthreads")
and option("collapse_na.rm")
, which allow you to load collapse with different defaults e.g. through an .Rprofile
or .fastverse
configuration file. Once collapse is loaded, these options take no effect, and users need to use set_collapse()
to change .op[["nthreads"]]
and .op[["na.rm"]]
interactively.
Exported method plot.psmat()
(can be useful to plot time series matrices).
Published by SebKrantz almost 2 years ago
Fixed minor C/C++ issues flagged by CRAN's detailed checks.
Added functions set_collapse()
and get_collapse()
, allowing you to globally set defaults for the nthreads
and na.rm
arguments to all functions in the package. E.g. set_collapse(nthreads = 4, na.rm = FALSE)
could be a suitable setting for larger data without missing values. This is implemented using an internal environment by the name of .op
, such that these defaults are received using e.g. .op[["nthreads"]]
, at the computational cost of a few nanoseconds (8-10x faster than getOption("nthreads")
which would take about 1 microsecond). .op
is not accessible by the user, so function get_collapse()
can be used to retrieve settings. Exempt from this are functions .quantile
, and a new function .range
(alias of frange
), which go directly to C for maximum performance in repeated executions, and are not affected by these global settings. Function descr()
, which internally calls a bunch of statistical functions, is also not affected by these settings.
Further improvements in thread safety for fsum()
and fmean()
in grouped computations across data frame columns. All OpenMP enabled functions in collapse can now be considered thread safe i.e. they pass the full battery of tests in multithreaded mode.
Published by SebKrantz almost 2 years ago
collapse 1.9.0 released mid of January 2023, provides improvements in performance and versatility in many areas, as well as greater statistical capabilities, most notably efficient (grouped, weighted) estimation of sample quantiles.
All functions renamed in collapse 1.6.0 are now depreciated, to be removed end of 2023. These functions had already been giving messages since v1.6.0. See help("collapse-renamed")
.
The lead operator F()
is not exported anymore from the package namespace, to avoid clashes with base::F
flagged by multiple people. The operator is still part of the package and can be accessed using collapse:::F
. I have also added an option "collapse_export_F"
, such that setting options(collapse_export_F = TRUE)
before loading the package exports the operator as before. Thanks @matthewross07 (#100), @edrubin (#194), and @arthurgailes (#347).
Function fnth()
has a new default ties = "q7"
, which gives the same result as quantile(..., type = 7)
(R's default). More details below.
fmode()
gave wrong results for singleton groups (groups of size 1) on unsorted data. I had optimized fmode()
for singleton groups to directly return the corresponding element, but it did not access the element through the (internal) ordering vector, so the first element/row of the entire vector/data was taken. The same mistake occurred for fndistinct
if singleton groups were NA
, which were counted as 1
instead of 0
under the na.rm = TRUE
default (provided the first element of the vector/data was not NA
). The mistake did not occur with data sorted by the groups, because here the data pointer already pointed to the first element of the group. (My apologies for this bug, it took me more than half a year to discover it, using collapse on a daily basis, and it escaped 700 unit tests as well).
Function groupid(x, na.skip = TRUE)
returned uninitialized first elements if the first values in x
where NA
. Thanks for reporting @Henrik-P (#335).
Fixed a bug in the .names
argument to across()
. Passing a naming function such as .names = function(c, f) paste0(c, "-", f)
now works as intended i.e. the function is applied to all combinations of columns (c) and functions (f) using outer()
. Previously this was just internally evaluated as .names(cols, funs)
, which did not work if there were multiple cols and multiple funs. There is also now a possibility to set .names = "flip"
, which names columns f_c
instead of c_f
.
fnrow()
was rewritten in C and also supports data frames with 0 columns. Similarly for seq_row()
. Thanks @NicChr (#344).
Added functions fcount()
and fcountv()
: a versatile and blazing fast alternative to dplyr::count
. It also works with vectors, matrices, as well as grouped and indexed data.
Added function fquantile()
: Fast (weighted) continuous quantile estimation (methods 5-9 following Hyndman and Fan (1996)), implemented fully in C based on quickselect and radixsort algorithms, and also supports an ordering vector as optional input to speed up the process. It is up to 2x faster than stats::quantile
on larger vectors, but also especially fast on smaller data, where the R overhead of stats::quantile
becomes burdensome. For maximum performance during repeated executions, a programmers version .quantile()
with different defaults is also provided.
Added function fdist()
: A fast and versatile replacement for stats::dist
. It computes a full euclidian distance matrix around 4x faster than stats::dist
in serial mode, with additional gains possible through multithreading along the distance matrix columns (decreasing thread loads as the matrix is lower triangular). It also supports computing the distance of a matrix with a single row-vector, or simply between two vectors. E.g. fdist(mat, mat[1, ])
is the same as sqrt(colSums((t(mat) - mat[1, ])^2)))
, but about 20x faster in serial mode, and fdist(x, y)
is the same as sqrt(sum((x-y)^2))
, about 3x faster in serial mode. In both cases (sub-column level) multithreading is available. Note that fdist
does not skip missing values i.e. NA
's will result in NA
distances. There is also no internal implementation for integers or data frames. Such inputs will be coerced to numeric matrices.
Added function GRPid()
to easily fetch the group id from a grouping object, especially inside grouped fmutate()
calls. This addition was warranted especially by the new improved fnth.default()
method which allows orderings to be supplied for performance improvements. See commends on fnth()
and the example provided below.
fsummarize()
was added as a synonym to fsummarise
. Thanks @arthurgailes for the PR.
C API: collapse exports around 40 C functions that provide functionality that is either convenient or rather complicated to implement from scratch. The exported functions can be found at the bottom of src/ExportSymbols.c
. The API does not include the Fast Statistical Functions, which I thought are too closely related to how collapse works internally to be of much use to a C programmer (e.g. they expect grouping objects or certain kinds of integer vectors). But you are free to request the export of additional functions, including C++ functions.
fnth()
and fmedian()
were rewritten in C, with significant gains in performance and versatility. Notably, fnth()
now supports (grouped, weighted) continuous quantile estimation like fquantile()
(fmedian()
, which is a wrapper around fnth()
, can also estimate various quantile based weighted medians). The new default for fnth()
is ties = "q7"
, which gives the same result as (f)quantile(..., type = 7)
(R's default). OpenMP multithreading across groups is also much more effective in both the weighted and unweighted case. Finally, fnth.default
gained an additional argument o
to pass an ordering vector, which can dramatically speed up repeated invocations of the function on the dame data:
# Estimating multiple weighted-grouped quantiles on mpg: pre-computing an ordering provides extra speed.
mtcars %>% fgroup_by(cyl, vs, am) %>%
fmutate(o = radixorder(GRPid(), mpg)) %>% # On grouped data, need to account for GRPid()
fsummarise(mpg_Q1 = fnth(mpg, 0.25, o = o, w = wt),
mpg_median = fmedian(mpg, o = o, w = wt),
mpg_Q3 = fnth(mpg, 0.75, o = o, w = wt))
# Note that without weights this is not always faster. Quickselect can be very efficient, so it depends
# on the data, the number of groups, whether they are sorted (which speeds up radixorder), etc...
BY
now supports data-length arguments to be passed e.g. BY(mtcars, mtcars$cyl, fquantile, w = mtcars$wt)
, making it effectively a generic grouped mapply
function as well. Furthermore, the grouped_df method now also expands grouping columns for output length > 1.
collap()
, which internally uses BY
with non-Fast Statistical Functions, now also supports arbitrary further arguments passed down to functions to be split by groups. Thus users can also apply custom weighted functions with collap()
. Furthermore, the parsing of the FUN
, catFUN
and wFUN
arguments was improved and brought in-line with the parsing of .fns
in across()
. The main benefit of this is that Fast Statistical Functions are now also detected and optimizations carried out when passed in a list providing a new name e.g. collap(data, ~ id, list(mean = fmean))
is now optimized! Thanks @ttrodrigz (#358) for requesting this.
descr()
, by virtue of fquantile
and the improvements to BY
, supports full-blown grouped and weighted descriptions of data. This is implemented through additional by
and w
arguments. The function has also been turned into an S3 generic, with a default and a 'grouped_df' method. The 'descr' methods as.data.frame
and print
also feature various improvements, and a new compact
argument to print.descr
, allowing a more compact printout. Users will also notice improved performance, mainly due to fquantile
: on the M1 descr(wlddev)
is now 2x faster than summary(wlddev)
, and 41x faster than Hmisc::describe(wlddev)
. Thanks @statzhero for the request (#355).
radixorder
is about 25% faster on characters and doubles. This also benefits grouping performance. Note that group()
may still be substantially faster on unsorted data, so if performance is critical try the sort = FALSE
argument to functions like fgroup_by
and compare.
Most list processing functions are noticeably faster, as checking the data types of elements in a list is now also done in C, and I have made some improvements to collapse's version of rbindlist()
(used in unlist2d()
, and various other places).
fsummarise
and fmutate
gained an ability to evaluate arbitrary expressions that result in lists / data frames without the need to use across()
. For example: mtcars |> fgroup_by(cyl, vs, am) |> fsummarise(mctl(cor(cbind(mpg, wt, carb)), names = TRUE))
or mtcars |> fgroup_by(cyl) |> fsummarise(mctl(lmtest::coeftest(lm(mpg ~ wt + carb)), names = TRUE))
. There is also the possibility to compute expressions using .data
e.g. mtcars |> fgroup_by(cyl) |> fsummarise(mctl(lmtest::coeftest(lm(mpg ~ wt + carb, .data)), names = TRUE))
yields the same thing, but is less efficient because the whole dataset (including 'cyl') is split by groups. For greater efficiency and convenience, you can pre-select columns using a global .cols
argument, e.g. mtcars |> fgroup_by(cyl, vs, am) |> fsummarise(mctl(cor(.data), names = TRUE), .cols = .c(mpg, wt, carb))
gives the same as above. Three Notes about this:
fmutate
, have the same length as the data (in each group)..data
is used, the entire expression (expr
) will be turned into a function of .data
(function(.data) expr
), which means columns are only available when accessed through .data
e.g. .data$col1
.fsummarise
supports computations with mixed result lengths e.g. mtcars |> fgroup_by(cyl) |> fsummarise(N = GRPN(), mean_mpg = fmean(mpg), quantile_mpg = fquantile(mpg))
, as long as all computations result in either length 1 or length k vectors, where k is the maximum result length (e.g. for fquantile
with default settings k = 5).
List extraction function get_elem()
now has an option invert = TRUE
(default FALSE
) to remove matching elements from a (nested) list. Also the functionality of argument keep.class = TRUE
is implemented in a better way, such that the default keep.class = FALSE
toggles classes from (non-matched) list-like objects inside the list to be removed.
num_vars()
has become a bit smarter: columns of class 'ts' and 'units' are now also recognized as numeric. In general, users should be aware that num_vars()
does not regard any R methods defined for is.numeric()
, it is implemented in C and simply checks whether objects are of type integer or double, and do not have a class. The addition of these two exceptions now guards against two common cases where num_vars()
may give undesirable outcomes. Note that num_vars()
is also called in collap()
to distinguish between numeric (FUN
) and non-numeric (catFUN
) columns.
Improvements to setv()
and copyv()
, making them more robust to borderline cases: integer(0)
passed to v
does nothing (instead of error), and it is also possible to pass a single real index if vind1 = TRUE
i.e. passing 1
instead of 1L
does not produce an error.
alloc()
now works with all types of objects i.e. it can replicate any object. If the input is non-atomic, atomic with length > 1 or NULL
, the output is a list of these objects, e.g. alloc(NULL, 10)
gives a length 10 list of NULL
objects, or alloc(mtcars, 10)
gives a list of mtcars
datasets. Note that in the latter case the datasets are not deep-copied, so no additional memory is consumed.
missing_cases()
and na_omit()
have gained an argument prop = 0
, indicating the proportion of values missing for the case to be considered missing/to be omitted. The default value of 0
indicates that at least 1 value must be missing. Of course setting prop = 1
indicates that all values must be missing. For data frames/lists the checking is done efficiently in C. For matrices this is currently still implemented using rowSums(is.na(X)) >= max(as.integer(prop * ncol(X)), 1L)
, so the performance is less than optimal.
missing_cases()
has an extra argument count = FALSE
. Setting count = TRUE
returns the case-wise missing value count (by cols
).
Functions frename()
and setrename()
have an additional argument .nse = TRUE
, conforming to the default non-standard evaluation of tagged vector expressions e.g. frename(mtcars, mpg = newname)
is the same as frename(mtcars, mpg = "newname")
. Setting .nse = FALSE
allows newname
to be a variable holding a name e.g. newname = "othername"; frename(mtcars, mpg = newname, .nse = FALSE)
. Another use of the argument is that a (named) character vector can now be passed to the function to rename a (subset of) columns e.g. cvec = letters[1:3]; frename(mtcars, cvec, cols = 4:6, .nse = FALSE)
(this works even with .nse = TRUE
), and names(cvec) = c("cyl", "vs", "am"); frename(mtcars, cvec, .nse = FALSE)
. Furthermore, setrename()
now also returns the renamed data invisibly, and relabel()
and setrelabel()
have also gained similar flexibility to allow (named) lists or vectors of variable labels to be passed. Note that these function have no NSE capabilities, so they work essentially like frename(..., .nse = FALSE)
.
Function add_vars()
became a bit more flexible and also allows single vectors to be added with tags e.g. add_vars(mtcars, log_mpg = log(mtcars$mpg), STD(mtcars))
, similar to cbind
. However add_vars()
continues to not replicate length 1 inputs.
Safer multithreading: OpenMP multithreading over parts of the R API is minimized, reducing errors that occurred especially when multithreading across data frame columns. Also the number of threads supplied by the user to all OpenMP enabled functions is ensured to not exceed either of omp_get_num_procs()
, omp_get_thread_limit()
, and omp_get_max_threads()
.
Published by SebKrantz about 2 years ago
Fixed some warnings on rchk and newer C compilers (LLVM clang 10+).
.pseries
/ .indexed_series
methods also change the implicit class of the vector (attached after "pseries"
), if the data type changed. e.g. calling a function like fgrowth
on an integer pseries changed the data type to double, but the "integer" class was still attached after "pseries".
Fixed bad testing for SE inputs in fgroup_by()
and findex_by()
. See #320.
Added rsplit.matrix
method.
descr()
now by default also reports 10% and 90% quantiles for numeric variables (in line with STATA's detailed summary statistics), and can also be applied to 'pseries' / 'indexed_series'. Furthermore, descr()
itself now has an argument stepwise
such that descr(big_data, stepwise = TRUE)
yields computation of summary statistics on a variable-by-variable basis (and the finished 'descr' object is returned invisibly). The printed result is thus identical to print(descr(big_data), stepwise = TRUE)
, with the difference that the latter first does the entire computation whereas the former computes statistics on demand.
Function ss()
has a new argument check = TRUE
. Setting check = FALSE
allows subsetting data frames / lists with positive integers without checking whether integers are positive or in-range. For programmers.
Function get_vars()
has a new argument rename
allowing select-renaming of columns in standard evaluation programming, e.g. get_vars(mtcars, c(newname = "cyl", "vs", "am"), rename = TRUE)
. The default is rename = FALSE
, to warrant full backwards compatibility. See #327.
Added helper function setattrib()
, to set a new attribute list for an object by reference + invisible return. This is different from the existing function setAttrib()
(note the capital A), which takes a shallow copy of list-like objects and returns the result.
Published by SebKrantz about 2 years ago
flm
and fFtest
are now internal generic with an added formula method e.g. flm(mpg ~ hp + carb, mtcars, weights = wt)
or fFtest(mpg ~ hp + carb | vs + am, mtcars, weights = wt)
in addition to the programming interface. Thanks to Grant McDermott for suggesting.
Added method as.data.frame.qsu
, to efficiently turn the default array outputs from qsu()
into tidy data frames.
Major improvements to setv
and copyv
, generalizing the scope of operations that can be performed to all common cases. This means that even simple base R operations such as X[v] <- R
can now be done significantly faster using setv(X, v, R)
.
n
and qtab
can now be added to options("collapse_mask")
e.g. options(collapse_mask = c("manip", "helper", "n", "qtab"))
. This will export a function n()
to get the (group) count in fsummarise
and fmutate
(which can also always be done using GRPN()
but n()
is more familiar to dplyr users), and will mask table()
with qtab()
, which is principally a fast drop-in replacement, but with some different further arguments.
Added C-level helper function all_funs
, which fetches all the functions called in an expression, similar to setdiff(all.names(x), all.vars(x))
but better because it takes account of the syntax. For example let x = quote(sum(sum))
i.e. we are summing a column named sum
. Then all.names(x) = c("sum", "sum")
and all.vars(x) = "sum"
so that the difference is character(0)
, whereas all_funs(x)
returns "sum"
. This function makes collapse smarter when parsing expressions in fsummarise
and fmutate
and deciding which ones to vectorize.
Published by SebKrantz over 2 years ago
Fixed a bug in fscale.pdata.frame
where the default C++ method was being called instead of the list method (i.e. the method didn't work at all).
Fixed 2 minor rchk issues (the remaining ones are spurious).
fsum
has an additional argument fill = TRUE
(default FALSE
) that initializes the result vector with 0
instead of NA
when na.rm = TRUE
, so that fsum(NA, fill = TRUE)
gives 0
like base::sum(NA, na.rm = TRUE)
.
Slight performance increase in fmean
with groups if na.rm = TRUE
(the default).
Significant performance improvement when using base R expressions involving multiple functions and one column e.g. mid_col = (min(col) + max(col)) / 2
or lorentz_col = cumsum(sort(col)) / sum(col)
etc. inside fsummarise
and fmutate
. Instead of evaluating such expressions on a data subset of one column for each group, they are now turned into a function e.g. function(x) cumsum(sort(x)) / sum(x)
which is applied to a single vector split by groups.
Argument return.groups
from GRP.default
is now also available in fgroup_by
, allowing grouped data frames without materializing the unique grouping columns. This allows more efficient mutate-only operations e.g. mtcars |> fgroup_by(cyl, return.groups = FALSE) |> fmutate(across(hp:carb, fscale))
. Similarly for aggregation with dropping of grouping columns mtcars |> fgroup_by(cyl, return.groups = FALSE) |> fmean()
is equivalent and faster than mtcars |> fgroup_by(cyl) |> fmean(keep.group_vars = FALSE)
.
Published by SebKrantz over 2 years ago
Published by SebKrantz over 2 years ago
A few improvements and fixes to make collapse 1.8 acceptable to CRAN. The changes may be summarised as follows:
Significant speed improvement in qF/qG
(factor-generation) for character vectors with more than 100,000 obs and many levels if sort = TRUE
(the default). For details see the method
argument of ?qF
.
Optimizations in fmode
and fndistinct
for singleton groups.
Fixed some rchk issues found by Thomas Kalibera from CRAN.
faster funique.default
method.
group
now also internally optimizes on 'qG' objects.
Added function fnunique
(yet another alternative to data.table::uniqueN
, kit::uniqLen
or dplyr::n_distinct
, and principally a simple wrapper for attr(group(x), "N.groups")
). At present fnunique
generally outperforms the others on data frames.
finteraction
has an additional argument factor = TRUE
. Setting factor = FALSE
returns a 'qG' object, which is more efficient if just an integer id but no factor object itself is required.
Operators (see .OPERATOR_FUN
) have been improved a bit such that id-variables selected in the .data.frame
(by
, w
or t
arguments) or .pdata.frame
methods (variables in the index) are not computed upon even if they are numeric (since the default is cols = is.numeric
). In general, if cols
is a function used to select columns of a certain data type, id variables are excluded from computation even if they are of that data type. It is still possible to compute on id variables by explicitly selecting them using names or indices passed to cols
, or including them in the lhs of a formula passed to by
.
Further efforts to facilitate adding the group-count in fsummarise
and fmutate
:
options(collapse_mask = "all")
before loading the package, an additional function n()
is exported that works just like dplyr:::n()
. (Note that internal optimization flags for n
are always on, so if you really want the function to be called n()
without setting options(collapse_mask = "all")
, you could also do n <- GRPN
or n <- collapse:::n
)GRPN()
. The previous uses of GRPN
are unaltered i.e. GRPN
can also:
data |> gby(id) |> GRPN()
or data %>% gby(id) %>% ftransform(N = GRPN(.))
(note the dot).fsubset(data, GRPN(id) > 10L)
or fsubset(data, GRPN(list(id1, id2)) > 10L)
or GRPN(data, by = ~ id1 + id2)
.Published by SebKrantz over 2 years ago
collapse 1.8.0, released mid of May 2022, brings enhanced support for indexed computations on time series and panel data by introducing flexible 'indexed_frame' and 'indexed_series' classes and surrounding infrastructure, sets a modest start to OpenMP multithreading as well as data transformation by reference in statistical functions, and enhances the packages descriptive statistics toolset.
Functions Recode
, replace_non_finite
, depreciated since collapse v1.1.0 and is.regular
, depreciated since collapse v1.5.1 and clashing with a more important function in the zoo package, are now removed.
Fast Statistical Functions operating on numeric data (such as fmean
, fmedian
, fsum
, fmin
, fmax
, ...) now preserve attributes in more cases. Previously these functions did not preserve attributes for simple computations using the default method, and only preserved attributes in grouped computations if !is.object(x)
(see NEWS section for collapse 1.4.0). This meant that fmin
and fmax
did not preserve the attributes of Date or POSIXct objects, and none of these functions preserved 'units' objects (used a lot by the sf package). Now, attributes are preserved if !inherits(x, "ts")
, that is the new default of these functions is to generally keep attributes, except for 'ts' objects where doing so obviously causes an unwanted error (note that 'xts' and others are handled by the matrix or data.frame method where other principles apply, see NEWS for 1.4.0). An exception are the functions fnobs
and fndistinct
where the previous default is kept.
Time Series Functions flag
, fdiff
, fgrowth
and psacf/pspacf/psccf
(and the operators L/F/D/Dlog/G
) now internally process time objects passed to the t
argument (where is.object(t) && is.numeric(unclass(t))
) via a new function called timeid
which turns them into integer vectors based on the greatest common divisor (GCD) (see below). Previously such objects were converted to factor. This can change behavior of code e.g. a 'Date' variable representing monthly data may be regular when converted to factor, but is now irregular and regarded as daily data (with a GCD of 1) because of the different day counts of the months. Users should fix such code by either by calling qG
on the time variable (for grouping / factor-conversion) or using appropriate classes e.g. zoo::yearmon
. Note that plain numeric vectors where !is.object(t)
are still used directly for indexation without passing them through timeid
(which can still be applied manually if desired).
BY
now has an argument reorder = TRUE
, which casts elements in the original order if NROW(result) == NROW(x)
(like fmutate
). Previously the result was just in order of the groups, regardless of the length of the output. To obtain the former outcome users need to set reorder = FALSE
.
options("collapse_DT_alloccol")
was removed, the default is now fixed at 100. The reason is that data.table automatically expands the range of overallocated columns if required (so the option is not really necessary), and calling R options from C slows down C code and can cause problems in parallel code.
Fixed a bug in fcumsum
that caused a segfault during grouped operations on larger data, due to flawed internal memory allocation. Thanks @Gulde91 for reporting #237.
Fixed a bug in across
caused by two function(x)
statements being passed in a list e.g. mtcars |> fsummarise(acr(mpg, list(ssdd = function(x) sd(x), mu = function(x) mean(x))))
. Thanks @trang1618 for reporting #233.
Fixed an issue in across()
when logical vectors were used to select column on grouped data e.g. mtcars %>% gby(vs, am) %>% smr(acr(startsWith(names(.), "c"), fmean))
now works without error.
qsu
gives proper output for length 1 vectors e.g. qsu(1)
.
collapse depends on R > 3.3.0, due to the use of newer C-level macros introduced then. The earlier indication of R > 2.1.0 was only based on R-level code and misleading. Thanks @ben-schwen for reporting #236. I will try to maintain this dependency for as long as possible, without being too restrained by development in R's C API and the ALTREP system in particular, which collapse might utilize in the future.
Introduction of 'indexed_frame','indexed_series' and 'index_df' classes: fast and flexible indexed time series and panel data classes that inherit from plm's 'pdata.frame', 'pseries' and 'pindex' classes. These classes take full advantage of collapse's computational infrastructure, are class-agnostic i.e. they can be superimposed upon any data frame or vector/matrix like object while maintaining most of the functionality of that object, support both time series and panel data, natively handle irregularity, and supports ad-hoc computations inside arbitrary data masking functions and model formulas. This infrastructure comprises of additional functions and methods, and modification of some existing functions and 'pdata.frame' / 'pseries' methods.
New functions: findex_by/iby
, findex/ix
, unindex
, reindex
, is_irregular
, to_plm
.
New methods: [.indexed_series
, [.indexed_frame
, [<-.indexed_frame
, $.indexed_frame
,
$<-.indexed_frame
, [[.indexed_frame
, [[<-.indexed_frame
, [.index_df
, fsubset.pseries
, fsubset.pdata.frame
, funique.pseries
, funique.pdata.frame
, roworder(v)
(internal) na_omit
(internal), print.indexed_series
, print.indexed_frame
, print.index_df
, Math.indexed_series
, Ops.indexed_series
.
Modification of 'pseries' and 'pdata.frame' methods for functions flag/L/F
, fdiff/D/Dlog
, fgrowth/G
, fcumsum
, psmat
, psacf/pspacf/psccf
, fscale/STD
, fbetween/B
, fwithin/W
, fhdbetween/HDB
, fhdwithin/HDW
, qsu
and varying
to take advantage of 'indexed_frame' and 'indexed_series' while continuing to work as before with 'pdata.frame' and 'pseries'.
For more information and details see help("indexing")
.
Added function timeid
: Generation of an integer-id/time-factor from time or date sequences represented by integer of double vectors (such as 'Date', 'POSIXct', 'ts', 'yearmon', 'yearquarter' or plain integers / doubles) by a numerically quite robust greatest common divisor method (see below). This function is used internally in findex_by
, reindex
and also in evaluation of the t
argument to functions like flag
/fdiff
/fgrowth
whenever is.object(t) && is.numeric(unclass(t))
(see also note above).
Programming helper function vgcd
to efficiently compute the greatest common divisor from a vector or positive integer or double values (which should ideally be unique and sorted as well, timeid
uses vgcd(sort(unique(diff(sort(unique(na_rm(x)))))))
). Precision for doubles is up to 6 digits.
Programming helper function frange
: A significantly faster alternative to base::range
, which calls both min
and max
. Note that frange
inherits collapse's global na.rm = TRUE
default.
Added function qtab/qtable
: A versatile and computationally more efficient alternative to base::table
. Notably, it also supports tabulations with frequency weights, and computation of a statistic over combinations of variables. Objects are of class 'qtab' that inherits from 'table'. Thus all 'table' methods apply to it.
TRA
was rewritten in C, and now has an additional argument set = TRUE
which toggles data transformation by reference. The function setTRA
was added as a shortcut which additionally returns the result invisibly. Since TRA
is usually accessed internally through the like-named argument to Fast Statistical Functions, passing set = TRUE
to those functions yields an internal call to setTRA
. For example fmedian(num_vars(iris), g = iris$Species, TRA = "-", set = TRUE)
subtracts the species-wise median from the numeric variables in the iris dataset, modifying the data in place and returning the result invisibly. Similarly the argument can be added in other workflows such as iris |> fgroup_by(Species) |> fmutate(across(1:2, fmedian, set = TRUE))
or mtcars |> ftransform(mpg = mpg %+=% hp, wt = fsd(wt, cyl, TRA = "replace_fill", set = TRUE))
. Note that such chains must be ended by invisible()
if no printout is wanted.
Exported helper function greorder
, the companion to gsplit
to reorder output in fmutate
(and now also in BY
): let g
be a 'GRP' object (or something coercible such as a vector) and x
a vector, then greorder
orders data in y = unlist(gsplit(x, g))
such that identical(greorder(y, g), x)
.
fmean
, fprod
, fmode
and fndistinct
were rewritten in C, providing performance improvements, particularly in fmode
and fndistinct
, and improvements for integers in fmean
and fprod
.
OpenMP multithreading in fsum
, fmean
, fmedian
, fnth
, fmode
and fndistinct
, implemented via an additional nthreads
argument. The default is to use 1 thread, which internally calls a serial version of the code in fsum
and fmean
(thus no change in the default behavior). The plan is to slowly roll this out over all statistical functions and then introduce options to set alternative global defaults. Multi-threading internally works different for different functions, see the nthreads
argument documentation of a particular function. Unfortunately I currently cannot guarantee thread safety, as parallelization of complex loops entails some tricky bugs and I have limited time to sort these out. So please report bugs, and if you happen to have experience with OpenMP please consider examining the code and making some suggestions.
TRA
has an additional option "replace_NA"
, e.g. wlddev |> fgroup_by(iso3c) |> fmutate(across(PCGDP:POP, fmedian, TRA = "replace_NA"))
performs median value imputation of missing values. Similarly for a matrix X <- matrix(na_insert(rnorm(1e7)), ncol = 100)
, fmedian(X, TRA = "replace_NA", set = TRUE)
(column-wise median imputation by reference).
All Fast Statistical Functions support zero group sizes (e.g. grouping with a factor that has unused levels will always produce an output of length nlevels(x)
with 0
or NA
elements for the unused levels). Previously this produced an error message with counting/ordinal functions fmode
, fndistinct
, fnth
and fmedian
.
'GRP' objects now also contain a 'group.starts' item in the 8'th slot that gives the first positions of the unique groups, and is returned alongside the groups whenever return.groups = TRUE
. This now benefits ffirst
when invoked with na.rm = FALSE
, e.g. wlddev %>% fgroup_by(country) %>% ffirst(na.rm = FALSE)
is now just as efficient as funique(wlddev, cols = "country")
. Note that no additional computing cost is incurred by preserving the 'group.starts' information.
Conversion methods GRP.factor
, GRP.qG
, GRP.pseries
, GRP.pdata.frame
and GRP.grouped_df
now also efficiently check if grouping vectors are sorted (the information is stored in the "ordered" element of 'GRP' objects). This leads to performance improvements in gsplit
/ greorder
and dependent functions such as BY
and rsplit
if factors are sorted.
descr()
received some performance improvements (up to 2x for categorical data), and has an additional argument sort.table
, allowing frequency tables for categorical variables to be sorted by frequency ("freq"
) or by table values ("value"
). The new default is ("freq"
), which presents tables in decreasing order of frequency. A method [.descr
was added allowing 'descr' objects to be subset like a list. The print method was also enhanced, and by default now prints 14 values with the highest frequency and groups the remaining values into a single ... %s Others
category. Furthermore, if there are any missing values in the column, the percentage of values missing is now printed behind Statistics
. Additional arguments reverse
and stepwise
allow printing in reverse order and/or one variable at a time.
whichv
(and operators %==%
, %!=%
) now also support comparisons of equal-length arguments e.g. 1:3 %==% 1:3
. Note that this should not be used to compare 2 factors.
Added some code to the .onLoad
function that checks for the existence of a .fastverse
configuration file containing a setting for _opt_collapse_mask
: If found the code makes sure that the option takes effect before the package is loaded. This means that inside projects using the fastverse and options("collapse_mask")
to replace base R / dplyr functions, collapse cannot be loaded without the masking being applied, making it more secure to utilize this feature. For more information about function masking see help("collapse-options")
and for .fastverse
configuration files see the fastverse vignette.
Added hidden .list
methods for fhdwithin/HDW
and fhdbetween/HDB
. As for the other .FAST_FUN
this is just a wrapper for the data frame method and meant to be used on unclassed data frames.
ss()
supports unnamed lists / data frames.
The t
and w
arguments in 'grouped_df' methods (NSE) and where formula input is allowed, supports ad-hoc transformations. E.g. wlddev %>% gby(iso3c) %>% flag(t = qG(date))
or L(wlddev, 1, ~ iso3c, ~qG(date))
, similarly qsu(wlddev, w = ~ log(POP))
, wlddev %>% gby(iso3c) %>% collapg(w = log(POP))
or wlddev %>% gby(iso3c) %>% nv() %>% fmean(w = log(POP))
.
Small improvements to group()
algorithm, avoiding some cases where the hash function performed badly, particularly with integers.
Function GRPnames
now has a sep
argument to choose a separator other than "."
.
Published by SebKrantz over 2 years ago
Corrected a C-level bug in gsplit
that could lead R to crash in some instances (gsplit
is used internally in fsummarise
, fmutate
, BY
and collap
to perform computations with base R (non-optimized) functions).
Ensured that BY.grouped_df
always (by default) returns grouping columns in aggregations i.e. iris |> gby(Species) |> nv() |> BY(sum)
now gives the same as iris |> gby(Species) |> nv() |> fsum()
.
A .
was added to the first argument of functions fselect
, fsubset
, colorder
and fgroup_by
, i.e. fselect(x, ...) -> fselect(.x, ...)
. The reason for this is that over time I added the option to select-rename columns e.g. fselect(mtcars, cylinders = cyl)
, which was not offered when these functions were created. This presents problems if columns should be renamed into x
, e.g. fselect(mtcars, x = cyl)
failed, see #221. Renaming the first argument to .x
somewhat guards against such situations. I think this change is worthwhile to implement, because it makes the package more robust going forward, and usually the first argument of these functions is never invoked explicitly. I really hope this breaks nobody's code.
Added a function GRPN
to make it easy to add a column of group sizes e.g. mtcars %>% fgroup_by(cyl,vs,am) %>% ftransform(Sizes = GRPN(.))
or mtcars %>% ftransform(Sizes = GRPN(list(cyl, vs, am)))
or GRPN(mtcars, by = ~cyl+vs+am)
.
Added [.pwcor
and [.pwcov
, to be able to subset correlation/covariance matrices without loosing the print formatting.
Published by SebKrantz over 2 years ago
In the development version on GitHub, a .
was added to the first argument of functions fselect
, fsubset
, colorder
and fgroup_by
, i.e. fselect(x, ...) -> fselect(.x, ...)
. The reason for this is that over time I added the option to select-rename columns e.g. fselect(mtcars, cylinders = cyl)
, which was not offered when these functions were created. This presents problems if columns should be renamed into x
, e.g. fselect(mtcars, x = cyl)
fails, see e.g. #221 . Renaming the first argument to .x
somewhat guards against such situations. I think this API change is worthwhile to implement, because it makes the package more robust going forward, and usually the first argument of these functions is never invoked explicitly. For now it remains in the development version which you can install using remotes::install_github("SebKrantz/collapse")
. If you have strong objections to this change (because it will break your code or you know of people that have a programming style where they explicitly set the first argument of data manipulation functions), please let me know!
Also ensuring tidyverse examples are in \donttest{}
and building without the dplyr testing file to avoid issues with static code analysis on CRAN.
20-50% Speed improvement in gsplit
(and therefore in fsummarise
, fmutate
, collap
and BY
when invoked with base R functions) when grouping with GRP(..., sort = TRUE, return.order = TRUE)
. To enable this by default, the default for argument return.order
in GRP
was set to sort
, which retains the ordering vector (needed for the optimization). Retaining the ordering vector uses up some memory which can possibly adversely affect computations with big data, but with big data sort = FALSE
usually gives faster results anyway, and you can also always set return.order = FALSE
(also in fgroup_by
, collap
), so this default gives the best of both worlds.
sort.row
(replaced by sort
in 2020) is now removed from collap
. Also arguments return.order
and method
were added to collap
providing full control of the grouping that happens internally.Tests needed to be adjusted for the upcoming release of dplyr 1.0.8 which involves an API change in mutate
. fmutate
will not take over these changes i.e. fmutate(..., .keep = "none")
will continue to work like dplyr::transmute
. Furthermore, no more tests involving dplyr are run on CRAN, and I will also not follow along with any future dplyr API changes.
The C-API macro installTrChar
(used in the new massign
function) was replaced with installChar
to maintain backwards compatibility with R versions prior to 3.6.0. Thanks @tedmoorman #213.
Minor improvements to group()
, providing increased performance for doubles and also increased performance when the second grouping variable is integer, which turned out to be very slow in some instances.
Published by SebKrantz over 2 years ago
Removed tests involving the weights package (which is not available on R-devel CRAN checks).
fgroup_by
is more flexible, supporting computing columns e.g. fgroup_by(GGDC10S, Variable, Decade = floor(Year / 10) * 10)
and various programming options e.g. fgroup_by(GGDC10S, 1:3)
, fgroup_by(GGDC10S, c("Variable", "Country"))
, or fgroup_by(GGDC10S, is.character)
. You can also use column sequences e.g. fgroup_by(GGDC10S, Country:Variable, Year)
, but this should not be mixed with computing columns. Compute expressions may also not include the :
function.
More memory efficient attribute handling in C/C++ (using C-API macro SHALLOW_DUPLICATE_ATTRIB
instead of DUPLICATE_ATTRIB
) in most places.
Published by SebKrantz almost 3 years ago
|>
is not used in tests or examples, to avoid errors on CRAN checks with older versions of R.Published by SebKrantz almost 3 years ago
Fixed minor C/C++ issues flagged in CRAN checks.
Added option ties = "last"
to fmode
.
Added argument stable.algo
to qsu
. Setting stable.algo = FALSE
toggles a faster calculation of the standard deviation, yielding 2x speedup on large datasets.
Fast Statistical Functions now internally use group
for grouping data if both g
and TRA
arguments are used, yielding efficiency gains on unsorted data.
Ensured that fmutate
and fsummarise
can be called if collapse is not attached.
Published by SebKrantz almost 3 years ago
collapse 1.7.0, released mid January 2022, brings major improvements in the computational backend of the package, it's data manipulation capabilities, and a whole set of new functions that enable more flexible and memory efficiency R programming - significantly enhancing the language itself. For the vast majority of codes, updating to 1.7 should not cause any problems.
num_vars
is now implemented in C, yielding a massive performance increase over checking columns using vapply(x, is.numeric, logical(1))
. It selects columns where (is.double(x) || is.integer(x)) && !is.object(x)
. This provides the same results for most common classes found in data frames (e.g. factors and date columns are not numeric), however it is possible for users to define methods for is.numeric
for other objects, which will not be respected by num_vars
anymore. A prominent example are base R's 'ts' objects i.e. is.numeric(AirPassengers)
returns TRUE
, but is.object(AirPassengers)
is also TRUE
so the above yields FALSE
, implying - if you happened to work with data frames of 'ts' columns - that num_vars
will now not select those anymore. Please make me aware if there are other important classes that are found in data frames and where is.numeric
returns TRUE
. num_vars
is also used internally in collap
so this might affect your aggregations.
In flag
, fdiff
and fgrowth
, if a plain numeric vector is passed to the t
argument such that is.double(t) && !is.object(t)
, it is coerced to integer using as.integer(t)
and directly used as time variable, rather than applying ordered grouping first. This is to avoid the inefficiency of grouping, and owes to the fact that in most data imported into R with various packages, the time (year) variables are coded as double although they should be integer (I also don't know of any cases where time needs to be indexed by a non-date variable with decimal places). Note that the algorithm internally handles irregularity in the time variable so this is not a problem. Should this break any code, kindly raise an issue on GitHub.
The function setrename
now truly renames objects by reference (without creating a shallow copy). The same is true for vlabels<-
(which was rewritten in C) and a new function setrelabel
. Thus additional care needs to be taken (with use inside functions etc.) as the renaming will take global effects unless a shallow copy of the data was created by some prior operation inside the function. If in doubt, better use frename
or relabel
which do create a shallow copy.
Some improvements to the BY
function, both in terms of performance and security. Performance is enhanced through a new C function gsplit
, providing split-apply-combine computing speeds competitive with dplyr on a much broader range of R objects. Regarding Security: if the result of the computation has the same length as the original data, names / rownames and grouping columns (for grouped data) are only added to the result object if known to be valid, i.e. if the data was originally sorted by the grouping columns (information recorded by GRP.default(..., sort = TRUE)
, which is called internally on non-factor/GRP/qG objects). This is because BY
does not reorder data after the split-apply-combine step (unlike dplyr::mutate
); data are simply recombined in the order of the groups. Because of this, in general, BY
should be used to compute summary statistics (unless data are sorted before grouping). The added security makes this explicit.
Added a method length.GRP
giving the length of a grouping object. This could break code calling length
on a grouping object before (which just returned the length of the list).
Functions renamed in collapse 1.6.0 will now print a message telling you to use the updated names. The functions under the old names will stay around for 1-3 more years.
The passing of argument order
instead of sort
in function GRP
(from a very early version of collapse), is now disabled.
fvar
, fsd
, fscale
and qsu
) to calculate variances, occurring when initial or final zero weights caused the running sum of weights in the algorithm to be zero, yielding a division by zero and NA
as output although a value was expected. These functions now skip zero weights alongside missing weights, which also implies that you can pass a logical vector to the weights argument to very efficiently calculate statistics on a subset of data (e.g. using qsu
).Function group
was added, providing a low-level interface to a new unordered grouping algorithm based on hashing in C and optimized for R's data structures. The algorithm was heavily inspired by the great kit
package of Morgan Jacob, and now feeds into the package through multiple central functions (including GRP
/ fgroup_by
, funique
and qF
) when invoked with argument sort = FALSE
. It is also used in internal groupings performed in data transformation functions such as fwithin
(when no factor or 'GRP' object is provided to the g
argument). The speed of the algorithm is very promising (often superior to radixorder
), and it could be used in more places still. I welcome any feedback on it's performance on different datasets.
Function gsplit
provides an efficient alternative to split
based on grouping objects. It is used as a new backend to rsplit
(which also supports data frame) as well as BY
, collap
, fsummarise
and fmutate
- for more efficient grouped operations with functions external to the package.
Added multiple functions to facilitate memory efficient programming (written in C). These include elementary mathematical operations by reference (setop
, %+=%
, %-=%
, %*=%
, %/=%
), supporting computations involving integers and doubles on vectors, matrices and data frames (including row-wise operations via setop
) with no copies at all. Furthermore a set of functions which check a single value against a vector without generating logical vectors: whichv
, whichNA
(operators %==%
and %!=%
which return indices and are significantly faster than ==
, especially inside functions like fsubset
), anyv
and allv
(allNA
was already added before). Finally, functions setv
and copyv
speed up operations involving the replacement of a value (x[x == 5] <- 6
) or of a sequence of values from a equally sized object (x[x == 5] <- y[x == 5]
, or x[ind] <- y[ind]
where ind
could be pre-computed vectors or indices) in vectors and data frames without generating any logical vectors or materializing vector subsets.
Function vlengths
was added as a more efficient alternative to lengths
(without method dispatch, simply coded in C).
Function massign
provides a multivariate version of assign
(written in C, and supporting all basic vector types). In addition the operator %=%
was added as an efficient multiple assignment operator. (It is called %=%
and not %<-%
to facilitate the translation of Matlab or Python codes into R, and because the zeallot package already provides multiple-assignment operators (%<-%
and %->%
), which are significantly more versatile, but orders of magnitude slower than %=%
)
Fully fledged fmutate
function that provides functionality analogous to dplyr::mutate
(sequential evaluation of arguments, including arbitrary tagged expressions and across
statements). fmutate
is optimized to work together with the packages Fast Statistical and Data Transformation Functions, yielding fast, vectorized execution, but also benefits from gsplit
for other operations.
across()
function implemented for use inside fsummarise
and fmutate
. It is also optimized for Fast Statistical and Data Transformation Functions, but performs well with other functions too. It has an additional arguments .apply = FALSE
which will apply functions to the entire subset of the data instead of individual columns, and thus allows for nesting tibbles and estimating models or correlation matrices by groups etc.. across()
also supports an arbitrary number of additional arguments which are split and evaluated by groups if necessary. Multiple across()
statements can be combined with tagged vector expressions in a single call to fsummarise
or fmutate
. Thus the computational framework is pretty general and similar to data.table, although less efficient with big datasets.
Added functions relabel
and setrelabel
to make interactive dealing with variable labels a bit easier. Note that both functions operate by reference. (Through vlabels<-
which is implemented in C. Taking a shallow copy of the data frame is useless in this case because variable labels are attributes of the columns, not of the frame). The only difference between the two is that setrelabel
returns the result invisibly.
function shortcuts rnm
and mtt
added for frename
and fmutate
. across
can also be abbreviated using acr
.
Added two options that can be invoked before loading of the package to change the namespace: options(collapse_mask = c(...))
can be set to export copies of selected (or all) functions in the package that start with f
removing the leading f
e.g. fsubset
-> subset
(both fsubset
and subset
will be exported). This allows masking base R and dplyr functions (even basic functions such as sum
, mean
, unique
etc. if desired) with collapse's fast functions, facilitating the optimization of existing codes and allowing you to work with collapse using a more natural namespace. The package has been internally insulated against such changes, but of course they might have major effects on existing codes. Also options(collapse_F_to_FALSE = FALSE)
can be invoked to get rid of the lead operator F
, which masks base::F
(an issue raised by some people who like to use T
/F
instead of TRUE
/FALSE
). Read the help page ?collapse-options
for more information.
Package loads faster (because I don't fetch functions from some other C/C++ heavy packages in .onLoad
anymore, which implied unnecessary loading of a lot of DLLs).
fsummarise
is now also fully featured supporting evaluation of arbitrary expressions and across()
statements. Note that mixing Fast Statistical Functions with other functions in a single expression can yield unintended outcomes, read more at ?fsummarise
.
funique
benefits from group
in the default sort = FALSE
, configuration, providing extra speed and unique values in first-appearance order in both the default and the data frame method, for all data types.
Function ss
supports both empty i
or j
.
The printout of fgroup_by
also shows minimum and maximum group size for unbalanced groupings.
In ftransformv/settransformv
and fcomputev
, the vars
argument is also evaluated inside the data frame environment, allowing NSE specifications using column names e.g. ftransformv(data, c(col1, col2:coln), FUN)
.
qF
with option sort = FALSE
now generates factors with levels in first-appearance order (instead of a random order assigned by the hash function), and can also be called on an existing factor to recast the levels in first-appearance order. It is also faster with sort = FALSE
(thanks to group
).
finteraction
has argument sort = FALSE
to also take advantage of group
.
rsplit
has improved performance through gsplit
, and an additional argument use.names
, which can be used to return an unnamed list.
Speedup in vtypes
and functions num_vars
, cat_vars
, char_vars
, logi_vars
and fact_vars
. Note than num_vars
behaves slightly differently as discussed above.
vlabels(<-)
/ setLabels
rewritten in C, giving a ~20x speed improvement. Note that they now operate by reference.
vlabels
, vclasses
and vtypes
have a use.names
argument. The default is TRUE
(as before).
colorder
can rename columns on the fly and also has a new mode pos = "after"
to place all selected columns after the first selected one, e.g.: colorder(mtcars, cyl, vs_new = vs, am, pos = "after")
. The pos = "after"
option was also added to roworderv
.
add_stub
and rm_stub
have an additional cols
argument to apply a stub to certain columns only e.g. add_stub(mtcars, "new_", cols = 6:9)
.namlab
has additional arguments N
and Ndistinct
, allowing to display number of observations and distinct values next to variable names, labels and classes, to get a nice and quick overview of the variables in a large dataset.
copyMostAttrib
only copies the "row.names"
attribute when known to be valid.
na_rm
can now be used to efficiently remove empty or NULL
elements from a list.
flag
, fdiff
and fgrowth
produce less messages (i.e. no message if you don't use a time variable in grouped operations, and messages about computations on highly irregular panel data only if data length exceeds 10 million obs.).
The print methods of pwcor
and pwcov
now have a return
argument, allowing users to obtain the formatted correlation matrix, for exporting purposes.
replace_NA
, recode_num
and recode_char
have improved performance and an additional argument set
to take advantage of setv
to change (some) data by reference. For replace_NA
, this feature is mature and setting set = TRUE
will modify all selected columns in place and return the data invisibly. For recode_num
and recode_char
only a part of the transformations are done by reference, thus users will still have to assign the data to preserve changes. In the future, this will be improved so that set = TRUE
toggles all transformations to be done by reference.
Published by SebKrantz about 3 years ago
Use of VECTOR_PTR
in C API now gives an error on R-devel even if USE_RINTERNALS
is defined. Thus this patch gets rid of all remaining usage of this macro to avoid errors on CRAN checks using the development version of R.
The print method for qsu
now uses an apostrophe (') to designate million digits, instead of a comma (,). This is to avoid confusion with the decimal point, and the typical use of (,) for thousands (which I don't like).
Published by SebKrantz over 3 years ago
A patch for 1.6.0 which fixes (minor) issues flagged by CRAN and adds a few handy extras.
Puts examples using the new base pipe |>
inside \donttest{}
so that they don't fail CRAN tests on older R versions.
Fixes a LTO issue caused by a small mistake in a header file (which does not have any implications to the user but was detected by CRAN checks).
Checks on the gcc11 compiler flagged an additional issue with a pointer pointing to element -1 of a C array (which I had done on purpose to index it with an R integer vector).
Fixes a valgrind issue because of comparing an uninitialized value to something.
Added a function fcomputev
, which allows selecting columns and transforming them with a function in one go. The keep
argument can be used to add columns to the selection that are not transformed.
Added a function setLabels
as a wrapper around vlabels<-
to facilitate setting variable labels inside pipes.
Function rm_stub
now has an argument regex = TRUE
which triggers a call to gsub
and allows general removing of character sequences in column names on the fly.
vlabels<-
and setLabels
now support list of variable labels or other attributes (i.e. the value
is internally subset using [[
, not [
). Thus they are now general functions to attach a vector or list of attributes to columns in a list / data frame.Published by SebKrantz over 3 years ago
collapse 1.6.0, released end of June 2021, presents some significant improvements in the user-friendliness, compatibility and programmability of the package, as well as a few function additions.
ffirst
, flast
, fnobs
, fsum
, fmin
and fmax
were rewritten in C. The former three now also support list columns (where NULL
or empty list elements are considered missing values when na.rm = TRUE
), and are extremely fast for grouped aggregation if na.rm = FALSE
. The latter three also support and return integers, with significant performance gains, even compared to base R. Code using these functions expecting an error for list-columns or expecting double output even if the input is integer should be adjusted.
collapse now directly supports sf data frames through functions like fselect
, fsubset
, num_vars
, qsu
, descr
, varying
, funique
, roworder
, rsplit
, fcompute
etc., which will take along the geometry column even if it is not explicitly selected (mirroring dplyr methods for sf data frames). This is mostly done internally at C-level, so functions remain simple and fast. Existing code that explicitly selects the geometry column is unaffected by the change, but code of the form sf_data %>% num_vars %>% qDF %>% ...
, where columns excluding geometry were selected and the object later converted to a data frame, needs to be rewritten as sf_data %>% qDF %>% num_vars %>% ...
. A short vignette was added describing the integration of collapse and sf.
I've received several requests for increased namespace consistency. collapse functions were named to be consistent with base R, dplyr and data.table, resulting in names like is.Date
, fgroup_by
or settransformv
. To me this makes sense, but I've been convinced that a bit more consistency is advantageous. Towards that end I have decided to eliminate the '.' notation of base R and to remove some unexpected capitalizations in function names giving some people the impression I was using camel-case. The following functions are renamed:
fNobs
-> fnobs
, fNdistinct
-> fndistinct
, pwNobs
-> pwnobs
, fHDwithin
-> fhdwithin
fHDbetween
-> fhdbetween
, as.factor_GRP
-> as_factor_GRP
, as.factor_qG
-> as_factor_qG
, is.GRP
-> is_GRP
, is.qG
-> is_qG
, is.unlistable
-> is_unlistable
, is.categorical
-> is_categorical
, is.Date
-> is_date
, as.numeric_factor
-> as_numeric_factor
, as.character_factor
-> as_character_factor
,
Date_vars
-> date_vars
.
This is done in a very careful manor, the others will stick around for a long while (end of 2022), and the generics of fNobs
, fNdistinct
, fHDbetween
and fHDwithin
will be kept in the package for an indeterminate period, but their core methods will not be exported beyond 2022. I will start warning about these renamed functions in 2022. In the future I will undogmatically stick to a function naming style with lowercase function names and underslashes where words need to be split. Other function names will be kept. To say something about this: The quick-conversion functions qDF
qDT
, qM
, qF
, qG
are consistent and in-line with data.table (setDT
etc.), and similarly the operators L
, F
, D
, Dlog
, G
, B
, W
, HDB
, HDW
. I'll keep GRP
, BY
and TRA
, for lack of better names, parsimony and because they are central to the package. The camel case will be kept in helper functions setDimnames
etc. because they work like stats setNames
and do not modify the argument by reference (like settransform
or setrename
and various data.table functions). Functions copyAttrib
and copyMostAttrib
are exports of like-named functions in the C API and thus kept as they are. Finally, I want to keep fFtest
the way it is because the F-distribution is widely recognized by a capital F.
I've updated the wlddev
dataset with the latest data from the World Bank, and also added a variable giving the total population (which may be useful e.g. for population-weighted aggregations across regions). The extra column could invalidate codes used to demonstrate something (I had to adjust some examples, tests and code in vignettes).
Added a function fcumsum
(written in C), permitting flexible (grouped, ordered) cumulative summations on matrix-like objects (integer or double typed) with extra methods for grouped data frames and panel series and data frames. Apart from the internal grouping, and an ordering argument allowing cumulative sums in a different order than data appear, fcumsum
has 2 options to deal with missing values. The default (na.rm = TRUE
) is to skip (preserve) missing values, whereas setting fill = TRUE
allows missing values to be populated with the previous value of the cumulative sum (starting from 0).
Added a function alloc
to efficiently generate vectors initialized with any value (faster than rep_len
).
Added a function pad
to efficiently pad vectors / matrices / data.frames with a value (default is NA
). This function was mainly created to make it easy to expand results coming from a statistical model fitted on data with missing values to the original length. For example let data <- na_insert(mtcars); mod <- lm(mpg ~ cyl, data)
, then we can do settransform(data, resid = pad(resid(mod), mod$na.action))
, or we could do pad(model.matrix(mod), mod$na.action)
or pad(model.frame(mod), mod$na.action)
to receive matrices and data frames from model data matching the rows of data
. pad
is a general function that will also work with mixed-type data. It is also possible to pass a vector of indices matching the rows of the data to pad
, in which case pad
will fill gaps in those indices with a value/row in the data.
Full data.table support, including reference semantics (set*
, :=
)!! There is some complex C-level programming behind data.table's operations by reference. Notably, additional (hidden) column pointers are allocated to be able to add columns without taking a shallow copy of the data.table, and an ".internal.selfref"
attribute containing an external pointer is used to check if any shallow copy was made using base R commands like <-
. This is done to avoid even a shallow copy of the data.table in manipulations using :=
(and is in my opinion not worth it as even large tables are shallow copied by base R (>=3.1.0) within microseconds and all of this complicates development immensely). Previously, collapse treated data.table's like any other data frame, using shallow copies in manipulations and preserving the attributes (thus ignoring how data.table works internally). This produced a warning whenever you wanted to use data.table reference semantics (set*
, :=
) after passing the data.table through a collapse function such as collap
, fselect
, fsubset
, fgroup_by
etc. From v1.6.0, I have adopted essential C code from data.table to do the overallocation and generate the ".internal.selfref"
attribute, thus seamless workflows combining collapse and data.table are now possible. This comes at a cost of about 2-3 microseconds per function, as to do this I have to shallow copy the data.table again and add extra column pointers and an ".internal.selfref"
attribute telling data.table that this table was not copied (it seems to be the only way to do it for now). This integration encompasses all data manipulation functions in collapse, but not the Fast Statistical Functions themselves. Thus you can do agDT <- DT %>% fselect(id, col1:coln) %>% collap(~id, fsum); agDT[, newcol := 1]
, but you would need to do add a qDT
after a function like fsum
if you want to use reference semantics without incurring a warning: agDT <- DT %>% fselect(id, col1:coln) %>% fgroup_by(id) %>% fsum %>% qDT; agDT[, newcol := 1]
. collapse appears to be the first package that attempts to account for data.table's internal working without importing data.table, and qDT
is now the fastest way to create a fully functional data.table from any R object. A global option "collapse_DT_alloccol"
was added to regulate how many columns collapse overallocates when creating data.table's. The default is 100, which is lower than the data.table default of 1024. This was done to increase efficiency of the additional shallow copies, and may be changed by the user.
Programming enabled with fselect
and fgroup_by
(you can now pass vectors containing column names or indices). Note that instead of fselect
you should use get_vars
for standard eval programming.
fselect
and fsubset
support in-place renaming e.g. fselect(data, newname = var1, var3:varN)
,
fsubset(data, vark > varp, newname = var1, var3:varN)
.
collap
supports renaming columns in the custom argument, e.g. collap(data, ~ id, custom = list(fmean = c(newname = "var1", "var2"), fmode = c(newname = 3), flast = is_date))
.
Performance improvements: fsubset
/ ss
return the data or perform a simple column subset without deep copying the data if all rows are selected through a logical expression. fselect
and get_vars
, num_vars
etc. are slightly faster through data frame subsetting done fully in C. ftransform
/ fcompute
use alloc
instead of base::rep
to replicate a scalar value which is slightly more efficient.
fcompute
now has a keep
argument, to preserve several existing columns when computing columns on a data frame.
replace_NA
now has a cols
argument, so we can do replace_NA(data, cols = is.numeric)
, to replace NA
's in numeric columns. I note that for big numeric data data.table::setnafill
is the most efficient solution.
fhdbetween
and fhdwithin
have an effect
argument in plm methods, allowing centering on selected identifiers. The default is still to center on all panel identifiers.
The plot method for panel series matrices and arrays plot.psmat
was improved slightly. It now supports custom colours and drawing of a grid.
settransform
and settransformv
can now be called without attaching the package e.g. collapse::settransform(data, ...)
. These errored before when collapse is not loaded because they are simply wrappers around data <- ftransform(data, ...)
. I'd like to note from a discussion that avoiding shallow copies with <-
(e.g. via :=
) does not appear to yield noticeable performance gains. Where data.table is faster on big data this mostly has to do with parallelism and sometimes with algorithms, generally not memory efficiency.
Functions setAttrib
, copyAttrib
and copyMostAttrib
only make a shallow copy of lists, not of atomic vectors (which amounts to doing a full copy and is inefficient). Thus atomic objects are now modified in-place.
Small improvements: Calling qF(x, ordered = FALSE)
on an ordered factor will remove the ordered class, the operators L
, F
, D
, Dlog
, G
, B
, W
, HDB
, HDW
and functions like pwcor
now work on unnamed matrices or data frames.
Published by SebKrantz over 3 years ago
The first argument of ftransform
was renamed to .data
from X
. This was done to enable the user to transform columns named "X". For the same reason the first argument of frename
was renamed to .x
from x
(not .data
to make it explicit that .x
can be any R object with a "names" attribute). It is not possible to depreciate X
and x
without at the same time undoing the benefits of the argument renaming, thus this change is immediate and code breaking in rare cases where the first argument is explicitly set.
The function is.regular
to check whether an R object is atomic or list-like is depreciated and will be removed before the end of the year. This was done to avoid a namespace clash with the zoo package (#127).
For reasons of efficiency, most statistical and transformation functions used the C macro SHALLOW_DUPLICATE_ATTRIB
to copy column attributes in a data frame. Since this macro does not copy S4 object bits, it caused some problems with S4 object columns such as POSIXct (e.g. computing lags/leads, first and last values on these columns). This is now fixed, all statistical functions (apart from fvar
and fsd
) now use DUPLICATE_ATTRIB
and thus preserve S4 object columns (#91).
unlist2d
produced a subsetting error if an empty list was present in the list-tree. This is now fixed, empty or NULL
elements in the list-tree are simply ignored (#99).
A function fsummarise
was added to facilitate translating dplyr / data.table code to collapse. Like collap
, it is only very fast when used with the Fast Statistical Functions.
A function t_list
is made available to efficiently transpose lists of lists.
Published by SebKrantz almost 4 years ago
A small patch for 1.5.0 that:
Fixes a numeric precision issue when grouping doubles (e.g. before qF(wlddev$LIFEEX)
gave an error, now it works).
Fixes a minor issue with fHDwithin
when applied to pseries and fill = FALSE
.