Microsynthesis using quasirandom sampling and/or IPF
OTHER License
humanleague is a python and an R package for microsynthesising populations from marginal and (optionally) seed data. The package is implemented in C++ for performance.
The package contains algorithms that use a number of different microsynthesis techniques:
The latter provides a bridge between deterministic reweighting and combinatorial optimisation, offering advantages of both techniques:
The algorithms:
The package also contains the following utilities:
Version 1.0.1 reflects the work described in the Quasirandom Integer Sampling (QIS) paper.
Requires Python 3.9 or newer. The package can be installed using pip
, e.g.
python -m pip install humanleague --user
Fork or clone the repo, then
pip install -e .[dev]
pytest
Official release:
> install.packages("humanleague")
For a development version
> devtools::install_github("virgesmith/humanleague")
Or, for the legacy version
> devtools::install_github("virgesmith/[email protected]")
Consult the package documentation, e.g.
> library(humanleague)
> ?humanleague
The package now contains type annotations and your IDE should automatically display this, e.g.:
NB type stubs are generated using the pybind11-stubgen
package, with some manual corrections.
Building on the one-dimensionl integerise
function - which given a discrete probability distribution and a count, returns the closest integer population to the distribution that sums to the count - a multidimensional equivalent integerise
is introduced. In one dimension, for example this:
>>> import humanleague
>>> p = [0.1, 0.2, 0.3, 0.4]
>>> result, stats = humanleague.integerise(p, 11)
>>> result
array([1, 2, 3, 5], dtype=int32)
>>> stats
{'rmse': 0.3535533905932736}
produces the optimal (i.e. closest possible) integer population to the discrete distribution.
The integerise
function generalises this problem and applies it to higher dimensions: given an n-dimensional array of real numbers where the 1-d marginal sums in every dimension are integral (and thus the total population is too), it attempts to find an integral array that also satisfies these constraints.
The QISI algorithm is repurposed to this end. As it is a sampling algorithm it cannot guarantee that a solution is found, and if so, whether the solution is optimal. If it fails this does not prove that a solution does not exist for the given input.
>>> import numpy as np
>>> a = np.array([[ 0.3, 1.2, 2. , 1.5],
[ 0.6, 2.4, 4. , 3. ],
[ 1.5, 6. , 10. , 7.5],
[ 0.6, 2.4, 4. , 3. ]])
# marginal sums
>>> a.sum(axis=0)
array([ 3., 12., 20., 15.])
>>> a.sum(axis=1)
array([ 5., 10., 25., 10.])
# perform integerisation
>>> result, stats = humanleague.integerise(a)
>>> stats
{'conv': True, 'rmse': 0.5766281297335398}
>>> result
array([[ 0, 2, 2, 1],
[ 0, 3, 4, 3],
[ 2, 6, 10, 7],
[ 1, 1, 4, 4]])
# check marginals are preserved
>>> (result.sum(axis=0) == a.sum(axis=0)).all()
True
>>> (result.sum(axis=1) == a.sum(axis=1)).all()
True