Preprocessed data for various popular tabular datasets to go along with imodels.
Includes the following datasets and more (see notebooks for more details on the datasets).
To download, use the "Name" field as the key: e.g. imodels.get_clean_dataset('compas_two_year_clean', data_source='imodels')
.
Name | Samples | Features | Class 0 | Class 1 | Majority class % |
---|---|---|---|---|---|
heart | 270 | 15 | 150 | 120 | 55.6 |
breast_cancer | 277 | 17 | 196 | 81 | 70.8 |
haberman | 306 | 3 | 81 | 225 | 73.5 |
credit_g | 1000 | 60 | 300 | 700 | 70 |
csi_pecarn_prop | 3313 | 97 | 2773 | 540 | 83.7 |
csi_pecarn_pred | 3313 | 39 | 2773 | 540 | 83.7 |
juvenile_clean | 3640 | 286 | 3153 | 487 | 86.6 |
compas_two_year_clean | 6172 | 20 | 3182 | 2990 | 51.6 |
enhancer | 7809 | 80 | 7115 | 694 | 91.1 |
fico | 10459 | 23 | 5000 | 5459 | 52.2 |
iai_pecarn_prop | 12044 | 73 | 11841 | 203 | 98.3 |
iai_pecarn_pred | 12044 | 58 | 11841 | 203 | 98.3 |
credit_card_clean | 30000 | 33 | 23364 | 6636 | 77.9 |
tbi_pecarn_prop | 42428 | 223 | 42052 | 376 | 99.1 |
tbi_pecarn_pred | 42428 | 121 | 42052 | 376 | 99.1 |
readmission_clean | 101763 | 150 | 54861 | 46902 | 53.9 |
First, install the imodels
package: pip install imodels
. Then, use the imodels.get_clean_dataset
function.
imodels.get_clean_dataset(dataset_name: str, data_source: str = 'imodels', data_path='data') ‑> Tuple[numpy.ndarray, numpy.ndarray, list]
"""
Fetch clean data (as numpy arrays) from various sources including imodels, pmlb, openml, and sklearn. If data is not downloaded, will download and cache. Otherwise will load locally
Parameters
----------
dataset_name: str
dataset_name - unique dataset identifier
data_source: str
options: 'imodels', 'pmlb', 'sklearn', 'openml', 'synthetic'
data_path: str
path to load/save data (default: 'data')
Returns
-------
X: np.ndarray
features
y: np.ndarray
outcome
feature_names: list
"""
# download compas dataset from imodels
X, y, feature_names = imodels.get_clean_dataset('compas_two_year_clean', data_source='imodels')
# download ionosphere dataset from pmlb
X, y, feature_names = imodels.get_clean_dataset('ionosphere', data_source='pmlb')
# download liver dataset from openml
X, y, feature_names = imodels.get_clean_dataset('8', data_source='openml')
# download ca housing from sklearn
X, y, feature_names = imodels.get_clean_dataset('california_housing', data_source='sklearn')
Data comes from various sources - please cite those sources appropriately.
notebooks_fetch_data contains notebooks which download and preprocess the data
data_cleaned contains the cleaned csv file for each dataset
To use any of the clinical decision-rule datasets, you must first accept the research data use agreement here.
There are two versions of each PECARN (TBI, IAI, and CSI) dataset.
prop
: missing values have not been imputedpred
: missing values have been imputedcsi_pecarn_pred.csv
note: unlike the rest of the datasets in this repo, which are fully cleaned, csi_pecarn_pred.csv
contains a variable ("SITE")
that should be removed before fitting models.
Dataset | Task | Size | References |
---|---|---|---|
iai_pecarn | Predict intra-abdominal injury requiring acute intervention before CT | 12,044 patients, 203 with IAI-I | 📄, 🔗 |
tbi_pecarn | Predict traumatic brain injuries before CT | 42,412 patients, 376 with ciTBI | 📄, 🔗 |
csi_pecarn | Predict cervical spine injury in children | 3,314 patients, 540 with CSI | 📄, 🔗 |
The breast_cancer
dataset here is not the extremely common Wisconsin breast-cancer dataset but rather this dataset from OpenML. Preprocessing (e.g. dropping missing values) results in the cleaned data having n=277, p=17, rather than the original n=286, p=9.
Some other cool datasets: