Table of Contents
The cross-platform application resides in this repository is named YouML
which stands for You
r (free) M
achine L
earning (toolkit). It intends to provide the machine learning community a free
and no-code
toolkit for preprocessing data and building machine learning models. Several key features will be released after no major bug can be found in the current version. The ultimate goal is to deliver a platform
where users can obtain solutions
to address tough problems in their machine learning tasks.
Task | The number of algorithms | Extra Information |
---|---|---|
Data Cleaning | 3 | N/A |
Feature Selection | 3 | N/A |
Data Preprocessing | 35 | unlimited SQL queries |
Data Splitting | 4 | unlimited SQL queries |
Machine Learning | 50 | 23 for classification 27 for regression |
According to targets, the names of aforementioned 35 Data Preprocessing
algorithms are classified into 3 groups (the ones with asterisk are imported from Scikit-learn).
Features (23)
: Binarizer*, KBinsDiscretizer*, LabelEncoder*, OrdinalEncoder*, MaxAbsScaler*, MinMaxScaler*, StandardScaler*, RobustScaler*, Normalizer*, PolynomialFeatures*, QuantileTransformer*, PowerTransformer*, FeatureRenaming, FeatureRemoving, FeatureAddition, Numeric -> Nominal, Nominal -> Numeric, ValueSorting, ValueReplacement, ValueMerging (specified), ValueMerging (infrequent), Missing -> MostFrequent, Missing -> Percentile, OutlierCleaning
Samples (6)
: Resampling, Randomization, SampleRemoving (range), SampleRemoving (values), SampleRemoving (duplicates), SampleRemoving (infrequent)
Dimension (6)
: OneHotEncoder*, IncrementalPCA*, KernelPCA*, PCA*, FactorAnalysis*, GaussianRandomProjection*
The names of aforementioned 3 Feature Selection
algorithms in Scikit-learn are:
RFE, SelectFromModel, SelectKBest
The names of aforementioned 23 Classification
algorithms in Scikit-learn are:
DecisionTreeClassifier, ExtraTreeClassifier, RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier, GaussianProcessClassifier, LogisticRegression, PassiveAggressiveClassifier, Perceptron, RidgeClassifier, SGDClassifier, BernoulliNB, CategoricalNB, ComplementNB, GaussianNB, MultinomialNB, KNeighborsClassifier, RadiusNeighborsClassifier, MLPClassifier, LinearSVC, NuSVC, SVC, DummyClassifier
The names of aforementioned 27 Regression
algorithms in Scikit-learn are:
DecisionTreeRegressor, ExtraTreeRegressor, RandomForestRegressor, GradientBoostingRegressor, HistGradientBoostingRegressor, GaussianProcessRegressor, LinearRegression, Ridge, SGDRegressor, ElasticNet, Lars, Lasso, LassoLars, LassoLarsIC, OrthogonalMatchingPursuit, ARDRegression, BayesianRidge, HuberRegressor, TheilSenRegressor, PassiveAggressiveRegressor, KNeighborsRegressor, RadiusNeighborsRegressor, LinearSVR, NuSVR, SVR, MLPRegressor, DummyRegressor
Mainstream data science libraries are employed by YouML to ensure high-efficient manipulations, they are:
YouML is designed for entry-to-mid
level users (prior knowledge of statistics is desirable), sophisticated users could purchase cloud-based machine learning products from tech giants (e.g., Microsoft, Google, Amazon) if they need such an automation toolkit.
A preliminary technical solution to accomplish the aforementioned ultimate goal has been determined. I will make every effort and try to look for others to ensure on-schedule completion. What I pursue is to create an application that is/with (a):
Solution hub
As a personal project, I do realize
that YouML has numerous bugs because it has never
been tested any user. Therefore, Im eager to hear about your experiences - bug reports, feature requests, questions and suggestions are very welcomed.
Author: Chongya Song
Wechat: schongy
Email: [email protected]
Profile: https://www.linkedin.com/in/chongya-song/
YouML complies with the traditional installation routine as common Python packages (i.e., pip3 install YouML
and python setup.py install
), so experienced users may skip this section and directly install YouML in your preferred manners.
YouML runs on top of built-in Windows Subsystem for Linux (WSL)
that hosted on Windows 10 and 11, so the following commands should be executed within the former instead of the latter. Newcomers could install the WSL by referring the instructions below:
https://docs.microsoft.com/en-us/windows/wsl/install
To install YouML, the easiest approach is to execute the following command (case-insensitive) in your command prompt/terminal:
pip3 install YouML
Reminder: To achieve a stable performance, YouML retrieved from PyPi (i.e., Python official package repository: https://pypi.org/project/YouML/) is pinned to Python 3.7.11
which is the version for development and self-testing. Furthermore, dozens dependencies that YouML relies on may mess the current Python environment and result in conflicts with the existing packages. Accordingly, it is highly recommended to create a dedicated virtual environment (e.g., conda, pyenv) with a Python version of 3.7.11 for YouML.
If you prefer to employ a Python interpreter with a version higher than 3.7.11
or simply want to install YouML manually, then you could follow the instructions below.
Downloading the entire YouML repository on Github (i.e., https://github.com/ChongyaSong/YouML) and uncompressing it as a folder YouML-main
. YouML runs on top of dozens dependencies which can be installed by two package managers:
Although it is easier to install dependencies using pip3, I still recommend beginners
to install everything through conda due to the following reasons:
pip3 search
is currently inaccessible due to security threats (as of today Feb 17, 2022). Consequently, you have to manually search alternative dependencies if there is something wrong during installation (e.g., version conflict).Installing Anaconda or Miniconda (recommended) by following the instructions below. Beginners could download and use any installer (excluding ARM-based) to simplify the installation. https://docs.conda.io/en/latest/miniconda.html
Opening a command prompt/terminal and navigating into the uncompressed folder:
YouML-main/YouML_MacOS
(for MacOS Users)
or
YouML-main/YouML_Linux_Windows
(for Linux and Windows Users)
Reminder: installing YouML outside the folder YouML_xxxxx will result in ModuleNotFoundError
).
conda config --append channels conda-forge
This command enables your conda to download dependencies from conda-forge repository because a few dependencies and/or specific versions are not available in the default repository.
conda create --name YouML --file Dependency.txt --yes
This command creates a new conda virtual environment named YouML (case-insensitive on MacOS, but case-sensitive on Linux and Windows Subsystem for Linux) in which all dependencies are installed.
conda activate YouML
This command brings you into the conda virtual environment YouML.
python3 setup_conda.py install
This command installs YouML into the conda virtual environment, which can be launched by command: YouML (case-insensitive on MacOS, but case-sensitive on Linux and Windows Subsystem for Linux). Reminder: the prerequisite of launching YouML in this manner is to enter the conda virtual environment YouML (i.e., step No.5).
Installing a Python3 interpreter with a version of 3.7.11 (pip is included by default). https://www.python.org/downloads/
Opening a command prompt/terminal and navigating into the uncompressed folder:
YouML-main/YouML_MacOS
(for MacOS Users)
or
YouML-main/YouML_Linux_Windows
(for Linux and Windows Users)
Reminder: installing YouML outside the folder YouML_xxxxx will result in ModuleNotFoundError
).
python3 setup_pip.py install
This command installs YouML and the associated dependencies on your machine, which can be launched by command: YouML (case-insensitive on MacOS, but case-sensitive on Linux and Windows Subsystem for Linux).
highly recommended
)For MacOS Users
which YouML | xargs -I {} cp {} /Applications
This command creates a copy of YouML executable in folder /Applications. Now, you can drag & drop it to anywhere (e.g., dock) and/or open it by clicking.
For Linux Users
which YouML | xargs -I {} cp {} ~/
This command creates a copy of YouML executable in your home folder ~/. Now, you can drag & drop it to anywhere (e.g., dock) and/or open it by clicking.
For Windows Users (i.e., Windows Subsystem for Linux - WSL)
which YouML | xargs -I {} sudo bash -c 'cat << EOF > /usr/share/applications/YouML.desktop
[Desktop Entry]
Type=Application
Name=YouML
Version=0.6
Exec={}
EOF'
This command creates a shortcut of YouML executable in your start menu of Windows. Now, you can pin it to anywhere (e.g., taskbar) and/or open it by clicking.
Reminder: ARM-based Macs may wait for minutes when starting YouML for the first time (due to the binary translation performed by Rosetta 2), but it only takes seconds to start YouML afterward.
YouML is able to track and save your most-updated progress automatically, so there is no save button and data will not loss unless YouML is quit forcibly.
The availablility of sidebars on each panel is shown in the table below. To trigger a sidebar, you should move the pointer to the corresponding edge/border.
Sidebar / Panel | Experiment | Data | Sampling | Processing | Splitting | Model |
---|---|---|---|---|---|---|
Workflow (left) | ||||||
Table (bottom) | ||||||
Feature Selection (right) |
You may run into the following situation during data preprocessing: you are very satisfied with the conducted manipulations so far, however, you have different ideas for the next steps and would like to compare the accuracy of models result from different data. Achieving it through repeatedly undo and redo multiple manipulations is apparently a bad practice. Fortunately, YouML allows you to create a branch for an experiment by clicking a button next to the experiment name, which works in the following way:
Assume an experiment has conducted 3 data manipulations: original data ---1---> data_1 ---2---> data_2 ---3---> data_3
After creating a branch, both of the experiment and its branch are starting/attached with/to data_3
(i.e., original data).
Reminder: if you choose another data in data panel, the new data will be employed in creating a branch because Auto-save is always tracking your most-updated progress (refer to USAGE No.1).
The employed plotting library Matplotlib is not designed for real-time display, but for generating publication-quality figures. By default, a small figure associated with each feature is produced and filled into a table at the bottom after each preprocessing manipulation. As a result, YouML may take more than 10 seconds to draw all colorful
figures (refer to the second paragraph of USAGE No.3) if the numbers of samples and features exceed 50,000 and 50, simultaneously. It is the fact that the small figures are not that informative due to the limitation of the size. Therefore, YouML also visualizes each feature in separate 6X-larger widgets and doesnt generate these redundant (i.e., small) figures when the product of #samples and #features is greater than 2.5e6 (i.e., 50,000 times 50). The 6X-larger figures are publication-quality (i.e., ppi = 300) and are generated within a few seconds even if the number of samples exceeds 1e5.
Furthermore, samples are also loaded into the table after each preprocessing manipulation. However, users would refer to summary and statistical information instead of specific samples when preprocessing big data in practice. Consequently, YouML allows users to manually turn off the data loading and the figure plotting features.
You may find some of the small figures are different from the corresponding large ones. This is not caused by a bug, but result from a built-in adaptive data visualization algorithm
. In short, the algorithm is able to mine data pattern from various perspectives by adjusting 6 decoupled parameters (will be available in YouML). As a result, users may discover more valuable patterns, generate more informative data and build more accurate models.
In addition, if the number of unique target values doesnt exceed 20 (configurable in future versions), the large figures are colorful and each color represents one value, otherwise, the large figures are plain. This is due to a fact that users may not gain useful information from complicated figures.
On the other hand, the labels on each axis are separately replaced by unique letters if the number of labels exceeds 26 or the conjunction of all labels is longer than the corresponding axis. The mapping relation between labels and letter can be found in a list adjacent to figures.
Experiment Panel
Data Panel (attach data)
Data Panel (import data)
Sampling Panel with Feature Selection
Processing Panel
Splitting Panel with Workflow
Model Panel (classification)
Model Panel (regression)
Quick View
YouTube
bilibili
Tutorial
YouTube
bilibili
Installation (from GitHub via Setup.py)
YouTube
bilibili
Installation (from PyPI via Pip3)
YouTube
bilibili
Linux and Windows Versions
YouTube
bilibili