koalas

Koalas: pandas API on Apache Spark

APACHE-2.0 License

Downloads
1.4M
Stars
3.3K
Committers
53

Bot releases are visible (Hide)

koalas - Version 0.25.0

Published by HyukjinKwon almost 5 years ago

loc and iloc indexers improvement

We improved loc and iloc indexers. Now, loc can support scalar values as indexers (#1172).

>>> import databricks.koalas as ks
>>>
>>> df = ks.DataFrame([[1, 2], [4, 5], [7, 8]],
...                   index=['cobra', 'viper', 'sidewinder'],
...                   columns=['max_speed', 'shield'])
>>> df.loc['sidewinder']
max_speed    7
shield       8
Name: sidewinder, dtype: int64
>>> df.loc['sidewinder', 'max_speed']
7

In addition, Series derived from a different Frame can be used as indexers (#1155).

>>> import databricks.koalas as ks
>>>
>>> ks.options.compute.ops_on_diff_frames = True
>>> 
>>> df1 = ks.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [100, 200, 300, 400, 500]},
...                    index=[20, 10, 30, 0, 50])
>>> df2 = ks.DataFrame({'A': [0, -1, -2, -3, -4], 'B': [-100, -200, -300, -400, -500]},
...                    index=[20, 10, 30, 0, 50])
>>> df1.A.loc[df2.A > -3].sort_index()
10    1
20    0
30    2

Lastly, now loc uses its natural order according to index identically with pandas' when using the slice (#1159, #1174, #1179). See the example below.

>>> df = ks.DataFrame([[1, 2], [4, 5], [7, 8]],
...                   index=['cobra', 'viper', 'sidewinder'],
...                   columns=['max_speed', 'shield'])
>>> df.loc['cobra':'viper', 'max_speed']
cobra    1
viper    4
Name: max_speed, dtype: int64

Other new features and improvements

We added the following new features:

koalas.Series:

  • get (#1153)

koalas.Index

  • drop (#1117)
  • len (#1161)
  • set_names (#1134)
  • argmin (#1162)
  • argmax (#1162)

koalas.MultiIndex:

  • from_product (#1144)
  • drop (#1117)
  • len (#1161)
  • set_names (#1134)

Other improvements

  • Add support from_pandas for Index/MultiIndex. (#1170)
  • Add a hidden column __natural_order__. (#1146)
  • Introduce _LocIndexerLike and consolidate some logic. (#1149)
  • Refactor LocIndexerLike.__getitem__. (#1152)
  • Remove sort in GroupBy._reduce_for_stat_function. (#1147)
  • Randomize index in tests and fix some window-like functions. (#1151)
  • Explicitly don't support Index.duplicated (#1131)
  • Fix DataFrame._repr_html_(). (#1177)
koalas - Version 0.24.0

Published by HyukjinKwon almost 5 years ago

NumPy's universal function (ufunc) compatibility

We added the compatibility of NumPy ufunc (#1127). Virtually all ufunc compatibilities in Koalas DataFrame were implemented. See the example below:

>>> import databricks.koalas as ks
>>> import numpy as np
>>> kdf = ks.range(10)
>>> np.log(kdf)
         id
0       NaN
1  0.000000
2  0.693147
3  1.098612
4  1.386294
5  1.609438
6  1.791759
7  1.945910
8  2.079442
9  2.197225

Other new features and improvements

We added the following new features:

koalas:

  • to_numeric (#1060)

koalas.DataFrame:

  • idxmax (#1054)
  • idxmin (#1054)
  • pct_change (#1051)
  • info (#1124)

koalas.Index

  • fillna (#1102)
  • min (#1114)
  • max (#1114)
  • drop_duplicates (#1121)
  • nunique (#1132)
  • sort_values (#1120)

koalas.MultiIndex:

  • levshape (#1086)
  • min (#1114)
  • max (#1114)
  • sort_values (#1120)

koalas.SeriesGroupBy

  • head (#1050)

koalas.DataFrameGroupBy

  • head (#1050)

Other improvements

  • Setting index name / names for Series (#1079)
  • disable 'str' for 'SeriesGroupBy', disable 'DataFrame' for 'GroupBy' (#1097)
  • Support 'compute.ops_on_diff_frames' for NumPy ufunc compay in Series (#1128)
  • Support arithmetic and comparison APIs on same DataFrames (#1129)
  • Fix rename() for Index to support MultiIndex also (#1125)
  • Set the upper-bound for pandas. (#1137)
  • Fix _cum() for Series to work properly (#1113)
  • Fix value_counts() to work properly when dropna is True (#1116, #1142)
koalas - Version 0.23.0

Published by HyukjinKwon almost 5 years ago

NumPy's universal function (ufunc) compatibility

We added the compatibility of NumPy ufunc (#1096, #1106). Virtually all ufunc compatibilities in Koalas Series were implemented. See the example below:

>>> import databricks.koalas as ks
>>> import numpy as np
>>> kdf = ks.range(10)
>>> kser = np.sqrt(kdf.id)
>>> type(kser)
<class 'databricks.koalas.series.Series'>
>>> kser
0    0.000000
1    1.000000
2    1.414214
3    1.732051
4    2.000000
5    2.236068
6    2.449490
7    2.645751
8    2.828427
9    3.000000

Other new features and improvements

We added the following new features:

koalas:

  • option_context (#1077)

koalas.DataFrame:

  • where (#1018)
  • mask (#1018)
  • iterrows (#1070)

koalas.Series:

  • pop (#866)
  • first_valid_index (#1092)
  • pct_change (#1071)

koalas.Index

  • symmetric_difference (#953, #1059)
  • to_numpy (#1058)
  • transpose (#1056)
  • T (#1056)
  • dropna (#938)
  • shape (#1085)
  • value_counts (#949)

koalas.MultiIndex:

  • symmetric_difference (#953, #1059)
  • to_numpy (#1058)
  • transpose (#1056)
  • T (#1056)
  • dropna (#938)
  • shape (#1085)
  • value_counts (#949)

Other improvements

  • Fix comparison operators to treat NULL as False (#1029)
  • Make corr return koalas.DataFrame (#1069)
  • Include link to Help Thirsty Koalas Fund (#1082)
  • Add Null handling for different frames (#1083)
  • Allow Series.__getitem__ to take boolean Series (#1075)
  • Produce correct output against multiIndex when 'compute.ops_on_diff_frames' is enabled (#1089)
  • Fix idxmax() / idxmin() for Series work properly (#1078)
koalas - Version 0.22.0

Published by HyukjinKwon almost 5 years ago

Enable Arrow 0.15.1+

Apache Arrow 0.15.0 did not work well with PySpark 2.4 so it was disabled in the previous version.
With Arrow 0.15.1, now it works in Koalas (#902).

Expanding and Rolling

We also added expanding() and rolling() APIs in all groupby(), Series and Frame (#985, #991, #990, #1015, #996, #1034, #1037)

  • min
  • max
  • sum
  • mean
  • std
  • var

Multi-index columns support

We continue improving multi-index columns support. We made the following APIs support multi-index columns:

  • median (#995)
  • at (#1049)

Documentation

We added "Best Practices" section in the documentation (#1041) so that Koalas users can read and follow. Please see https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html

Other new features and improvements

We added the following new features:

koalas.DataFrame:

  • quantile (#984)
  • explain (#1042)

koalas.Series:

  • between (#997)
  • update (#923)
  • mask (#1017)

koalas.MultiIndex:

  • from_tuples (#970)
  • from_arrays (#1001)

Along with the following improvements:

  • Introduce column_scols in InternalFrame substitude for data_columns. (#956)
  • Fix different index level assignment when 'compute.ops_on_diff_frames' is enabled (#1045)
  • Fix Dataframe.melt function & Add doctest case for melt function (#987)
  • Enable creating Index from list like 'Index([1, 2, 3])' (#986)
  • Fix combine_frames to handle where the right hand side arguments are modified Series (#1020)
  • setup.py should support Python 2 to show a proper error message. (#1027)
  • Remove Series.schema. (#993)
koalas - Version 0.21.0

Published by HyukjinKwon almost 5 years ago

Multi-index columns support

We continue improving multi-index columns support. We made the following APIs support multi-index columns:

  • nunique (#980)
  • to_csv (#983)

Documentation

Now, we have installation guide, design principles and FAQ in our public documentation (#914, #944, #963, #964)

Other new features and improvements

We added the following new features:

koalas

  • merge (#969)

koalas.DataFrame:

  • keys (#937)
  • ndim (#947)

koalas.Series:

  • keys (#935)
  • mode (#899)
  • truncate (#928)
  • xs (#921)
  • where (#922)
  • first_valid_index (#936)

koalas.Index:

  • copy (#939)
  • unique (#912)
  • ndim (#947)
  • has_duplicates (#946)
  • nlevels (#945)

koalas.MultiIndex:

  • copy (#939)
  • ndim (#947)
  • has_duplicates (#946)
  • nlevels (#945)

koalas.Expanding

  • count (#978)

Along with the following improvements:

  • Fix passing options as keyword arguments (#968)
  • Make is_monotonic~ work properly for index (#930)
  • Fix Series.__getitem__ to work properly (#934)
  • Fix reindex when all the given columns are included the existing columns (#975)
  • Add datetime as the equivalent python type to TimestampType (#957)
  • Fix is_unique to respect the current Spark column (#981)
  • Fix bug when assign None to name as Index (#974)
  • Use name_like_string instead of str directly. (#942, #950)
koalas - Version 0.20.0

Published by ueshin about 5 years ago

Disable Arrow 0.15

Apache Arrow 0.15.0 was released on the 5th of October, 2019, which Koalas depends on to execute Pandas UDF, but the Spark community reports an issue with PyArrow 0.15.

We decided to set an upper bound for pyarrow version to avoid such issues until we are sure that Koalas works fine with it.

  • Set an upper bound for pyarrow version. (#918)

Multi-index columns support

We continue improving multi-index columns support. We made the following APIs support multi-index columns:

  • pivot_table (#908)
  • melt (#920)

Other new features and improvements

We added the following new features:

koalas.DataFrame:

  • xs (#892)

koalas.Series:

  • drop_duplicates (#896)
  • replace (#903)

koalas.GroupBy:

  • shift (#910)

Along with the following improvements:

  • Implement nested renaming for groupby agg (#904)
  • Add 'index_col' parameter to DataFrame.to_spark (#906)
  • Add more options to read_csv (#916)
  • Add NamedAgg (#911)
  • Enable DataFrame setting value as list of labels (#905)
koalas - Version 0.19.0

Published by ueshin about 5 years ago

Koalas Logo

Now that we have an official logo!

We can see the cute logo in our documents as well.

Documentation

Also we improved the documentation: https://koalas.readthedocs.io/en/latest/

  • Added the logo (#831)
  • Added a Jupyter notebook for 10 min tutorial (#843)
  • Added the tutorial to the documentation (#853)
  • Add some examples for plot implementations in their docstrings (#847)
  • Move contribution guide to the official documentation site (#841)

Binder integration for the 10 min tutorial

You can run a live Jupyter notebook for 10 min tutorial from Binder.

Multi-index columns support

We continue improving multi-index columns support. We made the following APIs support multi-index columns:

  • transform (#800)
  • round (#802)
  • unique (#809)
  • duplicated (#803)
  • assign (#811)
  • merge (#825)
  • plot (#830)
  • groupby and its functions (#833)
  • update (#848)
  • join (#848)
  • drop_duplicate (#856)
  • dtype (#858)
  • filter (#859)
  • dropna (#857)
  • replace (#860)

Plots

We also continue adding plot APIs as follows:

For DataFrame:

  • plot.kde() (#784)

Other new features and improvements

We added the following new features:

koalas.DataFrame:

  • pop (#791)
  • __iter__ (#836)
  • rename (#806)
  • expanding (#840)
  • rolling (#840)

koalas.Series:

  • aggregate (#816)
  • agg (#816)
  • expanding (#840)
  • rolling (#840)
  • drop (#829)
  • copy (#869)

koalas.DataFrameGroupBy:

  • expanding (#840)
  • rolling (#840)

koalas.SeriesGroupBy:

  • expanding (#840)
  • rolling (#840)

Along with the following improvements:

  • Add squeeze argument to read_csv (#812)
  • Raise a more helpful error for duplicated columns in Join (#820)
  • Issue with ks.merge to Series (#818)
  • Fix MultiIndex.to_pandas() and __repr__(). (#832)
  • Add unit and origin options for to_datetime (#839)
  • Fix on wrong error raise in DataFrame.fillna (#844)
  • Allow str and list in aggfunc in DataFrameGroupby.agg (#828)
  • Add index_col argument to to_koalas(). (#863)
koalas - Version 0.18.0

Published by HyukjinKwon about 5 years ago

Multi-index columns support

We continue improving multi-index columns support (#793, #776). We made the following APIs support multi-index columns:

  • applymap (#793)
  • shift (#793)
  • diff (#793)
  • fillna (#793)
  • rank (#793)

Also, we can set tuple or None name for Series and Index. (#776)

>>> import databricks.koalas as ks
>>> kser = ks.Series([1, 2, 3])
>>> kser.name = ('a', 'b')
>>> kser
0    1
1    2
2    3
Name: (a, b), dtype: int64

Plots

We also continue adding plot APIs as follows:

For Series:

  • plot.kde() (#767)

For DataFrame:

  • plot.hist() (#780)

Options

In addition, we added the support for namespace-access in options (#785).

>>> import databricks.koalas as ks
>>> ks.options.display.max_rows
1000
>>> ks.options.display.max_rows = 10
>>> ks.options.display.max_rows
10

See also User Guide of our project docs.

Other new features and improvements

We added the following new features:

koalas.DataFrame:

  • aggregate (#796)
  • agg (#796)
  • items (#787)

koalas.indexes.Index/MultiIndex

  • is_boolean (#795)
  • is_categorical (#795)
  • is_floating (#795)
  • is_integer (#795)
  • is_interval (#795)
  • is_numeric (#795)
  • is_object (#795)

Along with the following improvements:

  • Add index_col for read_json (#797)
  • Add index_col for spark IO reads (#769, #775)
  • Add "sep" parameter for read_csv (#777)
  • Add axis parameter to dataframe.diff (#774)
  • Add read_json and let to_json use spark.write.json (#753)
  • Use spark.write.csv in to_csv of Series and DataFrame (#749)
  • Handle TimestampType separately when convert to pandas' dtype. (#798)
  • Fix spark_df when set_index(.., drop=False). (#792)

Backward compatibility

  • We removed some parameters in DataFrame.to_csv and DataFrame.to_json to allow distributed writing (#749, #753)
koalas - Version 0.17.0

Published by ueshin about 5 years ago

Options

We started using options to configure the Koalas' behavior. Now we have the following options:

  • display.max_rows (#714, #742)
  • compute.max_rows (#721, #736)
  • compute.shortcut_limit (#717)
  • compute.ops_on_diff_frames (#725)
  • compute.default_index_type (#723)
  • plotting.max_rows (#728)
  • plotting.sample_ratio (#737)

We can also see the list and their descriptions in the User Guide of our project docs.

Plots

We continue adding plot APIs as follows:

For Series:

  • plot.area() (#704)

For DataFrame:

  • plot.line() (#686)
  • plot.bar() (#695)
  • plot.barh() (#698)
  • plot.pie() (#703)
  • plot.area() (#696)
  • plot.scatter() (#719)

Multi-index columns support

We also continue improving multi-index columns support. We made the following APIs support multi-index columns:

  • koalas.concat() (#680)
  • koalas.get_dummies() (#695)
  • DataFrame.pivot_table() (#635)

Other new features and improvements

We added the following new features:

koalas:

  • read_sql_table() (#741)
  • read_sql_query() (#741)
  • read_sql() (#741)

koalas.DataFrame:

  • style (#712)

Along with the following improvements:

  • GroupBy.apply should return Koalas DataFrame instead of pandas DataFrame (#731)
  • Fix rpow and rfloordiv to use proper operators in Series (#735)
  • Fix rpow and rfloordiv to use proper operators in DataFrame (#740)
  • Add schema inference support at DataFrame.transform (#732)
  • Add Option class to support type check and value check in options (#739)
  • Added missing tests (#687, #692, #694, #709, #711, #730, #729, #733, #734)

Backward compatibility

  • We renamed two of the default index names from one-by-one and distributed-one-by-one to sequence and distributed-sequence respectively. (#679)
  • We moved the configuration for enabling operations on different DataFrames from the environment variable to the option. (#725)
  • We moved the configuration for the default index from the environment variable to the option. (#723)
koalas - Version 0.16.0

Published by HyukjinKwon about 5 years ago

Firstly, we introduced new mode to enable operations on different DataFrames (#633). This mode can be enabled by setting OPS_ON_DIFF_FRAMES environment variable is set to true as below:

>>> import databricks.koalas as ks
>>>
>>> kdf1 = ks.range(5)
>>> kdf2 = ks.DataFrame({'id': [5, 4, 3]})
>>> (kdf1 - kdf2).sort_index()
    id
0 -5.0
1 -3.0
2 -1.0
3  NaN
4  NaN
>>> import databricks.koalas as ks
>>>
>>> kdf = ks.range(5)
>>> kdf['new_col'] = ks.Series([1, 2, 3, 4])
>>> kdf
   id  new_col
0   0      1.0
1   1      2.0
3   3      4.0
2   2      3.0
4   4      NaN

Secondly, we also introduced default index and disallowed Koalas DataFrame with no index internally (#639)(#655). For example, if you create Koalas DataFrame from Spark DataFrame, the default index is used. The default index implementation can be configured by setting DEFAULT_INDEX as one of three types:

  • (default) one-by-one: It implements a one-by-one sequence by Window function without
    specifying partition. This index type should be avoided when the data is large.

    >>> ks.range(3)
       id
    0   0
    1   1
    2   2
    
  • distributed-one-by-one: It implements a one-by-one sequence by group-by and
    group-map approach. It still generates a one-by-one sequential index globally.
    If the default index must be a one-by-one sequence in a large dataset, this
    index can be used.

    >>> ks.range(3)
       id
    0   0
    1   1
    2   2
    
  • distributed: It implements a monotonically increasing sequence simply by using
    Spark's monotonically_increasing_id function. If the index does not have to be
    a one-by-one sequence, this index can be used. Performance-wise, this index
    almost does not have any penalty comparing to other index types.

    >>> ks.range(3)
                 id
    25769803776   0
    60129542144   1
    94489280512   2
    

Thirdly, we implemented many plot APIs in Series as follows:

  • plot.pie() (#669)
  • plot.area() (#670)
  • plot.line() (#671)
  • plot.barh() (#673)

See the example below:

import databricks.koalas as ks

ks.range(10).to_pandas().id.plot.pie()

image

Fourthly, we rapidly improved multi-index columns support continuously. Now multi-index columns are supported in multiple APIs:

  • DataFrame.sort_index()(#637)
  • GroupBy.diff()(#653)
  • GroupBy.rank()(#653)
  • Series.any()(#652)
  • Series.all()(#652)
  • DataFrame.any()(#652)
  • DataFrame.all()(#652)
  • DataFrame.assign()(#657)
  • DataFrame.drop()(#658)
  • DataFrame.reindex()(#659)
  • Series.quantile()(#663)
  • Series,transform()(#663)
  • DataFrame.select_dtypes()(#662)
  • DataFrame.transpose()(#664).

Lastly we added new functionalities, especially for groupby-related functionalities, in the past weeks. We added the following features:

koalas.DataFrame

  • duplicated() (#569)
  • fillna() (#640)
  • bfill() (#640)
  • pad() (#640)
  • ffill() (#640)

koalas.groupby.GroupBy:

  • diff() (#622)
  • nunique() (#617)
  • nlargest() (#654)
  • nsmallest() (#654)
  • idxmax() (#649)
  • idxmin() (#649)

Along with the following improvements:

  • Add a basic infrastructure for configurations. (#645)
  • Always use column_index. (#648)
  • Allow to omit type hint in GroupBy.transform, filter, apply (#646)
koalas - Version 0.15.0

Published by ueshin about 5 years ago

We rapidly improved and added new functionalities, especially for groupby-related functionalities, in the past weeks. We also added the following features:

koalas.groupby.GroupBy:

  • size() (#593)
  • filter() (#614)
  • cummax() (#610)
  • cummin() (#610)
  • cumsum() (#610)
  • cumprod() (#610)
  • rand() (#619)

koalas.groupby.SeriesGroupBy:

  • apply() (#609)
  • value_counts() (#613)

koalas.indexes.Index:

  • size() (#623)

Along with the following improvements:

  • Add multiple aggregations on a single column (#602)
  • Add axis=columns to count, var, std, max, sum, min, kurtosis, skew and mean in DataFrame (#605)
  • Add Spark DDL formatted string support in read_csv(names=...) (#604)
  • Support names of index levels (#621, #629)
  • Add as_index argument to groupby. (#627)
  • Fix issues related to multi-index column access (#594, #597, #606, #611, #612, #620)
koalas - Version 0.14.0

Published by HyukjinKwon about 5 years ago

We added a basic multi-index support in columns (#590) as below. pandas multi-index can be also mapped.

>>> import databricks.koalas as ks
>>> import numpy as np
>>>
>>> arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
...           np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
>>> kdf = ks.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=arrays)
>>> kdf
        bar                 baz                 foo                 qux
        one       two       one       two       one       two       one       two
A -1.574777  0.805108  0.139748  1.287946 -1.782297 -0.152292  0.680594  1.419407
B  0.076886 -1.560807  0.403807 -0.715029  1.236899 -0.364483 -1.548554  0.076003
C -0.575168  0.061539 -2.083615 -0.816090 -1.267440  0.745949 -1.194421  0.468818
>>> kdf['bar']
        one       two
A -1.574777  0.805108
B  0.076886 -1.560807
C -0.575168  0.061539
>>> kdf['bar']['two']
A    0.805108
B   -1.560807
C    0.061539
Name: two, dtype: float64

In addition, we are triaging APIs to support and unsupport explicitly (#574)(#580). Some of pandas APIs would explicitly be unsupported according to Guardrails to prevent users from shooting themselves in the foot and based upon other justifications such as the cost of their operations.

We also added the following features:

koalas.DataFrame:

  • ffill() (#571)
  • bfill() (#570)
  • filter() (#589)

koalas.Series:

  • idxmax() (#587)
  • idxmin() (#587)

koalas.indexes.Index:

  • Index.rename() (#581)

koalas.groupby.GroupBy:

  • apply() (#584)
  • transform() (#585)

Along with the following improvements:

  • pandas 0.25 support (#579)
  • method and limit parameter support in DataFrame.fillna() (#565)
  • Dots (.) in columns names are allowed (#490)
  • Add support of level argument for DataFrame/Series.sort_index() (#583)
koalas - Version 0.13.0

Published by HyukjinKwon over 5 years ago

We rapidly improved and added new functionalities in the past week. We also added the following features:

koalas.DataFrame:

  • diff (#562)
  • shift (#562)
  • round (#537)
  • rank (#546)
  • any (#568)
  • all (#568)

koalas.Series:

  • diff (#564)
  • quantile (#566)
  • shift (#563)
  • is_monotonic (#560)
  • is_monotonic_increasing (#560)
  • is_monotonic_decreasing (#560)
  • round (#537)
  • rank (#546)
koalas - Version 0.12.0

Published by HyukjinKwon over 5 years ago

We rapidly improved and added new functionalities in the past week. We also added the following features:

koalas:

  • isna (#548)
  • isnull (#548)
  • notna (#548)
  • notnull (#548)

koalas.DataFrame:

  • bool (#533)
  • reindex (#493)
  • pivot (#532)
  • transform (#541)
  • median (#544)
  • cumprod (#545)

koalas.Series:

  • cummax (#534)
  • cummin (#534)
  • cumsum (#534)
  • bool (#533)
  • median (#540)
  • transpose (#543)
  • T (#543)
  • cumprod (#545)
  • hasnans (#547)

Along with the following improvements:

  • Fix DataFrame.replace to take kdf.replace({0: 10, 1: 100}) (#527)
koalas - Version 0.11.0

Published by HyukjinKwon over 5 years ago

We fixed a critical regression for pandas 0.23.x compatibility (#528, #529)
Now, pandas 0.23.x support is back.

koalas - Version 0.10.0

Published by HyukjinKwon over 5 years ago

We added infrastructure for usage logging (#494). It allows to use a custom logger to handle each API process failure and success. In Koalas, it has a built-in Koalas logger, databricks.koalas.usage_logging.usage_logger, with Python logging.

In addition, Koalas experimentally introduced type hints for both Series and DataFrame (#453). The new type hints are used as below:

def func(...) -> ks.Series[np.float]:
    ...
def func(...) -> ks.DataFrame[np.float, int, str]:
    ...

We also added the following features:

koalas.DataFrame:

  • update (#498)
  • pivot_table (#386)
  • pow (#503)
  • rpow (#503)
  • mod (#503)
  • rmod (#503)
  • floordiv (#503)
  • rfloordiv (#503)
  • T (#469)
  • transpose (#469)
  • select_dtypes (#510)
  • replace (#495)
  • cummin (#521)
  • cummax (#521)
  • cumsum (#521)

koalas.Series:

  • rank (#516)

Along with the following improvements:

  • Remaining Koalas Series.str functions (#496)
  • nunique in koalas.groupby.GroupBy.agg (#512)
koalas - Version 0.9.0

Published by ueshin over 5 years ago

We bumped up supporting MLflow to 1.0 and now we can use URI pointing to the model. Please see MLflow documentation for more details. Note that we don't support older versions any more. (#477)

We also added the following features:

koalas:

  • melt (#474)

koalas.DataFrame:

  • eq (#476)
  • ne (#476)
  • gt (#476)
  • ge(#476)
  • lt(#476)
  • le (#476)
  • join (#473)
  • melt (#474)
  • get_dtype_counts (#480)

koalas.Series:

  • eq (#476)
  • ne (#476)
  • gt (#476)
  • ge(#476)
  • lt(#476)
  • le (#476)
  • get_dtype_counts (#480)
  • to_frame (#483)

koalas.groupby.GroupBy:

  • all (#485)
  • any (#485)

Along with the following improvements:

  • The Koalas DataFrame constructor can now take Koalas Series. (#470)
  • A lot of missing properties and functions are added to Series.dt property (#478)
koalas - Version 0.8.0

Published by HyukjinKwon over 5 years ago

We added new functionalities, improved the documentation and fixed some bugs in the past week. Also, koalas.sql has an improvement (#448). Now Koalas DataFrame and some regular Python types can be used directly in SQL, for instance, as below:

>>> mydf = ks.range(10)
>>> x = range(4)
>>> ks.sql("SELECT * from {mydf} WHERE id IN {x}")
   id
0   0
1   1
2   2
3   3

We also added the following features:

koalas

  • read_spark_io (#447)
  • read_table (#449)
  • read_delta (#456)

koalas.DataFrame:

  • append (#388)
  • from_records (#436)
  • to_parquet (#443)
  • to_spark_io (#447)
  • to_table (#449)
  • cache (#397)
  • to_delta (#456)
  • drop_duplicates (#458)

koalas.Series:

  • append (#388)
  • str (#429)
  • plot (#294)
  • hist (#294)

Along with the following improvements:

  • mean, sum, skew, kurtosis, min, max, std and var at DataFrame and Series supports numeric_only argument (#422)
koalas - Version 0.7.0

Published by HyukjinKwon over 5 years ago

We refined the internal structure, improved the documentation and added new functionalities in the past week.

We also added the following features:

koalas:

  • read_clipboard (#430)
  • read_excel (#430)
  • read_html (#430)

koalas.DataFrame:

  • at (#384)
  • nunique (#346)
  • add_prefix (#414)
  • add_suffix (#414)
  • add (#427)
  • radd (#427)
  • div (#427)
  • divide (#427)
  • rdiv (#427)
  • truediv (#427)
  • rtruediv (#427)
  • mul (#427)
  • multiply (#427)
  • rmul (#427)
  • sub (#427)
  • substract (#427)
  • rsub (#427)

koalas.Series:

  • at (#384)
  • nunique (#346)
  • add_prefix (#414)
  • add_suffix (#414)
  • transform (#428)
koalas - Version 0.6.0

Published by ueshin over 5 years ago

We added basic integration with MLflow, so that models that have the pyfunc flavor (which is, most of them), can be loaded as predictors. These predictors then works on both pandas and koalas dataframes with no code change. See the documentation example for details. (#353)

We also added the following features:

koalas.DataFrame:

  • sort_index (#380)
  • applymap (#390)
  • empty (#391)

koalas.Series:

  • sort_values (#366)
  • to_list (#379)
  • sort_index (#380)
  • pipe (#392)
  • map (#389)
  • empty (#391)
  • add (#401)
  • radd (#401)
  • div (#401)
  • divide (#401)
  • rdiv (#401)
  • truediv (#401)
  • rtruediv (#401)
  • mul (#401)
  • multiply (#401)
  • rmul (#401)
  • sub (#401)
  • substract (#401)
  • rsub (#401)

Along with the following improvements:

  • DataFrame.merge function now supports left_on and right_on arguments. (#381)
  • DataFrame.describe function now supports percentiles argument. (#378)
Package Rankings
Top 19.39% on Conda-forge.org
Top 6.72% on Proxy.golang.org
Top 1.21% on Pypi.org
Badges
Extracted from project README
Github Actions codecov Documentation Status Latest Release Conda Version Binder Downloads
Related Projects