Koalas: pandas API on Apache Spark
APACHE-2.0 License
Bot releases are hidden (Show)
Published by HyukjinKwon almost 5 years ago
loc
and iloc
indexers improvementWe improved loc
and iloc
indexers. Now, loc
can support scalar values as indexers (#1172).
>>> import databricks.koalas as ks
>>>
>>> df = ks.DataFrame([[1, 2], [4, 5], [7, 8]],
... index=['cobra', 'viper', 'sidewinder'],
... columns=['max_speed', 'shield'])
>>> df.loc['sidewinder']
max_speed 7
shield 8
Name: sidewinder, dtype: int64
>>> df.loc['sidewinder', 'max_speed']
7
In addition, Series derived from a different Frame can be used as indexers (#1155).
>>> import databricks.koalas as ks
>>>
>>> ks.options.compute.ops_on_diff_frames = True
>>>
>>> df1 = ks.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [100, 200, 300, 400, 500]},
... index=[20, 10, 30, 0, 50])
>>> df2 = ks.DataFrame({'A': [0, -1, -2, -3, -4], 'B': [-100, -200, -300, -400, -500]},
... index=[20, 10, 30, 0, 50])
>>> df1.A.loc[df2.A > -3].sort_index()
10 1
20 0
30 2
Lastly, now loc
uses its natural order according to index identically with pandas' when using the slice (#1159, #1174, #1179). See the example below.
>>> df = ks.DataFrame([[1, 2], [4, 5], [7, 8]],
... index=['cobra', 'viper', 'sidewinder'],
... columns=['max_speed', 'shield'])
>>> df.loc['cobra':'viper', 'max_speed']
cobra 1
viper 4
Name: max_speed, dtype: int64
We added the following new features:
koalas.Series:
get
(#1153)koalas.Index
drop
(#1117)len
(#1161)set_names
(#1134)argmin
(#1162)argmax
(#1162)koalas.MultiIndex:
from_product
(#1144)drop
(#1117)len
(#1161)set_names
(#1134)from_pandas
for Index/MultiIndex. (#1170)__natural_order__
. (#1146)_LocIndexerLike
and consolidate some logic. (#1149)LocIndexerLike.__getitem__
. (#1152)GroupBy._reduce_for_stat_function
. (#1147)Index.duplicated
(#1131)DataFrame._repr_html_()
. (#1177)Published by HyukjinKwon almost 5 years ago
We added the compatibility of NumPy ufunc (#1127). Virtually all ufunc compatibilities in Koalas DataFrame were implemented. See the example below:
>>> import databricks.koalas as ks
>>> import numpy as np
>>> kdf = ks.range(10)
>>> np.log(kdf)
id
0 NaN
1 0.000000
2 0.693147
3 1.098612
4 1.386294
5 1.609438
6 1.791759
7 1.945910
8 2.079442
9 2.197225
We added the following new features:
koalas:
to_numeric
(#1060)koalas.DataFrame:
idxmax
(#1054)idxmin
(#1054)pct_change
(#1051)info
(#1124)koalas.Index
fillna
(#1102)min
(#1114)max
(#1114)drop_duplicates
(#1121)nunique
(#1132)sort_values
(#1120)koalas.MultiIndex:
levshape
(#1086)min
(#1114)max
(#1114)sort_values
(#1120)koalas.SeriesGroupBy
head
(#1050)koalas.DataFrameGroupBy
head
(#1050)Published by HyukjinKwon almost 5 years ago
We added the compatibility of NumPy ufunc (#1096, #1106). Virtually all ufunc compatibilities in Koalas Series were implemented. See the example below:
>>> import databricks.koalas as ks
>>> import numpy as np
>>> kdf = ks.range(10)
>>> kser = np.sqrt(kdf.id)
>>> type(kser)
<class 'databricks.koalas.series.Series'>
>>> kser
0 0.000000
1 1.000000
2 1.414214
3 1.732051
4 2.000000
5 2.236068
6 2.449490
7 2.645751
8 2.828427
9 3.000000
We added the following new features:
koalas:
option_context
(#1077)koalas.DataFrame:
where
(#1018)mask
(#1018)iterrows
(#1070)koalas.Series:
pop
(#866)first_valid_index
(#1092)pct_change
(#1071)koalas.Index
symmetric_difference
(#953, #1059)to_numpy
(#1058)transpose
(#1056)T
(#1056)dropna
(#938)shape
(#1085)value_counts
(#949)koalas.MultiIndex:
symmetric_difference
(#953, #1059)to_numpy
(#1058)transpose
(#1056)T
(#1056)dropna
(#938)shape
(#1085)value_counts
(#949)Series.__getitem__
to take boolean Series (#1075)Published by HyukjinKwon almost 5 years ago
Apache Arrow 0.15.0 did not work well with PySpark 2.4 so it was disabled in the previous version.
With Arrow 0.15.1, now it works in Koalas (#902).
We also added expanding()
and rolling()
APIs in all groupby()
, Series and Frame (#985, #991, #990, #1015, #996, #1034, #1037)
min
max
sum
mean
std
var
We continue improving multi-index columns support. We made the following APIs support multi-index columns:
median
(#995)at
(#1049)We added "Best Practices" section in the documentation (#1041) so that Koalas users can read and follow. Please see https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html
We added the following new features:
koalas.DataFrame:
quantile
(#984)explain
(#1042)koalas.Series:
between
(#997)update
(#923)mask
(#1017)koalas.MultiIndex:
from_tuples
(#970)from_arrays
(#1001)Along with the following improvements:
setup.py
should support Python 2 to show a proper error message. (#1027)Series.schema
. (#993)Published by HyukjinKwon almost 5 years ago
We continue improving multi-index columns support. We made the following APIs support multi-index columns:
nunique
(#980)to_csv
(#983)Now, we have installation guide, design principles and FAQ in our public documentation (#914, #944, #963, #964)
We added the following new features:
koalas
merge
(#969)koalas.DataFrame:
keys
(#937)ndim
(#947)koalas.Series:
keys
(#935)mode
(#899)truncate
(#928)xs
(#921)where
(#922)first_valid_index
(#936)koalas.Index:
copy
(#939)unique
(#912)ndim
(#947)has_duplicates
(#946)nlevels
(#945)koalas.MultiIndex:
copy
(#939)ndim
(#947)has_duplicates
(#946)nlevels
(#945)koalas.Expanding
count
(#978)Along with the following improvements:
Published by ueshin about 5 years ago
Apache Arrow 0.15.0 was released on the 5th of October, 2019, which Koalas depends on to execute Pandas UDF, but the Spark community reports an issue with PyArrow 0.15.
We decided to set an upper bound for pyarrow version to avoid such issues until we are sure that Koalas works fine with it.
We continue improving multi-index columns support. We made the following APIs support multi-index columns:
pivot_table
(#908)melt
(#920)We added the following new features:
koalas.DataFrame:
xs
(#892)koalas.Series:
drop_duplicates
(#896)replace
(#903)koalas.GroupBy:
shift
(#910)Along with the following improvements:
read_csv
(#916)Published by ueshin about 5 years ago
Now that we have an official logo!
We can see the cute logo in our documents as well.
Also we improved the documentation: https://koalas.readthedocs.io/en/latest/
You can run a live Jupyter notebook for 10 min tutorial from .
We continue improving multi-index columns support. We made the following APIs support multi-index columns:
transform
(#800)round
(#802)unique
(#809)duplicated
(#803)assign
(#811)merge
(#825)plot
(#830)groupby
and its functions (#833)update
(#848)join
(#848)drop_duplicate
(#856)dtype
(#858)filter
(#859)dropna
(#857)replace
(#860)We also continue adding plot APIs as follows:
For DataFrame:
plot.kde()
(#784)We added the following new features:
koalas.DataFrame:
pop
(#791)__iter__
(#836)rename
(#806)expanding
(#840)rolling
(#840)koalas.Series:
aggregate
(#816)agg
(#816)expanding
(#840)rolling
(#840)drop
(#829)copy
(#869)koalas.DataFrameGroupBy:
expanding
(#840)rolling
(#840)koalas.SeriesGroupBy:
expanding
(#840)rolling
(#840)Along with the following improvements:
MultiIndex.to_pandas()
and __repr__()
. (#832)index_col
argument to to_koalas()
. (#863)Published by HyukjinKwon about 5 years ago
We continue improving multi-index columns support (#793, #776). We made the following APIs support multi-index columns:
applymap
(#793)shift
(#793)diff
(#793)fillna
(#793)rank
(#793)Also, we can set tuple or None name for Series and Index. (#776)
>>> import databricks.koalas as ks
>>> kser = ks.Series([1, 2, 3])
>>> kser.name = ('a', 'b')
>>> kser
0 1
1 2
2 3
Name: (a, b), dtype: int64
We also continue adding plot APIs as follows:
For Series:
plot.kde()
(#767)For DataFrame:
plot.hist()
(#780)In addition, we added the support for namespace-access in options (#785).
>>> import databricks.koalas as ks
>>> ks.options.display.max_rows
1000
>>> ks.options.display.max_rows = 10
>>> ks.options.display.max_rows
10
See also User Guide of our project docs.
We added the following new features:
koalas.DataFrame:
aggregate
(#796)agg
(#796)items
(#787)koalas.indexes.Index/MultiIndex
is_boolean
(#795)is_categorical
(#795)is_floating
(#795)is_integer
(#795)is_interval
(#795)is_numeric
(#795)is_object
(#795)Along with the following improvements:
index_col
for read_json
(#797)spark_df
when set_index(.., drop=False)
. (#792)DataFrame.to_csv
and DataFrame.to_json
to allow distributed writing (#749, #753)Published by ueshin about 5 years ago
We started using options to configure the Koalas' behavior. Now we have the following options:
display.max_rows
(#714, #742)compute.max_rows
(#721, #736)compute.shortcut_limit
(#717)compute.ops_on_diff_frames
(#725)compute.default_index_type
(#723)plotting.max_rows
(#728)plotting.sample_ratio
(#737)We can also see the list and their descriptions in the User Guide of our project docs.
We continue adding plot APIs as follows:
For Series:
plot.area()
(#704)For DataFrame:
plot.line()
(#686)plot.bar()
(#695)plot.barh()
(#698)plot.pie()
(#703)plot.area()
(#696)plot.scatter()
(#719)We also continue improving multi-index columns support. We made the following APIs support multi-index columns:
koalas.concat()
(#680)koalas.get_dummies()
(#695)DataFrame.pivot_table()
(#635)We added the following new features:
koalas:
read_sql_table()
(#741)read_sql_query()
(#741)read_sql()
(#741)koalas.DataFrame:
style
(#712)Along with the following improvements:
GroupBy.apply
should return Koalas DataFrame instead of pandas DataFrame (#731)rpow
and rfloordiv
to use proper operators in Series (#735)rpow
and rfloordiv
to use proper operators in DataFrame (#740)Option
class to support type check and value check in options (#739)one-by-one
and distributed-one-by-one
to sequence
and distributed-sequence
respectively. (#679)Published by HyukjinKwon about 5 years ago
Firstly, we introduced new mode to enable operations on different DataFrames (#633). This mode can be enabled by setting OPS_ON_DIFF_FRAMES
environment variable is set to true
as below:
>>> import databricks.koalas as ks
>>>
>>> kdf1 = ks.range(5)
>>> kdf2 = ks.DataFrame({'id': [5, 4, 3]})
>>> (kdf1 - kdf2).sort_index()
id
0 -5.0
1 -3.0
2 -1.0
3 NaN
4 NaN
>>> import databricks.koalas as ks
>>>
>>> kdf = ks.range(5)
>>> kdf['new_col'] = ks.Series([1, 2, 3, 4])
>>> kdf
id new_col
0 0 1.0
1 1 2.0
3 3 4.0
2 2 3.0
4 4 NaN
Secondly, we also introduced default index and disallowed Koalas DataFrame with no index internally (#639)(#655). For example, if you create Koalas DataFrame from Spark DataFrame, the default index is used. The default index implementation can be configured by setting DEFAULT_INDEX
as one of three types:
(default) one-by-one
: It implements a one-by-one sequence by Window function without
specifying partition. This index type should be avoided when the data is large.
>>> ks.range(3)
id
0 0
1 1
2 2
distributed-one-by-one
: It implements a one-by-one sequence by group-by and
group-map approach. It still generates a one-by-one sequential index globally.
If the default index must be a one-by-one sequence in a large dataset, this
index can be used.
>>> ks.range(3)
id
0 0
1 1
2 2
distributed
: It implements a monotonically increasing sequence simply by using
Spark's monotonically_increasing_id
function. If the index does not have to be
a one-by-one sequence, this index can be used. Performance-wise, this index
almost does not have any penalty comparing to other index types.
>>> ks.range(3)
id
25769803776 0
60129542144 1
94489280512 2
Thirdly, we implemented many plot APIs in Series as follows:
See the example below:
import databricks.koalas as ks
ks.range(10).to_pandas().id.plot.pie()
Fourthly, we rapidly improved multi-index columns support continuously. Now multi-index columns are supported in multiple APIs:
DataFrame.sort_index()
(#637)GroupBy.diff()
(#653)GroupBy.rank()
(#653)Series.any()
(#652)Series.all()
(#652)DataFrame.any()
(#652)DataFrame.all()
(#652)DataFrame.assign()
(#657)DataFrame.drop()
(#658)DataFrame.reindex()
(#659)Series.quantile()
(#663)Series,transform()
(#663)DataFrame.select_dtypes()
(#662)DataFrame.transpose()
(#664).Lastly we added new functionalities, especially for groupby-related functionalities, in the past weeks. We added the following features:
koalas.DataFrame
koalas.groupby.GroupBy:
Along with the following improvements:
column_index
. (#648)Published by ueshin about 5 years ago
We rapidly improved and added new functionalities, especially for groupby-related functionalities, in the past weeks. We also added the following features:
koalas.groupby.GroupBy:
koalas.groupby.SeriesGroupBy:
koalas.indexes.Index:
Along with the following improvements:
Published by HyukjinKwon about 5 years ago
We added a basic multi-index support in columns (#590) as below. pandas multi-index can be also mapped.
>>> import databricks.koalas as ks
>>> import numpy as np
>>>
>>> arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
... np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
>>> kdf = ks.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=arrays)
>>> kdf
bar baz foo qux
one two one two one two one two
A -1.574777 0.805108 0.139748 1.287946 -1.782297 -0.152292 0.680594 1.419407
B 0.076886 -1.560807 0.403807 -0.715029 1.236899 -0.364483 -1.548554 0.076003
C -0.575168 0.061539 -2.083615 -0.816090 -1.267440 0.745949 -1.194421 0.468818
>>> kdf['bar']
one two
A -1.574777 0.805108
B 0.076886 -1.560807
C -0.575168 0.061539
>>> kdf['bar']['two']
A 0.805108
B -1.560807
C 0.061539
Name: two, dtype: float64
In addition, we are triaging APIs to support and unsupport explicitly (#574)(#580). Some of pandas APIs would explicitly be unsupported according to Guardrails to prevent users from shooting themselves in the foot and based upon other justifications such as the cost of their operations.
We also added the following features:
koalas.DataFrame:
koalas.Series:
koalas.indexes.Index:
koalas.groupby.GroupBy:
Along with the following improvements:
method
and limit
parameter support in DataFrame.fillna()
(#565).
) in columns names are allowed (#490)Published by HyukjinKwon over 5 years ago
We rapidly improved and added new functionalities in the past week. We also added the following features:
koalas.DataFrame:
koalas.Series:
Published by HyukjinKwon over 5 years ago
We rapidly improved and added new functionalities in the past week. We also added the following features:
koalas:
koalas.DataFrame:
koalas.Series:
Along with the following improvements:
kdf.replace({0: 10, 1: 100})
(#527)Published by HyukjinKwon over 5 years ago
We fixed a critical regression for pandas 0.23.x compatibility (#528, #529)
Now, pandas 0.23.x support is back.
Published by HyukjinKwon over 5 years ago
We added infrastructure for usage logging (#494). It allows to use a custom logger to handle each API process failure and success. In Koalas, it has a built-in Koalas logger, databricks.koalas.usage_logging.usage_logger
, with Python logging
.
In addition, Koalas experimentally introduced type hints for both Series
and DataFrame
(#453). The new type hints are used as below:
def func(...) -> ks.Series[np.float]:
...
def func(...) -> ks.DataFrame[np.float, int, str]:
...
We also added the following features:
koalas.DataFrame:
koalas.Series:
Along with the following improvements:
nunique
in koalas.groupby.GroupBy.agg (#512)Published by ueshin over 5 years ago
We bumped up supporting MLflow to 1.0
and now we can use URI pointing to the model. Please see MLflow documentation for more details. Note that we don't support older versions any more. (#477)
We also added the following features:
koalas:
koalas.DataFrame:
koalas.Series:
koalas.groupby.GroupBy:
Along with the following improvements:
DataFrame
constructor can now take Koalas Series
. (#470)Series.dt
property (#478)Published by HyukjinKwon over 5 years ago
We added new functionalities, improved the documentation and fixed some bugs in the past week. Also, koalas.sql
has an improvement (#448). Now Koalas DataFrame and some regular Python types can be used directly in SQL, for instance, as below:
>>> mydf = ks.range(10)
>>> x = range(4)
>>> ks.sql("SELECT * from {mydf} WHERE id IN {x}")
id
0 0
1 1
2 2
3 3
We also added the following features:
koalas
koalas.DataFrame:
koalas.Series:
Along with the following improvements:
numeric_only
argument (#422)Published by HyukjinKwon over 5 years ago
We refined the internal structure, improved the documentation and added new functionalities in the past week.
We also added the following features:
koalas:
koalas.DataFrame:
koalas.Series:
Published by ueshin over 5 years ago
We added basic integration with MLflow, so that models that have the pyfunc
flavor (which is, most of them), can be loaded as predictors. These predictors then works on both pandas and koalas dataframes with no code change. See the documentation example for details. (#353)
We also added the following features:
koalas.DataFrame:
koalas.Series:
Along with the following improvements:
DataFrame.merge
function now supports left_on
and right_on
arguments. (#381)DataFrame.describe
function now supports percentiles
argument. (#378)