Koalas: pandas API on Apache Spark
APACHE-2.0 License
Bot releases are hidden (Show)
Koalas 1.8.2 is a maintenance release.
Koalas is officially included in PySpark as pandas API on Spark in Apache Spark 3.2. In Apache Spark 3.2+, please use Apache Spark directly.
Although moving to pandas API on Spark is recommended, Koalas 1.8.2 still works with Spark 3.2 (#2203).
Published by xinrong-meng over 3 years ago
Koalas 1.8.1 is a maintenance release. Koalas will be officially included in PySpark in the upcoming Apache Spark 3.2. In Apache Spark 3.2+, please use Apache Spark directly.
Along with the following fixes:
Published by HyukjinKwon over 3 years ago
Koalas 1.8.0 is the last minor release because Koalas will be officially included in PySpark in the upcoming Apache Spark 3.2. In Apache Spark 3.2+, please use Apache Spark directly.
ExtensionDtype
We added the support of pandas' categorical type (#2064, #2106).
>>> s = ks.Series(list("abbccc"), dtype="category")
>>> s
0 a
1 b
2 b
3 c
4 c
5 c
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> s.cat.categories
Index(['a', 'b', 'c'], dtype='object')
>>> s.cat.codes
0 0
1 1
2 1
3 2
4 2
5 2
dtype: int8
>>> idx = ks.CategoricalIndex(list("abbccc"))
>>> idx
CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'],
categories=['a', 'b', 'c'], ordered=False, dtype='category')
>>> idx.codes
Int64Index([0, 1, 1, 2, 2, 2], dtype='int64')
>>> idx.categories
Index(['a', 'b', 'c'], dtype='object')
and ExtensionDtype as type arguments to annotate return types (#2120, #2123, #2132, #2127, #2126, #2125, #2124):
def func() -> ks.Series[pd.Int32Dtype()]:
...
We added the following new features:
DataFrame:
first
(#2128)at_time
(#2116)Series:
at_time
(#2130)first
(#2128)between_time
(#2129)DatetimeIndex:
indexer_between_time
(#2104)indexer_at_time
(#2109)between_time
(#2111)Along with the following fixes:
Published by itholic over 3 years ago
We switched the default plotting backend from Matplotlib to Plotly (#2029, #2033). In addition, we added more Plotly methods such as DataFrame.plot.kde
and Series.plot.kde
(#2028).
import databricks.koalas as ks
kdf = ks.DataFrame({
'a': [1, 2, 2.5, 3, 3.5, 4, 5],
'b': [1, 2, 3, 4, 5, 6, 7],
'c': [0.5, 1, 1.5, 2, 2.5, 3, 3.5]})
kdf.plot.hist()
Plotting backend can be switched to matplotlib
by setting ks.options.plotting.backend
to matplotlib
.
ks.options.plotting.backend = "matplotlib"
We added more types of Index
such as Index64Index
, Float64Index
and DatetimeIndex
(#2025, #2066).
When creating an index, Index
instance is always returned regardless of the data type.
But now Int64Index
, Float64Index
or DatetimeIndex
is returned depending on the data type of the index.
>>> type(ks.Index([1, 2, 3]))
<class 'databricks.koalas.indexes.numeric.Int64Index'>
>>> type(ks.Index([1.1, 2.5, 3.0]))
<class 'databricks.koalas.indexes.numeric.Float64Index'>
>>> type(ks.Index([datetime.datetime(2021, 3, 9)]))
<class 'databricks.koalas.indexes.datetimes.DatetimeIndex'>
In addition, we added many properties for DatetimeIndex
such as year
, month
, day
, hour
, minute
, second
, etc. (#2074) and added APIs for DatetimeIndex
such as round()
, floor()
, ceil()
, normalize()
, strftime()
, month_name()
and day_name()
(#2082, #2086, #2089).
Index can be created by taking Series
or Index
objects (#2071).
>>> kser = ks.Series([1, 2, 3], name="a", index=[10, 20, 30])
>>> ks.Index(kser)
Int64Index([1, 2, 3], dtype='int64', name='a')
>>> ks.Int64Index(kser)
Int64Index([1, 2, 3], dtype='int64', name='a')
>>> ks.Float64Index(kser)
Float64Index([1.0, 2.0, 3.0], dtype='float64', name='a')
>>> kser = ks.Series([datetime(2021, 3, 1), datetime(2021, 3, 2)], index=[10, 20])
>>> ks.Index(kser)
DatetimeIndex(['2021-03-01', '2021-03-02'], dtype='datetime64[ns]', freq=None)
>>> ks.DatetimeIndex(kser)
DatetimeIndex(['2021-03-01', '2021-03-02'], dtype='datetime64[ns]', freq=None)
We added basic extension dtypes support (#2039).
>>> kdf = ks.DataFrame(
... {
... "a": [1, 2, None, 3],
... "b": [4.5, 5.2, 6.1, None],
... "c": ["A", "B", "C", None],
... "d": [False, None, True, False],
... }
... ).astype({"a": "Int32", "b": "Float64", "c": "string", "d": "boolean"})
>>> kdf
a b c d
0 1 4.5 A False
1 2 5.2 B <NA>
2 <NA> 6.1 C True
3 3 NaN <NA> False
>>> kdf.dtypes
a Int32
b float64
c string
d boolean
dtype: object
The following types are supported per the installed pandas:
Int8Dtype
Int16Dtype
Int32Dtype
Int64Dtype
BooleanDtype
StringDtype
Float32Dtype
Float64Dtype
Binary operations and type casting are supported:
>>> kdf.a + kdf.b
0 5
1 7
2 <NA>
3 <NA>
dtype: Int64
>>> kdf + kdf
a b
0 2 8
1 4 10
2 <NA> 12
3 6 <NA>
>>> kdf.a.astype('Float64')
0 1.0
1 2.0
2 <NA>
3 3.0
Name: a, dtype: Float64
We added the following new features:
koalas:
date_range
(#2081)read_orc
(#2017)Series:
align
(#2019)DataFrame:
align
(#2019)to_orc
(#2024)Along with the following fixes:
Published by HyukjinKwon over 3 years ago
We improved plotting support by implementing pie, histogram and box plots with Plotly plot backend. Koalas now can plot data with Plotly via:
DataFrame.plot.pie
and Series.plot.pie
(#1971)
DataFrame.plot.hist
and Series.plot.hist
(#1999)
Series.plot.box
(#2007)
In addition, we optimized histogram calculation as a single pass in DataFrame
(#1997) instead of launching each job to calculate each Series
in DataFrame
.
The operations between Series
and Index
are now supported as below (#1996):
>>> kser = ks.Series([1, 2, 3, 4, 5, 6, 7])
>>> kidx = ks.Index([0, 1, 2, 3, 4, 5, 6])
>>> (kser + 1 + 10 * kidx).sort_index()
0 2
1 13
2 24
3 35
4 46
5 57
6 68
dtype: int64
>>> (kidx + 1 + 10 * kser).sort_index()
0 11
1 22
2 33
3 44
4 55
5 66
6 77
dtype: int64
Series
via attribute accessWe have added the support of setting a column via attribute assignment in DataFrame
, (#1989).
>>> kdf = ks.DataFrame({'A': [1, 2, 3, None]})
>>> kdf.A = kdf.A.fillna(kdf.A.median())
>>> kdf
A
0 1.0
1 2.0
2 3.0
3 2.0
We added the following new features:
Series:
factorize
(#1972)sem
(#1993)DataFrame
insert
(#1983)sem
(#1993)In addition, we also implement new parameters:
Along with the following fixes:
Published by xinrong-meng almost 4 years ago
We improved Index operations support (#1944, #1955).
Here are some examples:
Before
>>> kidx = ks.Index([1, 2, 3, 4, 5])
>>> kidx + kidx
Int64Index([2, 4, 6, 8, 10], dtype='int64')
>>> kidx + kidx + kidx
Traceback (most recent call last):
...
AssertionError: args should be single DataFrame or single/multiple Series
>>> ks.Index([1, 2, 3, 4, 5]) + ks.Index([6, 7, 8, 9, 10])
Traceback (most recent call last):
...
AssertionError: args should be single DataFrame or single/multiple Series
After
>>> kidx = ks.Index([1, 2, 3, 4, 5])
>>> kidx + kidx + kidx
Int64Index([3, 6, 9, 12, 15], dtype='int64')
>>> ks.options.compute.ops_on_diff_frames = True
>>> ks.Index([1, 2, 3, 4, 5]) + ks.Index([6, 7, 8, 9, 10])
Int64Index([7, 9, 13, 11, 15], dtype='int64')
We added the following new features:
DataFrame:
swaplevel
(#1928)swapaxes
(#1946)dot
(#1945)itertuples
(#1960)Series:
swaplevel
(#1919)swapaxes
(#1954)Index:
to_list
(#1948)MultiIndex:
to_list
(#1948)GroupBy:
tail
(#1949)median
(#1957)Published by ueshin almost 4 years ago
We improved the type mapping between pandas and Koalas (#1870, #1903). We added more types or string expressions to specify the data type or fixed mismatches between pandas and Koalas.
Here are some examples:
Added np.float32
and "float32"
(matched to FloatType
)
>>> ks.Series([10]).astype(np.float32)
0 10.0
dtype: float32
>>> ks.Series([10]).astype("float32")
0 10.0
dtype: float32
Added np.datetime64
and "datetime64[ns]"
(matched to TimestampType
)
>>> ks.Series(["2020-10-26"]).astype(np.datetime64)
0 2020-10-26
dtype: datetime64[ns]
>>> ks.Series(["2020-10-26"]).astype("datetime64[ns]")
0 2020-10-26
dtype: datetime64[ns]
Fixed np.int
to match LongType
, not IntegerType
.
>>> pd.Series([100]).astype(np.int)
0 100.0
dtype: int64
>>> ks.Series([100]).astype(np.int)
0 100.0
dtype: int32 # This fixed to `int64` now.
Fixed np.float
to match DoubleType
, not FloatType
.
>>> pd.Series([100]).astype(np.float)
0 100.0
dtype: float64
>>> ks.Series([100]).astype(np.float)
0 100.0
dtype: float32 # This fixed to `float64` now.
We also added a document which describes supported/unsupported pandas data types or data type mapping between pandas data types and PySpark data types. See: Type Support In Koalas.
To improve Koala’s auto-completion in various editors and avoid misuse of APIs, we added return type annotations to major Koalas objects. These objects include DataFrame, Series, Index, GroupBy, Window objects, etc. (#1852, #1857, #1859, #1863, #1871, #1882, #1884, #1889, #1892, #1894, #1898, #1899, #1900, #1902).
The return type annotations help auto-completion libraries, such as Jedi, to infer the actual data type and provide proper suggestions:
It also helps mypy enable static analysis over the method body.
We verified the behaviors of pandas 1.1.4 in Koalas.
As pandas 1.1.4 introduced a behavior change related to MultiIndex.is_monotonic
(MultiIndex.is_monotonic_increasing
) and MultiIndex.is_monotonic_decreasing
(pandas-dev/pandas#37220), Koalas also changes the behavior (#1881).
We added the following new features:
DataFrame:
__neg__
(#1847)rename_axis
(#1843)spark.repartition
(#1864)spark.coalesce
(#1873)spark.checkpoint
(#1877)spark.local_checkpoint
(#1878)reindex_like
(#1880)Series:
rename_axis
(#1843)compare
(#1802)reindex_like
(#1880)Index:
intersection
(#1747)MultiIndex:
intersection
(#1747)Published by itholic about 4 years ago
We verified the behaviors of pandas 1.1 in Koalas. Koalas now supports pandas 1.1 officially (#1688, #1822, #1829).
Now we support for non-string names (#1784). Previously names in Koalas, e.g., df.columns
, df.colums.names
, df.index.names
, needed to be a string or a tuple of string, but it should allow other data types which are supported by Spark.
Before:
>>> kdf = ks.DataFrame([[1, 'x'], [2, 'y'], [3, 'z']])
>>> kdf.columns
Index(['0', '1'], dtype='object')
After:
>>> kdf = ks.DataFrame([[1, 'x'], [2, 'y'], [3, 'z']])
>>> kdf.columns
Int64Index([0, 1], dtype='int64')
distributed-sequence
default indexThe performance is improved when creating a distributed-sequence
as a default index type by avoiding the interaction between Python and JVM (#1699).
Make behaviors of binary operations (+
, -
, *
, /
, //
, %
) between int
and str
columns consistent with respective pandas behaviors (#1828).
It standardizes binary operations as follows:
+
: raise TypeError
between int column and str column (or string literal)*
: act as spark SQL repeat
between int column(or int literal) and str columns; raise TypeError
if a string literal is involved-
, /
, //
, %(modulo)
: raise TypeError
if a str column (or string literal) is involvedWe added the following new features:
DataFrame:
product
(#1739)from_dict
(#1778)pad
(#1786)backfill
(#1798)Series:
reindex
(#1737)explode
(#1777)pad
(#1786)argmin
(#1790)argmax
(#1790)argsort
(#1793)backfill
(#1798)Index:
inferred_type
(#1745)item
(#1744)is_unique
(#1766)asi8
(#1764)is_type_compatible
(#1765)view
(#1788)insert
(#1804)MultiIndex:
inferred_type
(#1745)item
(#1744)is_unique
(#1766)asi8
(#1764)is_type_compatible
(#1765)from_frame
(#1762)view
(#1788)insert
(#1804)GroupBy:
get_group
(#1783)Published by ueshin about 4 years ago
Now we added support for non-named Series (#1712). Previously Koalas automatically named a Series "0"
if no name is specified or None
is set to the name, whereas pandas allows a Series without the name.
For example:
>>> ks.__version__
'1.1.0'
>>> kser = ks.Series([1, 2, 3])
>>> kser
0 1
1 2
2 3
Name: 0, dtype: int64
>>> kser.name = None
>>> kser
0 1
1 2
2 3
Name: 0, dtype: int64
Now the Series will be non-named.
>>> ks.__version__
'1.2.0'
>>> ks.Series([1, 2, 3])
0 1
1 2
2 3
dtype: int64
>>> kser = ks.Series([1, 2, 3], name="a")
>>> kser.name = None
>>> kser
0 1
1 2
2 3
dtype: int64
Previously "distributed-sequence" default index had sometimes produced wrong values or even raised an exception. For example, the codes below:
>>> from databricks import koalas as ks
>>> ks.options.compute.default_index_type = 'distributed-sequence'
>>> ks.range(10).reset_index()
did not work as below:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
pyspark.sql.utils.PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
...
File "/.../koalas/databricks/koalas/internal.py", line 620, in offset
current_partition_offset = sums[id.iloc[0]]
KeyError: 103
We investigated and made the default index type more stable (#1701). Now it unlikely causes such situations and it is stable enough.
We changed the testing infrastructure to use pandas' testing utils for exact check (#1722). Now it compares even index/column types and names so that we will be able to follow pandas more strictly.
We added the following new features:
DataFrame:
last_valid_index
(#1705)Series:
product
(#1677)last_valid_index
(#1705)GroupBy:
cumcount
(#1702)partitionBy
explicitly in to_parquet
.mode
and partition_cols
to to_csv
and to_json
.Optional
.PlotAccessor
for DataFrame and Series (#1662)Published by ueshin over 4 years ago
We added support for API extensions (#1617).
You can register your custom accessors to DataFrame
, Seires
, and Index
.
For example, in your library code:
from databricks.koalas.extensions import register_dataframe_accessor
@register_dataframe_accessor("geo")
class GeoAccessor:
def __init__(self, koalas_obj):
self._obj = koalas_obj
# other constructor logic
@property
def center(self):
# return the geographic center point of this DataFrame
lat = self._obj.latitude
lon = self._obj.longitude
return (float(lon.mean()), float(lat.mean()))
def plot(self):
# plot this array's data on a map
pass
...
Then, in a session:
>>> from my_ext_lib import GeoAccessor
>>> kdf = ks.DataFrame({"longitude": np.linspace(0,10),
... "latitude": np.linspace(0, 20)})
>>> kdf.geo.center
(5.0, 10.0)
>>> kdf.geo.plot()
...
See also: https://koalas.readthedocs.io/en/latest/reference/extensions.html
We introduced plotting.backend
configuration (#1639).
Plotly (>=4.8) or other libraries that pandas supports can be used as a plotting backend if they are installed in the environment.
>>> kdf = ks.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns=["A", "B", "C", "D"])
>>> kdf.plot(title="Example Figure") # defaults to backend="matplotlib"
>>> fig = kdf.plot(backend="plotly", title="Example Figure", height=500, width=500)
>>> ## same as:
>>> # ks.options.plotting.backend = "plotly"
>>> # fig = kdf.plot(title="Example Figure", height=500, width=500)
>>> fig.show()
Each backend returns the figure in their own format, allowing for further editing or customization if required.
>>> fig.update_layout(template="plotly_dark")
>>> fig.show()
We introduced koalas
accessor and some methods specific to Koalas (#1613, #1628).
DataFrame.apply_batch
, DataFrame.transform_batch
, and Series.transform_batch
are deprecated and moved to koalas
accessor.
>>> kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pdf):
... return pdf + 1 # should always return the same length as input.
...
>>> kdf.koalas.transform_batch(pandas_plus)
a b
0 2 5
1 3 6
2 4 7
>>> kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_filter(pdf):
... return pdf[pdf.a > 1] # allow arbitrary length
...
>>> kdf.koalas.apply_batch(pandas_filter)
a b
1 2 5
2 3 6
or
>>> kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pser):
... return pser + 1 # should always return the same length as input.
...
>>> kdf.a.koalas.transform_batch(pandas_plus)
0 2
1 3
2 4
Name: a, dtype: int64
See also: https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html
We added the following new features:
DataFrame:
tail
(#1632)droplevel
(#1622)Series:
iteritems
(#1603)items
(#1603)tail
(#1632)droplevel
(#1630)Published by ueshin over 4 years ago
We fixed a critical bug introduced in Koalas 1.0.0 (#1609).
If we call DataFrame.rename
with columns
parameter after some operations on the DataFrame, the operations will be lost:
>>> kdf = ks.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns=["A", "B", "C", "D"])
>>> kdf1 = kdf + 1
>>> kdf1
A B C D
0 2 3 4 5
1 6 7 8 9
>>> kdf1.rename(columns={"A": "aa", "B": "bb"})
aa bb C D
0 1 2 3 4
1 5 6 7 8
This should be:
>>> pdf1.rename(columns={"A": "aa", "B": "bb"})
aa bb C D
0 2 3 4 5
1 6 7 8 9
Published by HyukjinKwon over 4 years ago
We implemented many APIs and features equivalent with pandas such as plotting, grouping, windowing, I/O, and transformation, and now Koalas reaches the pandas API coverage close to 80% in Koalas 1.0.0.
Apache Spark 3.0 is now supported in Koalas 1.0 (#1586, #1558). Koalas does not require any change to use Spark 3.0. Apache Spark has more than 3400 fixes landed in Spark 3.0 and Koalas shares the most of fixes in many other components.
It also brings the performance improvement in Koalas APIs that execute Python native functions internally via pandas UDFs, for example, DataFrame.apply
and DataFrame.apply_batch
(#1508).
With Apache Spark 3.0, Koalas supports the latest Python 3.8 which has many significant improvements (#1587), see also Python 3.8.0 release notes.
spark
accessor was introduced from Koalas 1.0.0 in order for the Koalas users to leverage the existing PySpark APIs more easily (#1530). For example, you can apply the PySpark functions as below:
import databricks.koalas as ks
import pyspark.sql.functions as F
kss = ks.Series([1, 2, 3, 4])
kss.spark.apply(lambda s: F.collect_list(s))
In the early versions, it was required to use Koalas instances as the return type hints for the functions that return a pandas instances, which looks slightly awkward.
def pandas_div(pdf) -> koalas.DataFrame[float, float]:
# pdf is a pandas DataFrame,
return pdf[['B', 'C']] / pdf[['B', 'C']]
df = ks.DataFrame({'A': ['a', 'a', 'b'], 'B': [1, 2, 3], 'C': [4, 6, 5]})
df.groupby('A').apply(pandas_div)
In Koalas 1.0.0 with Python 3.7+, you can also use pandas instances in the return type as below:
def pandas_div(pdf) -> pandas.DataFrame[float, float]:
return pdf[['B', 'C']] / pdf[['B', 'C']]
In addition, the new type hinting is experimentally introduced in order to allow users to specify column names in the type hints as below (#1577):
def pandas_div(pdf) -> pandas.DataFrame['B': float, 'C': float]:
return pdf[['B', 'C']] / pdf[['B', 'C']]
See also the guide in Koalas documentation (#1584) for more details.
Previously in-place updates happen only within each DataFrame or Series, but now the behavior follows pandas in-place updates and the update of one side also updates the other side (#1592).
For example, the following updates kdf
as well.
kdf = ks.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6]})
kser = kdf.x
kser.fillna(0, inplace=True)
kdf = ks.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6]})
kser = kdf.x
kser.loc[2] = 30
kdf = ks.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6]})
kser = kdf.x
kdf.loc[2, 'x'] = 30
If the DataFrame and Series are connected, the in-place updates update each other.
compute.ops_on_diff_frames
In Koalas 1.0.0, the restriction of compute.ops_on_diff_frames
became much more loosened (#1522, #1554). For example, the operations such as below can be performed without enabling compute.ops_on_diff_frames
, which can be expensive due to the shuffle under the hood.
df + df + df
df['foo'] = df['bar']['baz']
df[['x', 'y']] = df[['x', 'y']].fillna(0)
DataFrame:
__bool__
(#1526)explode
(#1507)spark.apply
(#1536)spark.schema
(#1530)spark.print_schema
(#1530)spark.frame
(#1530)spark.cache
(#1530)spark.persist
(#1530)spark.hint
(#1530)spark.to_table
(#1530)spark.to_spark_io
(#1530)spark.explain
(#1530)spark.apply
(#1530)mad
(#1538)__abs__
(#1561)Series:
item
(#1502, #1518)divmod
(#1397)rdivmod
(#1397)unstack
(#1501)mad
(#1503)__bool__
(#1526)to_markdown
(#1510)spark.apply
(#1536)spark.data_type
(#1530)spark.nullable
(#1530)spark.column
(#1530)spark.transform
(#1530)filter
(#1511)__abs__
(#1561)bfill
(#1580)ffill
(#1580)Index:
__bool__
(#1526)spark.data_type
(#1530)spark.column
(#1530)spark.transform
(#1530)get_level_values
(#1517)delete
(#1165)__abs__
(#1561)holds_integer
(#1547)MultiIndex:
__bool__
(#1526)spark.data_type
(#1530)spark.column
(#1530)spark.transform
(#1530)get_level_values
(#1517)delete
(#1165__abs__
(#1561)holds_integer
(#1547)Along with the following improvements:
Published by ueshin over 4 years ago
apply
and transform
ImprovementsWe added supports to have positional/keyword arguments for apply
, apply_batch
, transform
, and transform_batch
in DataFrame
, Series
, and GroupBy
. (#1484, #1485, #1486)
>>> ks.range(10).apply(lambda a, b, c: a + b + c, args=(1,), c=3)
id
0 4
1 5
2 6
3 7
4 8
5 9
6 10
7 11
8 12
9 13
>>> ks.range(10).transform_batch(lambda pdf, a, b, c: pdf.id + a + b + c, 1, 2, c=3)
0 6
1 7
2 8
3 9
4 10
5 11
6 12
7 13
8 14
9 15
Name: id, dtype: int64
>>> kdf = ks.DataFrame(
... {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
... columns=["a", "b", "c"])
>>> kdf.groupby(["a", "b"]).apply(lambda x, y, z: x + x.min() + y + z, 1, z=2)
a b c
0 5 5 5
1 7 5 11
2 9 7 21
3 11 9 35
4 13 13 53
5 15 19 75
We add spark_schema
and print_schema
to know the underlying Spark Schema. (#1446)
>>> kdf = ks.DataFrame({'a': list('abc'),
... 'b': list(range(1, 4)),
... 'c': np.arange(3, 6).astype('i1'),
... 'd': np.arange(4.0, 7.0, dtype='float64'),
... 'e': [True, False, True],
... 'f': pd.date_range('20130101', periods=3)},
... columns=['a', 'b', 'c', 'd', 'e', 'f'])
>>> # Print the schema out in Spark’s DDL formatted string
>>> kdf.spark_schema().simpleString()
'struct<a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'
>>> kdf.spark_schema(index_col='index').simpleString()
'struct<index:bigint,a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'
>>> # Print out the schema as same as DataFrame.printSchema()
>>> kdf.print_schema()
root
|-- a: string (nullable = false)
|-- b: long (nullable = false)
|-- c: byte (nullable = false)
|-- d: double (nullable = false)
|-- e: boolean (nullable = false)
|-- f: timestamp (nullable = false)
>>> kdf.print_schema(index_col='index')
root
|-- index: long (nullable = false)
|-- a: string (nullable = false)
|-- b: long (nullable = false)
|-- c: byte (nullable = false)
|-- d: double (nullable = false)
|-- e: boolean (nullable = false)
|-- f: timestamp (nullable = false)
We fixed many bugs of GroupBy
as listed below.
We added the following new feature:
SeriesGroupBy:
filter
(#1483)Published by HyukjinKwon over 4 years ago
Koalas documentation was redesigned with a better theme, pydata-sphinx-theme. Please check the new Koalas documentation site out.
transform_batch
and apply_batch
We added the APIs that enable you to directly transform and apply a function against Koalas Series or DataFrame. map_in_pandas
is deprecated and now renamed to apply_batch
.
import databricks.koalas as ks
kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
def pandas_plus(pdf):
return pdf + 1 # should always return the same length as input.
kdf.transform_batch(pandas_plus)
import databricks.koalas as ks
kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
def pandas_plus(pdf):
return pdf[pdf.a > 1] # allow arbitrary length
kdf.apply_batch(pandas_plus)
Please also check Transform and apply a function in Koalas documentation.
We added the following new feature:
DataFrame:
truncate
(#1408)hint
(#1415)SeriesGroupBy:
unique
(#1426)Index:
spark_column
(#1438)Series:
spark_column
(#1438)MultiIndex:
spark_column
(#1438)Published by ueshin over 4 years ago
We added PyArrow>=0.15 support back (#1110).
Note that, when working with pyarrow>=0.15
and pyspark<3.0
, Koalas will set an environment variable ARROW_PRE_0_15_IPC_FORMAT=1
if it does not exist, as per the instruction in SPARK-29367, but it will NOT work if there is a Spark context already launched. In that case, you have to manage the environment variable by yourselves.
We added broadcast
function in namespace.py (#1360).
We can use it with merge
, join
, and update
which invoke join operation in Spark when you know one of the DataFrame is small enough to fit in memory, and we can expect much more performant than shuffle-based joins.
For example,
>>> merged = df1.merge(ks.broadcast(df2), left_on='lkey', right_on='rkey')
>>> merged.explain()
== Physical Plan ==
...
...BroadcastHashJoin...
...
We added persist
function to specify the storage level when caching (#1381), and also, we added storage_level
property to check the current storage level (#1385).
>>> with df.cache() as cached_df:
... print(cached_df.storage_level)
...
Disk Memory Deserialized 1x Replicated
>>> with df.persist(pyspark.StorageLevel.MEMORY_ONLY) as cached_df:
... print(cached_df.storage_level)
...
Memory Serialized 1x Replicated
We added the following new feature:
DataFrame:
to_markdown
(#1377)squeeze
(#1389)Series:
squeeze
(#1389)asof
(#1366)iloc.__setitem__
with the other Series from the same DataFrame. (#1388)loc/iloc.__setitem__
. (#1391)__setitem__
for loc/iloc with DataFrame. (#1394)Published by HyukjinKwon over 4 years ago
We continue to improve loc
indexer and added the slice column selection support (#1351).
>>> from databricks import koalas as ks
>>> df = ks.DataFrame({'a':list('abcdefghij'), 'b':list('abcdefghij'), 'c': range(10)})
>>> df.loc[:, "b":"c"]
b c
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4
5 f 5
6 g 6
7 h 7
8 i 8
9 j 9
We also added the support of slice as row selection in loc
indexer for multi-index (#1344).
>>> from databricks import koalas as ks
>>> import pandas as pd
>>> df = ks.DataFrame({'a': range(3)}, index=pd.MultiIndex.from_tuples([("a", "b"), ("a", "c"), ("b", "d")]))
>>> df.loc[("a", "c"): "b"]
a
a c 1
b d 2
We continued to improve iloc
indexer to support iterable indexes as row selection (#1338).
>>> from databricks import koalas as ks
>>> df = ks.DataFrame({'a':list('abcdefghij'), 'b':list('abcdefghij')})
>>> df.iloc[[-1, 1, 2, 3]]
a b
1 b b
2 c c
3 d d
9 j j
Now, we added the basic support of setting values via loc
and iloc
at Series (#1367).
>>> from databricks import koalas as ks
>>> kser = ks.Series([1, 2, 3], index=["cobra", "viper", "sidewinder"])
>>> kser.loc[kser % 2 == 1] = -kser
>>> kser
cobra -1
viper 2
sidewinder -3
We added the following new feature:
DataFrame:
take
(#1292)eval
(#1359)Series:
dot
(#1136)take
(#1357)combine_first
(#1290)Index:
droplevel
(#1340)union
(#1348)take
(#1357)asof
(#1350)MultiIndex:
droplevel
(#1340)unique
(#1342)union
(#1348)take
(#1357)Published by ueshin over 4 years ago
iloc
We improved iloc
indexer to support slice as row selection. (#1335)
For example,
>>> kdf = ks.DataFrame({'a':list('abcdefghij')})
>>> kdf
a
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
>>> kdf.iloc[2:5]
a
2 c
3 d
4 e
>>> kdf.iloc[2:-3:2]
a
2 c
4 e
6 g
>>> kdf.iloc[5:]
a
5 f
6 g
7 h
8 i
9 j
>>> kdf.iloc[5:2]
Empty DataFrame
Columns: [a]
Index: []
We added links to the previous talks in our document. (#1319)
You can see a lot of useful talks from the previous events and we will keep updated.
https://koalas.readthedocs.io/en/latest/getting_started/videos.html
We added the following new feature:
DataFrame:
stack
(#1329)Series:
repeat
(#1328)Index:
difference
(#1325)repeat
(#1328)MultiIndex:
difference
(#1325)repeat
(#1328)Published by HyukjinKwon over 4 years ago
We added pandas 1.0 support (#1197, #1299), and Koalas now can work with pandas 1.0.
We implemented DataFrame.map_in_pandas
API (#1276) so Koalas can allow any arbitrary function with pandas DataFrame against Koalas DataFrame. See the example below:
>>> import databricks.koalas as ks
>>> df = ks.DataFrame({'A': range(2000), 'B': range(2000)})
>>> def query_func(pdf):
... num = 1995
... return pdf.query('A > @num')
...
>>> df.map_in_pandas(query_func)
A B
1996 1996 1996
1997 1997 1997
1998 1998 1998
1999 1999 1999
As a development only change, we added Black integration (#1301). Now, all code style is standardized automatically via running ./dev/reformat
, and the style is checked as a part of ./dev/lint-python
.
We added the following new feature:
DataFrame:
query
(#1273)unstack
(#1295)DataFrame.describe()
to support multi-index columns. (#1279)drop_duplicates
(#1303)Published by ueshin over 4 years ago
head
orderingSince Koalas doesn't guarantee the row ordering, head
could return some rows from distributed partition and the result is not deterministic, which might confuse users.
We added a configuration compute.ordered_head
(#1231), and if it is set to True
, Koalas performs natural ordering beforehand and the result will be the same as pandas'.
The default value is False
because the ordering will cause a performance overhead.
>>> kdf = ks.DataFrame({'a': range(10)})
>>> pdf = kdf.to_pandas()
>>> pdf.head(3)
a
0 0
1 1
2 2
>>> kdf.head(3)
a
5 5
6 6
7 7
>>> kdf.head(3)
a
0 0
1 1
2 2
>>> ks.options.compute.ordered_head = True
>>> kdf.head(3)
a
0 0
1 1
2 2
>>> kdf.head(3)
a
0 0
1 1
2 2
We started trying to use GitHub Actions for CI. (#1254, #1265, #1264, #1267, #1269)
We added the following new feature:
DataFrame:
DataFrame/Series.clip
function to preserve its index. (#1232)DataFrame.sort_values
when multi-index column is used (#1238)fillna
not to change index values. (#1241)DataFrame.__setitem__
with tuple-named Series. (#1245)corr
to support multi-index columns. (#1246)print()
matches with pandas of Series (#1250)Published by HyukjinKwon over 4 years ago
iat
indexerWe continued to improve indexers. Now, iat
indexer is supported too (#1062).
>>> df = ks.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
... columns=['A', 'B', 'C'])
>>> df
A B C
0 0 2 3
1 0 4 1
2 10 20 30
>>> df.iat[1, 2]
1
We added the following new features:
koalas.Index
equals
(#1216)identical
(#1215)is_all_dates
(#1205)append
(#1163)to_frame
(#1187)koalas.MultiIndex:
equals
(#1216)identical
(#1215)swaplevel
(#1105)is_all_dates
(#1205)is_monotonic_increasing
(#1183)is_monotonic_decreasing
(#1183)append
(#1163)to_frame
(#1187)koalas.DataFrameGroupBy
describe
(#1168)DataFrame.idxmin/idxmax
. (#1198)