cudf | Python Ecosystem Directory

Bot releases are visible (Hide)

cudf - v24.04.00 Latest Release

Published by raydouglass 6 months ago

🚨 Breaking Changes

Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
Change exceptions thrown by copying APIs (#15319) @vyasr
Change strings_column_view::char_size to return int64 (#15197) @davidwendt
Upgrade to arrow-14.0.2 (#15108) @galipremsagar
Add support for pandas-2.2 in cudf (#15100) @galipremsagar
Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
Raise an error on import for unsupported GPUs. (#15053) @bdice
Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
Add future_stack to DataFrame.stack (#15015) @galipremsagar
Deprecate groupby fillna (#15000) @mroeschke
Deprecate replace with categorical columns (#14988) @mroeschke
Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
Add pandas-2.x support in cudf (#14916) @galipremsagar
Use cuco::static_set in the hash-based groupby (#14813) @PointKernel

🐛 Bug Fixes

Fix an issue with creating a series from scalar when dtype='category' (#15476) @galipremsagar
Update pre-commit-hooks to v0.0.3 (#15355) @KyleFromNVIDIA
[BUG][JNI] Trigger MemoryBuffer.onClosed after memory is freed (#15351) @abellina
Fix an issue with multiple short list rowgroups using the Parquet chunked reader. (#15342) @nvdbaranec
Avoid importing dask-expr if "query-planning" config is False (#15340) @rjzamora
Fix gtests/ERROR_TEST errors when run in Debug (#15317) @davidwendt
Fix OOB read in inflate_kernel (#15309) @vuule
Work around a cuFile error when running CSV tests with memcheck (#15293) @vuule
Fix Doxygen upload directory (#15291) @KyleFromNVIDIA
Fix Doxygen check (#15289) @KyleFromNVIDIA
Reintroduce PANDAS_GE_220 import (#15287) @wence-
Fix mean computation for the geometric distribution in the data generator (#15282) @vuule
Fix Parquet decimal64 stats (#15281) @etseidl
Make linking of nvtx3-cpp BUILD_LOCAL_INTERFACE (#15271) @KyleFromNVIDIA
Workaround compute-sanitizer memcheck bug (#15259) @davidwendt
Cleanup hostdevice_vector and add more APIs (#15252) @ttnghia
Fix number of rows in randomly generated lists columns (#15248) @vuule
Fix wrong output for collect_list/collect_set of lists column (#15243) @ttnghia
Fix testchunkedPackTwoPasses to copy from the bounce buffer (#15220) @abellina
Fix accessing .columns by an external API (#15212) @galipremsagar
[JNI] Disable testChunkedPackTwoPasses for now (#15210) @abellina
Update labeler and codeowner configs for CMake files (#15208) @PointKernel
Avoid dict normalization in __dask_tokenize__ (#15187) @rjzamora
Fix memcheck error in distinct inner join (#15164) @PointKernel
Remove unneeded script parameters in test_cpp_memcheck.sh (#15158) @davidwendt
Fix ListColumn.to_pandas() to retain list type (#15155) @galipremsagar
Avoid factorization in MultiIndex.to_pandas (#15150) @mroeschke
Fix GroupBy.get_group and GroupBy.indices (#15143) @wence-
Remove const from range_window_bounds::_extent. (#15138) @mythrocks
DataFrame.columns = ... retains RangeIndex & set dtype (#15129) @mroeschke
Correctly handle output for GroupBy.apply when chunk results are reindexed series (#15109) @brandon-b-miller
Fix Series.groupby.shift with a MultiIndex (#15098) @mroeschke
Fix reductions when DataFrame has MulitIndex columns (#15097) @mroeschke
Fix deprecation warnings for deprecated hash() calls (#15095) @davidwendt
Add support for arrow large_string in cudf (#15093) @galipremsagar
Fix sort_values pytest failure with pandas-2.x regression (#15092) @galipremsagar
Resolve path parsing issues in get_json_object (#15082) @SurajAralihalli
Fix bugs in handling of delta encodings (#15075) @etseidl
Fix is_device_write_preferred in void_sink and user_sink_wrapper (#15064) @vuule
Eliminate duplicate allocation of nested string columns (#15061) @vuule
Raise an error on import for unsupported GPUs. (#15053) @bdice
Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
Fix Index.difference to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar
Add future_stack to DataFrame.stack (#15015) @galipremsagar
Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
Fix DataFrame.sort_index to respect ignore_index on all axis (#14995) @galipremsagar
Raise for pyarrow array that is tz-aware (#14980) @mroeschke
Direct SeriesGroupBy.aggregate to SeriesGroupBy.agg (#14971) @rjzamora
Respect IntervalDtype and CategoricalDtype objects passed by users (#14961) @mroeschke
unset CUDF_SPILL after a pytest (#14958) @galipremsagar
Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader (#14939) @karthikeyann
Fix chunked reads of Parquet delta encoded pages (#14921) @etseidl
Fix reading offset for data stream in ORC reader (#14911) @ttnghia
Enable sanitizer check for a test case testORCReadAndWriteForDecimal128 (#14897) @res-life
Fix dask token normalization (#14829) @rjzamora
Fix 24.04 versions (#14825) @raydouglass
Ensure slow private attrs are maybe proxies (#14380) @mroeschke

📖 Documentation

Ignore DLManagedTensor in the docs build (#15392) @davidwendt
Revert "Temporarily disable docs errors. (#15265)" (#15269) @bdice
Temporarily disable docs errors. (#15265) @bdice
Update developer_guide.md with new guidance on quoted internal includes (#15238) @harrism
Fix broken link for developer guide (#15025) @sanjana098
[DOC] Update typo in docs example of structs_column_wrapper (#14949) @karthikeyann
Update cudf.pandas FAQ. (#14940) @bdice
Optimize doc builds (#14856) @vyasr
Add developer guideline to use east const. (#14836) @bdice
Document how cuDF is pronounced (#14753) @pentschev
Notes convert to Pandas-compat (#12641) @Touutae-lab

🚀 New Features

Address inconsistency in single quote normalization in JSON reader (#15324) @shrshi
Use JNI pinned pool resource with cuIO (#15255) @abellina
Add DELTA_BYTE_ARRAY encoder for Parquet (#15239) @etseidl
Migrate filling operations to pylibcudf (#15225) @brandon-b-miller
[JNI] rmm based pinned pool (#15219) @abellina
Implement zero-copy host buffer source instead of using an arrow implementation (#15189) @vuule
Enable creation of columns from scalar (#15181) @vyasr
Use NVTX from GitHub. (#15178) @bdice
Implement segmented_row_bit_count for computing row sizes by segments of rows (#15169) @ttnghia
Implement search using pylibcudf (#15166) @vyasr
Add distinct left join (#15149) @PointKernel
Add cardinality control for groupby benchs with flat types (#15134) @PointKernel
Add ability to request Parquet encodings on a per-column basis (#15081) @etseidl
Automate include grouping order in .clang-format (#15063) @harrism
Requesting a clean build directory also clears Jitify cache (#15052) @robertmaynard
API for JSON unquoted whitespace normalization (#15033) @shrshi
Implement concatenate, lists.explode, merge, sorting, and stream compaction in pylibcudf (#15011) @vyasr
Implement replace in pylibcudf (#15005) @vyasr
Add distinct key inner join (#14990) @PointKernel
Implement rolling in pylibcudf (#14982) @vyasr
Implement joins in pylibcudf (#14972) @vyasr
Implement scans and reductions in pylibcudf (#14970) @vyasr
Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
Implement groupby in pylibcudf (#14945) @vyasr
Support casting of Map type to string in JSON reader (#14936) @karthikeyann
POC for whitespace removal in input JSON data using FST (#14931) @shrshi
Support for LZ4 compression in ORC and Parquet (#14906) @vuule
Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
Migrate unary operations to pylibcudf (#14850) @vyasr
Migrate binary operations to pylibcudf (#14821) @vyasr
Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
Support CUDA 12.2 (#14712) @jameslamb

🛠️ Improvements

Use conda env create --yes instead of --force (#15403) @bdice
Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
Change exceptions thrown by copying APIs (#15319) @vyasr
Enable branch testing for cudf.pandas (#15316) @galipremsagar
Replace black with ruff-format (#15312) @mroeschke
This fixes an NPE when trying to read empty JSON data by adding a new API for missing information (#15307) @revans2
Address poor performance of Parquet string decoding (#15304) @etseidl
Update script input name (#15301) @AyodeAwe
Make test_read_parquet_partitioned_filtered data deterministic (#15296) @mroeschke
Add timeout for cudf.pandas pandas tests (#15284) @galipremsagar
Add upper bound to prevent usage of NumPy 2 (#15283) @bdice
Fix cudf::test::to_host return of host_vector (#15263) @davidwendt
Implement grouped product scan (#15254) @wence-
Add CUDA 12.4 to supported PTX versions (#15247) @brandon-b-miller
Implement DataFrame|Series.squeeze (#15244) @mroeschke
Roll back ipow changes due to register pressure. (#15242) @pmattione-nvidia
Remove create_chars_child_column utility (#15241) @davidwendt
Update dlpack to version 0.8 (#15237) @dantegd
Improve performance in JSON reader when mixed_types_as_string option is enabled (#15236) @shrshi
Remove row conversion code from libcudf (#15234) @ttnghia
Use variable substitution for RAPIDS version in Doxyfile (#15231) @KyleFromNVIDIA
Add ListColumns.to_pandas(arrow_type=) (#15228) @mroeschke
Treat dask-cudf CI artifacts as pure wheels (#15223) @bdice
Clean up usage of CUDA_ARCH and other macros. (#15218) @bdice
DOC: use constants in performance-comparisons.ipynb (#15215) @raybellwaves
Rewrite conversion in terms of column (#15213) @vyasr
Switch pytest-xdist algo to worksteal (#15207) @galipremsagar
Deprecate strings_column_view::offsets_begin() (#15205) @davidwendt
Add get_upstream_resource method to stream_checking_resource_adaptor (#15203) @miscco
Tune up row size estimation in the data generator (#15202) @vuule
Fix offset value for generating test data in parquet_chunked_reader_test.cu (#15200) @ttnghia
Change strings_column_view::char_size to return int64 (#15197) @davidwendt
Fix includes for row_operators.cuh (#15194) @davidwendt
Generalize GHA selectors for pure Python testing (#15191) @bdice
Improvements for __cuda_array_interface__ tests (#15188) @bdice
Allow to_pandas to return pandas.ArrowDtype (#15182) @mroeschke
Ignore byte_range in read_json when the size is not smaller than the input data (#15180) @vuule
Expose new stable_sort and finish stream_compaction in pylibcudf (#15175) @wence-
[ci] update matrix filters for dask-cudf builds (#15174) @jameslamb
Change make_strings_children to return uvector (#15171) @davidwendt
Don't override to_pandas for Datelike columns (#15167) @mroeschke
Drop python-snappy from dependencies. (#15161) @bdice
Add microkernels for fixed-width and fixed-width dictionary in Parquet decode (#15159) @abellina
Make HostColumnVector.DataType accessor methods public (#15157) @jbrennan333
Java bindings for left outer distinct join (#15154) @jlowe
Forward-merge branch-24.02 to branch-24.04 (#15153) @bdice
Enable pandas pytests for cudf.pandas (#15147) @galipremsagar
Add java option to keep quotes for JSON reads (#15146) @revans2
Change cross-pandas-version testing in cudf (#15145) @galipremsagar
Use hostdevice_vector in kernel_error to avoid the pageable copy (#15140) @vuule
Clean up Columns.astype & cudf.dtype (#15125) @mroeschke
Simplify some to_pandas implementations (#15123) @mroeschke
Java: Add leak tracking for Scalar instances (#15121) @jlowe
Remove calls to strings_column_view::offsets_begin() (#15112) @davidwendt
Add support for Python 3.11, require NumPy 1.23+ (#15111) @jameslamb
Compile-time ipow computation with array lookup (#15110) @pmattione-nvidia
Upgrade to arrow-14.0.2 (#15108) @galipremsagar
Dynamically set version in RAPIDS doc builds (#15101) @jakirkham
Add support for pandas-2.2 in cudf (#15100) @galipremsagar
Update devcontainers to CUDA Toolkit 12.2 (#15099) @trxcllnt
Fix datetime binop pytest failures in pandas-2.2 (#15090) @galipremsagar
Validate types in pylibcudf Column/Table constructors (#15088) @wence-
xfail test_join_ordering_pandas_compat for pandas 2.2 (#15080) @mroeschke
Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation. (#15079) @nvdbaranec
Adjust test_binops for pandas 2.2 (#15078) @mroeschke
Remove offsets_begin() call from nvtext::generate_ngrams (#15077) @davidwendt
Use offsetalator in cudf::detail::has_nonempty_null_rows (#15076) @davidwendt
Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
Fix cudf::test::to_host to handle both offset types for strings columns (#15073) @davidwendt
Add condition for test_groupby_nulls_basic in pandas 2.2 (#15072) @mroeschke
xfail tests in test_udf_masked_ops due to pandas 2.2 bug (#15071) @mroeschke
target branch-24.04 for GitHub Actions workflows (#15069) @jameslamb
Implement stable version of cudf::sort (#15066) @wence-
Fix ORC and JSON tests failures for pandas 2.2 (#15062) @mroeschke
Adjust test_joining for pandas 2.2 (#15060) @mroeschke
Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
Fix test_resample index dtype checking for pandas 2.2 (#15058) @mroeschke
Split out strings/replace.cu and rework its gtests (#15054) @davidwendt
Avoid incompatible value type setting in test_rolling for pandas 2.2 (#15050) @mroeschke
Change chained replace inplace test to COW test for pandas 2.2 (#15049) @mroeschke
Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
Avoid chained indexing in test_indexing for pandas 2.2 (#15045) @mroeschke
Avoid pandas 2.2 DeprecationWarning in test_hdf (#15044) @mroeschke
Use appropriate make_offsets_child_column for building lists columns (#15043) @davidwendt
Factor out position-offsets logic from strings split_helper utility (#15040) @davidwendt
Forward-merge branch-24.02 to branch-24.04 (#15039) @bdice
Clean up nvtx macros (#15038) @PointKernel
Add xfailures for test_applymap for pandas 2.2 (#15034) @mroeschke
Expose libcudf filter expression in read_parquet (#15028) @wence-
Adjust tests in test_dataframe.py for pandas 2.2 (#15023) @mroeschke
Adjust test_datetime_infer_format for pandas 2.2 (#15021) @mroeschke
Performance optimizations for parquet sub-rowgroup reader. (#15020) @nvdbaranec
JNI bindings for distinct_hash_join (#15019) @jlowe
Change copy_if_safe to call thrust instead of the overload function (#15018) @davidwendt
Improve performance of copy_if_else for long strings (#15017) @davidwendt
Fix is_string_dtype test for pandas 2.2 (#15012) @mroeschke
Rework cudf::strings::detail::copy_range for offsetalator (#15010) @davidwendt
Use offsetalator in cudf::get_json_object() (#15009) @davidwendt
Align integral types in ORC to specs (#15008) @vuule
Clean up detail sequence header inclusion (#15007) @PointKernel
Add groupby.apply(include_groups=) to match pandas 2.2 deprecation (#15006) @mroeschke
Use offsetalator in cudf::interleave_columns() (#15004) @davidwendt
Use offsetalator in cudf::row_bit_count() (#15003) @davidwendt
Use offsetalator in cudf::strings::wrap() (#15002) @davidwendt
Use offsetalator in cudf::strings::reverse (#15001) @davidwendt
Deprecate groupby fillna (#15000) @mroeschke
Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (#14999) @mroeschke
Remove unneeded calls to create_chars_child_column utility (#14997) @davidwendt
Add environment-agnostic scripts for running ctests and pytests (#14992) @trxcllnt
Filter all DeprecationWarning's by ArrowTable.to_pandas() (#14989) @galipremsagar
Deprecate replace with categorical columns (#14988) @mroeschke
Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
Ensure that ctest is called with --no-tests=error. (#14983) @bdice
Deprecate non-integer periods in date_range and interval_range (#14976) @galipremsagar
Update ops-bot.yaml (#14974) @AyodeAwe
Use page statistics in Parquet reader (#14973) @etseidl
Use fused types for overloaded function signatures (#14969) @vyasr
Deprecate certain frequency strings (#14967) @galipremsagar
Update copyrights for 24.04. (#14964) @bdice
Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
Introduce GetJsonObjectOptions in getJSONObject Java API (#14956) @SurajAralihalli
JNI JSON read with DataSource and infered schema, along with basic java nested Schema JSON reads (#14954) @revans2
Make codecov only informational (always pass). (#14952) @bdice
Replace legacy cudf and dask_cudf imports as (d)gd (#14944) @mroeschke
Replace _is_datetime64tz/interval_dtype with isinstance (#14943) @mroeschke
Update tests for pandas 2. (#14941) @bdice
Use more public pandas APIs (#14929) @mroeschke
Replace local copyright check with pre-commit-hooks verify-copyright (#14917) @KyleFromNVIDIA
Add pandas-2.x support in cudf (#14916) @galipremsagar
Use offsetalator in nvtext::byte_pair_encoding (#14888) @davidwendt
De-DOS line-endings (#14880) @wence-
Add detail cuco_allocator (#14877) @PointKernel
Move all core types to using enum class in Cython (#14876) @vyasr
Read cudf.__version__ in Sphinx build (#14872) @KyleFromNVIDIA
Use int64 offset types for accessing code-points in nvtext::normalize (#14868) @davidwendt
Read version from VERSION file in CMake (#14867) @KyleFromNVIDIA
Update conda-cpp-post-build-checks to branch-24.04. (#14854) @bdice
Update cudf for compatibility with the latest cuco (#14849) @PointKernel
Remove deprecated strings functions (#14848) @davidwendt
Fix CI workflows for pandas-tests and add test summary. (#14847) @bdice
Use offsetalator in cudf::strings::copy_slice (#14844) @davidwendt
Fix V2 Parquet page alignment for use with zStandard compression (#14841) @etseidl
Fix calls to deprecated strings factory API in examples. (#14838) @bdice
Update pre-commit hooks (#14837) @bdice
Use rapids_cuda_set_runtime to determine cuda runtime usage by target (#14833) @vyasr
Remove get_mem_info functions from custom memory resources (#14832) @harrism
Fix debug build by splitting row_operator_tests_utilities.cu (#14826) @davidwendt
Remove -DNVBench_ENABLE_CUPTI=OFF. (#14820) @bdice
Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
Branch 24.04 merge branch 24.02 (#14809) @vyasr
Branch 24.04 merge branch 24.02 (#14806) @vyasr
Introduce basic "cudf" backend for Dask Expressions (#14805) @rjzamora
Remove build_struct|list_column (#14786) @mroeschke
Use offsetalator in nvtext tokenize functions (#14783) @davidwendt
Reduce execution time of Python ORC tests (#14776) @vuule
Use offsetalator in cudf::strings::split functions (#14757) @davidwendt
Use offsetalator in cudf::strings::findall (#14745) @davidwendt
Use offsetalator in cudf::strings::url_decode (#14744) @davidwendt
Use get_offset_value utility in strings shift function (#14743) @davidwendt
Use as_column instead of full (#14698) @mroeschke
List all notable breaking changes (#13535) @galipremsagar

cudf - [NIGHTLY] v24.06.00

Published by rapids-bot[bot] 7 months ago

🔗 Links

🚨 Breaking Changes

Remove deprecated strings offsets_begin (#15454) @davidwendt
Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
Remove deprecated hash() and spark_murmurhash3_x86_32() (#15375) @davidwendt
Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
[FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
Align date_range defaults with pandas, support tz (#15139) @mroeschke

🐛 Bug Fixes

nanoarrow uses package override for proper pinned versions generation (#15515) @robertmaynard
Make improvements in pandas-test reporting (#15485) @galipremsagar
Fixed page data truncation in parquet writer under certain conditions. (#15474) @nvdbaranec
Only use data_type constructor with scale for decimal types (#15472) @wence-
Avoid "p2p" shuffle as a default when dask_cudf is imported (#15469) @rjzamora
Fix debug build errors from to_arrow_device_test.cpp (#15463) @davidwendt
Fix base_normalator::integer_sizeof_fn integer dispatch (#15457) @davidwendt
Allow consumers of static builds to find nanoarrow (#15456) @robertmaynard
Allow jit compilation when using a splayed CUDA toolkit (#15451) @robertmaynard
Test static builds in CI and fix nanoarrow configure (#15437) @vyasr
Fixes potential race in JSON parser when parsing JSON lines format and when recovering from invalid lines (#15419) @elstehle
Fix errors in chunked ORC writer when no tables were (successfully) written (#15393) @vuule
Support implicit array conversion with query-planning enabled (#15378) @rjzamora
Fix arrow-based round trip of empty dataframes (#15373) @wence-
Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
Remove boundscheck=False setting in cython files (#15362) @wence-
Patch dask-expr var logic in dask-cudf (#15347) @rjzamora
Fix for logical and syntactical errors in libcudf c++ examples (#15346) @mhaseeb123
Disable dask-expr in docs builds. (#15343) @bdice
Apply the cuFile error work around to data_sink as well (#15335) @vuule

📖 Documentation

Add debug tips section to libcudf developer guide (#15329) @davidwendt

🚀 New Features

Introduce benchmark suite for JSON reader options (#15124) @shrshi
Add to_arrow_device function to cudf interop using nanoarrow (#15047) @zeroshade

🛠️ Improvements

Enable tests/scalar and test/series in cudf.pandas tests (#15486) @mroeschke
Avoid .ordered and .categories from being settable in CategoricalColumn and CategoricalDtype (#15475) @mroeschke
Ignore pandas tests for cudf.pandas that need motoserver (#15468) @mroeschke
Use cached_property for NumericColumn.nan_count instead of ._nan_count variable (#15466) @mroeschke
Add custom status check workflow (#15464) @galipremsagar
Enable tests/strings/test_api.py and tests/io/pytables in cudf.pandas tests (#15461) @mroeschke
Remove deprecated strings offsets_begin (#15454) @davidwendt
Enable tests/windows/ in cudf.pandas tests (#15444) @mroeschke
Enable tests/interchange/test_impl.py in cudf.pandas tests (#15443) @mroeschke
Enable tests/io/test_user_agent.py in cudf pandas tests (#15442) @mroeschke
Remove prior test skipping in run-pandas-tests with testing 2.2.1 (#15440) @mroeschke
Support orc and text IO with dask-expr using legacy conversion (#15439) @rjzamora
Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
Enable dask_cudf json and s3 tests with query-planning on (#15408) @rjzamora
Bump ruff and codespell pre-commit checks (#15407) @mroeschke
Enable all tests for arm arch (#15402) @galipremsagar
Remove deprecated hash() and spark_murmurhash3_x86_32() (#15375) @davidwendt
Enable test-reporting for pandas pytests in CI (#15369) @galipremsagar
Use logical types in Parquet reader (#15365) @etseidl
Forward-merge branch-24.04 to branch-24.06 (#15349) @bdice
Use ruff pydocstyle over pydocstyle pre-commit hook (#15345) @mroeschke
Refactor stream mode setup for gtests (#15337) @davidwendt
Avoid duplicate dask-cudf testing (#15333) @rjzamora
Update udf_cpp to use rapids_cpm_cccl. (#15331) @bdice
Forward-merge branch-24.04 into branch-24.06 [skip ci] (#15330) @rapids-bot[bot]
Allow numeric_only=True for simple groupby reductions (#15326) @rjzamora
Drop CentOS 7 support. (#15323) @bdice
Rework cudf::find_and_replace_all to use gather-based make_strings_column (#15305) @davidwendt
First pass at adding testing for pylibcudf (#15300) @vyasr
[FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
Rework cudf::replace_nulls to use strings::detail::copy_if_else (#15286) @davidwendt
Large strings support in cudf::concatenate (#15195) @davidwendt
Use less _is_categorical_dtype (#15148) @mroeschke
Align date_range defaults with pandas, support tz (#15139) @mroeschke
ModuleAccelerator performance: cache the result of checking if a caller is in the denylist (#15056) @shwina
Use offsetalator in cudf::strings::replace functions (#14824) @davidwendt
Cleanup some timedelta/datetime column logic (#14715) @mroeschke
Refactor numpy array input in as_column (#14651) @mroeschke

cudf - v24.02.02

Published by raydouglass 8 months ago

🚨 Breaking Changes

Remove **kwargs from astype (#14765) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Drop Pascal GPU support. (#14630) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
Include writer code and writerVersion in ORC files (#14458) @vuule
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
Switch to scikit-build-core (#13531) @vyasr

🐛 Bug Fixes

Bump to nvcomp 3.0.6. (#15128) @bdice
[HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
Exclude tests from builds (#14981) @vyasr
Fix the bounce buffer size in ORC writer (#14947) @vuule
Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
Fix index difference to follow the pandas format (#14789) @amiralimi
Fix shared-workflows repo name (#14784) @raydouglass
Remove unparseable attributes from all nodes (#14780) @vyasr
Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
Fix calls to deprecated strings factory API (#14771) @davidwendt
Fix ptx file discovery in editable installs (#14767) @vyasr
Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
Enable intermediate proxies to be picklable (#14752) @shwina
Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
Fix CMake args (#14746) @vyasr
Fix logic bug introduced in #14730 (#14742) @wence-
[Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
Fix Groupby.get_group (#14728) @rjzamora
Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
Split cuda versions for notebook testing (#14722) @raydouglass
Fix to_numeric not preserving Series index and name (#14718) @mroeschke
Update dask-cudf wheel name (#14713) @raydouglass
Fix strings::contains matching end of string target (#14711) @davidwendt
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
Potential fix for peformance regression in #14415 (#14706) @etseidl
Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
Skip numba test that fails on ARM (#14702) @brandon-b-miller
Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
Add row conversion code from spark-rapids-jni (#14664) @ttnghia
Unconditionally export the CCCL path (#14656) @vyasr
Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
Fix invalid memory access in Parquet reader (#14637) @etseidl
Use column_empty over as_column([]) (#14632) @mroeschke
Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
Address potential race conditions in Parquet reader (#14602) @etseidl
Fix DataFrame.reindex removing column name (#14601) @mroeschke
Remove unsanitized input test data from copy gtests (#14600) @davidwendt
Fix race detected in Parquet writer (#14598) @etseidl
Correct invalid or missing return types (#14587) @robertmaynard
Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
Fixes a symbol group lookup table issue (#14561) @elstehle
Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
Improve memory footprint of isin by using contains (#14478) @wence-
Move creation of env.yaml outside the current directory (#14476) @davidwendt
Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
Correct dtype of count aggregations on empty dataframes (#14473) @wence-
Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
Fix default stream use in the CSV reader (#14443) @vuule
Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
Defer PTX file load to runtime (#13690) @brandon-b-miller

📖 Documentation

Disable parallel build (#14796) @vyasr
Add pylibcudf to the docs (#14791) @vyasr
Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
More doxygen fixes (#14639) @vyasr
Enable doxygen XML generation and fix issues (#14477) @vyasr
Some doxygen improvements (#14469) @vyasr
Remove warning in dask-cudf docs (#14454) @wence-
Update README links with redirects. (#14378) @bdice
Add pip install instructions to README (#13677) @shwina

🚀 New Features

Add ci check for external kernels (#14768) @robertmaynard
JSON single quote normalization API (#14729) @shrshi
Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
Don't constrain numba<0.58 (#14616) @brandon-b-miller
Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
JSON quote normalization (#14545) @shrshi
Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
Implement more copying APIs in pylibcudf (#14508) @vyasr
Include writer code and writerVersion in ORC files (#14458) @vuule
Parquet sub-rowgroup reading. (#14360) @nvdbaranec
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
PARQUET-2261 Size Statistics (#14000) @etseidl
Improve GroupBy JIT error handling (#13854) @brandon-b-miller
Generate unified Python/C++ docs (#13846) @vyasr
Expand JIT groupby test suite (#13813) @brandon-b-miller

🛠️ Improvements

Pin pytest<8 (#14920) @galipremsagar
Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
Remove **kwargs from astype (#14765) @mroeschke
fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
Add pynvjitlink as a dependency (#14763) @brandon-b-miller
Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
Pin pytest-cases<3.8.2 (#14756) @mroeschke
Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
Consolidate cudf object handling in as_column (#14754) @mroeschke
Reduce execution time of Parquet C++ tests (#14750) @vuule
Implement to_datetime(..., utc=True) (#14749) @mroeschke
Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
Remove unused/single use methods (#14739) @mroeschke
refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
Remove unneeded methods in Column (#14730) @mroeschke
Clean up base column methods (#14725) @mroeschke
Ensure column.fillna signatures are consistent (#14724) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
Use offsetalator in gather_chars (#14700) @davidwendt
Use make_strings_children for fill() specialization logic (#14697) @davidwendt
Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
Fix call to deprecated factory function (#14695) @davidwendt
Use as_column instead of arange for range like inputs (#14689) @mroeschke
Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
Split parquet test into multiple files (#14663) @etseidl
Custom error messages for IO with nonexistent files (#14662) @vuule
Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
Basic validation in reader benchmarks (#14647) @vuule
Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
Consolidate memoryview handling in as_column (#14643) @mroeschke
Convert FieldType to scoped enum (#14642) @vuule
Use instance over is_foo_dtype (#14641) @mroeschke
Use isinstance over is_foo_dtype internally (#14638) @mroeschke
Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
Drop nvbench patch for nvml. (#14631) @bdice
Drop Pascal GPU support. (#14630) @bdice
Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
Create strings-specific make_offsets_child_column for multiple offset types (#14612) @davidwendt
Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
Support freq in DatetimeIndex (#14593) @shwina
Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
Remove WORKSPACE env var from cudf_test temp_directory class (#14588) @davidwendt
Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
Use cuda::proclaim_return_type on device lambdas. (#14577) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Update dependencies.yaml to new pip index (#14575) @vyasr
Simplify Python CMake (#14565) @vyasr
Java expose parquet pass_read_limit (#14564) @revans2
Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
Use cudf_test temp_directory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
Fix return type of prefix increment overloads (#14544) @vuule
Make bpe_merge_pairs_impl member private (#14543) @davidwendt
Small clean up in io::statistics (#14542) @vuule
Change json gtest environment variable to compile-time definition (#14541) @davidwendt
Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
Move non-templated inline function definitions from table_view.hpp to table_view.cpp (#14535) @davidwendt
Add JNI for strings::code_points (#14533) @thirtiseven
Add a test for issue 12773 (#14529) @vyasr
Split libarrow build dependencies. (#14506) @bdice
Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
Refactor Parquet kernel_error (#14464) @etseidl
Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
Expose stream parameter in public nvtext APIs (#14456) @davidwendt
Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
Refactor cudf.Series.init (#14450) @mroeschke
Remove the use of volatile in Parquet (#14448) @vuule
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Testing stream pool implementation (#14437) @shrshi
Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
Use isinstance(..., cudf.IntervalDtype) instead of is_interval_dtype (#14424) @mroeschke
Use isinstance(..., cudf.CategoricalDtype) instead of is_categorical_dtype (#14423) @mroeschke
Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
REF: Remove instances of pd.core (#14421) @mroeschke
Expose streams in public filling APIs for label_bins (#14401) @ZelboK
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
Expose streams in Parquet reader and writer APIs (#14359) @shrshi
Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
Expose streams in ORC reader and writer APIs (#14350) @shrshi
Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
Add cuDF devcontainers (#14015) @trxcllnt
Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
Switch to scikit-build-core (#13531) @vyasr
Simplify null count checking in column equality comparator (#13312) @vyasr

cudf - v24.02.01

Published by raydouglass 8 months ago

🚨 Breaking Changes

Remove **kwargs from astype (#14765) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Drop Pascal GPU support. (#14630) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
Include writer code and writerVersion in ORC files (#14458) @vuule
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
Switch to scikit-build-core (#13531) @vyasr

🐛 Bug Fixes

[HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
Exclude tests from builds (#14981) @vyasr
Fix the bounce buffer size in ORC writer (#14947) @vuule
Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
Fix index difference to follow the pandas format (#14789) @amiralimi
Fix shared-workflows repo name (#14784) @raydouglass
Remove unparseable attributes from all nodes (#14780) @vyasr
Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
Fix calls to deprecated strings factory API (#14771) @davidwendt
Fix ptx file discovery in editable installs (#14767) @vyasr
Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
Enable intermediate proxies to be picklable (#14752) @shwina
Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
Fix CMake args (#14746) @vyasr
Fix logic bug introduced in #14730 (#14742) @wence-
[Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
Fix Groupby.get_group (#14728) @rjzamora
Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
Split cuda versions for notebook testing (#14722) @raydouglass
Fix to_numeric not preserving Series index and name (#14718) @mroeschke
Update dask-cudf wheel name (#14713) @raydouglass
Fix strings::contains matching end of string target (#14711) @davidwendt
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
Potential fix for peformance regression in #14415 (#14706) @etseidl
Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
Skip numba test that fails on ARM (#14702) @brandon-b-miller
Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
Add row conversion code from spark-rapids-jni (#14664) @ttnghia
Unconditionally export the CCCL path (#14656) @vyasr
Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
Fix invalid memory access in Parquet reader (#14637) @etseidl
Use column_empty over as_column([]) (#14632) @mroeschke
Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
Address potential race conditions in Parquet reader (#14602) @etseidl
Fix DataFrame.reindex removing column name (#14601) @mroeschke
Remove unsanitized input test data from copy gtests (#14600) @davidwendt
Fix race detected in Parquet writer (#14598) @etseidl
Correct invalid or missing return types (#14587) @robertmaynard
Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
Fixes a symbol group lookup table issue (#14561) @elstehle
Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
Improve memory footprint of isin by using contains (#14478) @wence-
Move creation of env.yaml outside the current directory (#14476) @davidwendt
Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
Correct dtype of count aggregations on empty dataframes (#14473) @wence-
Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
Fix default stream use in the CSV reader (#14443) @vuule
Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
Defer PTX file load to runtime (#13690) @brandon-b-miller

📖 Documentation

Disable parallel build (#14796) @vyasr
Add pylibcudf to the docs (#14791) @vyasr
Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
More doxygen fixes (#14639) @vyasr
Enable doxygen XML generation and fix issues (#14477) @vyasr
Some doxygen improvements (#14469) @vyasr
Remove warning in dask-cudf docs (#14454) @wence-
Update README links with redirects. (#14378) @bdice
Add pip install instructions to README (#13677) @shwina

🚀 New Features

Add ci check for external kernels (#14768) @robertmaynard
JSON single quote normalization API (#14729) @shrshi
Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
Don't constrain numba<0.58 (#14616) @brandon-b-miller
Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
JSON quote normalization (#14545) @shrshi
Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
Implement more copying APIs in pylibcudf (#14508) @vyasr
Include writer code and writerVersion in ORC files (#14458) @vuule
Parquet sub-rowgroup reading. (#14360) @nvdbaranec
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
PARQUET-2261 Size Statistics (#14000) @etseidl
Improve GroupBy JIT error handling (#13854) @brandon-b-miller
Generate unified Python/C++ docs (#13846) @vyasr
Expand JIT groupby test suite (#13813) @brandon-b-miller

🛠️ Improvements

Pin pytest<8 (#14920) @galipremsagar
Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
Remove **kwargs from astype (#14765) @mroeschke
fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
Add pynvjitlink as a dependency (#14763) @brandon-b-miller
Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
Pin pytest-cases<3.8.2 (#14756) @mroeschke
Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
Consolidate cudf object handling in as_column (#14754) @mroeschke
Reduce execution time of Parquet C++ tests (#14750) @vuule
Implement to_datetime(..., utc=True) (#14749) @mroeschke
Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
Remove unused/single use methods (#14739) @mroeschke
refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
Remove unneeded methods in Column (#14730) @mroeschke
Clean up base column methods (#14725) @mroeschke
Ensure column.fillna signatures are consistent (#14724) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
Use offsetalator in gather_chars (#14700) @davidwendt
Use make_strings_children for fill() specialization logic (#14697) @davidwendt
Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
Fix call to deprecated factory function (#14695) @davidwendt
Use as_column instead of arange for range like inputs (#14689) @mroeschke
Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
Split parquet test into multiple files (#14663) @etseidl
Custom error messages for IO with nonexistent files (#14662) @vuule
Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
Basic validation in reader benchmarks (#14647) @vuule
Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
Consolidate memoryview handling in as_column (#14643) @mroeschke
Convert FieldType to scoped enum (#14642) @vuule
Use instance over is_foo_dtype (#14641) @mroeschke
Use isinstance over is_foo_dtype internally (#14638) @mroeschke
Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
Drop nvbench patch for nvml. (#14631) @bdice
Drop Pascal GPU support. (#14630) @bdice
Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
Create strings-specific make_offsets_child_column for multiple offset types (#14612) @davidwendt
Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
Support freq in DatetimeIndex (#14593) @shwina
Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
Remove WORKSPACE env var from cudf_test temp_directory class (#14588) @davidwendt
Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
Use cuda::proclaim_return_type on device lambdas. (#14577) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Update dependencies.yaml to new pip index (#14575) @vyasr
Simplify Python CMake (#14565) @vyasr
Java expose parquet pass_read_limit (#14564) @revans2
Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
Use cudf_test temp_directory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
Fix return type of prefix increment overloads (#14544) @vuule
Make bpe_merge_pairs_impl member private (#14543) @davidwendt
Small clean up in io::statistics (#14542) @vuule
Change json gtest environment variable to compile-time definition (#14541) @davidwendt
Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
Move non-templated inline function definitions from table_view.hpp to table_view.cpp (#14535) @davidwendt
Add JNI for strings::code_points (#14533) @thirtiseven
Add a test for issue 12773 (#14529) @vyasr
Split libarrow build dependencies. (#14506) @bdice
Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
Refactor Parquet kernel_error (#14464) @etseidl
Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
Expose stream parameter in public nvtext APIs (#14456) @davidwendt
Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
Refactor cudf.Series.init (#14450) @mroeschke
Remove the use of volatile in Parquet (#14448) @vuule
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Testing stream pool implementation (#14437) @shrshi
Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
Use isinstance(..., cudf.IntervalDtype) instead of is_interval_dtype (#14424) @mroeschke
Use isinstance(..., cudf.CategoricalDtype) instead of is_categorical_dtype (#14423) @mroeschke
Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
REF: Remove instances of pd.core (#14421) @mroeschke
Expose streams in public filling APIs for label_bins (#14401) @ZelboK
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
Expose streams in Parquet reader and writer APIs (#14359) @shrshi
Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
Expose streams in ORC reader and writer APIs (#14350) @shrshi
Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
Add cuDF devcontainers (#14015) @trxcllnt
Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
Switch to scikit-build-core (#13531) @vyasr
Simplify null count checking in column equality comparator (#13312) @vyasr

cudf - v24.02.00

Published by raydouglass 8 months ago

🚨 Breaking Changes

Remove **kwargs from astype (#14765) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Drop Pascal GPU support. (#14630) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
Include writer code and writerVersion in ORC files (#14458) @vuule
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
Switch to scikit-build-core (#13531) @vyasr

🐛 Bug Fixes

Exclude tests from builds (#14981) @vyasr
Fix the bounce buffer size in ORC writer (#14947) @vuule
Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
Fix index difference to follow the pandas format (#14789) @amiralimi
Fix shared-workflows repo name (#14784) @raydouglass
Remove unparseable attributes from all nodes (#14780) @vyasr
Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
Fix calls to deprecated strings factory API (#14771) @davidwendt
Fix ptx file discovery in editable installs (#14767) @vyasr
Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
Enable intermediate proxies to be picklable (#14752) @shwina
Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
Fix CMake args (#14746) @vyasr
Fix logic bug introduced in #14730 (#14742) @wence-
[Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
Fix Groupby.get_group (#14728) @rjzamora
Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
Split cuda versions for notebook testing (#14722) @raydouglass
Fix to_numeric not preserving Series index and name (#14718) @mroeschke
Update dask-cudf wheel name (#14713) @raydouglass
Fix strings::contains matching end of string target (#14711) @davidwendt
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
Potential fix for peformance regression in #14415 (#14706) @etseidl
Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
Skip numba test that fails on ARM (#14702) @brandon-b-miller
Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
Add row conversion code from spark-rapids-jni (#14664) @ttnghia
Unconditionally export the CCCL path (#14656) @vyasr
Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
Fix invalid memory access in Parquet reader (#14637) @etseidl
Use column_empty over as_column([]) (#14632) @mroeschke
Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
Address potential race conditions in Parquet reader (#14602) @etseidl
Fix DataFrame.reindex removing column name (#14601) @mroeschke
Remove unsanitized input test data from copy gtests (#14600) @davidwendt
Fix race detected in Parquet writer (#14598) @etseidl
Correct invalid or missing return types (#14587) @robertmaynard
Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
Fixes a symbol group lookup table issue (#14561) @elstehle
Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
Improve memory footprint of isin by using contains (#14478) @wence-
Move creation of env.yaml outside the current directory (#14476) @davidwendt
Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
Correct dtype of count aggregations on empty dataframes (#14473) @wence-
Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
Fix default stream use in the CSV reader (#14443) @vuule
Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
Defer PTX file load to runtime (#13690) @brandon-b-miller

📖 Documentation

Disable parallel build (#14796) @vyasr
Add pylibcudf to the docs (#14791) @vyasr
Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
More doxygen fixes (#14639) @vyasr
Enable doxygen XML generation and fix issues (#14477) @vyasr
Some doxygen improvements (#14469) @vyasr
Remove warning in dask-cudf docs (#14454) @wence-
Update README links with redirects. (#14378) @bdice
Add pip install instructions to README (#13677) @shwina

🚀 New Features

Add ci check for external kernels (#14768) @robertmaynard
JSON single quote normalization API (#14729) @shrshi
Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
Don't constrain numba<0.58 (#14616) @brandon-b-miller
Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
JSON quote normalization (#14545) @shrshi
Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
Implement more copying APIs in pylibcudf (#14508) @vyasr
Include writer code and writerVersion in ORC files (#14458) @vuule
Parquet sub-rowgroup reading. (#14360) @nvdbaranec
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
PARQUET-2261 Size Statistics (#14000) @etseidl
Improve GroupBy JIT error handling (#13854) @brandon-b-miller
Generate unified Python/C++ docs (#13846) @vyasr
Expand JIT groupby test suite (#13813) @brandon-b-miller

🛠️ Improvements

Pin pytest<8 (#14920) @galipremsagar
Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
Remove **kwargs from astype (#14765) @mroeschke
fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
Add pynvjitlink as a dependency (#14763) @brandon-b-miller
Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
Pin pytest-cases<3.8.2 (#14756) @mroeschke
Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
Consolidate cudf object handling in as_column (#14754) @mroeschke
Reduce execution time of Parquet C++ tests (#14750) @vuule
Implement to_datetime(..., utc=True) (#14749) @mroeschke
Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
Remove unused/single use methods (#14739) @mroeschke
refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
Remove unneeded methods in Column (#14730) @mroeschke
Clean up base column methods (#14725) @mroeschke
Ensure column.fillna signatures are consistent (#14724) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
Use offsetalator in gather_chars (#14700) @davidwendt
Use make_strings_children for fill() specialization logic (#14697) @davidwendt
Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
Fix call to deprecated factory function (#14695) @davidwendt
Use as_column instead of arange for range like inputs (#14689) @mroeschke
Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
Split parquet test into multiple files (#14663) @etseidl
Custom error messages for IO with nonexistent files (#14662) @vuule
Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
Basic validation in reader benchmarks (#14647) @vuule
Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
Consolidate memoryview handling in as_column (#14643) @mroeschke
Convert FieldType to scoped enum (#14642) @vuule
Use instance over is_foo_dtype (#14641) @mroeschke
Use isinstance over is_foo_dtype internally (#14638) @mroeschke
Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
Drop nvbench patch for nvml. (#14631) @bdice
Drop Pascal GPU support. (#14630) @bdice
Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
Create strings-specific make_offsets_child_column for multiple offset types (#14612) @davidwendt
Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
Support freq in DatetimeIndex (#14593) @shwina
Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
Remove WORKSPACE env var from cudf_test temp_directory class (#14588) @davidwendt
Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
Use cuda::proclaim_return_type on device lambdas. (#14577) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Update dependencies.yaml to new pip index (#14575) @vyasr
Simplify Python CMake (#14565) @vyasr
Java expose parquet pass_read_limit (#14564) @revans2
Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
Use cudf_test temp_directory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
Fix return type of prefix increment overloads (#14544) @vuule
Make bpe_merge_pairs_impl member private (#14543) @davidwendt
Small clean up in io::statistics (#14542) @vuule
Change json gtest environment variable to compile-time definition (#14541) @davidwendt
Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
Move non-templated inline function definitions from table_view.hpp to table_view.cpp (#14535) @davidwendt
Add JNI for strings::code_points (#14533) @thirtiseven
Add a test for issue 12773 (#14529) @vyasr
Split libarrow build dependencies. (#14506) @bdice
Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
Refactor Parquet kernel_error (#14464) @etseidl
Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
Expose stream parameter in public nvtext APIs (#14456) @davidwendt
Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
Refactor cudf.Series.init (#14450) @mroeschke
Remove the use of volatile in Parquet (#14448) @vuule
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Testing stream pool implementation (#14437) @shrshi
Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
Use isinstance(..., cudf.IntervalDtype) instead of is_interval_dtype (#14424) @mroeschke
Use isinstance(..., cudf.CategoricalDtype) instead of is_categorical_dtype (#14423) @mroeschke
Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
REF: Remove instances of pd.core (#14421) @mroeschke
Expose streams in public filling APIs for label_bins (#14401) @ZelboK
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
Expose streams in Parquet reader and writer APIs (#14359) @shrshi
Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
Expose streams in ORC reader and writer APIs (#14350) @shrshi
Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
Add cuDF devcontainers (#14015) @trxcllnt
Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
Switch to scikit-build-core (#13531) @vyasr
Simplify null count checking in column equality comparator (#13312) @vyasr

cudf - [NIGHTLY] v24.04.00

Published by rapids-bot[bot] 9 months ago

🔗 Links

🚨 Breaking Changes

Add future_stack to DataFrame.stack (#15015) @galipremsagar
Deprecate groupby fillna (#15000) @mroeschke
Deprecate replace with categorical columns (#14988) @mroeschke
Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
Add pandas-2.x support in cudf (#14916) @galipremsagar

🐛 Bug Fixes

Fix Index.difference to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar
Add future_stack to DataFrame.stack (#15015) @galipremsagar
Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
Fix DataFrame.sort_index to respect ignore_index on all axis (#14995) @galipremsagar
Raise for pyarrow array that is tz-aware (#14980) @mroeschke
Direct SeriesGroupBy.aggregate to SeriesGroupBy.agg (#14971) @rjzamora
unset CUDF_SPILL after a pytest (#14958) @galipremsagar
Fix dask token normalization (#14829) @rjzamora
Fix 24.04 versions (#14825) @raydouglass

📖 Documentation

[DOC] Update typo in docs example of structs_column_wrapper (#14949) @karthikeyann
Update cudf.pandas FAQ. (#14940) @bdice
Optimize doc builds (#14856) @vyasr
Add developer guideline to use east const. (#14836) @bdice
Notes convert to Pandas-compat (#12641) @Touutae-lab

🚀 New Features

Implement replace in pylibcudf (#15005) @vyasr
Implement rolling in pylibcudf (#14982) @vyasr
Implement joins in pylibcudf (#14972) @vyasr
Implement scans and reductions in pylibcudf (#14970) @vyasr
Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
Implement groupby in pylibcudf (#14945) @vyasr
POC for whitespace removal in input JSON data using FST (#14931) @shrshi
Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
Migrate unary operations to pylibcudf (#14850) @vyasr
Migrate binary operations to pylibcudf (#14821) @vyasr
Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
Support CUDA 12.2 (#14712) @jameslamb

🛠️ Improvements

Change copy_if_safe to call thrust instead of the overload function (#15018) @davidwendt
Fix is_string_dtype test for pandas 2.2 (#15012) @mroeschke
Clean up detail sequence header inclusion (#15007) @PointKernel
Use offsetalator in cudf::strings::wrap() (#15002) @davidwendt
Use offsetalator in cudf::strings::reverse (#15001) @davidwendt
Deprecate groupby fillna (#15000) @mroeschke
Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (#14999) @mroeschke
Filter all DeprecationWarning's by ArrowTable.to_pandas() (#14989) @galipremsagar
Deprecate replace with categorical columns (#14988) @mroeschke
Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
Ensure that ctest is called with --no-tests=error. (#14983) @bdice
Deprecate non-integer periods in date_range and interval_range (#14976) @galipremsagar
Use fused types for overloaded function signatures (#14969) @vyasr
Deprecate certain frequency strings (#14967) @galipremsagar
Update copyrights for 24.04. (#14964) @bdice
Introduce GetJsonObjectOptions in getJSONObject Java API (#14956) @SurajAralihalli
JNI JSON read with DataSource and infered schema, along with basic java nested Schema JSON reads (#14954) @revans2
Make codecov only informational (always pass). (#14952) @bdice
Replace legacy cudf and dask_cudf imports as (d)gd (#14944) @mroeschke
Replace _is_datetime64tz/interval_dtype with isinstance (#14943) @mroeschke
Update tests for pandas 2. (#14941) @bdice
Use more public pandas APIs (#14929) @mroeschke
Add pandas-2.x support in cudf (#14916) @galipremsagar
Use offsetalator in nvtext::byte_pair_encoding (#14888) @davidwendt
De-DOS line-endings (#14880) @wence-
Add detail cuco_allocator (#14877) @PointKernel
Move all core types to using enum class in Cython (#14876) @vyasr
Use int64 offset types for accessing code-points in nvtext::normalize (#14868) @davidwendt
Update conda-cpp-post-build-checks to branch-24.04. (#14854) @bdice
Remove deprecated strings functions (#14848) @davidwendt
Fix CI workflows for pandas-tests and add test summary. (#14847) @bdice
Use offsetalator in cudf::strings::copy_slice (#14844) @davidwendt
Fix V2 Parquet page alignment for use with zStandard compression (#14841) @etseidl
Fix calls to deprecated strings factory API in examples. (#14838) @bdice
Update pre-commit hooks (#14837) @bdice
Use rapids_cuda_set_runtime to determine cuda runtime usage by target (#14833) @vyasr
Remove get_mem_info functions from custom memory resources (#14832) @harrism
Fix debug build by splitting row_operator_tests_utilities.cu (#14826) @davidwendt
Remove -DNVBench_ENABLE_CUPTI=OFF. (#14820) @bdice
Branch 24.04 merge branch 24.02 (#14809) @vyasr
Branch 24.04 merge branch 24.02 (#14806) @vyasr
Use offsetalator in nvtext tokenize functions (#14783) @davidwendt
Reduce execution time of Python ORC tests (#14776) @vuule
Use offsetalator in cudf::strings::split functions (#14757) @davidwendt
Use offsetalator in cudf::strings::findall (#14745) @davidwendt
Use offsetalator in cudf::strings::url_decode (#14744) @davidwendt
Use get_offset_value utility in strings shift function (#14743) @davidwendt

cudf - v23.12.01

Published by raydouglass 10 months ago

🚨 Breaking Changes

Raise error in reindex when index is not unique (#14400) @galipremsagar
Expose stream parameter to get_json_object API (#14297) @davidwendt
Refactor cudf_kafka to use skbuild (#14292) @jdye64
Expose stream parameter in public strings convert APIs (#14255) @davidwendt
Upgrade to nvCOMP 3.0.4 (#13815) @vuule

🐛 Bug Fixes

Fix synchronization issue when writing string columns with dictionary to ORC (#14595) @vuule
Update actions/labeler to v4 (#14562) @raydouglass
Fix data corruption when skipping rows (#14557) @etseidl
Fix function name typo in cudf.pandas profiler (#14514) @galipremsagar
Fix intermediate type checking in expression parsing (#14445) @vyasr
Forward merge branch-23.10 into branch-23.12 (#14435) @raydouglass
Remove needs: wheel-build-cudf. (#14427) @bdice
Fix dask dependency in custreamz (#14420) @vyasr
Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
Support java AST String literal with desired encoding (#14402) @winningsix
Raise error in reindex when index is not unique (#14400) @galipremsagar
Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
Fix token-count logic in nvtext::tokenize_with_vocabulary (#14393) @davidwendt
Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
cudf.pandas: cuDF subpath checking in module __getattr__ (#14388) @shwina
Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
Add the new manylinux builds to the build job (#14351) @vyasr
cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
Fix overflow check in cudf::merge (#14345) @divyegala
Add cramjam (#14344) @vyasr
Enable dask_cudf/io pytests in CI (#14338) @galipremsagar
Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
Fix host buffer access from device function in the Parquet reader (#14328) @vuule
Run IO tests for Dask-cuDF (#14327) @rjzamora
Fix logical type issues in the Parquet writer (#14322) @vuule
Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
test is_valid before reading column data (#14318) @etseidl
Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
fixing thread index overflow issue (#14290) @hyperbolic2346
Fix memset error in nvtext::edit_distance_matrix (#14283) @davidwendt
Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
Handle empty string correctly in Parquet statistics (#14257) @etseidl
Fixes behaviour for incomplete lines when recover_with_nulls is enabled (#14252) @elstehle
cudf::detail::pinned_allocator doesn't throw from deallocate (#14251) @robertmaynard
Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
Fixing parquet list of struct interpretation (#13715) @hyperbolic2346

📖 Documentation

Fix io reference in docs. (#14452) @bdice
Update README (#14374) @shwina
Example code for blog on new row comparators (#13795) @divyegala

🚀 New Features

Expose streams in public unary APIs (#14342) @vyasr
Add python tests for Parquet DELTA_BINARY_PACKED encoder (#14316) @etseidl
Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
Expose streams in public null mask APIs (#14263) @vyasr
Expose streams in binaryop APIs (#14187) @vyasr
Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
Add decoder for DELTA_BYTE_ARRAY to Parquet reader (#14101) @etseidl
Add DELTA_BINARY_PACKED encoder for Parquet writer (#14100) @etseidl
Add BytePairEncoder class to cuDF (#13891) @davidwendt
Upgrade to nvCOMP 3.0.4 (#13815) @vuule
Use pynvjitlink for CUDA 12+ MVC (#13650) @brandon-b-miller

🛠️ Improvements

Build concurrency for nightly and merge triggers (#14441) @bdice
Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
Update to Arrow 14.0.1. (#14387) @bdice
Remove Cython libcpp wrappers (#14382) @vyasr
Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
Upgrade to arrow 14 (#14371) @galipremsagar
Fix a pytest typo in test_kurt_skew_error (#14368) @galipremsagar
Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
Change nullable() to has_nulls() in cudf::detail::gather (#14363) @divyegala
Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
Implement user_datasource_wrapper is_empty() and is_device_read_preferred(). (#14357) @tpn
Added streams to CSV reader and writer api (#14340) @shrshi
Upgrade wheels to use arrow 13 (#14339) @vyasr
Rework nvtext::byte_pair_encoding API (#14337) @davidwendt
Improve performance of nvtext::tokenize_with_vocabulary for long strings (#14336) @davidwendt
Upgrade arrow to 13 (#14330) @galipremsagar
Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
Drop pyorc dependency and use pandas/pyarrow instead (#14323) @galipremsagar
Avoid pyarrow.fs import for local storage (#14321) @rjzamora
Unpin dask and distributed for 23.12 development (#14320) @galipremsagar
Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
Added streams to JSON reader and writer api (#14313) @shrshi
Minor improvements in source_info (#14308) @vuule
Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
Expose stream parameter to get_json_object API (#14297) @davidwendt
Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
Expose stream parameter in public strings filter APIs (#14293) @davidwendt
Refactor cudf_kafka to use skbuild (#14292) @jdye64
Update shared-action-workflows references (#14289) @AyodeAwe
Register partd encode dispatch in dask_cudf (#14287) @rjzamora
Update versioning strategy (#14285) @vyasr
Move and rename byte-pair-encoding source files (#14284) @davidwendt
Expose stream parameter in public strings combine APIs (#14281) @davidwendt
Expose stream parameter in public strings contains APIs (#14280) @davidwendt
Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
Use branch-23.12 workflows. (#14271) @bdice
Refactor LogicalType for Parquet (#14264) @etseidl
Centralize chunked reading code in the parquet reader to reader_impl_chunking.cu (#14262) @nvdbaranec
Expose stream parameter in public strings replace APIs (#14261) @davidwendt
Expose stream parameter in public strings APIs (#14260) @davidwendt
Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
Make parquet schema index type consistent (#14256) @hyperbolic2346
Expose stream parameter in public strings convert APIs (#14255) @davidwendt
Add in java bindings for DataSource (#14254) @revans2
Reimplement cudf::merge for nested types without using comparators (#14250) @divyegala
Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
Improve contains_column by invoking contains_table (#14238) @PointKernel
Detect and report errors in Parquet header parsing (#14237) @etseidl
Normalizing offsets iterator (#14234) @davidwendt
Forward merge 23.10 into 23.12 (#14231) @galipremsagar
Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
Enable indexalator for device code (#14206) @davidwendt
Marginally reduce memory footprint of joins (#14197) @wence-
Add nvtx annotations to spilling-based data movement (#14196) @wence-
Optimize ORC writer for decimal columns (#14190) @vuule
Remove the use of volatile in ORC (#14175) @vuule
Add bytes_per_second to distinct_count of stream_compaction nvbench. (#14172) @Blonck
Add bytes_per_second to transpose benchmark (#14170) @Blonck
cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
Add bytes_per_second to shift benchmark (#13950) @Blonck
Extract debug_utilities.hpp/cu from column_utilities.hpp/cu (#13720) @ttnghia

cudf - v23.12.00

Published by raydouglass 11 months ago

🚨 Breaking Changes

Raise error in reindex when index is not unique (#14400) @galipremsagar
Expose stream parameter to get_json_object API (#14297) @davidwendt
Refactor cudf_kafka to use skbuild (#14292) @jdye64
Expose stream parameter in public strings convert APIs (#14255) @davidwendt
Upgrade to nvCOMP 3.0.4 (#13815) @vuule

🐛 Bug Fixes

Update actions/labeler to v4 (#14562) @raydouglass
Fix data corruption when skipping rows (#14557) @etseidl
Fix function name typo in cudf.pandas profiler (#14514) @galipremsagar
Fix intermediate type checking in expression parsing (#14445) @vyasr
Forward merge branch-23.10 into branch-23.12 (#14435) @raydouglass
Remove needs: wheel-build-cudf. (#14427) @bdice
Fix dask dependency in custreamz (#14420) @vyasr
Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
Support java AST String literal with desired encoding (#14402) @winningsix
Raise error in reindex when index is not unique (#14400) @galipremsagar
Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
Fix token-count logic in nvtext::tokenize_with_vocabulary (#14393) @davidwendt
Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
cudf.pandas: cuDF subpath checking in module __getattr__ (#14388) @shwina
Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
Add the new manylinux builds to the build job (#14351) @vyasr
cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
Fix overflow check in cudf::merge (#14345) @divyegala
Add cramjam (#14344) @vyasr
Enable dask_cudf/io pytests in CI (#14338) @galipremsagar
Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
Fix host buffer access from device function in the Parquet reader (#14328) @vuule
Run IO tests for Dask-cuDF (#14327) @rjzamora
Fix logical type issues in the Parquet writer (#14322) @vuule
Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
test is_valid before reading column data (#14318) @etseidl
Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
fixing thread index overflow issue (#14290) @hyperbolic2346
Fix memset error in nvtext::edit_distance_matrix (#14283) @davidwendt
Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
Handle empty string correctly in Parquet statistics (#14257) @etseidl
Fixes behaviour for incomplete lines when recover_with_nulls is enabled (#14252) @elstehle
cudf::detail::pinned_allocator doesn't throw from deallocate (#14251) @robertmaynard
Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
Fixing parquet list of struct interpretation (#13715) @hyperbolic2346

📖 Documentation

Fix io reference in docs. (#14452) @bdice
Update README (#14374) @shwina
Example code for blog on new row comparators (#13795) @divyegala

🚀 New Features

Expose streams in public unary APIs (#14342) @vyasr
Add python tests for Parquet DELTA_BINARY_PACKED encoder (#14316) @etseidl
Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
Expose streams in public null mask APIs (#14263) @vyasr
Expose streams in binaryop APIs (#14187) @vyasr
Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
Add decoder for DELTA_BYTE_ARRAY to Parquet reader (#14101) @etseidl
Add DELTA_BINARY_PACKED encoder for Parquet writer (#14100) @etseidl
Add BytePairEncoder class to cuDF (#13891) @davidwendt
Upgrade to nvCOMP 3.0.4 (#13815) @vuule
Use pynvjitlink for CUDA 12+ MVC (#13650) @brandon-b-miller

🛠️ Improvements

Build concurrency for nightly and merge triggers (#14441) @bdice
Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
Update to Arrow 14.0.1. (#14387) @bdice
Remove Cython libcpp wrappers (#14382) @vyasr
Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
Upgrade to arrow 14 (#14371) @galipremsagar
Fix a pytest typo in test_kurt_skew_error (#14368) @galipremsagar
Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
Change nullable() to has_nulls() in cudf::detail::gather (#14363) @divyegala
Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
Implement user_datasource_wrapper is_empty() and is_device_read_preferred(). (#14357) @tpn
Added streams to CSV reader and writer api (#14340) @shrshi
Upgrade wheels to use arrow 13 (#14339) @vyasr
Rework nvtext::byte_pair_encoding API (#14337) @davidwendt
Improve performance of nvtext::tokenize_with_vocabulary for long strings (#14336) @davidwendt
Upgrade arrow to 13 (#14330) @galipremsagar
Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
Drop pyorc dependency and use pandas/pyarrow instead (#14323) @galipremsagar
Avoid pyarrow.fs import for local storage (#14321) @rjzamora
Unpin dask and distributed for 23.12 development (#14320) @galipremsagar
Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
Added streams to JSON reader and writer api (#14313) @shrshi
Minor improvements in source_info (#14308) @vuule
Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
Expose stream parameter to get_json_object API (#14297) @davidwendt
Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
Expose stream parameter in public strings filter APIs (#14293) @davidwendt
Refactor cudf_kafka to use skbuild (#14292) @jdye64
Update shared-action-workflows references (#14289) @AyodeAwe
Register partd encode dispatch in dask_cudf (#14287) @rjzamora
Update versioning strategy (#14285) @vyasr
Move and rename byte-pair-encoding source files (#14284) @davidwendt
Expose stream parameter in public strings combine APIs (#14281) @davidwendt
Expose stream parameter in public strings contains APIs (#14280) @davidwendt
Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
Use branch-23.12 workflows. (#14271) @bdice
Refactor LogicalType for Parquet (#14264) @etseidl
Centralize chunked reading code in the parquet reader to reader_impl_chunking.cu (#14262) @nvdbaranec
Expose stream parameter in public strings replace APIs (#14261) @davidwendt
Expose stream parameter in public strings APIs (#14260) @davidwendt
Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
Make parquet schema index type consistent (#14256) @hyperbolic2346
Expose stream parameter in public strings convert APIs (#14255) @davidwendt
Add in java bindings for DataSource (#14254) @revans2
Reimplement cudf::merge for nested types without using comparators (#14250) @divyegala
Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
Improve contains_column by invoking contains_table (#14238) @PointKernel
Detect and report errors in Parquet header parsing (#14237) @etseidl
Normalizing offsets iterator (#14234) @davidwendt
Forward merge 23.10 into 23.12 (#14231) @galipremsagar
Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
Enable indexalator for device code (#14206) @davidwendt
Marginally reduce memory footprint of joins (#14197) @wence-
Add nvtx annotations to spilling-based data movement (#14196) @wence-
Optimize ORC writer for decimal columns (#14190) @vuule
Remove the use of volatile in ORC (#14175) @vuule
Add bytes_per_second to distinct_count of stream_compaction nvbench. (#14172) @Blonck
Add bytes_per_second to transpose benchmark (#14170) @Blonck
cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
Add bytes_per_second to shift benchmark (#13950) @Blonck
Extract debug_utilities.hpp/cu from column_utilities.hpp/cu (#13720) @ttnghia

cudf - [NIGHTLY] v24.02.00

Published by rapids-bot[bot] 11 months ago

🔗 Links

🚨 Breaking Changes

Expunge as_frame conversions in Column algorithms (#14491) @wence-
Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke

🐛 Bug Fixes

Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
Improve memory footprint of isin by using contains (#14478) @wence-
Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
Correct dtype of count aggregations on empty dataframes (#14473) @wence-
Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
Fix default stream use in the CSV reader (#14443) @vuule
Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke

📖 Documentation

Some doxygen improvements (#14469) @vyasr
Remove warning in dask-cudf docs (#14454) @wence-
Update README links with redirects. (#14378) @bdice

🚀 New Features

Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov

🛠️ Improvements

Split libarrow build dependencies. (#14506) @bdice
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
Refactor Parquet kernel_error (#14464) @etseidl
Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
Expose stream parameter in public nvtext APIs (#14456) @davidwendt
Remove the use of volatile in Parquet (#14448) @vuule
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
REF: Remove instances of pd.core (#14421) @mroeschke
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
Add cuDF devcontainers (#14015) @trxcllnt

cudf - v23.10.02

Published by raydouglass 11 months ago

🚨 Breaking Changes

Raise error in reindex when index is not unique (#14429) @galipremsagar
Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Create table_input_metadata from a table_metadata (#13920) @etseidl
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Update to Cython 3.0.0 (#13777) @vyasr
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

🐛 Bug Fixes

Raise error in reindex when index is not unique (#14429) @galipremsagar
Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
Fix inaccuracy in decimal128 rounding. (#14233) @bdice
Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
Fix pytorch related pytest (#14198) @galipremsagar
Pin to aws-sdk-cpp<1.11 (#14173) @pentschev
Fix assert failure for range window functions (#14168) @mythrocks
Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
Fix kernel launch error for cudf::io::orc::gpu::rowgroup_char_counts_kernel (#14139) @davidwendt
Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
Fix DataFrame.values with no columns but index (#14134) @mroeschke
Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
Preserve name of the column while initializing a DataFrame (#14110) @galipremsagar
Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
Drop kwargs from Series.count (#14106) @galipremsagar
Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
Only use memory resources that haven't been freed (#14103) @robertmaynard
Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
Validate ignore_index type in drop_duplicates (#14098) @mroeschke
Fix renaming Series and Index (#14080) @galipremsagar
Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
Raise NotImplementedError in to_datetime with dayfirst without infer_format (#14058) @mroeschke
Fix various issues in Index.intersection (#14054) @galipremsagar
Fix Index.difference to match with pandas (#14053) @galipremsagar
Fix empty string column construction (#14052) @galipremsagar
Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Ignore compile_commands.json (#14048) @harrism
Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
Implement sort_remaining for sort_index (#14033) @wence-
Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
Fix return type of MultiIndex.difference (#14009) @galipremsagar
Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
Fix map column can not be non-nullable for java (#14003) @res-life
Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
Use thread_index_type to avoid out of bounds accesses in conditional joins (#13971) @vyasr
Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
Use cudf::thread_index_type in get_json_object and tdigest kernels (#13962) @nvdbaranec
Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
Fixed processed bytes calculation in APPLY_BOOLEAN_MASK benchmark. (#13937) @Blonck
Fix construction of Grouping objects (#13932) @galipremsagar
Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
Fix handling of typecasting in searchsorted (#13925) @galipremsagar
Preserve index name in reindex (#13917) @galipremsagar
Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
Use cudf::thread_index_type in concatenate.cu. (#13906) @bdice
Use cudf::thread_index_type in replace.cu. (#13905) @bdice
Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
Fix nvtext::generate_character_ngrams performance regression for longer strings (#13874) @davidwendt
Fix return type of MultiIndex.levels (#13870) @galipremsagar
Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
Fix binary operations between Series and Index (#13842) @galipremsagar
Update make_lists_column_from_scalar to use make_offsets_child_column utility (#13841) @davidwendt
Fix read out of bounds in string concatenate (#13838) @pentschev
Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
Fix cuFile I/O factories (#13829) @vuule
DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
Branch 23.10 merge 23.08 (#13822) @vyasr
Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
Raise error when mixed types are being constructed (#13816) @galipremsagar
Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Fix negative unary operation for boolean type (#13780) @galipremsagar
Fix contains(in) method for Series (#13779) @galipremsagar
Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
Preserve names of column object in various APIs (#13772) @galipremsagar
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
Provide our own Cython declaration for make_unique (#13746) @wence-

📖 Documentation

Fix benchmark image. (#14376) @bdice
Fix typo in docstring: metadata. (#14025) @bdice
Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
Simplify Python doc configuration (#13826) @vyasr
Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
Fix all warnings in Python docs (#13789) @vyasr

🚀 New Features

[Java] Add JNI bindings for integers_to_hex (#14205) @razajafri
Propagate errors from Parquet reader kernels back to host (#14167) @vuule
JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations (#14154) @ttnghia
Expose streams in all public sorting APIs (#14146) @vyasr
Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
Implement GroupBy.value_counts to match pandas API (#14114) @stmio
Refactor parquet thrift reader (#14097) @etseidl
Refactor hash_reduce_by_row (#14095) @ttnghia
Support negative preceding/following for ROW window functions (#14093) @mythrocks
Support for progressive parquet chunked reading. (#14079) @nvdbaranec
Implement HISTOGRAM and MERGE_HISTOGRAM aggregations (#14045) @ttnghia
Expose streams in public search APIs (#14034) @vyasr
Expose streams in public replace APIs (#14010) @vyasr
Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
Expose streams in public filling APIs (#13990) @vyasr
Expose streams in public concatenate APIs (#13987) @vyasr
Use HostMemoryAllocator in jni::allocate_host_buffer (#13975) @gerashegalov
Enable fractional null probability for hashing benchmark (#13967) @Blonck
Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
Add nvtext::tokenize_with_vocabulary API (#13930) @davidwendt
Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
Add HostMemoryAllocator interface (#13924) @gerashegalov
Global stream pool (#13922) @etseidl
Create table_input_metadata from a table_metadata (#13920) @etseidl
Translate column size overflow exception to JNI (#13911) @mythrocks
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Exclude some tests from running with the compute sanitizer (#13872) @firestarman
Expand statistics support in ORC writer (#13848) @vuule
Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
Add cudf::strings::find function with target per row (#13808) @davidwendt
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
Support corr in GroupBy.apply through the jit engine (#13767) @shwina
Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
Support more numeric types in Groupby.apply with engine='jit' (#13729) @brandon-b-miller
[FEA] Add DELTA_BINARY_PACKED decoding support to Parquet reader (#13637) @etseidl
Read FIXED_LEN_BYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

🛠️ Improvements

Update shared-action-workflows references (backport from 23.12 to 23.10) (#14300) @AyodeAwe
Pin dask and distributed for 23.10 release (#14225) @galipremsagar
update rmm tag path (#14195) @AyodeAwe
Disable Recently Updated Check (#14193) @ajschmidt8
Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
Add Parquet reader benchmarks for row selection (#14147) @vuule
Update image names (#14145) @AyodeAwe
Support callables in DataFrame.assign (#14142) @wence-
Reduce memory usage of as_categorical_column (#14138) @wence-
Replace Python scalar conversions with libcudf (#14124) @vyasr
Update to clang 16.0.6. (#14120) @bdice
Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
Add stream parameter to external dict APIs (#14115) @SurajAralihalli
Add fallback matrix for nvcomp. (#14082) @bdice
[Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
Remove header tests (#14072) @ajschmidt8
Refactor contains_table with cuco::static_set (#14064) @PointKernel
Remove debug print in a Parquet test (#14063) @vuule
Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Expose stream parameter in public strings find APIs (#14060) @davidwendt
Update doxygen to 1.9.1 (#14059) @vyasr
Remove the mr from the base fixture (#14057) @vyasr
Expose streams in public strings case APIs (#14056) @davidwendt
Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
Use cudf::make_empty_column instead of column_view constructor (#14030) @davidwendt
Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
Explicitly depend on zlib in conda recipes (#14018) @wence-
Use grid_stride for stride computations. (#13996) @bdice
Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
Use thread_index_type in partitioning.cu (#13973) @divyegala
Use cudf::thread_index_type in merge.cu (#13972) @divyegala
Use copy-pr-bot (#13970) @ajschmidt8
Use cudf::thread_index_type in strings custom kernels (#13968) @davidwendt
Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
Added pinned pool reservation API for java (#13964) @revans2
Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
Add bytes_per_second to copy_if_else benchmark (#13960) @Blonck
Add pandas compatible output to Series.unique (#13959) @galipremsagar
Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
Make HostColumnVector.getRefCount public (#13934) @abellina
Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
Add java API to get size of host memory needed to copy column view (#13919) @revans2
Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
Enable hugepage for arrow host allocations (#13914) @madsbk
Improve performance of nvtext::edit_distance (#13912) @davidwendt
Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
Use empty() instead of size() where possible (#13908) @vuule
[JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
Allow explicit shuffle="p2p" within dask-cudf API (#13893) @rjzamora
Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
Fixes a performance regression in FST (#13850) @elstehle
Set native handles to null on close in Java wrapper classes (#13818) @jlowe
Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
Update lists::contains to experimental row comparator (#13810) @divyegala
Reduce lists::contains dispatches for scalars (#13805) @divyegala
Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
Update to Cython 3.0.0 (#13777) @vyasr
Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
Branch 23.10 merge 23.08 (#13773) @vyasr
Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
Branch 23.10 merge 23.08 (#13753) @vyasr
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Refactors JSON reader's pushdown automaton (#13716) @elstehle
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

cudf - v23.04.01

Published by raydouglass 12 months ago

🚨 Breaking Changes

Pin dask and distributed for release (#13070) @galipremsagar
Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
Update minimum pandas and numpy pinnings (#12887) @galipremsagar
Deprecate names & dtype in Index.copy (#12825) @galipremsagar
Deprecate Index.is_* methods (#12820) @galipremsagar
Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
Deprecate na_sentinel in factorize (#12817) @galipremsagar
Make string methods return a Series with a useful Index (#12814) @shwina
Produce useful guidance on overflow error in to_csv (#12705) @wence-
Move strings_udf code into cuDF (#12669) @brandon-b-miller
Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
Replace message parsing with throwing more specific exceptions (#12426) @vyasr

🐛 Bug Fixes

Pin curand version (#13127) @vyasr
Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
Fix DataFrame constructor to broadcast scalar inputs properly (#12997) @galipremsagar
Drop force_nullable_schema from chunked parquet writer (#12996) @galipremsagar
Fix gtest column utility comparator diff reporting (#12995) @davidwendt
Handle index names while performing groupby (#12992) @galipremsagar
Fix __setitem__ on string columns when the scalar value ends in a null byte (#12991) @wence-
Fix sort_values when column is all empty strings (#12988) @eriknw
Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
Pre-emptive fix for upstream dask.dataframe.read_parquet changes (#12983) @rjzamora
Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
cudftestutil supports static gtest dependencies (#12957) @robertmaynard
Include gtest in build environment. (#12956) @vyasr
Correctly handle scalar indices in Index.__getitem__ (#12955) @wence-
Avoid building cython twice (#12945) @galipremsagar
Fix set index error for Series rolling window operations (#12942) @galipremsagar
Fix calculation of null counts for Parquet statistics (#12938) @etseidl
Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
Use get_current_device_resource for intermediate allocations in COLLECT_LIST window code (#12927) @karthikeyann
Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
Fix conda recipe post-link.sh typo (#12916) @pentschev
min_rows and num_rows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
Use python -m pytest for nightly wheel tests (#12871) @bdice
Parquet writer column_size() should return a size_t (#12870) @etseidl
Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
Remove tokenizers pre-install pinning. (#12854) @vyasr
Fix parquet RangeIndex bug (#12838) @rjzamora
Remove KAFKA_HOST_TEST from compute-sanitizer check (#12831) @davidwendt
Make string methods return a Series with a useful Index (#12814) @shwina
Tell cudf_kafka to use header-only fmt (#12796) @vyasr
Add GroupBy.dtypes (#12783) @galipremsagar
Fix a leak in a test and clarify some test names (#12781) @revans2
Fix bug in all-null list due to join_list_elements special handling (#12767) @karthikeyann
Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
Fix a bug with num_keys in _scatter_by_slice (#12749) @thomcom
Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
Add always_nullable flag to Dremel encoding (#12727) @divyegala
Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
Fix faulty conditional logic in JIT GroupBy.apply (#12706) @brandon-b-miller
Produce useful guidance on overflow error in to_csv (#12705) @wence-
Handle parquet list data corner case (#12698) @nvdbaranec
Fix missing trailing comma in json writer (#12688) @karthikeyann
Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
Handle bool types in round API (#12670) @galipremsagar
Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
Fix from_arrow to load a sliced arrow table (#12665) @galipremsagar
Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
Fix find_common_dtype and values to handle complex dtypes (#12537) @galipremsagar
Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
Fix Series comparison vs scalars (#12519) @brandon-b-miller
Allow casting from UDFString back to StringView to call methods in strings_udf (#12363) @brandon-b-miller

📖 Documentation

Fix GroupBy.apply doc examples rendering (#12994) @brandon-b-miller
add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
Add README symlink for dask-cudf. (#12946) @bdice
Remove return type from @return doxygen tags (#12908) @davidwendt
Fix docs build to be pydata-sphinx-theme=0.13.0 compatible (#12874) @galipremsagar
Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
Enable doctests for GroupBy methods (#12658) @brandon-b-miller
Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt

🚀 New Features

Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
Refactor orc chunked writer (#12949) @ttnghia
Make Parquet writer nullable option application to single table writes (#12933) @vuule
Refactor io::orc::ProtobufWriter (#12877) @ttnghia
Make timezone table independent from ORC (#12805) @vuule
Cache JIT GroupBy.apply functions (#12802) @brandon-b-miller
Implement initial support for avro logical types (#6482) (#12788) @tpn
Update tests/column_utilities to use experimental::equality row comparator (#12777) @divyegala
Update distinct/unique_count to experimental::row hasher/comparator (#12776) @divyegala
Update hash_partition to use experimental::row::row_hasher (#12761) @divyegala
Update is_sorted to use experimental::row::lexicographic (#12752) @divyegala
Update default data source in cuio reader benchmarks (#12740) @PointKernel
Reenable stream identification library in CI (#12714) @vyasr
Add regex_program strings splitting java APIs and tests (#12713) @cindyyuanjiang
Add regex_program strings replacing java APIs and tests (#12701) @cindyyuanjiang
Add regex_program strings extract java APIs and tests (#12699) @cindyyuanjiang
Variable fragment sizes for Parquet writer (#12685) @etseidl
Add segmented reduction support for fixed-point types (#12680) @davidwendt
Move strings_udf code into cuDF (#12669) @brandon-b-miller
Add regex_program searching APIs and related java classes (#12666) @cindyyuanjiang
Add logging to libcudf (#12637) @vuule
Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
Convert rank to use to experimental row comparators (#12481) @divyegala
Use rapids-cmake parallel testing feature (#12451) @robertmaynard
Enable detection of undesired stream usage (#12089) @vyasr

🛠️ Improvements

Pin dask and distributed for release (#13070) @galipremsagar
Pin cupy in wheel tests to supported versions (#13041) @vyasr
Pin numba version (#13001) @vyasr
Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
Stop setting package version attribute in wheels (#12977) @vyasr
Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
Remove default detail mrs: part7 (#12970) @vyasr
Remove default detail mrs: part6 (#12969) @vyasr
Remove default detail mrs: part5 (#12968) @vyasr
Remove default detail mrs: part4 (#12967) @vyasr
Remove default detail mrs: part3 (#12966) @vyasr
Remove default detail mrs: part2 (#12965) @vyasr
Remove default detail mrs: part1 (#12964) @vyasr
Add force_nullable_schema parameter to Parquet writer. (#12952) @galipremsagar
Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
Remove remaining default stream parameters (#12943) @vyasr
Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
Implement groupby.head and groupby.tail (#12939) @wence-
Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
Pass SCCACHE_S3_USE_SSL to conda builds (#12910) @ajschmidt8
Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
Generate pyproject dependencies using dfg (#12906) @vyasr
Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
Fix moto env vars & pass AWS_SESSION_TOKEN to conda builds (#12902) @ajschmidt8
Rewrite CSV writer benchmark with nvbench (#12901) @PointKernel
Rework some code logic to reduce iterator and comparator inlining to improve compile time (#12900) @davidwendt
Deprecate line_terminator in favor of lineterminator in to_csv (#12896) @wence-
Add stream and mr parameters for structs::detail::flatten_nested_columns (#12892) @ttnghia
Deprecate libcudf regex APIs accepting pattern strings directly (#12891) @davidwendt
Remove default parameters from detail headers in include (#12888) @vyasr
Update minimum pandas and numpy pinnings (#12887) @galipremsagar
Implement groupby.sample (#12882) @wence-
Update JNI build ENV default to gcc 11 (#12881) @pxLi
Change return type of cudf::structs::detail::flatten_nested_columns to smart pointer (#12878) @ttnghia
Fix passing seed parameter to MurmurHash3_32 in cudf::hash() function (#12875) @davidwendt
Remove manual artifact upload step in CI (#12869) @ajschmidt8
Update to GCC 11 (#12868) @bdice
Fix null hive-partition behavior in dask-cudf parquet (#12866) @rjzamora
Update to protobuf>=4.21.6,<4.22. (#12864) @bdice
Update RMM allocators (#12861) @pentschev
Improve performance for replace-multi for long strings (#12858) @davidwendt
Drop Python 3.7 handling for pickle protocol 4 (#12857) @jakirkham
Migrate as much as possible to pyproject.toml (#12850) @vyasr
Enable nbqa pre-commit hooks for isort and black. (#12848) @bdice
Setting a threshold for KvikIO IO (#12841) @madsbk
Update datasets download URL (#12840) @jjacobelli
Make docs builds less verbose (#12836) @AyodeAwe
Consolidate linter configs into pyproject.toml (#12834) @vyasr
Deprecate names & dtype in Index.copy (#12825) @galipremsagar
Deprecate inplace parameters in categorical methods (#12824) @galipremsagar
Add optional text file support to ninja-log utility (#12823) @davidwendt
Deprecate Index.is_* methods (#12820) @galipremsagar
Add dfg as a pre-commit hook (#12819) @vyasr
Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
Deprecate na_sentinel in factorize (#12817) @galipremsagar
Shuffling read into a sub function in parquet read (#12809) @hyperbolic2346
Fixing parquet coalescing of reads (#12808) @hyperbolic2346
CI: Remove specification of manual stage for check_style.sh script. (#12803) @csadorf
Add compute-sanitizer github workflow action to nightly tests (#12800) @davidwendt
Enable groupby std and variance aggregation types in libcudf Debug build (#12799) @davidwendt
Expose seed argument to hash_values (#12795) @ayushdg
Fix groupby gtests coded in namespace cudf::test (#12784) @davidwendt
Improve performance for cudf::strings::count_characters for long strings (#12779) @davidwendt
Deallocate encoded data in ORC writer immediately after compression (#12770) @vuule
Stop force pulling fmt in nvbench. (#12768) @vyasr
Remove now redundant cuda initialization (#12758) @vyasr
Adds JSON reader, writer io benchmark (#12753) @karthikeyann
Use test paths relative to package directory. (#12751) @bdice
Add build metrics report as artifact to cpp-build workflow (#12750) @davidwendt
Add JNI methods for detecting and purging non-empty nulls from LIST and STRUCT (#12742) @razajafri
Stop using versioneer to manage versions (#12741) @vyasr
Reduce error handling verbosity in CI tests scripts (#12738) @AjayThorve
Reduce the number of test cases in multibyte_split benchmark (#12737) @PointKernel
Update shared workflow branches (#12733) @ajschmidt8
JNI switches to nested JSON reader (#12732) @res-life
Changing cudf::io::source_info to use cudf::host_span<std::byte> in a non-breaking form (#12730) @hyperbolic2346
Add nvbench environment class for initializing RMM in benchmarks (#12728) @davidwendt
Split C++ and Python build dependencies into separate lists. (#12724) @bdice
Add build dependencies to Java tests. (#12723) @bdice
Allow setting the seed argument for hash partition (#12715) @firestarman
Remove gpuCI scripts. (#12712) @bdice
Unpin dask and distributed for development (#12710) @galipremsagar
partition_by_hash(): use _split() (#12704) @madsbk
Remove DataFrame.quantiles from docs. (#12684) @bdice
Fast path for experimental::row::equality (#12676) @divyegala
Move date to build string in conda recipe (#12661) @ajschmidt8
Refactor reduction logic for fixed-point types (#12652) @davidwendt
Pay off some JNI RMM API tech debt (#12632) @revans2
Merge copy-on-write feature branch into branch-23.04 (#12619) @galipremsagar
Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
Pin cuda-nvrtc. (#12606) @bdice
Remove cudf::test::print calls from libcudf gtests (#12604) @davidwendt
Init JNI version 23.04.0-SNAPSHOT (#12599) @pxLi
Add performance benchmarks to user facing docs (#12595) @galipremsagar
Add docs build job (#12592) @AyodeAwe
Replace message parsing with throwing more specific exceptions (#12426) @vyasr
Support conversion to/from cudf in dask.dataframe.core.to_backend (#12380) @rjzamora

cudf - [NIGHTLY] v23.10.00

Published by rapids-bot[bot] almost 1 year ago

🔗 Links

🚨 Breaking Changes

Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Create table_input_metadata from a table_metadata (#13920) @etseidl
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Update to Cython 3.0.0 (#13777) @vyasr
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

🐛 Bug Fixes

Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
Fix inaccuracy in decimal128 rounding. (#14233) @bdice
Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
Fix pytorch related pytest (#14198) @galipremsagar
Pin to aws-sdk-cpp<1.11 (#14173) @pentschev
Fix assert failure for range window functions (#14168) @mythrocks
Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
Fix kernel launch error for cudf::io::orc::gpu::rowgroup_char_counts_kernel (#14139) @davidwendt
Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
Fix DataFrame.values with no columns but index (#14134) @mroeschke
Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
Preserve name of the column while initializing a DataFrame (#14110) @galipremsagar
Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
Drop kwargs from Series.count (#14106) @galipremsagar
Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
Only use memory resources that haven't been freed (#14103) @robertmaynard
Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
Validate ignore_index type in drop_duplicates (#14098) @mroeschke
Fix renaming Series and Index (#14080) @galipremsagar
Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
Raise NotImplementedError in to_datetime with dayfirst without infer_format (#14058) @mroeschke
Fix various issues in Index.intersection (#14054) @galipremsagar
Fix Index.difference to match with pandas (#14053) @galipremsagar
Fix empty string column construction (#14052) @galipremsagar
Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Ignore compile_commands.json (#14048) @harrism
Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
Implement sort_remaining for sort_index (#14033) @wence-
Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
Fix return type of MultiIndex.difference (#14009) @galipremsagar
Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
Fix map column can not be non-nullable for java (#14003) @res-life
Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
Use thread_index_type to avoid out of bounds accesses in conditional joins (#13971) @vyasr
Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
Use cudf::thread_index_type in get_json_object and tdigest kernels (#13962) @nvdbaranec
Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
Fixed processed bytes calculation in APPLY_BOOLEAN_MASK benchmark. (#13937) @Blonck
Fix construction of Grouping objects (#13932) @galipremsagar
Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
Fix handling of typecasting in searchsorted (#13925) @galipremsagar
Preserve index name in reindex (#13917) @galipremsagar
Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
Use cudf::thread_index_type in concatenate.cu. (#13906) @bdice
Use cudf::thread_index_type in replace.cu. (#13905) @bdice
Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
Fix nvtext::generate_character_ngrams performance regression for longer strings (#13874) @davidwendt
Fix return type of MultiIndex.levels (#13870) @galipremsagar
Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
Fix binary operations between Series and Index (#13842) @galipremsagar
Update make_lists_column_from_scalar to use make_offsets_child_column utility (#13841) @davidwendt
Fix read out of bounds in string concatenate (#13838) @pentschev
Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
Fix cuFile I/O factories (#13829) @vuule
DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
Branch 23.10 merge 23.08 (#13822) @vyasr
Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
Raise error when mixed types are being constructed (#13816) @galipremsagar
Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Fix negative unary operation for boolean type (#13780) @galipremsagar
Fix contains(in) method for Series (#13779) @galipremsagar
Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
Preserve names of column object in various APIs (#13772) @galipremsagar
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
Provide our own Cython declaration for make_unique (#13746) @wence-

📖 Documentation

Fix typo in docstring: metadata. (#14025) @bdice
Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
Simplify Python doc configuration (#13826) @vyasr
Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
Fix all warnings in Python docs (#13789) @vyasr

🚀 New Features

[Java] Add JNI bindings for integers_to_hex (#14205) @razajafri
Propagate errors from Parquet reader kernels back to host (#14167) @vuule
JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations (#14154) @ttnghia
Expose streams in all public sorting APIs (#14146) @vyasr
Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
Implement GroupBy.value_counts to match pandas API (#14114) @stmio
Refactor parquet thrift reader (#14097) @etseidl
Refactor hash_reduce_by_row (#14095) @ttnghia
Support negative preceding/following for ROW window functions (#14093) @mythrocks
Support for progressive parquet chunked reading. (#14079) @nvdbaranec
Implement HISTOGRAM and MERGE_HISTOGRAM aggregations (#14045) @ttnghia
Expose streams in public search APIs (#14034) @vyasr
Expose streams in public replace APIs (#14010) @vyasr
Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
Expose streams in public filling APIs (#13990) @vyasr
Expose streams in public concatenate APIs (#13987) @vyasr
Use HostMemoryAllocator in jni::allocate_host_buffer (#13975) @gerashegalov
Enable fractional null probability for hashing benchmark (#13967) @Blonck
Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
Add nvtext::tokenize_with_vocabulary API (#13930) @davidwendt
Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
Add HostMemoryAllocator interface (#13924) @gerashegalov
Global stream pool (#13922) @etseidl
Create table_input_metadata from a table_metadata (#13920) @etseidl
Translate column size overflow exception to JNI (#13911) @mythrocks
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Exclude some tests from running with the compute sanitizer (#13872) @firestarman
Expand statistics support in ORC writer (#13848) @vuule
Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
Add cudf::strings::find function with target per row (#13808) @davidwendt
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
Support corr in GroupBy.apply through the jit engine (#13767) @shwina
Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
Support more numeric types in Groupby.apply with engine='jit' (#13729) @brandon-b-miller
[FEA] Add DELTA_BINARY_PACKED decoding support to Parquet reader (#13637) @etseidl
Read FIXED_LEN_BYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

🛠️ Improvements

Update shared-action-workflows references (backport from 23.12 to 23.10) (#14300) @AyodeAwe
Pin dask and distributed for 23.10 release (#14225) @galipremsagar
update rmm tag path (#14195) @AyodeAwe
Disable Recently Updated Check (#14193) @ajschmidt8
Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
Add Parquet reader benchmarks for row selection (#14147) @vuule
Update image names (#14145) @AyodeAwe
Support callables in DataFrame.assign (#14142) @wence-
Reduce memory usage of as_categorical_column (#14138) @wence-
Replace Python scalar conversions with libcudf (#14124) @vyasr
Update to clang 16.0.6. (#14120) @bdice
Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
Add stream parameter to external dict APIs (#14115) @SurajAralihalli
Add fallback matrix for nvcomp. (#14082) @bdice
[Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
Remove header tests (#14072) @ajschmidt8
Refactor contains_table with cuco::static_set (#14064) @PointKernel
Remove debug print in a Parquet test (#14063) @vuule
Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Expose stream parameter in public strings find APIs (#14060) @davidwendt
Update doxygen to 1.9.1 (#14059) @vyasr
Remove the mr from the base fixture (#14057) @vyasr
Expose streams in public strings case APIs (#14056) @davidwendt
Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
Use cudf::make_empty_column instead of column_view constructor (#14030) @davidwendt
Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
Explicitly depend on zlib in conda recipes (#14018) @wence-
Use grid_stride for stride computations. (#13996) @bdice
Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
Use thread_index_type in partitioning.cu (#13973) @divyegala
Use cudf::thread_index_type in merge.cu (#13972) @divyegala
Use copy-pr-bot (#13970) @ajschmidt8
Use cudf::thread_index_type in strings custom kernels (#13968) @davidwendt
Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
Added pinned pool reservation API for java (#13964) @revans2
Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
Add bytes_per_second to copy_if_else benchmark (#13960) @Blonck
Add pandas compatible output to Series.unique (#13959) @galipremsagar
Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
Make HostColumnVector.getRefCount public (#13934) @abellina
Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
Add java API to get size of host memory needed to copy column view (#13919) @revans2
Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
Enable hugepage for arrow host allocations (#13914) @madsbk
Improve performance of nvtext::edit_distance (#13912) @davidwendt
Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
Use empty() instead of size() where possible (#13908) @vuule
[JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
Allow explicit shuffle="p2p" within dask-cudf API (#13893) @rjzamora
Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
Fixes a performance regression in FST (#13850) @elstehle
Set native handles to null on close in Java wrapper classes (#13818) @jlowe
Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
Update lists::contains to experimental row comparator (#13810) @divyegala
Reduce lists::contains dispatches for scalars (#13805) @divyegala
Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
Update to Cython 3.0.0 (#13777) @vyasr
Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
Branch 23.10 merge 23.08 (#13773) @vyasr
Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
Branch 23.10 merge 23.08 (#13753) @vyasr
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Refactors JSON reader's pushdown automaton (#13716) @elstehle
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

cudf - v23.10.00

Published by raydouglass about 1 year ago

🚨 Breaking Changes

Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Create table_input_metadata from a table_metadata (#13920) @etseidl
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Update to Cython 3.0.0 (#13777) @vyasr
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

🐛 Bug Fixes

Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
Fix inaccuracy in decimal128 rounding. (#14233) @bdice
Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
Fix pytorch related pytest (#14198) @galipremsagar
Pin to aws-sdk-cpp<1.11 (#14173) @pentschev
Fix assert failure for range window functions (#14168) @mythrocks
Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
Fix kernel launch error for cudf::io::orc::gpu::rowgroup_char_counts_kernel (#14139) @davidwendt
Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
Fix DataFrame.values with no columns but index (#14134) @mroeschke
Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
Preserve name of the column while initializing a DataFrame (#14110) @galipremsagar
Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
Drop kwargs from Series.count (#14106) @galipremsagar
Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
Only use memory resources that haven't been freed (#14103) @robertmaynard
Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
Validate ignore_index type in drop_duplicates (#14098) @mroeschke
Fix renaming Series and Index (#14080) @galipremsagar
Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
Raise NotImplementedError in to_datetime with dayfirst without infer_format (#14058) @mroeschke
Fix various issues in Index.intersection (#14054) @galipremsagar
Fix Index.difference to match with pandas (#14053) @galipremsagar
Fix empty string column construction (#14052) @galipremsagar
Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Ignore compile_commands.json (#14048) @harrism
Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
Implement sort_remaining for sort_index (#14033) @wence-
Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
Fix return type of MultiIndex.difference (#14009) @galipremsagar
Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
Fix map column can not be non-nullable for java (#14003) @res-life
Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
Use thread_index_type to avoid out of bounds accesses in conditional joins (#13971) @vyasr
Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
Use cudf::thread_index_type in get_json_object and tdigest kernels (#13962) @nvdbaranec
Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
Fixed processed bytes calculation in APPLY_BOOLEAN_MASK benchmark. (#13937) @Blonck
Fix construction of Grouping objects (#13932) @galipremsagar
Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
Fix handling of typecasting in searchsorted (#13925) @galipremsagar
Preserve index name in reindex (#13917) @galipremsagar
Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
Use cudf::thread_index_type in concatenate.cu. (#13906) @bdice
Use cudf::thread_index_type in replace.cu. (#13905) @bdice
Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
Fix nvtext::generate_character_ngrams performance regression for longer strings (#13874) @davidwendt
Fix return type of MultiIndex.levels (#13870) @galipremsagar
Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
Fix binary operations between Series and Index (#13842) @galipremsagar
Update make_lists_column_from_scalar to use make_offsets_child_column utility (#13841) @davidwendt
Fix read out of bounds in string concatenate (#13838) @pentschev
Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
Fix cuFile I/O factories (#13829) @vuule
DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
Branch 23.10 merge 23.08 (#13822) @vyasr
Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
Raise error when mixed types are being constructed (#13816) @galipremsagar
Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Fix negative unary operation for boolean type (#13780) @galipremsagar
Fix contains(in) method for Series (#13779) @galipremsagar
Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
Preserve names of column object in various APIs (#13772) @galipremsagar
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
Provide our own Cython declaration for make_unique (#13746) @wence-

📖 Documentation

Fix typo in docstring: metadata. (#14025) @bdice
Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
Simplify Python doc configuration (#13826) @vyasr
Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
Fix all warnings in Python docs (#13789) @vyasr

🚀 New Features

[Java] Add JNI bindings for integers_to_hex (#14205) @razajafri
Propagate errors from Parquet reader kernels back to host (#14167) @vuule
JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations (#14154) @ttnghia
Expose streams in all public sorting APIs (#14146) @vyasr
Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
Implement GroupBy.value_counts to match pandas API (#14114) @stmio
Refactor parquet thrift reader (#14097) @etseidl
Refactor hash_reduce_by_row (#14095) @ttnghia
Support negative preceding/following for ROW window functions (#14093) @mythrocks
Support for progressive parquet chunked reading. (#14079) @nvdbaranec
Implement HISTOGRAM and MERGE_HISTOGRAM aggregations (#14045) @ttnghia
Expose streams in public search APIs (#14034) @vyasr
Expose streams in public replace APIs (#14010) @vyasr
Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
Expose streams in public filling APIs (#13990) @vyasr
Expose streams in public concatenate APIs (#13987) @vyasr
Use HostMemoryAllocator in jni::allocate_host_buffer (#13975) @gerashegalov
Enable fractional null probability for hashing benchmark (#13967) @Blonck
Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
Add nvtext::tokenize_with_vocabulary API (#13930) @davidwendt
Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
Add HostMemoryAllocator interface (#13924) @gerashegalov
Global stream pool (#13922) @etseidl
Create table_input_metadata from a table_metadata (#13920) @etseidl
Translate column size overflow exception to JNI (#13911) @mythrocks
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Exclude some tests from running with the compute sanitizer (#13872) @firestarman
Expand statistics support in ORC writer (#13848) @vuule
Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
Add cudf::strings::find function with target per row (#13808) @davidwendt
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
Support corr in GroupBy.apply through the jit engine (#13767) @shwina
Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
Support more numeric types in Groupby.apply with engine='jit' (#13729) @brandon-b-miller
[FEA] Add DELTA_BINARY_PACKED decoding support to Parquet reader (#13637) @etseidl
Read FIXED_LEN_BYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

🛠️ Improvements

Pin dask and distributed for 23.10 release (#14225) @galipremsagar
update rmm tag path (#14195) @AyodeAwe
Disable Recently Updated Check (#14193) @ajschmidt8
Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
Add Parquet reader benchmarks for row selection (#14147) @vuule
Update image names (#14145) @AyodeAwe
Support callables in DataFrame.assign (#14142) @wence-
Reduce memory usage of as_categorical_column (#14138) @wence-
Replace Python scalar conversions with libcudf (#14124) @vyasr
Update to clang 16.0.6. (#14120) @bdice
Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
Add stream parameter to external dict APIs (#14115) @SurajAralihalli
Add fallback matrix for nvcomp. (#14082) @bdice
[Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
Remove header tests (#14072) @ajschmidt8
Refactor contains_table with cuco::static_set (#14064) @PointKernel
Remove debug print in a Parquet test (#14063) @vuule
Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Expose stream parameter in public strings find APIs (#14060) @davidwendt
Update doxygen to 1.9.1 (#14059) @vyasr
Remove the mr from the base fixture (#14057) @vyasr
Expose streams in public strings case APIs (#14056) @davidwendt
Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
Use cudf::make_empty_column instead of column_view constructor (#14030) @davidwendt
Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
Explicitly depend on zlib in conda recipes (#14018) @wence-
Use grid_stride for stride computations. (#13996) @bdice
Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
Use thread_index_type in partitioning.cu (#13973) @divyegala
Use cudf::thread_index_type in merge.cu (#13972) @divyegala
Use copy-pr-bot (#13970) @ajschmidt8
Use cudf::thread_index_type in strings custom kernels (#13968) @davidwendt
Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
Added pinned pool reservation API for java (#13964) @revans2
Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
Add bytes_per_second to copy_if_else benchmark (#13960) @Blonck
Add pandas compatible output to Series.unique (#13959) @galipremsagar
Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
Make HostColumnVector.getRefCount public (#13934) @abellina
Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
Add java API to get size of host memory needed to copy column view (#13919) @revans2
Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
Enable hugepage for arrow host allocations (#13914) @madsbk
Improve performance of nvtext::edit_distance (#13912) @davidwendt
Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
Use empty() instead of size() where possible (#13908) @vuule
[JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
Allow explicit shuffle="p2p" within dask-cudf API (#13893) @rjzamora
Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
Fixes a performance regression in FST (#13850) @elstehle
Set native handles to null on close in Java wrapper classes (#13818) @jlowe
Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
Update lists::contains to experimental row comparator (#13810) @divyegala
Reduce lists::contains dispatches for scalars (#13805) @divyegala
Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
Update to Cython 3.0.0 (#13777) @vyasr
Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
Branch 23.10 merge 23.08 (#13773) @vyasr
Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
Branch 23.10 merge 23.08 (#13753) @vyasr
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Refactors JSON reader's pushdown automaton (#13716) @elstehle
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

cudf - [NIGHTLY] v23.12.00

Published by rapids-bot[bot] about 1 year ago

🔗 Links

🚨 Breaking Changes

Expose stream parameter in public strings convert APIs (#14255) @davidwendt

🐛 Bug Fixes

Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
Fix memset error in nvtext::edit_distance_matrix (#14283) @davidwendt
Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
Handle empty string correctly in Parquet statistics (#14257) @etseidl
Fixes behaviour for incomplete lines when recover_with_nulls is enabled (#14252) @elstehle
cudf::detail::pinned_allocator doesn't throw from deallocate (#14251) @robertmaynard
Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
Fixing parquet list of struct interpretation (#13715) @hyperbolic2346

🚀 New Features

Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
Expose streams in public null mask APIs (#14263) @vyasr
Expose streams in binaryop APIs (#14187) @vyasr
Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
Add DELTA_BINARY_PACKED encoder for Parquet writer (#14100) @etseidl

🛠️ Improvements

Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
Update shared-action-workflows references (#14289) @AyodeAwe
Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
Use branch-23.12 workflows. (#14271) @bdice
Refactor LogicalType for Parquet (#14264) @etseidl
Centralize chunked reading code in the parquet reader to reader_impl_chunking.cu (#14262) @nvdbaranec
Expose stream parameter in public strings replace APIs (#14261) @davidwendt
Expose stream parameter in public strings APIs (#14260) @davidwendt
Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
Make parquet schema index type consistent (#14256) @hyperbolic2346
Expose stream parameter in public strings convert APIs (#14255) @davidwendt
Add in java bindings for DataSource (#14254) @revans2
Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
Improve contains_column by invoking contains_table (#14238) @PointKernel
Detect and report errors in Parquet header parsing (#14237) @etseidl
Forward merge 23.10 into 23.12 (#14231) @galipremsagar
Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
Enable indexalator for device code (#14206) @davidwendt
Marginally reduce memory footprint of joins (#14197) @wence-
Add nvtx annotations to spilling-based data movement (#14196) @wence-
Remove the use of volatile in ORC (#14175) @vuule
Add bytes_per_second to distinct_count of stream_compaction nvbench. (#14172) @Blonck
Add bytes_per_second to transpose benchmark (#14170) @Blonck
cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
Add bytes_per_second to shift benchmark (#13950) @Blonck

cudf - v23.08.00

Published by raydouglass about 1 year ago

🚨 Breaking Changes

Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
Expose streams in all public copying APIs (#13629) @vyasr
Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
Remove deprecated cudf.set_allocator. (#13591) @bdice
Change build.sh to use pip install instead of setup.py (#13507) @vyasr
Remove unused max_rows_tensor parameter from subword tokenizer (#13463) @davidwendt
Fix decimal scale reductions in _get_decimal_type (#13224) @charlesbluca

🐛 Bug Fixes

Add CUDA version to cudf_kafka and libcudf-example build strings. (#13769) @bdice
Fix typo in wheels-test.yaml. (#13763) @bdice
Don't test strings shorter than the requested ngram size (#13758) @vyasr
Add CUDA version to custreamz build string. (#13754) @bdice
Fix writing of ORC files with empty child string columns (#13745) @vuule
Remove the erroneous "empty level" short-circuit from ORC reader (#13722) @vuule
Fix character counting when writing sliced tables into ORC (#13721) @vuule
Parquet uses row group row count if missing from header (#13712) @hyperbolic2346
Fix reading of RLE encoded boolean data from parquet files with V2 page headers (#13707) @etseidl
Fix a corner case of list lexicographic comparator (#13701) @ttnghia
Fix combined filtering and column projection in dask_cudf.read_parquet (#13697) @rjzamora
Revert fetch-rapids changes (#13696) @vyasr
Data generator - include offsets in the size estimate of list elments (#13688) @vuule
Add cuda-nvcc-impl to cudf for numba CUDA 12 (#13673) @jakirkham
Fix combined filtering and column projection in read_parquet (#13666) @rjzamora
Use thrust::identity as hash functions for byte pair encoding (#13665) @PointKernel
Fix loc-getitem ordering when index contains duplicate labels (#13659) @wence-
[REVIEW] Introduce parity with pandas for MultiIndex.loc ordering & fix a bug in Groupby with as_index (#13657) @galipremsagar
Fix memcheck error found in nvtext tokenize functions (#13649) @davidwendt
Fix has_nonempty_nulls ignoring column offset (#13647) @ttnghia
[Java] Avoid double-free corruption in case of an Exception while creating a ColumnView (#13645) @razajafri
Fix memcheck error in ORC reader call to cudf::io::copy_uncompressed_kernel (#13643) @davidwendt
Fix CUDA 12 conda environment to remove cubinlinker and ptxcompiler. (#13636) @bdice
Fix inf/NaN comparisons for FLOAT orderby in window functions (#13635) @mythrocks
Refactor Index search to simplify code and increase correctness (#13625) @wence-
Fix compile warning for unused variable in split_re.cu (#13621) @davidwendt
Fix tz_localize for dask_cudf Series (#13610) @shwina
Fix issue with no decompressed data in ORC reader (#13609) @vuule
Fix floating point window range extents. (#13606) @mythrocks
Fix localize(None) for timezone-naive columns (#13603) @shwina
Fixed a memory leak caused by Exception thrown while constructing a ColumnView (#13597) @razajafri
Handle nullptr return value from bitmask_or in distinct_count (#13590) @wence-
Bring parity with pandas in Index.join (#13589) @galipremsagar
Fix cudf.melt when there are more than 255 columns (#13588) @hcho3
Fix memory issues in cuIO due to removal of memory padding (#13586) @ttnghia
Fix Parquet multi-file reading (#13584) @etseidl
Fix memcheck error found in LISTS_TEST (#13579) @davidwendt
Fix memcheck error found in STRINGS_TEST (#13578) @davidwendt
Fix memcheck error found in INTEROP_TEST (#13577) @davidwendt
Fix memcheck errors found in REDUCTION_TEST (#13574) @davidwendt
Preemptive fix for hive-partitioning change in dask (#13564) @rjzamora
Fix an issue with dask_cudf.read_csv when lines are needed to be skipped (#13555) @galipremsagar
Fix out-of-bounds memory write in cudf::dictionary::detail::concatenate (#13554) @davidwendt
Fix the null mask size in json reader (#13537) @karthikeyann
Fix cudf::strings::strip for all-empty input column (#13533) @davidwendt
Make sure to build without isolation or installing dependencies (#13524) @vyasr
Remove preload lib from CMake for now (#13519) @vyasr
Fix missing separator after null values in JSON writer (#13503) @karthikeyann
Ensure single_lane_block_sum_reduce is safe to call in a loop (#13488) @wence-
Update all versions in pyproject.toml files. (#13486) @bdice
Remove applying nvbench that doesn't exist in 23.08 (#13484) @robertmaynard
Fix chunked Parquet reader benchmark (#13482) @vuule
Update JNI JSON reader column compatability for Spark (#13477) @revans2
Fix unsanitized output of scan with strings (#13455) @davidwendt
Reject functions without bytecode from _can_be_jitted in GroupBy Apply (#13429) @brandon-b-miller
Fix decimal scale reductions in _get_decimal_type (#13224) @charlesbluca

📖 Documentation

Fix doxygen groups for io data sources and sinks (#13718) @davidwendt
Add pandas compatibility note to DataFrame.query docstring (#13693) @beckernick
Add pylibcudf to developer guide (#13639) @vyasr
Fix repeated words in doxygen text (#13598) @karthikeyann
Update docs for top-level API. (#13592) @bdice
Fix the the doxygen text for cudf::concatenate and other places (#13561) @davidwendt
Document stream validation approach used in testing (#13556) @vyasr
Cleanup doc repetitions in libcudf (#13470) @karthikeyann

🚀 New Features

Support min and max aggregations for list type in groupby and reduction (#13676) @ttnghia
Add nvtext::jaccard_index API for strings columns (#13669) @davidwendt
Add read_parquet_metadata libcudf API (#13663) @karthikeyann
Expose streams in all public copying APIs (#13629) @vyasr
Add XXHash_64 hash function to cudf (#13612) @davidwendt
Java support: Floating point order-by columns for RANGE window functions (#13595) @mythrocks
Use cuco::static_map to build string dictionaries in ORC writer (#13580) @vuule
Add pylibcudf subpackage with gather implementation (#13562) @vyasr
Add JNI for lists::concatenate_list_elements (#13547) @ttnghia
Enable nested types for lists::concatenate_list_elements (#13545) @ttnghia
Add unicode encoding for string columns in JSON writer (#13539) @karthikeyann
Remove numba kernels from find_index_of_val (#13517) @brandon-b-miller
Floating point order-by columns for RANGE window functions (#13512) @mythrocks
Parse column chunk metadata statistics in parquet reader (#13472) @karthikeyann
Add abs function to apply (#13408) @brandon-b-miller
[FEA] AST filtering in parquet reader (#13348) @karthikeyann
[FEA] Adds option to recover from invalid JSON lines in JSON tokenizer (#13344) @elstehle
Ensure cccl packages don't clash with upstream version (#13235) @robertmaynard
Update struct_minmax_util to experimental row comparator (#13069) @divyegala
Add stream parameter to hashing APIs (#12090) @vyasr

🛠️ Improvements

Pin dask and distributed for 23.08 release (#13802) @galipremsagar
Relax protobuf pinnings. (#13770) @bdice
Switch fully unbounded window functions to use aggregations (#13727) @mythrocks
Switch to new wheel building pipeline (#13723) @vyasr
Revert CUDA 12.0 CI workflows to branch-23.08. (#13719) @bdice
Adding identify minimum version requirement (#13713) @hyperbolic2346
Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
Optimize ORC reader performance for list data (#13708) @vyasr
fix limit overflow message in a docstring (#13703) @ahmet-uyar
Alleviates JSON parser's need for multi-file sources to end with a newline (#13702) @elstehle
Update cython-lint and replace flake8 with ruff (#13699) @vyasr
Add __dask_tokenize__ definitions to cudf classes (#13695) @rjzamora
Convert libcudf hashing benchmarks to nvbench (#13694) @davidwendt
Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
Improve performance of cudf::strings::split on whitespace (#13680) @davidwendt
Allow ORC and Parquet writers to write nullable columns without nulls as non-nullable (#13675) @vuule
Raise a NotImplementedError in to_datetime when utc is passed (#13670) @shwina
Add rmm_mode parameter to nvbench base fixture (#13668) @davidwendt
Fix multiindex loc ordering in pandas-compat mode (#13660) @wence-
Add nvtext hash_character_ngrams function (#13654) @davidwendt
Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
Acquire spill lock in to/from_arrow (#13646) @shwina
Expose stable versions of libcudf sort routines (#13634) @wence-
Separate out hash_test.cpp source for each hash API (#13633) @davidwendt
Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
Create separate libcudf hash APIs for each supported hash function (#13626) @davidwendt
Add convert_dtypes API (#13623) @shwina
Clean up cupy in dependencies.yaml. (#13617) @bdice
Use cuda-version to constrain cudatoolkit. (#13615) @bdice
Add murmurhash3_x64_128 function to libcudf (#13604) @davidwendt
Performance improvement for cudf::strings::like (#13594) @davidwendt
Remove deprecated cudf.set_allocator. (#13591) @bdice
Clean up cudf device atomic with cuda::atomic_ref (#13583) @PointKernel
Add java bindings for distinct count (#13573) @revans2
Use nvcomp conda package. (#13566) @bdice
Add exception to string_scalar if input string exceeds size_type (#13560) @davidwendt
Add dispatch for cudf.Dataframe to/from pyarrow.Table conversion (#13558) @rjzamora
Get rid of cuco::pair_type aliases (#13553) @PointKernel
Introduce parity with pandas when sort=False in Groupby (#13551) @galipremsagar
Update CMake in docker to 3.26.4 (#13550) @NvTimLiu
Clarify source of error message in stream testing. (#13541) @bdice
Deprecate strings_to_categorical in cudf.read_parquet (#13540) @galipremsagar
Update to CMake 3.26.4 (#13538) @vyasr
s3 folder naming fix (#13536) @AyodeAwe
Implement iloc-getitem using parse-don't-validate approach (#13534) @wence-
Make synchronization explicit in the names of hostdevice_* copying APIs (#13530) @ttnghia
Add benchmark (Google Benchmark) dependency to conda packages. (#13528) @bdice
Add libcufile to dependencies.yaml. (#13523) @bdice
Fix some memoization logic in groupby/sort/sort_helper.cu (#13521) @davidwendt
Use sizes_to_offsets_iterator in cudf::gather for strings (#13520) @davidwendt
use rapids-upload-docs script (#13518) @AyodeAwe
Support UTF-8 BOM in CSV reader (#13516) @davidwendt
Move stream-related test configuration to CMake (#13513) @vyasr
Implement cudf.option_context (#13511) @galipremsagar
Unpin dask and distributed for development (#13508) @galipremsagar
Change build.sh to use pip install instead of setup.py (#13507) @vyasr
Use test default stream (#13506) @vyasr
Remove documentation build scripts for Jenkins (#13495) @ajschmidt8
Use east const in include files (#13494) @karthikeyann
Use east const in src files (#13493) @karthikeyann
Use east const in tests files (#13492) @karthikeyann
Use east const in benchmarks files (#13491) @karthikeyann
Performance improvement for nvtext tokenize/token functions (#13480) @davidwendt
Add pd.Float*Dtype to Avro and ORC mappings (#13475) @mroeschke
Use pandas public APIs where available (#13467) @mroeschke
Allow pd.ArrowDtype in cudf.from_pandas (#13465) @mroeschke
Rework libcudf regex benchmarks with nvbench (#13464) @davidwendt
Remove unused max_rows_tensor parameter from subword tokenizer (#13463) @davidwendt
Separate io-text and nvtext pytests into different files (#13435) @davidwendt
Add a move_to function to cudf::string_view::const_iterator (#13428) @davidwendt
Allow newer scikit-build (#13424) @vyasr
Refactor sort_by_values to sort_values, drop indices from return values. (#13419) @bdice
Inline Cython exception handler (#13411) @vyasr
Init JNI version 23.08.0-SNAPSHOT (#13401) @pxLi
Refactor ORC reader (#13396) @ttnghia
JNI: Remove cleaned objects in memory cleaner (#13378) @res-life
Add tests of currently unsupported indexing (#13338) @wence-
Performance improvement for some libcudf regex functions for long strings (#13322) @davidwendt
Exposure Tracked Buffer (first step towards unifying copy-on-write and spilling) (#13307) @madsbk
Write string data directly to column_buffer in Parquet reader (#13302) @etseidl
Add stacktrace into cudf exception types (#13298) @ttnghia
cuDF: Build CUDA 12 packages (#12922) @bdice

cudf - [NIGHTLY] v23.10.00

Published by rapids-bot[bot] about 1 year ago

🔗 Links

🚨 Breaking Changes

Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Create table_input_metadata from a table_metadata (#13920) @etseidl
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Update to Cython 3.0.0 (#13777) @vyasr
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

🐛 Bug Fixes

Fix kernel launch error for cudf::io::orc::gpu::rowgroup_char_counts_kernel (#14139) @davidwendt
Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
Fix DataFrame.values with no columns but index (#14134) @mroeschke
Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
Drop kwargs from Series.count (#14106) @galipremsagar
Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
Only use memory resources that haven't been freed (#14103) @robertmaynard
Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
Validate ignore_index type in drop_duplicates (#14098) @mroeschke
Fix renaming Series and Index (#14080) @galipremsagar
Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
Raise NotImplementedError in to_datetime with dayfirst without infer_format (#14058) @mroeschke
Fix various issues in Index.intersection (#14054) @galipremsagar
Fix Index.difference to match with pandas (#14053) @galipremsagar
Fix empty string column construction (#14052) @galipremsagar
Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Ignore compile_commands.json (#14048) @harrism
Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
Implement sort_remaining for sort_index (#14033) @wence-
Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
Fix return type of MultiIndex.difference (#14009) @galipremsagar
Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
Fix map column can not be non-nullable for java (#14003) @res-life
Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
Use thread_index_type to avoid out of bounds accesses in conditional joins (#13971) @vyasr
Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
Use cudf::thread_index_type in get_json_object and tdigest kernels (#13962) @nvdbaranec
Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
Fixed processed bytes calculation in APPLY_BOOLEAN_MASK benchmark. (#13937) @Blonck
Fix construction of Grouping objects (#13932) @galipremsagar
Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
Fix handling of typecasting in searchsorted (#13925) @galipremsagar
Preserve index name in reindex (#13917) @galipremsagar
Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
Use cudf::thread_index_type in concatenate.cu. (#13906) @bdice
Use cudf::thread_index_type in replace.cu. (#13905) @bdice
Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
Fix nvtext::generate_character_ngrams performance regression for longer strings (#13874) @davidwendt
Fix return type of MultiIndex.levels (#13870) @galipremsagar
Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
Fix binary operations between Series and Index (#13842) @galipremsagar
Update make_lists_column_from_scalar to use make_offsets_child_column utility (#13841) @davidwendt
Fix read out of bounds in string concatenate (#13838) @pentschev
Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
Fix cuFile I/O factories (#13829) @vuule
DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
Branch 23.10 merge 23.08 (#13822) @vyasr
Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
Raise error when mixed types are being constructed (#13816) @galipremsagar
Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Fix negative unary operation for boolean type (#13780) @galipremsagar
Fix contains(in) method for Series (#13779) @galipremsagar
Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
Preserve names of column object in various APIs (#13772) @galipremsagar
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
Provide our own Cython declaration for make_unique (#13746) @wence-

📖 Documentation

Fix typo in docstring: metadata. (#14025) @bdice
Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
Simplify Python doc configuration (#13826) @vyasr
Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
Fix all warnings in Python docs (#13789) @vyasr

🚀 New Features

Implement GroupBy.value_counts to match pandas API (#14114) @stmio
Refactor parquet thrift reader (#14097) @etseidl
Refactor hash_reduce_by_row (#14095) @ttnghia
Support negative preceding/following for ROW window functions (#14093) @mythrocks
Expose streams in public search APIs (#14034) @vyasr
Expose streams in public replace APIs (#14010) @vyasr
Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
Expose streams in public filling APIs (#13990) @vyasr
Expose streams in public concatenate APIs (#13987) @vyasr
Use HostMemoryAllocator in jni::allocate_host_buffer (#13975) @gerashegalov
Enable fractional null probability for hashing benchmark (#13967) @Blonck
Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
Add HostMemoryAllocator interface (#13924) @gerashegalov
Global stream pool (#13922) @etseidl
Create table_input_metadata from a table_metadata (#13920) @etseidl
Translate column size overflow exception to JNI (#13911) @mythrocks
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Exclude some tests from running with the compute sanitizer (#13872) @firestarman
Expand statistics support in ORC writer (#13848) @vuule
Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
Add cudf::strings::find function with target per row (#13808) @davidwendt
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
Support corr in GroupBy.apply through the jit engine (#13767) @shwina
Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
Support more numeric types in Groupby.apply with engine='jit' (#13729) @brandon-b-miller
[FEA] Add DELTA_BINARY_PACKED decoding support to Parquet reader (#13637) @etseidl
Read FIXED_LEN_BYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

🛠️ Improvements

Reduce memory usage of as_categorical_column (#14138) @wence-
Update to clang 16.0.6. (#14120) @bdice
Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
Add fallback matrix for nvcomp. (#14082) @bdice
[Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
Remove header tests (#14072) @ajschmidt8
Remove debug print in a Parquet test (#14063) @vuule
Expose stream parameter in public strings find APIs (#14060) @davidwendt
Update doxygen to 1.9.1 (#14059) @vyasr
Remove the mr from the base fixture (#14057) @vyasr
Expose streams in public strings case APIs (#14056) @davidwendt
Use cudf::make_empty_column instead of column_view constructor (#14030) @davidwendt
Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
Explicitly depend on zlib in conda recipes (#14018) @wence-
Use grid_stride for stride computations. (#13996) @bdice
Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
Use thread_index_type in partitioning.cu (#13973) @divyegala
Use cudf::thread_index_type in merge.cu (#13972) @divyegala
Use copy-pr-bot (#13970) @ajschmidt8
Use cudf::thread_index_type in strings custom kernels (#13968) @davidwendt
Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
Added pinned pool reservation API for java (#13964) @revans2
Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
Add bytes_per_second to copy_if_else benchmark (#13960) @Blonck
Add pandas compatible output to Series.unique (#13959) @galipremsagar
Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
Make HostColumnVector.getRefCount public (#13934) @abellina
Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
Add java API to get size of host memory needed to copy column view (#13919) @revans2
Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
Enable hugepage for arrow host allocations (#13914) @madsbk
Improve performance of nvtext::edit_distance (#13912) @davidwendt
Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
Use empty() instead of size() where possible (#13908) @vuule
[JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
Fixes a performance regression in FST (#13850) @elstehle
Set native handles to null on close in Java wrapper classes (#13818) @jlowe
Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
Update lists::contains to experimental row comparator (#13810) @divyegala
Reduce lists::contains dispatches for scalars (#13805) @divyegala
Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
Update to Cython 3.0.0 (#13777) @vyasr
Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
Branch 23.10 merge 23.08 (#13773) @vyasr
Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
Branch 23.10 merge 23.08 (#13753) @vyasr
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Refactors JSON reader's pushdown automaton (#13716) @elstehle
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

cudf - [NIGHTLY] v23.06.00

Published by rapids-bot[bot] over 1 year ago

🔗 Links

🚨 Breaking Changes

Fix batch processing for parquet writer (#13438) @ttnghia
Use <NA> instead of null to match pandas. (#13415) @bdice
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Remove null mask and null count from column_view constructors (#13311) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Cleanup Parquet chunked writer (#13094) @ttnghia
Cleanup ORC chunked writer (#13091) @ttnghia
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Remove deprecated regex functions from libcudf (#13067) @davidwendt
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🐛 Bug Fixes

Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
Fix writing of ORC files with empty rowgroups (#13466) @vuule
Fix cudf::repeat logic when count is zero (#13459) @davidwendt
Fix batch processing for parquet writer (#13438) @ttnghia
Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
Use <NA> instead of null to match pandas. (#13415) @bdice
Fix tokenize with non-space delimiter (#13403) @shwina
Fix groupby head/tail for empty dataframe (#13398) @shwina
Default to closed="right" in IntervalIndex constructor (#13394) @shwina
Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
Fix unused argument errors in nvcc 11.5 (#13387) @abellina
Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
Fix page size estimation in Parquet writer (#13364) @etseidl
Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
Support gcc 12 as the C++ compiler (#13316) @robertmaynard
Correctly set bitmask size in from_column_view (#13315) @wence-
Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
Fix parquet schema interpretation issue (#13277) @hyperbolic2346
Fix 64bit shift bug in avro reader (#13276) @karthikeyann
Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
Clean up buffers in case AssertionError (#13262) @razajafri
Allow empty input table in ast compute_column (#13245) @wence-
Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
Fix the row index stream order in ORC reader (#13242) @vuule
Make is_decompression_disabled and is_compression_disabled thread-safe (#13240) @vuule
Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
Fix race in ORC string dictionary creation (#13214) @revans2
Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
Fix hostdevice_vector::subspan (#13187) @ttnghia
Use custom nvbench entry point to ensure cudf::nvbench_base_fixture usage (#13183) @robertmaynard
Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
Fix a few clang-format style check errors (#13146) @davidwendt
[REVIEW] Fix Series and DataFrame constructors to validate index lengths (#13122) @galipremsagar
Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
Adds checks to make sure json reader won't overflow (#13115) @elstehle
Fix null_count of columns returned by chunked_parquet_reader (#13111) @vuule
Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
[REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Fix column selection read_parquet benchmarks (#13082) @vuule
Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
Add algorithm include in data_sink.hpp (#13068) @ahendriksen
Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
Prevent overflow with skip_rows in ORC and Parquet readers (#13063) @vuule
Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
[REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
Fix read_avro() skip_rows and num_rows. (#12912) @tpn
Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina

🚀 New Features

Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
cuDF numba cuda 12 updates (#13337) @brandon-b-miller
Add tz_convert method to convert between timestamps (#13328) @shwina
Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
Support the case=False argument to str.contains (#13290) @shwina
Add an event handler for ColumnVector.close (#13279) @abellina
JNI api for cudf::chunked_pack (#13278) @abellina
Implement a chunked_pack API (#13260) @abellina
Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
JNI changes for range-extents in window functions. (#13199) @mythrocks
Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
Add IS_NULL operator to AST (#13145) @karthikeyann
STRING order-by column for RANGE window functions (#13143) @mythrocks
Update contains_table to experimental row hasher and equality comparator (#13119) @divyegala
Automatically select GroupBy.apply algorithm based on if the UDF is jittable (#13113) @brandon-b-miller
Refactor Parquet chunked writer (#13076) @ttnghia
Add Python bindings for string literal support in AST (#13073) @karthikeyann
Add Java bindings for string literal support in AST (#13072) @karthikeyann
Add string scalar support in AST (#13061) @karthikeyann
Log cuIO warnings using the libcudf logger (#13043) @vuule
Update mixed_join to use experimental row hasher and comparator (#13028) @divyegala
Support structs of lists in row lexicographic comparator (#13005) @ttnghia
Adding hostdevice_span that is a span createable from hostdevice_vector (#12981) @hyperbolic2346
Add nvtext::minhash function (#12961) @davidwendt
Support lists of structs in row lexicographic comparator (#12953) @ttnghia
Update join to use experimental row hasher and comparator (#12787) @divyegala
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🛠️ Improvements

Bump typing_extensions minimum version to 4.0.0 (#13618) @shwina
Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
Handle some corner-cases in indexing with boolean masks (#13402) @wence-
Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
[JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
Fix JNI method with mismatched parameter list (#13384) @ttnghia
Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Move some nvtext benchmarks to nvbench (#13368) @davidwendt
run docs nightly too (#13366) @AyodeAwe
Add warning for default dtype parameter in get_dummies (#13365) @galipremsagar
Add log messages about kvikIO compatibility mode (#13363) @vuule
Switch back to using primary shared-action-workflows branch (#13362) @vyasr
Deprecate StringIndex and use Index instead (#13361) @galipremsagar
Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
Expunge most uses of TypeVar(bound="Foo") (#13346) @wence-
Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
Improve distinct_count with cuco::static_set (#13343) @PointKernel
Fix contiguous_split performance (#13342) @ttnghia
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Update mypy to 1.3 (#13340) @wence-
[Java] Purge non-empty nulls when setting validity (#13335) @razajafri
Add row-wise filtering step to read_parquet (#13334) @rjzamora
Performance improvement for nvtext::minhash (#13333) @davidwendt
Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
Move meta calculation in dask_cudf.read_parquet (#13327) @rjzamora
Changes to support Numpy >= 1.24 (#13325) @shwina
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Clean up distinct_count benchmark (#13321) @PointKernel
Fix gtest pinning to 1.13.0. (#13319) @bdice
Remove null mask and null count from column_view constructors (#13311) @vyasr
Address feedback from 13289 (#13306) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
First check for BaseDtype when infering the data type of an arbitrary object (#13295) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Support CUDA 12.0 for pip wheels (#13289) @divyegala
Refactor transform_lists_of_structs in row_operators.cu (#13288) @ttnghia
Branch 23.06 merge 23.04 (#13286) @vyasr
Update cupy dependency (#13284) @vyasr
Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
Fix unused variables and functions (#13275) @karthikeyann
Fix integer overflow in partition scatter_map construction (#13272) @wence-
Numba 0.57 compatibility fixes (#13271) @gmarkall
Performance improvement in cudf::strings::all_characters_of_type (#13259) @davidwendt
Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
Build wheels using new single image workflow (#13249) @vyasr
Enable sccache hits from local builds (#13248) @AyodeAwe
Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
Introduce pandas_compatible option in cudf (#13241) @galipremsagar
Add metadata_builder helper class (#13232) @abellina
Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
Add chunked reader benchmark (#13223) @SrikarVanavasam
Set the null count in output columns in the CSV reader (#13221) @vuule
Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
Fix string_scalar stream usage in write_json.cu (#13212) @davidwendt
Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Optimization to decoding of parquet level streams (#13203) @nvdbaranec
Clean up and simplify gpuDecideCompression (#13202) @vuule
Use std::array for a statically sized vector in create_serialized_trie (#13201) @vuule
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
Split up unique_count.cu to improve build time (#13169) @davidwendt
Use nvtx3 includes in string examples. (#13165) @bdice
Change some .cu gtest files to .cpp (#13155) @davidwendt
Remove wheel pytest verbosity (#13151) @sevagh
Fix libcudf to always pass null-count to set_null_mask (#13149) @davidwendt
Fix gtests to always pass null-count to set_null_mask calls (#13148) @davidwendt
Optimize JSON writer (#13144) @karthikeyann
Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
[REVIEW] Deprecate pad and backfill methods (#13140) @galipremsagar
Use CTAD instead of functions in ProtobufReader (#13135) @vuule
Remove more instances of UNKNOWN_NULL_COUNT (#13134) @vyasr
Update clang-format to 16.0.1. (#13133) @bdice
Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
Branch 23.06 merge 23.04 (#13131) @vyasr
Compute null-count in cudf::detail::slice (#13124) @davidwendt
Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
Set null-count in linked_column_view conversion operator (#13121) @davidwendt
Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
Remove uses-setup-env-vars (#13105) @vyasr
Explicitly compute null count in concatenate APIs (#13104) @vyasr
Replace unnecessary uses of UNKNOWN_NULL_COUNT (#13102) @vyasr
Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
Use .element() instead of .data() for window range calculations (#13095) @mythrocks
Cleanup Parquet chunked writer (#13094) @ttnghia
Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
Cleanup ORC chunked writer (#13091) @ttnghia
Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
Change cudf::test::make_null_mask to also return null-count (#13081) @davidwendt
Resolved automerger from branch-23.04 to branch-23.06 (#13080) @galipremsagar
Assert for non-empty nulls (#13071) @razajafri
Remove deprecated regex functions from libcudf (#13067) @davidwendt
Refactor cudf::detail::sorted_order (#13062) @ttnghia
Improve performance of slice_strings for long strings (#13057) @davidwendt
Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
[REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
Remove console output from some libcudf gtests (#13027) @davidwendt
Remove underscore in build string. (#13025) @bdice
Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
Fix auto merger from branch-23.04 to branch-23.06 (#13009) @galipremsagar
Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
Add nvtx annotatations to groupby methods (#12941) @wence-
Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
Optimize set-like operations (#12769) @ttnghia
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Add empty test files for test reorganization (#12288) @shwina

cudf - v23.06.01

Published by raydouglass over 1 year ago

🚨 Breaking Changes

Fix batch processing for parquet writer (#13438) @ttnghia
Use <NA> instead of null to match pandas. (#13415) @bdice
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Remove null mask and null count from column_view constructors (#13311) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Cleanup Parquet chunked writer (#13094) @ttnghia
Cleanup ORC chunked writer (#13091) @ttnghia
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Remove deprecated regex functions from libcudf (#13067) @davidwendt
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🐛 Bug Fixes

Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
Fix writing of ORC files with empty rowgroups (#13466) @vuule
Fix cudf::repeat logic when count is zero (#13459) @davidwendt
Fix batch processing for parquet writer (#13438) @ttnghia
Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
Use <NA> instead of null to match pandas. (#13415) @bdice
Fix tokenize with non-space delimiter (#13403) @shwina
Fix groupby head/tail for empty dataframe (#13398) @shwina
Default to closed="right" in IntervalIndex constructor (#13394) @shwina
Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
Fix unused argument errors in nvcc 11.5 (#13387) @abellina
Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
Fix page size estimation in Parquet writer (#13364) @etseidl
Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
Support gcc 12 as the C++ compiler (#13316) @robertmaynard
Correctly set bitmask size in from_column_view (#13315) @wence-
Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
Fix parquet schema interpretation issue (#13277) @hyperbolic2346
Fix 64bit shift bug in avro reader (#13276) @karthikeyann
Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
Clean up buffers in case AssertionError (#13262) @razajafri
Allow empty input table in ast compute_column (#13245) @wence-
Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
Fix the row index stream order in ORC reader (#13242) @vuule
Make is_decompression_disabled and is_compression_disabled thread-safe (#13240) @vuule
Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
Fix race in ORC string dictionary creation (#13214) @revans2
Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
Fix hostdevice_vector::subspan (#13187) @ttnghia
Use custom nvbench entry point to ensure cudf::nvbench_base_fixture usage (#13183) @robertmaynard
Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
Fix a few clang-format style check errors (#13146) @davidwendt
[REVIEW] Fix Series and DataFrame constructors to validate index lengths (#13122) @galipremsagar
Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
Adds checks to make sure json reader won't overflow (#13115) @elstehle
Fix null_count of columns returned by chunked_parquet_reader (#13111) @vuule
Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
[REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Fix column selection read_parquet benchmarks (#13082) @vuule
Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
Add algorithm include in data_sink.hpp (#13068) @ahendriksen
Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
Prevent overflow with skip_rows in ORC and Parquet readers (#13063) @vuule
Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
[REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
Fix read_avro() skip_rows and num_rows. (#12912) @tpn
Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina

🚀 New Features

Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
cuDF numba cuda 12 updates (#13337) @brandon-b-miller
Add tz_convert method to convert between timestamps (#13328) @shwina
Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
Support the case=False argument to str.contains (#13290) @shwina
Add an event handler for ColumnVector.close (#13279) @abellina
JNI api for cudf::chunked_pack (#13278) @abellina
Implement a chunked_pack API (#13260) @abellina
Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
JNI changes for range-extents in window functions. (#13199) @mythrocks
Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
Add IS_NULL operator to AST (#13145) @karthikeyann
STRING order-by column for RANGE window functions (#13143) @mythrocks
Update contains_table to experimental row hasher and equality comparator (#13119) @divyegala
Automatically select GroupBy.apply algorithm based on if the UDF is jittable (#13113) @brandon-b-miller
Refactor Parquet chunked writer (#13076) @ttnghia
Add Python bindings for string literal support in AST (#13073) @karthikeyann
Add Java bindings for string literal support in AST (#13072) @karthikeyann
Add string scalar support in AST (#13061) @karthikeyann
Log cuIO warnings using the libcudf logger (#13043) @vuule
Update mixed_join to use experimental row hasher and comparator (#13028) @divyegala
Support structs of lists in row lexicographic comparator (#13005) @ttnghia
Adding hostdevice_span that is a span createable from hostdevice_vector (#12981) @hyperbolic2346
Add nvtext::minhash function (#12961) @davidwendt
Support lists of structs in row lexicographic comparator (#12953) @ttnghia
Update join to use experimental row hasher and comparator (#12787) @divyegala
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🛠️ Improvements

Bump typing_extensions minimum version to 4.0.0 (#13618) @shwina
Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
Handle some corner-cases in indexing with boolean masks (#13402) @wence-
Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
[JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
Fix JNI method with mismatched parameter list (#13384) @ttnghia
Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Move some nvtext benchmarks to nvbench (#13368) @davidwendt
run docs nightly too (#13366) @AyodeAwe
Add warning for default dtype parameter in get_dummies (#13365) @galipremsagar
Add log messages about kvikIO compatibility mode (#13363) @vuule
Switch back to using primary shared-action-workflows branch (#13362) @vyasr
Deprecate StringIndex and use Index instead (#13361) @galipremsagar
Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
Expunge most uses of TypeVar(bound="Foo") (#13346) @wence-
Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
Improve distinct_count with cuco::static_set (#13343) @PointKernel
Fix contiguous_split performance (#13342) @ttnghia
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Update mypy to 1.3 (#13340) @wence-
[Java] Purge non-empty nulls when setting validity (#13335) @razajafri
Add row-wise filtering step to read_parquet (#13334) @rjzamora
Performance improvement for nvtext::minhash (#13333) @davidwendt
Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
Move meta calculation in dask_cudf.read_parquet (#13327) @rjzamora
Changes to support Numpy >= 1.24 (#13325) @shwina
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Clean up distinct_count benchmark (#13321) @PointKernel
Fix gtest pinning to 1.13.0. (#13319) @bdice
Remove null mask and null count from column_view constructors (#13311) @vyasr
Address feedback from 13289 (#13306) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
First check for BaseDtype when infering the data type of an arbitrary object (#13295) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Support CUDA 12.0 for pip wheels (#13289) @divyegala
Refactor transform_lists_of_structs in row_operators.cu (#13288) @ttnghia
Branch 23.06 merge 23.04 (#13286) @vyasr
Update cupy dependency (#13284) @vyasr
Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
Fix unused variables and functions (#13275) @karthikeyann
Fix integer overflow in partition scatter_map construction (#13272) @wence-
Numba 0.57 compatibility fixes (#13271) @gmarkall
Performance improvement in cudf::strings::all_characters_of_type (#13259) @davidwendt
Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
Build wheels using new single image workflow (#13249) @vyasr
Enable sccache hits from local builds (#13248) @AyodeAwe
Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
Introduce pandas_compatible option in cudf (#13241) @galipremsagar
Add metadata_builder helper class (#13232) @abellina
Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
Add chunked reader benchmark (#13223) @SrikarVanavasam
Set the null count in output columns in the CSV reader (#13221) @vuule
Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
Fix string_scalar stream usage in write_json.cu (#13212) @davidwendt
Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Optimization to decoding of parquet level streams (#13203) @nvdbaranec
Clean up and simplify gpuDecideCompression (#13202) @vuule
Use std::array for a statically sized vector in create_serialized_trie (#13201) @vuule
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
Split up unique_count.cu to improve build time (#13169) @davidwendt
Use nvtx3 includes in string examples. (#13165) @bdice
Change some .cu gtest files to .cpp (#13155) @davidwendt
Remove wheel pytest verbosity (#13151) @sevagh
Fix libcudf to always pass null-count to set_null_mask (#13149) @davidwendt
Fix gtests to always pass null-count to set_null_mask calls (#13148) @davidwendt
Optimize JSON writer (#13144) @karthikeyann
Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
[REVIEW] Deprecate pad and backfill methods (#13140) @galipremsagar
Use CTAD instead of functions in ProtobufReader (#13135) @vuule
Remove more instances of UNKNOWN_NULL_COUNT (#13134) @vyasr
Update clang-format to 16.0.1. (#13133) @bdice
Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
Branch 23.06 merge 23.04 (#13131) @vyasr
Compute null-count in cudf::detail::slice (#13124) @davidwendt
Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
Set null-count in linked_column_view conversion operator (#13121) @davidwendt
Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
Remove uses-setup-env-vars (#13105) @vyasr
Explicitly compute null count in concatenate APIs (#13104) @vyasr
Replace unnecessary uses of UNKNOWN_NULL_COUNT (#13102) @vyasr
Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
Use .element() instead of .data() for window range calculations (#13095) @mythrocks
Cleanup Parquet chunked writer (#13094) @ttnghia
Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
Cleanup ORC chunked writer (#13091) @ttnghia
Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
Change cudf::test::make_null_mask to also return null-count (#13081) @davidwendt
Resolved automerger from branch-23.04 to branch-23.06 (#13080) @galipremsagar
Assert for non-empty nulls (#13071) @razajafri
Remove deprecated regex functions from libcudf (#13067) @davidwendt
Refactor cudf::detail::sorted_order (#13062) @ttnghia
Improve performance of slice_strings for long strings (#13057) @davidwendt
Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
[REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
Remove console output from some libcudf gtests (#13027) @davidwendt
Remove underscore in build string. (#13025) @bdice
Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
Fix auto merger from branch-23.04 to branch-23.06 (#13009) @galipremsagar
Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
Add nvtx annotatations to groupby methods (#12941) @wence-
Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
Optimize set-like operations (#12769) @ttnghia
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Add empty test files for test reorganization (#12288) @shwina

cudf - v23.06.00

Published by raydouglass over 1 year ago

🚨 Breaking Changes

Fix batch processing for parquet writer (#13438) @ttnghia
Use <NA> instead of null to match pandas. (#13415) @bdice
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Remove null mask and null count from column_view constructors (#13311) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Cleanup Parquet chunked writer (#13094) @ttnghia
Cleanup ORC chunked writer (#13091) @ttnghia
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Remove deprecated regex functions from libcudf (#13067) @davidwendt
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🐛 Bug Fixes

Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
Fix writing of ORC files with empty rowgroups (#13466) @vuule
Fix cudf::repeat logic when count is zero (#13459) @davidwendt
Fix batch processing for parquet writer (#13438) @ttnghia
Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
Use <NA> instead of null to match pandas. (#13415) @bdice
Fix tokenize with non-space delimiter (#13403) @shwina
Fix groupby head/tail for empty dataframe (#13398) @shwina
Default to closed="right" in IntervalIndex constructor (#13394) @shwina
Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
Fix unused argument errors in nvcc 11.5 (#13387) @abellina
Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
Fix page size estimation in Parquet writer (#13364) @etseidl
Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
Support gcc 12 as the C++ compiler (#13316) @robertmaynard
Correctly set bitmask size in from_column_view (#13315) @wence-
Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
Fix parquet schema interpretation issue (#13277) @hyperbolic2346
Fix 64bit shift bug in avro reader (#13276) @karthikeyann
Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
Clean up buffers in case AssertionError (#13262) @razajafri
Allow empty input table in ast compute_column (#13245) @wence-
Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
Fix the row index stream order in ORC reader (#13242) @vuule
Make is_decompression_disabled and is_compression_disabled thread-safe (#13240) @vuule
Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
Fix race in ORC string dictionary creation (#13214) @revans2
Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
Fix hostdevice_vector::subspan (#13187) @ttnghia
Use custom nvbench entry point to ensure cudf::nvbench_base_fixture usage (#13183) @robertmaynard
Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
Fix a few clang-format style check errors (#13146) @davidwendt
[REVIEW] Fix Series and DataFrame constructors to validate index lengths (#13122) @galipremsagar
Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
Adds checks to make sure json reader won't overflow (#13115) @elstehle
Fix null_count of columns returned by chunked_parquet_reader (#13111) @vuule
Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
[REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Fix column selection read_parquet benchmarks (#13082) @vuule
Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
Add algorithm include in data_sink.hpp (#13068) @ahendriksen
Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
Prevent overflow with skip_rows in ORC and Parquet readers (#13063) @vuule
Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
[REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
Fix read_avro() skip_rows and num_rows. (#12912) @tpn
Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina

🚀 New Features

Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
cuDF numba cuda 12 updates (#13337) @brandon-b-miller
Add tz_convert method to convert between timestamps (#13328) @shwina
Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
Support the case=False argument to str.contains (#13290) @shwina
Add an event handler for ColumnVector.close (#13279) @abellina
JNI api for cudf::chunked_pack (#13278) @abellina
Implement a chunked_pack API (#13260) @abellina
Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
JNI changes for range-extents in window functions. (#13199) @mythrocks
Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
Add IS_NULL operator to AST (#13145) @karthikeyann
STRING order-by column for RANGE window functions (#13143) @mythrocks
Update contains_table to experimental row hasher and equality comparator (#13119) @divyegala
Automatically select GroupBy.apply algorithm based on if the UDF is jittable (#13113) @brandon-b-miller
Refactor Parquet chunked writer (#13076) @ttnghia
Add Python bindings for string literal support in AST (#13073) @karthikeyann
Add Java bindings for string literal support in AST (#13072) @karthikeyann
Add string scalar support in AST (#13061) @karthikeyann
Log cuIO warnings using the libcudf logger (#13043) @vuule
Update mixed_join to use experimental row hasher and comparator (#13028) @divyegala
Support structs of lists in row lexicographic comparator (#13005) @ttnghia
Adding hostdevice_span that is a span createable from hostdevice_vector (#12981) @hyperbolic2346
Add nvtext::minhash function (#12961) @davidwendt
Support lists of structs in row lexicographic comparator (#12953) @ttnghia
Update join to use experimental row hasher and comparator (#12787) @divyegala
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🛠️ Improvements

Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
Handle some corner-cases in indexing with boolean masks (#13402) @wence-
Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
[JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
Fix JNI method with mismatched parameter list (#13384) @ttnghia
Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Move some nvtext benchmarks to nvbench (#13368) @davidwendt
run docs nightly too (#13366) @AyodeAwe
Add warning for default dtype parameter in get_dummies (#13365) @galipremsagar
Add log messages about kvikIO compatibility mode (#13363) @vuule
Switch back to using primary shared-action-workflows branch (#13362) @vyasr
Deprecate StringIndex and use Index instead (#13361) @galipremsagar
Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
Expunge most uses of TypeVar(bound="Foo") (#13346) @wence-
Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
Improve distinct_count with cuco::static_set (#13343) @PointKernel
Fix contiguous_split performance (#13342) @ttnghia
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Update mypy to 1.3 (#13340) @wence-
[Java] Purge non-empty nulls when setting validity (#13335) @razajafri
Add row-wise filtering step to read_parquet (#13334) @rjzamora
Performance improvement for nvtext::minhash (#13333) @davidwendt
Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
Move meta calculation in dask_cudf.read_parquet (#13327) @rjzamora
Changes to support Numpy >= 1.24 (#13325) @shwina
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Clean up distinct_count benchmark (#13321) @PointKernel
Fix gtest pinning to 1.13.0. (#13319) @bdice
Remove null mask and null count from column_view constructors (#13311) @vyasr
Address feedback from 13289 (#13306) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
First check for BaseDtype when infering the data type of an arbitrary object (#13295) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Support CUDA 12.0 for pip wheels (#13289) @divyegala
Refactor transform_lists_of_structs in row_operators.cu (#13288) @ttnghia
Branch 23.06 merge 23.04 (#13286) @vyasr
Update cupy dependency (#13284) @vyasr
Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
Fix unused variables and functions (#13275) @karthikeyann
Fix integer overflow in partition scatter_map construction (#13272) @wence-
Numba 0.57 compatibility fixes (#13271) @gmarkall
Performance improvement in cudf::strings::all_characters_of_type (#13259) @davidwendt
Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
Build wheels using new single image workflow (#13249) @vyasr
Enable sccache hits from local builds (#13248) @AyodeAwe
Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
Introduce pandas_compatible option in cudf (#13241) @galipremsagar
Add metadata_builder helper class (#13232) @abellina
Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
Add chunked reader benchmark (#13223) @SrikarVanavasam
Set the null count in output columns in the CSV reader (#13221) @vuule
Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
Fix string_scalar stream usage in write_json.cu (#13212) @davidwendt
Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Optimization to decoding of parquet level streams (#13203) @nvdbaranec
Clean up and simplify gpuDecideCompression (#13202) @vuule
Use std::array for a statically sized vector in create_serialized_trie (#13201) @vuule
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
Split up unique_count.cu to improve build time (#13169) @davidwendt
Use nvtx3 includes in string examples. (#13165) @bdice
Change some .cu gtest files to .cpp (#13155) @davidwendt
Remove wheel pytest verbosity (#13151) @sevagh
Fix libcudf to always pass null-count to set_null_mask (#13149) @davidwendt
Fix gtests to always pass null-count to set_null_mask calls (#13148) @davidwendt
Optimize JSON writer (#13144) @karthikeyann
Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
[REVIEW] Deprecate pad and backfill methods (#13140) @galipremsagar
Use CTAD instead of functions in ProtobufReader (#13135) @vuule
Remove more instances of UNKNOWN_NULL_COUNT (#13134) @vyasr
Update clang-format to 16.0.1. (#13133) @bdice
Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
Branch 23.06 merge 23.04 (#13131) @vyasr
Compute null-count in cudf::detail::slice (#13124) @davidwendt
Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
Set null-count in linked_column_view conversion operator (#13121) @davidwendt
Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
Remove uses-setup-env-vars (#13105) @vyasr
Explicitly compute null count in concatenate APIs (#13104) @vyasr
Replace unnecessary uses of UNKNOWN_NULL_COUNT (#13102) @vyasr
Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
Use .element() instead of .data() for window range calculations (#13095) @mythrocks
Cleanup Parquet chunked writer (#13094) @ttnghia
Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
Cleanup ORC chunked writer (#13091) @ttnghia
Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
Change cudf::test::make_null_mask to also return null-count (#13081) @davidwendt
Resolved automerger from branch-23.04 to branch-23.06 (#13080) @galipremsagar
Assert for non-empty nulls (#13071) @razajafri
Remove deprecated regex functions from libcudf (#13067) @davidwendt
Refactor cudf::detail::sorted_order (#13062) @ttnghia
Improve performance of slice_strings for long strings (#13057) @davidwendt
Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
[REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
Remove console output from some libcudf gtests (#13027) @davidwendt
Remove underscore in build string. (#13025) @bdice
Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
Fix auto merger from branch-23.04 to branch-23.06 (#13009) @galipremsagar
Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
Add nvtx annotatations to groupby methods (#12941) @wence-
Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
Optimize set-like operations (#12769) @ttnghia
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Add empty test files for test reorganization (#12288) @shwina

cudf - v23.04.00

Published by raydouglass over 1 year ago

🚨 Breaking Changes

Pin dask and distributed for release (#13070) @galipremsagar
Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
Update minimum pandas and numpy pinnings (#12887) @galipremsagar
Deprecate names & dtype in Index.copy (#12825) @galipremsagar
Deprecate Index.is_* methods (#12820) @galipremsagar
Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
Deprecate na_sentinel in factorize (#12817) @galipremsagar
Make string methods return a Series with a useful Index (#12814) @shwina
Produce useful guidance on overflow error in to_csv (#12705) @wence-
Move strings_udf code into cuDF (#12669) @brandon-b-miller
Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
Replace message parsing with throwing more specific exceptions (#12426) @vyasr

🐛 Bug Fixes

Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
Fix DataFrame constructor to broadcast scalar inputs properly (#12997) @galipremsagar
Drop force_nullable_schema from chunked parquet writer (#12996) @galipremsagar
Fix gtest column utility comparator diff reporting (#12995) @davidwendt
Handle index names while performing groupby (#12992) @galipremsagar
Fix __setitem__ on string columns when the scalar value ends in a null byte (#12991) @wence-
Fix sort_values when column is all empty strings (#12988) @eriknw
Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
Pre-emptive fix for upstream dask.dataframe.read_parquet changes (#12983) @rjzamora
Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
cudftestutil supports static gtest dependencies (#12957) @robertmaynard
Include gtest in build environment. (#12956) @vyasr
Correctly handle scalar indices in Index.__getitem__ (#12955) @wence-
Avoid building cython twice (#12945) @galipremsagar
Fix set index error for Series rolling window operations (#12942) @galipremsagar
Fix calculation of null counts for Parquet statistics (#12938) @etseidl
Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
Use get_current_device_resource for intermediate allocations in COLLECT_LIST window code (#12927) @karthikeyann
Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
Fix conda recipe post-link.sh typo (#12916) @pentschev
min_rows and num_rows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
Use python -m pytest for nightly wheel tests (#12871) @bdice
Parquet writer column_size() should return a size_t (#12870) @etseidl
Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
Remove tokenizers pre-install pinning. (#12854) @vyasr
Fix parquet RangeIndex bug (#12838) @rjzamora
Remove KAFKA_HOST_TEST from compute-sanitizer check (#12831) @davidwendt
Make string methods return a Series with a useful Index (#12814) @shwina
Tell cudf_kafka to use header-only fmt (#12796) @vyasr
Add GroupBy.dtypes (#12783) @galipremsagar
Fix a leak in a test and clarify some test names (#12781) @revans2
Fix bug in all-null list due to join_list_elements special handling (#12767) @karthikeyann
Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
Fix a bug with num_keys in _scatter_by_slice (#12749) @thomcom
Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
Add always_nullable flag to Dremel encoding (#12727) @divyegala
Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
Fix faulty conditional logic in JIT GroupBy.apply (#12706) @brandon-b-miller
Produce useful guidance on overflow error in to_csv (#12705) @wence-
Handle parquet list data corner case (#12698) @nvdbaranec
Fix missing trailing comma in json writer (#12688) @karthikeyann
Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
Handle bool types in round API (#12670) @galipremsagar
Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
Fix from_arrow to load a sliced arrow table (#12665) @galipremsagar
Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
Fix find_common_dtype and values to handle complex dtypes (#12537) @galipremsagar
Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
Fix Series comparison vs scalars (#12519) @brandon-b-miller
Allow casting from UDFString back to StringView to call methods in strings_udf (#12363) @brandon-b-miller

📖 Documentation

Fix GroupBy.apply doc examples rendering (#12994) @brandon-b-miller
add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
Add README symlink for dask-cudf. (#12946) @bdice
Remove return type from @return doxygen tags (#12908) @davidwendt
Fix docs build to be pydata-sphinx-theme=0.13.0 compatible (#12874) @galipremsagar
Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
Enable doctests for GroupBy methods (#12658) @brandon-b-miller
Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt

🚀 New Features

Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
Refactor orc chunked writer (#12949) @ttnghia
Make Parquet writer nullable option application to single table writes (#12933) @vuule
Refactor io::orc::ProtobufWriter (#12877) @ttnghia
Make timezone table independent from ORC (#12805) @vuule
Cache JIT GroupBy.apply functions (#12802) @brandon-b-miller
Implement initial support for avro logical types (#6482) (#12788) @tpn
Update tests/column_utilities to use experimental::equality row comparator (#12777) @divyegala
Update distinct/unique_count to experimental::row hasher/comparator (#12776) @divyegala
Update hash_partition to use experimental::row::row_hasher (#12761) @divyegala
Update is_sorted to use experimental::row::lexicographic (#12752) @divyegala
Update default data source in cuio reader benchmarks (#12740) @PointKernel
Reenable stream identification library in CI (#12714) @vyasr
Add regex_program strings splitting java APIs and tests (#12713) @cindyyuanjiang
Add regex_program strings replacing java APIs and tests (#12701) @cindyyuanjiang
Add regex_program strings extract java APIs and tests (#12699) @cindyyuanjiang
Variable fragment sizes for Parquet writer (#12685) @etseidl
Add segmented reduction support for fixed-point types (#12680) @davidwendt
Move strings_udf code into cuDF (#12669) @brandon-b-miller
Add regex_program searching APIs and related java classes (#12666) @cindyyuanjiang
Add logging to libcudf (#12637) @vuule
Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
Convert rank to use to experimental row comparators (#12481) @divyegala
Use rapids-cmake parallel testing feature (#12451) @robertmaynard
Enable detection of undesired stream usage (#12089) @vyasr

🛠️ Improvements

Pin dask and distributed for release (#13070) @galipremsagar
Pin cupy in wheel tests to supported versions (#13041) @vyasr
Pin numba version (#13001) @vyasr
Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
Stop setting package version attribute in wheels (#12977) @vyasr
Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
Remove default detail mrs: part7 (#12970) @vyasr
Remove default detail mrs: part6 (#12969) @vyasr
Remove default detail mrs: part5 (#12968) @vyasr
Remove default detail mrs: part4 (#12967) @vyasr
Remove default detail mrs: part3 (#12966) @vyasr
Remove default detail mrs: part2 (#12965) @vyasr
Remove default detail mrs: part1 (#12964) @vyasr
Add force_nullable_schema parameter to Parquet writer. (#12952) @galipremsagar
Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
Remove remaining default stream parameters (#12943) @vyasr
Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
Implement groupby.head and groupby.tail (#12939) @wence-
Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
Pass SCCACHE_S3_USE_SSL to conda builds (#12910) @ajschmidt8
Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
Generate pyproject dependencies using dfg (#12906) @vyasr
Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
Fix moto env vars & pass AWS_SESSION_TOKEN to conda builds (#12902) @ajschmidt8
Rewrite CSV writer benchmark with nvbench (#12901) @PointKernel
Rework some code logic to reduce iterator and comparator inlining to improve compile time (#12900) @davidwendt
Deprecate line_terminator in favor of lineterminator in to_csv (#12896) @wence-
Add stream and mr parameters for structs::detail::flatten_nested_columns (#12892) @ttnghia
Deprecate libcudf regex APIs accepting pattern strings directly (#12891) @davidwendt
Remove default parameters from detail headers in include (#12888) @vyasr
Update minimum pandas and numpy pinnings (#12887) @galipremsagar
Implement groupby.sample (#12882) @wence-
Update JNI build ENV default to gcc 11 (#12881) @pxLi
Change return type of cudf::structs::detail::flatten_nested_columns to smart pointer (#12878) @ttnghia
Fix passing seed parameter to MurmurHash3_32 in cudf::hash() function (#12875) @davidwendt
Remove manual artifact upload step in CI (#12869) @ajschmidt8
Update to GCC 11 (#12868) @bdice
Fix null hive-partition behavior in dask-cudf parquet (#12866) @rjzamora
Update to protobuf>=4.21.6,<4.22. (#12864) @bdice
Update RMM allocators (#12861) @pentschev
Improve performance for replace-multi for long strings (#12858) @davidwendt
Drop Python 3.7 handling for pickle protocol 4 (#12857) @jakirkham
Migrate as much as possible to pyproject.toml (#12850) @vyasr
Enable nbqa pre-commit hooks for isort and black. (#12848) @bdice
Setting a threshold for KvikIO IO (#12841) @madsbk
Update datasets download URL (#12840) @jjacobelli
Make docs builds less verbose (#12836) @AyodeAwe
Consolidate linter configs into pyproject.toml (#12834) @vyasr
Deprecate names & dtype in Index.copy (#12825) @galipremsagar
Deprecate inplace parameters in categorical methods (#12824) @galipremsagar
Add optional text file support to ninja-log utility (#12823) @davidwendt
Deprecate Index.is_* methods (#12820) @galipremsagar
Add dfg as a pre-commit hook (#12819) @vyasr
Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
Deprecate na_sentinel in factorize (#12817) @galipremsagar
Shuffling read into a sub function in parquet read (#12809) @hyperbolic2346
Fixing parquet coalescing of reads (#12808) @hyperbolic2346
CI: Remove specification of manual stage for check_style.sh script. (#12803) @csadorf
Add compute-sanitizer github workflow action to nightly tests (#12800) @davidwendt
Enable groupby std and variance aggregation types in libcudf Debug build (#12799) @davidwendt
Expose seed argument to hash_values (#12795) @ayushdg
Fix groupby gtests coded in namespace cudf::test (#12784) @davidwendt
Improve performance for cudf::strings::count_characters for long strings (#12779) @davidwendt
Deallocate encoded data in ORC writer immediately after compression (#12770) @vuule
Stop force pulling fmt in nvbench. (#12768) @vyasr
Remove now redundant cuda initialization (#12758) @vyasr
Adds JSON reader, writer io benchmark (#12753) @karthikeyann
Use test paths relative to package directory. (#12751) @bdice
Add build metrics report as artifact to cpp-build workflow (#12750) @davidwendt
Add JNI methods for detecting and purging non-empty nulls from LIST and STRUCT (#12742) @razajafri
Stop using versioneer to manage versions (#12741) @vyasr
Reduce error handling verbosity in CI tests scripts (#12738) @AjayThorve
Reduce the number of test cases in multibyte_split benchmark (#12737) @PointKernel
Update shared workflow branches (#12733) @ajschmidt8
JNI switches to nested JSON reader (#12732) @res-life
Changing cudf::io::source_info to use cudf::host_span<std::byte> in a non-breaking form (#12730) @hyperbolic2346
Add nvbench environment class for initializing RMM in benchmarks (#12728) @davidwendt
Split C++ and Python build dependencies into separate lists. (#12724) @bdice
Add build dependencies to Java tests. (#12723) @bdice
Allow setting the seed argument for hash partition (#12715) @firestarman
Remove gpuCI scripts. (#12712) @bdice
Unpin dask and distributed for development (#12710) @galipremsagar
partition_by_hash(): use _split() (#12704) @madsbk
Remove DataFrame.quantiles from docs. (#12684) @bdice
Fast path for experimental::row::equality (#12676) @divyegala
Move date to build string in conda recipe (#12661) @ajschmidt8
Refactor reduction logic for fixed-point types (#12652) @davidwendt
Pay off some JNI RMM API tech debt (#12632) @revans2
Merge copy-on-write feature branch into branch-23.04 (#12619) @galipremsagar
Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
Pin cuda-nvrtc. (#12606) @bdice
Remove cudf::test::print calls from libcudf gtests (#12604) @davidwendt
Init JNI version 23.04.0-SNAPSHOT (#12599) @pxLi
Add performance benchmarks to user facing docs (#12595) @galipremsagar
Add docs build job (#12592) @AyodeAwe
Replace message parsing with throwing more specific exceptions (#12426) @vyasr
Support conversion to/from cudf in dask.dataframe.core.to_backend (#12380) @rjzamora

Package Rankings

Top 5.32% on Pypi.org

Top 8.17% on Proxy.golang.org

Top 4.8% on Repo1.maven.org

Related Projects

cumm

CUda Matrix Multiply library.

08 Oct 2021 67

localGPT

Chat with your documents on your local device using GPT models. No data leaves your device and 10...

24 May 2023 19,925

cupy

NumPy & SciPy for GPU

01 Nov 2016 7,739

panda3d

Powerful, mature open-source cross-platform game engine for Python and C++, developed by Disney a...

30 Sep 2013 4,258

CV-CUDA

CV-CUDA™ is an open-source, GPU accelerated library for cloud-scale image processing and computer...

23 Aug 2022 2,338

annotated-s4

Implementation of https://srush.github.io/annotated-s4

08 Dec 2021 450

blazingsql

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

24 Sep 2018 1,896

sit4onnx

Tools for simple inference testing using TensorRT, CUDA and OpenVINO CPU/GPU and CPU providers. S...

12 May 2022 18

librapid

A highly optimised C++ library for mathematical applications and neural networks.

25 May 2021 163

sqaod

Solvers/annealers for simulated quantum annealing on CPU and CUDA(NVIDIA GPU).

24 Oct 2017 81

CuVec

Unifying Python/C++/CUDA memory: Python buffered array ↔️ `std::vector` ↔️ CUDA managed memory

16 Jan 2021 80

vqa-outliers

Code and Experiments for ACL-IJCNLP 2021 Paper "Mind Your Outliers! Investigating the Negative Im...

25 May 2021 55

spconv

Spatial Sparse Convolution Library

19 Jan 2019 1,847

DeepRec

DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is h...

24 Dec 2021 1,029