cudf

cuDF - GPU DataFrame Library

APACHE-2.0 License

Downloads
13.3K
Stars
7.2K
Committers
246

Bot releases are hidden (Show)

cudf - v24.04.00 Latest Release

Published by raydouglass 6 months ago

🚨 Breaking Changes

  • Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
  • Change exceptions thrown by copying APIs (#15319) @vyasr
  • Change strings_column_view::char_size to return int64 (#15197) @davidwendt
  • Upgrade to arrow-14.0.2 (#15108) @galipremsagar
  • Add support for pandas-2.2 in cudf (#15100) @galipremsagar
  • Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
  • Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
  • Raise an error on import for unsupported GPUs. (#15053) @bdice
  • Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
  • Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
  • Add future_stack to DataFrame.stack (#15015) @galipremsagar
  • Deprecate groupby fillna (#15000) @mroeschke
  • Deprecate replace with categorical columns (#14988) @mroeschke
  • Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
  • Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
  • Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
  • Add pandas-2.x support in cudf (#14916) @galipremsagar
  • Use cuco::static_set in the hash-based groupby (#14813) @PointKernel

🐛 Bug Fixes

  • Fix an issue with creating a series from scalar when dtype='category' (#15476) @galipremsagar
  • Update pre-commit-hooks to v0.0.3 (#15355) @KyleFromNVIDIA
  • [BUG][JNI] Trigger MemoryBuffer.onClosed after memory is freed (#15351) @abellina
  • Fix an issue with multiple short list rowgroups using the Parquet chunked reader. (#15342) @nvdbaranec
  • Avoid importing dask-expr if "query-planning" config is False (#15340) @rjzamora
  • Fix gtests/ERROR_TEST errors when run in Debug (#15317) @davidwendt
  • Fix OOB read in inflate_kernel (#15309) @vuule
  • Work around a cuFile error when running CSV tests with memcheck (#15293) @vuule
  • Fix Doxygen upload directory (#15291) @KyleFromNVIDIA
  • Fix Doxygen check (#15289) @KyleFromNVIDIA
  • Reintroduce PANDAS_GE_220 import (#15287) @wence-
  • Fix mean computation for the geometric distribution in the data generator (#15282) @vuule
  • Fix Parquet decimal64 stats (#15281) @etseidl
  • Make linking of nvtx3-cpp BUILD_LOCAL_INTERFACE (#15271) @KyleFromNVIDIA
  • Workaround compute-sanitizer memcheck bug (#15259) @davidwendt
  • Cleanup hostdevice_vector and add more APIs (#15252) @ttnghia
  • Fix number of rows in randomly generated lists columns (#15248) @vuule
  • Fix wrong output for collect_list/collect_set of lists column (#15243) @ttnghia
  • Fix testchunkedPackTwoPasses to copy from the bounce buffer (#15220) @abellina
  • Fix accessing .columns by an external API (#15212) @galipremsagar
  • [JNI] Disable testChunkedPackTwoPasses for now (#15210) @abellina
  • Update labeler and codeowner configs for CMake files (#15208) @PointKernel
  • Avoid dict normalization in __dask_tokenize__ (#15187) @rjzamora
  • Fix memcheck error in distinct inner join (#15164) @PointKernel
  • Remove unneeded script parameters in test_cpp_memcheck.sh (#15158) @davidwendt
  • Fix ListColumn.to_pandas() to retain list type (#15155) @galipremsagar
  • Avoid factorization in MultiIndex.to_pandas (#15150) @mroeschke
  • Fix GroupBy.get_group and GroupBy.indices (#15143) @wence-
  • Remove const from range_window_bounds::_extent. (#15138) @mythrocks
  • DataFrame.columns = ... retains RangeIndex & set dtype (#15129) @mroeschke
  • Correctly handle output for GroupBy.apply when chunk results are reindexed series (#15109) @brandon-b-miller
  • Fix Series.groupby.shift with a MultiIndex (#15098) @mroeschke
  • Fix reductions when DataFrame has MulitIndex columns (#15097) @mroeschke
  • Fix deprecation warnings for deprecated hash() calls (#15095) @davidwendt
  • Add support for arrow large_string in cudf (#15093) @galipremsagar
  • Fix sort_values pytest failure with pandas-2.x regression (#15092) @galipremsagar
  • Resolve path parsing issues in get_json_object (#15082) @SurajAralihalli
  • Fix bugs in handling of delta encodings (#15075) @etseidl
  • Fix is_device_write_preferred in void_sink and user_sink_wrapper (#15064) @vuule
  • Eliminate duplicate allocation of nested string columns (#15061) @vuule
  • Raise an error on import for unsupported GPUs. (#15053) @bdice
  • Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
  • Fix Index.difference to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar
  • Add future_stack to DataFrame.stack (#15015) @galipremsagar
  • Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
  • Fix DataFrame.sort_index to respect ignore_index on all axis (#14995) @galipremsagar
  • Raise for pyarrow array that is tz-aware (#14980) @mroeschke
  • Direct SeriesGroupBy.aggregate to SeriesGroupBy.agg (#14971) @rjzamora
  • Respect IntervalDtype and CategoricalDtype objects passed by users (#14961) @mroeschke
  • unset CUDF_SPILL after a pytest (#14958) @galipremsagar
  • Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader (#14939) @karthikeyann
  • Fix chunked reads of Parquet delta encoded pages (#14921) @etseidl
  • Fix reading offset for data stream in ORC reader (#14911) @ttnghia
  • Enable sanitizer check for a test case testORCReadAndWriteForDecimal128 (#14897) @res-life
  • Fix dask token normalization (#14829) @rjzamora
  • Fix 24.04 versions (#14825) @raydouglass
  • Ensure slow private attrs are maybe proxies (#14380) @mroeschke

📖 Documentation

  • Ignore DLManagedTensor in the docs build (#15392) @davidwendt
  • Revert "Temporarily disable docs errors. (#15265)" (#15269) @bdice
  • Temporarily disable docs errors. (#15265) @bdice
  • Update developer_guide.md with new guidance on quoted internal includes (#15238) @harrism
  • Fix broken link for developer guide (#15025) @sanjana098
  • [DOC] Update typo in docs example of structs_column_wrapper (#14949) @karthikeyann
  • Update cudf.pandas FAQ. (#14940) @bdice
  • Optimize doc builds (#14856) @vyasr
  • Add developer guideline to use east const. (#14836) @bdice
  • Document how cuDF is pronounced (#14753) @pentschev
  • Notes convert to Pandas-compat (#12641) @Touutae-lab

🚀 New Features

  • Address inconsistency in single quote normalization in JSON reader (#15324) @shrshi
  • Use JNI pinned pool resource with cuIO (#15255) @abellina
  • Add DELTA_BYTE_ARRAY encoder for Parquet (#15239) @etseidl
  • Migrate filling operations to pylibcudf (#15225) @brandon-b-miller
  • [JNI] rmm based pinned pool (#15219) @abellina
  • Implement zero-copy host buffer source instead of using an arrow implementation (#15189) @vuule
  • Enable creation of columns from scalar (#15181) @vyasr
  • Use NVTX from GitHub. (#15178) @bdice
  • Implement segmented_row_bit_count for computing row sizes by segments of rows (#15169) @ttnghia
  • Implement search using pylibcudf (#15166) @vyasr
  • Add distinct left join (#15149) @PointKernel
  • Add cardinality control for groupby benchs with flat types (#15134) @PointKernel
  • Add ability to request Parquet encodings on a per-column basis (#15081) @etseidl
  • Automate include grouping order in .clang-format (#15063) @harrism
  • Requesting a clean build directory also clears Jitify cache (#15052) @robertmaynard
  • API for JSON unquoted whitespace normalization (#15033) @shrshi
  • Implement concatenate, lists.explode, merge, sorting, and stream compaction in pylibcudf (#15011) @vyasr
  • Implement replace in pylibcudf (#15005) @vyasr
  • Add distinct key inner join (#14990) @PointKernel
  • Implement rolling in pylibcudf (#14982) @vyasr
  • Implement joins in pylibcudf (#14972) @vyasr
  • Implement scans and reductions in pylibcudf (#14970) @vyasr
  • Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
  • Implement groupby in pylibcudf (#14945) @vyasr
  • Support casting of Map type to string in JSON reader (#14936) @karthikeyann
  • POC for whitespace removal in input JSON data using FST (#14931) @shrshi
  • Support for LZ4 compression in ORC and Parquet (#14906) @vuule
  • Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
  • Migrate unary operations to pylibcudf (#14850) @vyasr
  • Migrate binary operations to pylibcudf (#14821) @vyasr
  • Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
  • Support CUDA 12.2 (#14712) @jameslamb

🛠️ Improvements

  • Use conda env create --yes instead of --force (#15403) @bdice
  • Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
  • Change exceptions thrown by copying APIs (#15319) @vyasr
  • Enable branch testing for cudf.pandas (#15316) @galipremsagar
  • Replace black with ruff-format (#15312) @mroeschke
  • This fixes an NPE when trying to read empty JSON data by adding a new API for missing information (#15307) @revans2
  • Address poor performance of Parquet string decoding (#15304) @etseidl
  • Update script input name (#15301) @AyodeAwe
  • Make test_read_parquet_partitioned_filtered data deterministic (#15296) @mroeschke
  • Add timeout for cudf.pandas pandas tests (#15284) @galipremsagar
  • Add upper bound to prevent usage of NumPy 2 (#15283) @bdice
  • Fix cudf::test::to_host return of host_vector (#15263) @davidwendt
  • Implement grouped product scan (#15254) @wence-
  • Add CUDA 12.4 to supported PTX versions (#15247) @brandon-b-miller
  • Implement DataFrame|Series.squeeze (#15244) @mroeschke
  • Roll back ipow changes due to register pressure. (#15242) @pmattione-nvidia
  • Remove create_chars_child_column utility (#15241) @davidwendt
  • Update dlpack to version 0.8 (#15237) @dantegd
  • Improve performance in JSON reader when mixed_types_as_string option is enabled (#15236) @shrshi
  • Remove row conversion code from libcudf (#15234) @ttnghia
  • Use variable substitution for RAPIDS version in Doxyfile (#15231) @KyleFromNVIDIA
  • Add ListColumns.to_pandas(arrow_type=) (#15228) @mroeschke
  • Treat dask-cudf CI artifacts as pure wheels (#15223) @bdice
  • Clean up usage of CUDA_ARCH and other macros. (#15218) @bdice
  • DOC: use constants in performance-comparisons.ipynb (#15215) @raybellwaves
  • Rewrite conversion in terms of column (#15213) @vyasr
  • Switch pytest-xdist algo to worksteal (#15207) @galipremsagar
  • Deprecate strings_column_view::offsets_begin() (#15205) @davidwendt
  • Add get_upstream_resource method to stream_checking_resource_adaptor (#15203) @miscco
  • Tune up row size estimation in the data generator (#15202) @vuule
  • Fix offset value for generating test data in parquet_chunked_reader_test.cu (#15200) @ttnghia
  • Change strings_column_view::char_size to return int64 (#15197) @davidwendt
  • Fix includes for row_operators.cuh (#15194) @davidwendt
  • Generalize GHA selectors for pure Python testing (#15191) @bdice
  • Improvements for __cuda_array_interface__ tests (#15188) @bdice
  • Allow to_pandas to return pandas.ArrowDtype (#15182) @mroeschke
  • Ignore byte_range in read_json when the size is not smaller than the input data (#15180) @vuule
  • Expose new stable_sort and finish stream_compaction in pylibcudf (#15175) @wence-
  • [ci] update matrix filters for dask-cudf builds (#15174) @jameslamb
  • Change make_strings_children to return uvector (#15171) @davidwendt
  • Don't override to_pandas for Datelike columns (#15167) @mroeschke
  • Drop python-snappy from dependencies. (#15161) @bdice
  • Add microkernels for fixed-width and fixed-width dictionary in Parquet decode (#15159) @abellina
  • Make HostColumnVector.DataType accessor methods public (#15157) @jbrennan333
  • Java bindings for left outer distinct join (#15154) @jlowe
  • Forward-merge branch-24.02 to branch-24.04 (#15153) @bdice
  • Enable pandas pytests for cudf.pandas (#15147) @galipremsagar
  • Add java option to keep quotes for JSON reads (#15146) @revans2
  • Change cross-pandas-version testing in cudf (#15145) @galipremsagar
  • Use hostdevice_vector in kernel_error to avoid the pageable copy (#15140) @vuule
  • Clean up Columns.astype & cudf.dtype (#15125) @mroeschke
  • Simplify some to_pandas implementations (#15123) @mroeschke
  • Java: Add leak tracking for Scalar instances (#15121) @jlowe
  • Remove calls to strings_column_view::offsets_begin() (#15112) @davidwendt
  • Add support for Python 3.11, require NumPy 1.23+ (#15111) @jameslamb
  • Compile-time ipow computation with array lookup (#15110) @pmattione-nvidia
  • Upgrade to arrow-14.0.2 (#15108) @galipremsagar
  • Dynamically set version in RAPIDS doc builds (#15101) @jakirkham
  • Add support for pandas-2.2 in cudf (#15100) @galipremsagar
  • Update devcontainers to CUDA Toolkit 12.2 (#15099) @trxcllnt
  • Fix datetime binop pytest failures in pandas-2.2 (#15090) @galipremsagar
  • Validate types in pylibcudf Column/Table constructors (#15088) @wence-
  • xfail test_join_ordering_pandas_compat for pandas 2.2 (#15080) @mroeschke
  • Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation. (#15079) @nvdbaranec
  • Adjust test_binops for pandas 2.2 (#15078) @mroeschke
  • Remove offsets_begin() call from nvtext::generate_ngrams (#15077) @davidwendt
  • Use offsetalator in cudf::detail::has_nonempty_null_rows (#15076) @davidwendt
  • Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
  • Fix cudf::test::to_host to handle both offset types for strings columns (#15073) @davidwendt
  • Add condition for test_groupby_nulls_basic in pandas 2.2 (#15072) @mroeschke
  • xfail tests in test_udf_masked_ops due to pandas 2.2 bug (#15071) @mroeschke
  • target branch-24.04 for GitHub Actions workflows (#15069) @jameslamb
  • Implement stable version of cudf::sort (#15066) @wence-
  • Fix ORC and JSON tests failures for pandas 2.2 (#15062) @mroeschke
  • Adjust test_joining for pandas 2.2 (#15060) @mroeschke
  • Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
  • Fix test_resample index dtype checking for pandas 2.2 (#15058) @mroeschke
  • Split out strings/replace.cu and rework its gtests (#15054) @davidwendt
  • Avoid incompatible value type setting in test_rolling for pandas 2.2 (#15050) @mroeschke
  • Change chained replace inplace test to COW test for pandas 2.2 (#15049) @mroeschke
  • Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
  • Avoid chained indexing in test_indexing for pandas 2.2 (#15045) @mroeschke
  • Avoid pandas 2.2 DeprecationWarning in test_hdf (#15044) @mroeschke
  • Use appropriate make_offsets_child_column for building lists columns (#15043) @davidwendt
  • Factor out position-offsets logic from strings split_helper utility (#15040) @davidwendt
  • Forward-merge branch-24.02 to branch-24.04 (#15039) @bdice
  • Clean up nvtx macros (#15038) @PointKernel
  • Add xfailures for test_applymap for pandas 2.2 (#15034) @mroeschke
  • Expose libcudf filter expression in read_parquet (#15028) @wence-
  • Adjust tests in test_dataframe.py for pandas 2.2 (#15023) @mroeschke
  • Adjust test_datetime_infer_format for pandas 2.2 (#15021) @mroeschke
  • Performance optimizations for parquet sub-rowgroup reader. (#15020) @nvdbaranec
  • JNI bindings for distinct_hash_join (#15019) @jlowe
  • Change copy_if_safe to call thrust instead of the overload function (#15018) @davidwendt
  • Improve performance of copy_if_else for long strings (#15017) @davidwendt
  • Fix is_string_dtype test for pandas 2.2 (#15012) @mroeschke
  • Rework cudf::strings::detail::copy_range for offsetalator (#15010) @davidwendt
  • Use offsetalator in cudf::get_json_object() (#15009) @davidwendt
  • Align integral types in ORC to specs (#15008) @vuule
  • Clean up detail sequence header inclusion (#15007) @PointKernel
  • Add groupby.apply(include_groups=) to match pandas 2.2 deprecation (#15006) @mroeschke
  • Use offsetalator in cudf::interleave_columns() (#15004) @davidwendt
  • Use offsetalator in cudf::row_bit_count() (#15003) @davidwendt
  • Use offsetalator in cudf::strings::wrap() (#15002) @davidwendt
  • Use offsetalator in cudf::strings::reverse (#15001) @davidwendt
  • Deprecate groupby fillna (#15000) @mroeschke
  • Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (#14999) @mroeschke
  • Remove unneeded calls to create_chars_child_column utility (#14997) @davidwendt
  • Add environment-agnostic scripts for running ctests and pytests (#14992) @trxcllnt
  • Filter all DeprecationWarning's by ArrowTable.to_pandas() (#14989) @galipremsagar
  • Deprecate replace with categorical columns (#14988) @mroeschke
  • Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
  • Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
  • Ensure that ctest is called with --no-tests=error. (#14983) @bdice
  • Deprecate non-integer periods in date_range and interval_range (#14976) @galipremsagar
  • Update ops-bot.yaml (#14974) @AyodeAwe
  • Use page statistics in Parquet reader (#14973) @etseidl
  • Use fused types for overloaded function signatures (#14969) @vyasr
  • Deprecate certain frequency strings (#14967) @galipremsagar
  • Update copyrights for 24.04. (#14964) @bdice
  • Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
  • Introduce GetJsonObjectOptions in getJSONObject Java API (#14956) @SurajAralihalli
  • JNI JSON read with DataSource and infered schema, along with basic java nested Schema JSON reads (#14954) @revans2
  • Make codecov only informational (always pass). (#14952) @bdice
  • Replace legacy cudf and dask_cudf imports as (d)gd (#14944) @mroeschke
  • Replace _is_datetime64tz/interval_dtype with isinstance (#14943) @mroeschke
  • Update tests for pandas 2. (#14941) @bdice
  • Use more public pandas APIs (#14929) @mroeschke
  • Replace local copyright check with pre-commit-hooks verify-copyright (#14917) @KyleFromNVIDIA
  • Add pandas-2.x support in cudf (#14916) @galipremsagar
  • Use offsetalator in nvtext::byte_pair_encoding (#14888) @davidwendt
  • De-DOS line-endings (#14880) @wence-
  • Add detail cuco_allocator (#14877) @PointKernel
  • Move all core types to using enum class in Cython (#14876) @vyasr
  • Read cudf.__version__ in Sphinx build (#14872) @KyleFromNVIDIA
  • Use int64 offset types for accessing code-points in nvtext::normalize (#14868) @davidwendt
  • Read version from VERSION file in CMake (#14867) @KyleFromNVIDIA
  • Update conda-cpp-post-build-checks to branch-24.04. (#14854) @bdice
  • Update cudf for compatibility with the latest cuco (#14849) @PointKernel
  • Remove deprecated strings functions (#14848) @davidwendt
  • Fix CI workflows for pandas-tests and add test summary. (#14847) @bdice
  • Use offsetalator in cudf::strings::copy_slice (#14844) @davidwendt
  • Fix V2 Parquet page alignment for use with zStandard compression (#14841) @etseidl
  • Fix calls to deprecated strings factory API in examples. (#14838) @bdice
  • Update pre-commit hooks (#14837) @bdice
  • Use rapids_cuda_set_runtime to determine cuda runtime usage by target (#14833) @vyasr
  • Remove get_mem_info functions from custom memory resources (#14832) @harrism
  • Fix debug build by splitting row_operator_tests_utilities.cu (#14826) @davidwendt
  • Remove -DNVBench_ENABLE_CUPTI=OFF. (#14820) @bdice
  • Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
  • Branch 24.04 merge branch 24.02 (#14809) @vyasr
  • Branch 24.04 merge branch 24.02 (#14806) @vyasr
  • Introduce basic "cudf" backend for Dask Expressions (#14805) @rjzamora
  • Remove build_struct|list_column (#14786) @mroeschke
  • Use offsetalator in nvtext tokenize functions (#14783) @davidwendt
  • Reduce execution time of Python ORC tests (#14776) @vuule
  • Use offsetalator in cudf::strings::split functions (#14757) @davidwendt
  • Use offsetalator in cudf::strings::findall (#14745) @davidwendt
  • Use offsetalator in cudf::strings::url_decode (#14744) @davidwendt
  • Use get_offset_value utility in strings shift function (#14743) @davidwendt
  • Use as_column instead of full (#14698) @mroeschke
  • List all notable breaking changes (#13535) @galipremsagar
cudf - v24.02.02

Published by raydouglass 8 months ago

🚨 Breaking Changes

  • Remove **kwargs from astype (#14765) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Drop Pascal GPU support. (#14630) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • Switch to scikit-build-core (#13531) @vyasr

🐛 Bug Fixes

  • Bump to nvcomp 3.0.6. (#15128) @bdice
  • [HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
  • Exclude tests from builds (#14981) @vyasr
  • Fix the bounce buffer size in ORC writer (#14947) @vuule
  • Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
  • Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
  • Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
  • Fix index difference to follow the pandas format (#14789) @amiralimi
  • Fix shared-workflows repo name (#14784) @raydouglass
  • Remove unparseable attributes from all nodes (#14780) @vyasr
  • Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
  • Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
  • Fix calls to deprecated strings factory API (#14771) @davidwendt
  • Fix ptx file discovery in editable installs (#14767) @vyasr
  • Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
  • Enable intermediate proxies to be picklable (#14752) @shwina
  • Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
  • Fix CMake args (#14746) @vyasr
  • Fix logic bug introduced in #14730 (#14742) @wence-
  • [Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
  • Fix Groupby.get_group (#14728) @rjzamora
  • Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
  • Split cuda versions for notebook testing (#14722) @raydouglass
  • Fix to_numeric not preserving Series index and name (#14718) @mroeschke
  • Update dask-cudf wheel name (#14713) @raydouglass
  • Fix strings::contains matching end of string target (#14711) @davidwendt
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
  • Potential fix for peformance regression in #14415 (#14706) @etseidl
  • Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
  • Skip numba test that fails on ARM (#14702) @brandon-b-miller
  • Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
  • Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
  • Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
  • Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
  • Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
  • Add row conversion code from spark-rapids-jni (#14664) @ttnghia
  • Unconditionally export the CCCL path (#14656) @vyasr
  • Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
  • Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
  • Fix invalid memory access in Parquet reader (#14637) @etseidl
  • Use column_empty over as_column([]) (#14632) @mroeschke
  • Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
  • Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
  • Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
  • Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
  • Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
  • Address potential race conditions in Parquet reader (#14602) @etseidl
  • Fix DataFrame.reindex removing column name (#14601) @mroeschke
  • Remove unsanitized input test data from copy gtests (#14600) @davidwendt
  • Fix race detected in Parquet writer (#14598) @etseidl
  • Correct invalid or missing return types (#14587) @robertmaynard
  • Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
  • Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
  • Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
  • Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
  • Fixes a symbol group lookup table issue (#14561) @elstehle
  • Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
  • REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
  • Improve memory footprint of isin by using contains (#14478) @wence-
  • Move creation of env.yaml outside the current directory (#14476) @davidwendt
  • Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
  • Correct dtype of count aggregations on empty dataframes (#14473) @wence-
  • Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
  • JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
  • Fix default stream use in the CSV reader (#14443) @vuule
  • Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
  • Defer PTX file load to runtime (#13690) @brandon-b-miller

📖 Documentation

  • Disable parallel build (#14796) @vyasr
  • Add pylibcudf to the docs (#14791) @vyasr
  • Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
  • Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
  • More doxygen fixes (#14639) @vyasr
  • Enable doxygen XML generation and fix issues (#14477) @vyasr
  • Some doxygen improvements (#14469) @vyasr
  • Remove warning in dask-cudf docs (#14454) @wence-
  • Update README links with redirects. (#14378) @bdice
  • Add pip install instructions to README (#13677) @shwina

🚀 New Features

  • Add ci check for external kernels (#14768) @robertmaynard
  • JSON single quote normalization API (#14729) @shrshi
  • Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
  • Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
  • Don't constrain numba&lt;0.58 (#14616) @brandon-b-miller
  • Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
  • JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
  • JSON quote normalization (#14545) @shrshi
  • Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
  • Implement more copying APIs in pylibcudf (#14508) @vyasr
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Parquet sub-rowgroup reading. (#14360) @nvdbaranec
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • PARQUET-2261 Size Statistics (#14000) @etseidl
  • Improve GroupBy JIT error handling (#13854) @brandon-b-miller
  • Generate unified Python/C++ docs (#13846) @vyasr
  • Expand JIT groupby test suite (#13813) @brandon-b-miller

🛠️ Improvements

  • Pin pytest&lt;8 (#14920) @galipremsagar
  • Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
  • Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
  • Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
  • Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
  • Remove **kwargs from astype (#14765) @mroeschke
  • fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
  • Add pynvjitlink as a dependency (#14763) @brandon-b-miller
  • Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
  • Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
  • Pin pytest-cases<3.8.2 (#14756) @mroeschke
  • Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
  • Consolidate cudf object handling in as_column (#14754) @mroeschke
  • Reduce execution time of Parquet C++ tests (#14750) @vuule
  • Implement to_datetime(..., utc=True) (#14749) @mroeschke
  • Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
  • Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
  • Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
  • Remove unused/single use methods (#14739) @mroeschke
  • refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
  • Remove unneeded methods in Column (#14730) @mroeschke
  • Clean up base column methods (#14725) @mroeschke
  • Ensure column.fillna signatures are consistent (#14724) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
  • Use offsetalator in gather_chars (#14700) @davidwendt
  • Use make_strings_children for fill() specialization logic (#14697) @davidwendt
  • Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
  • Fix call to deprecated factory function (#14695) @davidwendt
  • Use as_column instead of arange for range like inputs (#14689) @mroeschke
  • Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
  • Split parquet test into multiple files (#14663) @etseidl
  • Custom error messages for IO with nonexistent files (#14662) @vuule
  • Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
  • Basic validation in reader benchmarks (#14647) @vuule
  • Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
  • Consolidate memoryview handling in as_column (#14643) @mroeschke
  • Convert FieldType to scoped enum (#14642) @vuule
  • Use instance over is_foo_dtype (#14641) @mroeschke
  • Use isinstance over is_foo_dtype internally (#14638) @mroeschke
  • Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
  • Drop nvbench patch for nvml. (#14631) @bdice
  • Drop Pascal GPU support. (#14630) @bdice
  • Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
  • Create strings-specific make_offsets_child_column for multiple offset types (#14612) @davidwendt
  • Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
  • Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
  • Support freq in DatetimeIndex (#14593) @shwina
  • Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
  • Remove WORKSPACE env var from cudf_test temp_directory class (#14588) @davidwendt
  • Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
  • Use cuda::proclaim_return_type on device lambdas. (#14577) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Update dependencies.yaml to new pip index (#14575) @vyasr
  • Simplify Python CMake (#14565) @vyasr
  • Java expose parquet pass_read_limit (#14564) @revans2
  • Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
  • Use cudf_test temp_directory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
  • Fix return type of prefix increment overloads (#14544) @vuule
  • Make bpe_merge_pairs_impl member private (#14543) @davidwendt
  • Small clean up in io::statistics (#14542) @vuule
  • Change json gtest environment variable to compile-time definition (#14541) @davidwendt
  • Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
  • Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
  • Move non-templated inline function definitions from table_view.hpp to table_view.cpp (#14535) @davidwendt
  • Add JNI for strings::code_points (#14533) @thirtiseven
  • Add a test for issue 12773 (#14529) @vyasr
  • Split libarrow build dependencies. (#14506) @bdice
  • Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
  • Refactor Parquet kernel_error (#14464) @etseidl
  • Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
  • Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
  • Expose stream parameter in public nvtext APIs (#14456) @davidwendt
  • Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • Refactor cudf.Series.init (#14450) @mroeschke
  • Remove the use of volatile in Parquet (#14448) @vuule
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Testing stream pool implementation (#14437) @shrshi
  • Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
  • Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
  • Use isinstance(..., cudf.IntervalDtype) instead of is_interval_dtype (#14424) @mroeschke
  • Use isinstance(..., cudf.CategoricalDtype) instead of is_categorical_dtype (#14423) @mroeschke
  • Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
  • REF: Remove instances of pd.core (#14421) @mroeschke
  • Expose streams in public filling APIs for label_bins (#14401) @ZelboK
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
  • Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
  • Expose streams in Parquet reader and writer APIs (#14359) @shrshi
  • Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
  • Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
  • Expose streams in ORC reader and writer APIs (#14350) @shrshi
  • Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
  • Add cuDF devcontainers (#14015) @trxcllnt
  • Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
  • Switch to scikit-build-core (#13531) @vyasr
  • Simplify null count checking in column equality comparator (#13312) @vyasr
cudf - v24.02.01

Published by raydouglass 8 months ago

🚨 Breaking Changes

  • Remove **kwargs from astype (#14765) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Drop Pascal GPU support. (#14630) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • Switch to scikit-build-core (#13531) @vyasr

🐛 Bug Fixes

  • [HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
  • Exclude tests from builds (#14981) @vyasr
  • Fix the bounce buffer size in ORC writer (#14947) @vuule
  • Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
  • Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
  • Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
  • Fix index difference to follow the pandas format (#14789) @amiralimi
  • Fix shared-workflows repo name (#14784) @raydouglass
  • Remove unparseable attributes from all nodes (#14780) @vyasr
  • Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
  • Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
  • Fix calls to deprecated strings factory API (#14771) @davidwendt
  • Fix ptx file discovery in editable installs (#14767) @vyasr
  • Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
  • Enable intermediate proxies to be picklable (#14752) @shwina
  • Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
  • Fix CMake args (#14746) @vyasr
  • Fix logic bug introduced in #14730 (#14742) @wence-
  • [Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
  • Fix Groupby.get_group (#14728) @rjzamora
  • Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
  • Split cuda versions for notebook testing (#14722) @raydouglass
  • Fix to_numeric not preserving Series index and name (#14718) @mroeschke
  • Update dask-cudf wheel name (#14713) @raydouglass
  • Fix strings::contains matching end of string target (#14711) @davidwendt
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
  • Potential fix for peformance regression in #14415 (#14706) @etseidl
  • Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
  • Skip numba test that fails on ARM (#14702) @brandon-b-miller
  • Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
  • Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
  • Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
  • Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
  • Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
  • Add row conversion code from spark-rapids-jni (#14664) @ttnghia
  • Unconditionally export the CCCL path (#14656) @vyasr
  • Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
  • Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
  • Fix invalid memory access in Parquet reader (#14637) @etseidl
  • Use column_empty over as_column([]) (#14632) @mroeschke
  • Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
  • Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
  • Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
  • Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
  • Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
  • Address potential race conditions in Parquet reader (#14602) @etseidl
  • Fix DataFrame.reindex removing column name (#14601) @mroeschke
  • Remove unsanitized input test data from copy gtests (#14600) @davidwendt
  • Fix race detected in Parquet writer (#14598) @etseidl
  • Correct invalid or missing return types (#14587) @robertmaynard
  • Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
  • Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
  • Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
  • Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
  • Fixes a symbol group lookup table issue (#14561) @elstehle
  • Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
  • REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
  • Improve memory footprint of isin by using contains (#14478) @wence-
  • Move creation of env.yaml outside the current directory (#14476) @davidwendt
  • Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
  • Correct dtype of count aggregations on empty dataframes (#14473) @wence-
  • Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
  • JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
  • Fix default stream use in the CSV reader (#14443) @vuule
  • Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
  • Defer PTX file load to runtime (#13690) @brandon-b-miller

📖 Documentation

  • Disable parallel build (#14796) @vyasr
  • Add pylibcudf to the docs (#14791) @vyasr
  • Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
  • Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
  • More doxygen fixes (#14639) @vyasr
  • Enable doxygen XML generation and fix issues (#14477) @vyasr
  • Some doxygen improvements (#14469) @vyasr
  • Remove warning in dask-cudf docs (#14454) @wence-
  • Update README links with redirects. (#14378) @bdice
  • Add pip install instructions to README (#13677) @shwina

🚀 New Features

  • Add ci check for external kernels (#14768) @robertmaynard
  • JSON single quote normalization API (#14729) @shrshi
  • Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
  • Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
  • Don't constrain numba&lt;0.58 (#14616) @brandon-b-miller
  • Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
  • JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
  • JSON quote normalization (#14545) @shrshi
  • Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
  • Implement more copying APIs in pylibcudf (#14508) @vyasr
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Parquet sub-rowgroup reading. (#14360) @nvdbaranec
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • PARQUET-2261 Size Statistics (#14000) @etseidl
  • Improve GroupBy JIT error handling (#13854) @brandon-b-miller
  • Generate unified Python/C++ docs (#13846) @vyasr
  • Expand JIT groupby test suite (#13813) @brandon-b-miller

🛠️ Improvements

  • Pin pytest&lt;8 (#14920) @galipremsagar
  • Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
  • Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
  • Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
  • Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
  • Remove **kwargs from astype (#14765) @mroeschke
  • fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
  • Add pynvjitlink as a dependency (#14763) @brandon-b-miller
  • Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
  • Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
  • Pin pytest-cases<3.8.2 (#14756) @mroeschke
  • Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
  • Consolidate cudf object handling in as_column (#14754) @mroeschke
  • Reduce execution time of Parquet C++ tests (#14750) @vuule
  • Implement to_datetime(..., utc=True) (#14749) @mroeschke
  • Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
  • Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
  • Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
  • Remove unused/single use methods (#14739) @mroeschke
  • refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
  • Remove unneeded methods in Column (#14730) @mroeschke
  • Clean up base column methods (#14725) @mroeschke
  • Ensure column.fillna signatures are consistent (#14724) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
  • Use offsetalator in gather_chars (#14700) @davidwendt
  • Use make_strings_children for fill() specialization logic (#14697) @davidwendt
  • Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
  • Fix call to deprecated factory function (#14695) @davidwendt
  • Use as_column instead of arange for range like inputs (#14689) @mroeschke
  • Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
  • Split parquet test into multiple files (#14663) @etseidl
  • Custom error messages for IO with nonexistent files (#14662) @vuule
  • Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
  • Basic validation in reader benchmarks (#14647) @vuule
  • Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
  • Consolidate memoryview handling in as_column (#14643) @mroeschke
  • Convert FieldType to scoped enum (#14642) @vuule
  • Use instance over is_foo_dtype (#14641) @mroeschke
  • Use isinstance over is_foo_dtype internally (#14638) @mroeschke
  • Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
  • Drop nvbench patch for nvml. (#14631) @bdice
  • Drop Pascal GPU support. (#14630) @bdice
  • Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
  • Create strings-specific make_offsets_child_column for multiple offset types (#14612) @davidwendt
  • Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
  • Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
  • Support freq in DatetimeIndex (#14593) @shwina
  • Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
  • Remove WORKSPACE env var from cudf_test temp_directory class (#14588) @davidwendt
  • Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
  • Use cuda::proclaim_return_type on device lambdas. (#14577) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Update dependencies.yaml to new pip index (#14575) @vyasr
  • Simplify Python CMake (#14565) @vyasr
  • Java expose parquet pass_read_limit (#14564) @revans2
  • Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
  • Use cudf_test temp_directory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
  • Fix return type of prefix increment overloads (#14544) @vuule
  • Make bpe_merge_pairs_impl member private (#14543) @davidwendt
  • Small clean up in io::statistics (#14542) @vuule
  • Change json gtest environment variable to compile-time definition (#14541) @davidwendt
  • Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
  • Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
  • Move non-templated inline function definitions from table_view.hpp to table_view.cpp (#14535) @davidwendt
  • Add JNI for strings::code_points (#14533) @thirtiseven
  • Add a test for issue 12773 (#14529) @vyasr
  • Split libarrow build dependencies. (#14506) @bdice
  • Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
  • Refactor Parquet kernel_error (#14464) @etseidl
  • Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
  • Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
  • Expose stream parameter in public nvtext APIs (#14456) @davidwendt
  • Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • Refactor cudf.Series.init (#14450) @mroeschke
  • Remove the use of volatile in Parquet (#14448) @vuule
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Testing stream pool implementation (#14437) @shrshi
  • Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
  • Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
  • Use isinstance(..., cudf.IntervalDtype) instead of is_interval_dtype (#14424) @mroeschke
  • Use isinstance(..., cudf.CategoricalDtype) instead of is_categorical_dtype (#14423) @mroeschke
  • Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
  • REF: Remove instances of pd.core (#14421) @mroeschke
  • Expose streams in public filling APIs for label_bins (#14401) @ZelboK
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
  • Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
  • Expose streams in Parquet reader and writer APIs (#14359) @shrshi
  • Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
  • Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
  • Expose streams in ORC reader and writer APIs (#14350) @shrshi
  • Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
  • Add cuDF devcontainers (#14015) @trxcllnt
  • Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
  • Switch to scikit-build-core (#13531) @vyasr
  • Simplify null count checking in column equality comparator (#13312) @vyasr
cudf - v24.02.00

Published by raydouglass 8 months ago

🚨 Breaking Changes

  • Remove **kwargs from astype (#14765) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Drop Pascal GPU support. (#14630) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • Switch to scikit-build-core (#13531) @vyasr

🐛 Bug Fixes

  • Exclude tests from builds (#14981) @vyasr
  • Fix the bounce buffer size in ORC writer (#14947) @vuule
  • Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
  • Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
  • Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
  • Fix index difference to follow the pandas format (#14789) @amiralimi
  • Fix shared-workflows repo name (#14784) @raydouglass
  • Remove unparseable attributes from all nodes (#14780) @vyasr
  • Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
  • Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
  • Fix calls to deprecated strings factory API (#14771) @davidwendt
  • Fix ptx file discovery in editable installs (#14767) @vyasr
  • Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
  • Enable intermediate proxies to be picklable (#14752) @shwina
  • Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
  • Fix CMake args (#14746) @vyasr
  • Fix logic bug introduced in #14730 (#14742) @wence-
  • [Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
  • Fix Groupby.get_group (#14728) @rjzamora
  • Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
  • Split cuda versions for notebook testing (#14722) @raydouglass
  • Fix to_numeric not preserving Series index and name (#14718) @mroeschke
  • Update dask-cudf wheel name (#14713) @raydouglass
  • Fix strings::contains matching end of string target (#14711) @davidwendt
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
  • Potential fix for peformance regression in #14415 (#14706) @etseidl
  • Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
  • Skip numba test that fails on ARM (#14702) @brandon-b-miller
  • Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
  • Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
  • Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
  • Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
  • Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
  • Add row conversion code from spark-rapids-jni (#14664) @ttnghia
  • Unconditionally export the CCCL path (#14656) @vyasr
  • Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
  • Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
  • Fix invalid memory access in Parquet reader (#14637) @etseidl
  • Use column_empty over as_column([]) (#14632) @mroeschke
  • Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
  • Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
  • Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
  • Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
  • Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
  • Address potential race conditions in Parquet reader (#14602) @etseidl
  • Fix DataFrame.reindex removing column name (#14601) @mroeschke
  • Remove unsanitized input test data from copy gtests (#14600) @davidwendt
  • Fix race detected in Parquet writer (#14598) @etseidl
  • Correct invalid or missing return types (#14587) @robertmaynard
  • Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
  • Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
  • Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
  • Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
  • Fixes a symbol group lookup table issue (#14561) @elstehle
  • Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
  • REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
  • Improve memory footprint of isin by using contains (#14478) @wence-
  • Move creation of env.yaml outside the current directory (#14476) @davidwendt
  • Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
  • Correct dtype of count aggregations on empty dataframes (#14473) @wence-
  • Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
  • JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
  • Fix default stream use in the CSV reader (#14443) @vuule
  • Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
  • Defer PTX file load to runtime (#13690) @brandon-b-miller

📖 Documentation

  • Disable parallel build (#14796) @vyasr
  • Add pylibcudf to the docs (#14791) @vyasr
  • Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
  • Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
  • More doxygen fixes (#14639) @vyasr
  • Enable doxygen XML generation and fix issues (#14477) @vyasr
  • Some doxygen improvements (#14469) @vyasr
  • Remove warning in dask-cudf docs (#14454) @wence-
  • Update README links with redirects. (#14378) @bdice
  • Add pip install instructions to README (#13677) @shwina

🚀 New Features

  • Add ci check for external kernels (#14768) @robertmaynard
  • JSON single quote normalization API (#14729) @shrshi
  • Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
  • Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
  • Don't constrain numba&lt;0.58 (#14616) @brandon-b-miller
  • Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
  • JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
  • JSON quote normalization (#14545) @shrshi
  • Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
  • Implement more copying APIs in pylibcudf (#14508) @vyasr
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Parquet sub-rowgroup reading. (#14360) @nvdbaranec
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • PARQUET-2261 Size Statistics (#14000) @etseidl
  • Improve GroupBy JIT error handling (#13854) @brandon-b-miller
  • Generate unified Python/C++ docs (#13846) @vyasr
  • Expand JIT groupby test suite (#13813) @brandon-b-miller

🛠️ Improvements

  • Pin pytest&lt;8 (#14920) @galipremsagar
  • Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
  • Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
  • Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
  • Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
  • Remove **kwargs from astype (#14765) @mroeschke
  • fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
  • Add pynvjitlink as a dependency (#14763) @brandon-b-miller
  • Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
  • Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
  • Pin pytest-cases<3.8.2 (#14756) @mroeschke
  • Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
  • Consolidate cudf object handling in as_column (#14754) @mroeschke
  • Reduce execution time of Parquet C++ tests (#14750) @vuule
  • Implement to_datetime(..., utc=True) (#14749) @mroeschke
  • Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
  • Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
  • Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
  • Remove unused/single use methods (#14739) @mroeschke
  • refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
  • Remove unneeded methods in Column (#14730) @mroeschke
  • Clean up base column methods (#14725) @mroeschke
  • Ensure column.fillna signatures are consistent (#14724) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
  • Use offsetalator in gather_chars (#14700) @davidwendt
  • Use make_strings_children for fill() specialization logic (#14697) @davidwendt
  • Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
  • Fix call to deprecated factory function (#14695) @davidwendt
  • Use as_column instead of arange for range like inputs (#14689) @mroeschke
  • Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
  • Split parquet test into multiple files (#14663) @etseidl
  • Custom error messages for IO with nonexistent files (#14662) @vuule
  • Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
  • Basic validation in reader benchmarks (#14647) @vuule
  • Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
  • Consolidate memoryview handling in as_column (#14643) @mroeschke
  • Convert FieldType to scoped enum (#14642) @vuule
  • Use instance over is_foo_dtype (#14641) @mroeschke
  • Use isinstance over is_foo_dtype internally (#14638) @mroeschke
  • Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
  • Drop nvbench patch for nvml. (#14631) @bdice
  • Drop Pascal GPU support. (#14630) @bdice
  • Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
  • Create strings-specific make_offsets_child_column for multiple offset types (#14612) @davidwendt
  • Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
  • Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
  • Support freq in DatetimeIndex (#14593) @shwina
  • Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
  • Remove WORKSPACE env var from cudf_test temp_directory class (#14588) @davidwendt
  • Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
  • Use cuda::proclaim_return_type on device lambdas. (#14577) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Update dependencies.yaml to new pip index (#14575) @vyasr
  • Simplify Python CMake (#14565) @vyasr
  • Java expose parquet pass_read_limit (#14564) @revans2
  • Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
  • Use cudf_test temp_directory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
  • Fix return type of prefix increment overloads (#14544) @vuule
  • Make bpe_merge_pairs_impl member private (#14543) @davidwendt
  • Small clean up in io::statistics (#14542) @vuule
  • Change json gtest environment variable to compile-time definition (#14541) @davidwendt
  • Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
  • Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
  • Move non-templated inline function definitions from table_view.hpp to table_view.cpp (#14535) @davidwendt
  • Add JNI for strings::code_points (#14533) @thirtiseven
  • Add a test for issue 12773 (#14529) @vyasr
  • Split libarrow build dependencies. (#14506) @bdice
  • Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
  • Refactor Parquet kernel_error (#14464) @etseidl
  • Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
  • Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
  • Expose stream parameter in public nvtext APIs (#14456) @davidwendt
  • Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • Refactor cudf.Series.init (#14450) @mroeschke
  • Remove the use of volatile in Parquet (#14448) @vuule
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Testing stream pool implementation (#14437) @shrshi
  • Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
  • Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
  • Use isinstance(..., cudf.IntervalDtype) instead of is_interval_dtype (#14424) @mroeschke
  • Use isinstance(..., cudf.CategoricalDtype) instead of is_categorical_dtype (#14423) @mroeschke
  • Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
  • REF: Remove instances of pd.core (#14421) @mroeschke
  • Expose streams in public filling APIs for label_bins (#14401) @ZelboK
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
  • Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
  • Expose streams in Parquet reader and writer APIs (#14359) @shrshi
  • Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
  • Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
  • Expose streams in ORC reader and writer APIs (#14350) @shrshi
  • Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
  • Add cuDF devcontainers (#14015) @trxcllnt
  • Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
  • Switch to scikit-build-core (#13531) @vyasr
  • Simplify null count checking in column equality comparator (#13312) @vyasr
cudf - v23.12.01

Published by raydouglass 11 months ago

🚨 Breaking Changes

  • Raise error in reindex when index is not unique (#14400) @galipremsagar
  • Expose stream parameter to get_json_object API (#14297) @davidwendt
  • Refactor cudf_kafka to use skbuild (#14292) @jdye64
  • Expose stream parameter in public strings convert APIs (#14255) @davidwendt
  • Upgrade to nvCOMP 3.0.4 (#13815) @vuule

🐛 Bug Fixes

  • Fix synchronization issue when writing string columns with dictionary to ORC (#14595) @vuule
  • Update actions/labeler to v4 (#14562) @raydouglass
  • Fix data corruption when skipping rows (#14557) @etseidl
  • Fix function name typo in cudf.pandas profiler (#14514) @galipremsagar
  • Fix intermediate type checking in expression parsing (#14445) @vyasr
  • Forward merge branch-23.10 into branch-23.12 (#14435) @raydouglass
  • Remove needs: wheel-build-cudf. (#14427) @bdice
  • Fix dask dependency in custreamz (#14420) @vyasr
  • Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
  • Support java AST String literal with desired encoding (#14402) @winningsix
  • Raise error in reindex when index is not unique (#14400) @galipremsagar
  • Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
  • Fix token-count logic in nvtext::tokenize_with_vocabulary (#14393) @davidwendt
  • Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
  • cudf.pandas: cuDF subpath checking in module __getattr__ (#14388) @shwina
  • Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
  • Add the new manylinux builds to the build job (#14351) @vyasr
  • cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
  • Fix overflow check in cudf::merge (#14345) @divyegala
  • Add cramjam (#14344) @vyasr
  • Enable dask_cudf/io pytests in CI (#14338) @galipremsagar
  • Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
  • Fix host buffer access from device function in the Parquet reader (#14328) @vuule
  • Run IO tests for Dask-cuDF (#14327) @rjzamora
  • Fix logical type issues in the Parquet writer (#14322) @vuule
  • Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
  • test is_valid before reading column data (#14318) @etseidl
  • Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
  • Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
  • Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
  • fixing thread index overflow issue (#14290) @hyperbolic2346
  • Fix memset error in nvtext::edit_distance_matrix (#14283) @davidwendt
  • Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
  • Handle empty string correctly in Parquet statistics (#14257) @etseidl
  • Fixes behaviour for incomplete lines when recover_with_nulls is enabled (#14252) @elstehle
  • cudf::detail::pinned_allocator doesn't throw from deallocate (#14251) @robertmaynard
  • Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
  • Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
  • Fixing parquet list of struct interpretation (#13715) @hyperbolic2346

📖 Documentation

  • Fix io reference in docs. (#14452) @bdice
  • Update README (#14374) @shwina
  • Example code for blog on new row comparators (#13795) @divyegala

🚀 New Features

  • Expose streams in public unary APIs (#14342) @vyasr
  • Add python tests for Parquet DELTA_BINARY_PACKED encoder (#14316) @etseidl
  • Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
  • Expose streams in public null mask APIs (#14263) @vyasr
  • Expose streams in binaryop APIs (#14187) @vyasr
  • Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
  • Add decoder for DELTA_BYTE_ARRAY to Parquet reader (#14101) @etseidl
  • Add DELTA_BINARY_PACKED encoder for Parquet writer (#14100) @etseidl
  • Add BytePairEncoder class to cuDF (#13891) @davidwendt
  • Upgrade to nvCOMP 3.0.4 (#13815) @vuule
  • Use pynvjitlink for CUDA 12+ MVC (#13650) @brandon-b-miller

🛠️ Improvements

  • Build concurrency for nightly and merge triggers (#14441) @bdice
  • Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
  • Update to Arrow 14.0.1. (#14387) @bdice
  • Remove Cython libcpp wrappers (#14382) @vyasr
  • Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
  • Upgrade to arrow 14 (#14371) @galipremsagar
  • Fix a pytest typo in test_kurt_skew_error (#14368) @galipremsagar
  • Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
  • Change nullable() to has_nulls() in cudf::detail::gather (#14363) @divyegala
  • Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
  • Implement user_datasource_wrapper is_empty() and is_device_read_preferred(). (#14357) @tpn
  • Added streams to CSV reader and writer api (#14340) @shrshi
  • Upgrade wheels to use arrow 13 (#14339) @vyasr
  • Rework nvtext::byte_pair_encoding API (#14337) @davidwendt
  • Improve performance of nvtext::tokenize_with_vocabulary for long strings (#14336) @davidwendt
  • Upgrade arrow to 13 (#14330) @galipremsagar
  • Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
  • Drop pyorc dependency and use pandas/pyarrow instead (#14323) @galipremsagar
  • Avoid pyarrow.fs import for local storage (#14321) @rjzamora
  • Unpin dask and distributed for 23.12 development (#14320) @galipremsagar
  • Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
  • Added streams to JSON reader and writer api (#14313) @shrshi
  • Minor improvements in source_info (#14308) @vuule
  • Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
  • Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
  • Expose stream parameter to get_json_object API (#14297) @davidwendt
  • Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
  • Expose stream parameter in public strings filter APIs (#14293) @davidwendt
  • Refactor cudf_kafka to use skbuild (#14292) @jdye64
  • Update shared-action-workflows references (#14289) @AyodeAwe
  • Register partd encode dispatch in dask_cudf (#14287) @rjzamora
  • Update versioning strategy (#14285) @vyasr
  • Move and rename byte-pair-encoding source files (#14284) @davidwendt
  • Expose stream parameter in public strings combine APIs (#14281) @davidwendt
  • Expose stream parameter in public strings contains APIs (#14280) @davidwendt
  • Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
  • Use branch-23.12 workflows. (#14271) @bdice
  • Refactor LogicalType for Parquet (#14264) @etseidl
  • Centralize chunked reading code in the parquet reader to reader_impl_chunking.cu (#14262) @nvdbaranec
  • Expose stream parameter in public strings replace APIs (#14261) @davidwendt
  • Expose stream parameter in public strings APIs (#14260) @davidwendt
  • Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
  • Make parquet schema index type consistent (#14256) @hyperbolic2346
  • Expose stream parameter in public strings convert APIs (#14255) @davidwendt
  • Add in java bindings for DataSource (#14254) @revans2
  • Reimplement cudf::merge for nested types without using comparators (#14250) @divyegala
  • Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
  • Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
  • Improve contains_column by invoking contains_table (#14238) @PointKernel
  • Detect and report errors in Parquet header parsing (#14237) @etseidl
  • Normalizing offsets iterator (#14234) @davidwendt
  • Forward merge 23.10 into 23.12 (#14231) @galipremsagar
  • Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
  • Enable indexalator for device code (#14206) @davidwendt
  • Marginally reduce memory footprint of joins (#14197) @wence-
  • Add nvtx annotations to spilling-based data movement (#14196) @wence-
  • Optimize ORC writer for decimal columns (#14190) @vuule
  • Remove the use of volatile in ORC (#14175) @vuule
  • Add bytes_per_second to distinct_count of stream_compaction nvbench. (#14172) @Blonck
  • Add bytes_per_second to transpose benchmark (#14170) @Blonck
  • cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
  • Add bytes_per_second to shift benchmark (#13950) @Blonck
  • Extract debug_utilities.hpp/cu from column_utilities.hpp/cu (#13720) @ttnghia
cudf - v23.12.00

Published by raydouglass 11 months ago

🚨 Breaking Changes

  • Raise error in reindex when index is not unique (#14400) @galipremsagar
  • Expose stream parameter to get_json_object API (#14297) @davidwendt
  • Refactor cudf_kafka to use skbuild (#14292) @jdye64
  • Expose stream parameter in public strings convert APIs (#14255) @davidwendt
  • Upgrade to nvCOMP 3.0.4 (#13815) @vuule

🐛 Bug Fixes

  • Update actions/labeler to v4 (#14562) @raydouglass
  • Fix data corruption when skipping rows (#14557) @etseidl
  • Fix function name typo in cudf.pandas profiler (#14514) @galipremsagar
  • Fix intermediate type checking in expression parsing (#14445) @vyasr
  • Forward merge branch-23.10 into branch-23.12 (#14435) @raydouglass
  • Remove needs: wheel-build-cudf. (#14427) @bdice
  • Fix dask dependency in custreamz (#14420) @vyasr
  • Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
  • Support java AST String literal with desired encoding (#14402) @winningsix
  • Raise error in reindex when index is not unique (#14400) @galipremsagar
  • Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
  • Fix token-count logic in nvtext::tokenize_with_vocabulary (#14393) @davidwendt
  • Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
  • cudf.pandas: cuDF subpath checking in module __getattr__ (#14388) @shwina
  • Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
  • Add the new manylinux builds to the build job (#14351) @vyasr
  • cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
  • Fix overflow check in cudf::merge (#14345) @divyegala
  • Add cramjam (#14344) @vyasr
  • Enable dask_cudf/io pytests in CI (#14338) @galipremsagar
  • Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
  • Fix host buffer access from device function in the Parquet reader (#14328) @vuule
  • Run IO tests for Dask-cuDF (#14327) @rjzamora
  • Fix logical type issues in the Parquet writer (#14322) @vuule
  • Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
  • test is_valid before reading column data (#14318) @etseidl
  • Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
  • Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
  • Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
  • fixing thread index overflow issue (#14290) @hyperbolic2346
  • Fix memset error in nvtext::edit_distance_matrix (#14283) @davidwendt
  • Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
  • Handle empty string correctly in Parquet statistics (#14257) @etseidl
  • Fixes behaviour for incomplete lines when recover_with_nulls is enabled (#14252) @elstehle
  • cudf::detail::pinned_allocator doesn't throw from deallocate (#14251) @robertmaynard
  • Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
  • Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
  • Fixing parquet list of struct interpretation (#13715) @hyperbolic2346

📖 Documentation

  • Fix io reference in docs. (#14452) @bdice
  • Update README (#14374) @shwina
  • Example code for blog on new row comparators (#13795) @divyegala

🚀 New Features

  • Expose streams in public unary APIs (#14342) @vyasr
  • Add python tests for Parquet DELTA_BINARY_PACKED encoder (#14316) @etseidl
  • Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
  • Expose streams in public null mask APIs (#14263) @vyasr
  • Expose streams in binaryop APIs (#14187) @vyasr
  • Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
  • Add decoder for DELTA_BYTE_ARRAY to Parquet reader (#14101) @etseidl
  • Add DELTA_BINARY_PACKED encoder for Parquet writer (#14100) @etseidl
  • Add BytePairEncoder class to cuDF (#13891) @davidwendt
  • Upgrade to nvCOMP 3.0.4 (#13815) @vuule
  • Use pynvjitlink for CUDA 12+ MVC (#13650) @brandon-b-miller

🛠️ Improvements

  • Build concurrency for nightly and merge triggers (#14441) @bdice
  • Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
  • Update to Arrow 14.0.1. (#14387) @bdice
  • Remove Cython libcpp wrappers (#14382) @vyasr
  • Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
  • Upgrade to arrow 14 (#14371) @galipremsagar
  • Fix a pytest typo in test_kurt_skew_error (#14368) @galipremsagar
  • Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
  • Change nullable() to has_nulls() in cudf::detail::gather (#14363) @divyegala
  • Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
  • Implement user_datasource_wrapper is_empty() and is_device_read_preferred(). (#14357) @tpn
  • Added streams to CSV reader and writer api (#14340) @shrshi
  • Upgrade wheels to use arrow 13 (#14339) @vyasr
  • Rework nvtext::byte_pair_encoding API (#14337) @davidwendt
  • Improve performance of nvtext::tokenize_with_vocabulary for long strings (#14336) @davidwendt
  • Upgrade arrow to 13 (#14330) @galipremsagar
  • Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
  • Drop pyorc dependency and use pandas/pyarrow instead (#14323) @galipremsagar
  • Avoid pyarrow.fs import for local storage (#14321) @rjzamora
  • Unpin dask and distributed for 23.12 development (#14320) @galipremsagar
  • Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
  • Added streams to JSON reader and writer api (#14313) @shrshi
  • Minor improvements in source_info (#14308) @vuule
  • Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
  • Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
  • Expose stream parameter to get_json_object API (#14297) @davidwendt
  • Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
  • Expose stream parameter in public strings filter APIs (#14293) @davidwendt
  • Refactor cudf_kafka to use skbuild (#14292) @jdye64
  • Update shared-action-workflows references (#14289) @AyodeAwe
  • Register partd encode dispatch in dask_cudf (#14287) @rjzamora
  • Update versioning strategy (#14285) @vyasr
  • Move and rename byte-pair-encoding source files (#14284) @davidwendt
  • Expose stream parameter in public strings combine APIs (#14281) @davidwendt
  • Expose stream parameter in public strings contains APIs (#14280) @davidwendt
  • Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
  • Use branch-23.12 workflows. (#14271) @bdice
  • Refactor LogicalType for Parquet (#14264) @etseidl
  • Centralize chunked reading code in the parquet reader to reader_impl_chunking.cu (#14262) @nvdbaranec
  • Expose stream parameter in public strings replace APIs (#14261) @davidwendt
  • Expose stream parameter in public strings APIs (#14260) @davidwendt
  • Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
  • Make parquet schema index type consistent (#14256) @hyperbolic2346
  • Expose stream parameter in public strings convert APIs (#14255) @davidwendt
  • Add in java bindings for DataSource (#14254) @revans2
  • Reimplement cudf::merge for nested types without using comparators (#14250) @divyegala
  • Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
  • Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
  • Improve contains_column by invoking contains_table (#14238) @PointKernel
  • Detect and report errors in Parquet header parsing (#14237) @etseidl
  • Normalizing offsets iterator (#14234) @davidwendt
  • Forward merge 23.10 into 23.12 (#14231) @galipremsagar
  • Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
  • Enable indexalator for device code (#14206) @davidwendt
  • Marginally reduce memory footprint of joins (#14197) @wence-
  • Add nvtx annotations to spilling-based data movement (#14196) @wence-
  • Optimize ORC writer for decimal columns (#14190) @vuule
  • Remove the use of volatile in ORC (#14175) @vuule
  • Add bytes_per_second to distinct_count of stream_compaction nvbench. (#14172) @Blonck
  • Add bytes_per_second to transpose benchmark (#14170) @Blonck
  • cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
  • Add bytes_per_second to shift benchmark (#13950) @Blonck
  • Extract debug_utilities.hpp/cu from column_utilities.hpp/cu (#13720) @ttnghia
cudf - v23.10.02

Published by raydouglass 11 months ago

🚨 Breaking Changes

  • Raise error in reindex when index is not unique (#14429) @galipremsagar
  • Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
  • Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
  • Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
  • Create table_input_metadata from a table_metadata (#13920) @etseidl
  • Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
  • Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
  • Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
  • Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
  • Remove the libcudf cudf::offset_type type (#13788) @davidwendt
  • Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
  • Update to Cython 3.0.0 (#13777) @vyasr
  • Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
  • Enforce deprecations in 23.10 (#13732) @galipremsagar
  • Upgrade to arrow 12 (#13728) @galipremsagar
  • Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

🐛 Bug Fixes

  • Raise error in reindex when index is not unique (#14429) @galipremsagar
  • Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
  • Fix inaccuracy in decimal128 rounding. (#14233) @bdice
  • Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
  • Fix pytorch related pytest (#14198) @galipremsagar
  • Pin to aws-sdk-cpp&lt;1.11 (#14173) @pentschev
  • Fix assert failure for range window functions (#14168) @mythrocks
  • Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
  • Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
  • Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
  • Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
  • Fix kernel launch error for cudf::io::orc::gpu::rowgroup_char_counts_kernel (#14139) @davidwendt
  • Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
  • Fix DataFrame.values with no columns but index (#14134) @mroeschke
  • Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
  • Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
  • Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
  • Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
  • Preserve name of the column while initializing a DataFrame (#14110) @galipremsagar
  • Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
  • Drop kwargs from Series.count (#14106) @galipremsagar
  • Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
  • Only use memory resources that haven't been freed (#14103) @robertmaynard
  • Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
  • Validate ignore_index type in drop_duplicates (#14098) @mroeschke
  • Fix renaming Series and Index (#14080) @galipremsagar
  • Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
  • Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
  • Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
  • Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
  • Raise NotImplementedError in to_datetime with dayfirst without infer_format (#14058) @mroeschke
  • Fix various issues in Index.intersection (#14054) @galipremsagar
  • Fix Index.difference to match with pandas (#14053) @galipremsagar
  • Fix empty string column construction (#14052) @galipremsagar
  • Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
  • Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
  • Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
  • Ignore compile_commands.json (#14048) @harrism
  • Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
  • Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
  • Implement sort_remaining for sort_index (#14033) @wence-
  • Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
  • Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
  • Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
  • Fix return type of MultiIndex.difference (#14009) @galipremsagar
  • Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
  • Fix map column can not be non-nullable for java (#14003) @res-life
  • Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
  • Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
  • Use thread_index_type to avoid out of bounds accesses in conditional joins (#13971) @vyasr
  • Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
  • Use cudf::thread_index_type in get_json_object and tdigest kernels (#13962) @nvdbaranec
  • Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
  • Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
  • Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
  • Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
  • Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
  • Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
  • Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
  • Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
  • Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
  • Fixed processed bytes calculation in APPLY_BOOLEAN_MASK benchmark. (#13937) @Blonck
  • Fix construction of Grouping objects (#13932) @galipremsagar
  • Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
  • Fix handling of typecasting in searchsorted (#13925) @galipremsagar
  • Preserve index name in reindex (#13917) @galipremsagar
  • Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
  • Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
  • Use cudf::thread_index_type in concatenate.cu. (#13906) @bdice
  • Use cudf::thread_index_type in replace.cu. (#13905) @bdice
  • Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
  • Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
  • Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
  • Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
  • Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
  • Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
  • Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
  • Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
  • Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
  • Fix nvtext::generate_character_ngrams performance regression for longer strings (#13874) @davidwendt
  • Fix return type of MultiIndex.levels (#13870) @galipremsagar
  • Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
  • Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
  • Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
  • Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
  • Fix binary operations between Series and Index (#13842) @galipremsagar
  • Update make_lists_column_from_scalar to use make_offsets_child_column utility (#13841) @davidwendt
  • Fix read out of bounds in string concatenate (#13838) @pentschev
  • Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
  • Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
  • Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
  • Fix cuFile I/O factories (#13829) @vuule
  • DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
  • Branch 23.10 merge 23.08 (#13822) @vyasr
  • Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
  • No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
  • Raise error when mixed types are being constructed (#13816) @galipremsagar
  • Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
  • Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
  • Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
  • Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
  • Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
  • Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
  • Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
  • Fix negative unary operation for boolean type (#13780) @galipremsagar
  • Fix contains(in) method for Series (#13779) @galipremsagar
  • Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
  • Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
  • Preserve names of column object in various APIs (#13772) @galipremsagar
  • Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
  • Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
  • Provide our own Cython declaration for make_unique (#13746) @wence-

📖 Documentation

  • Fix benchmark image. (#14376) @bdice
  • Fix typo in docstring: metadata. (#14025) @bdice
  • Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
  • Simplify Python doc configuration (#13826) @vyasr
  • Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
  • Fix all warnings in Python docs (#13789) @vyasr

🚀 New Features

  • [Java] Add JNI bindings for integers_to_hex (#14205) @razajafri
  • Propagate errors from Parquet reader kernels back to host (#14167) @vuule
  • JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations (#14154) @ttnghia
  • Expose streams in all public sorting APIs (#14146) @vyasr
  • Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
  • Implement GroupBy.value_counts to match pandas API (#14114) @stmio
  • Refactor parquet thrift reader (#14097) @etseidl
  • Refactor hash_reduce_by_row (#14095) @ttnghia
  • Support negative preceding/following for ROW window functions (#14093) @mythrocks
  • Support for progressive parquet chunked reading. (#14079) @nvdbaranec
  • Implement HISTOGRAM and MERGE_HISTOGRAM aggregations (#14045) @ttnghia
  • Expose streams in public search APIs (#14034) @vyasr
  • Expose streams in public replace APIs (#14010) @vyasr
  • Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
  • Expose streams in public filling APIs (#13990) @vyasr
  • Expose streams in public concatenate APIs (#13987) @vyasr
  • Use HostMemoryAllocator in jni::allocate_host_buffer (#13975) @gerashegalov
  • Enable fractional null probability for hashing benchmark (#13967) @Blonck
  • Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
  • Add nvtext::tokenize_with_vocabulary API (#13930) @davidwendt
  • Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
  • Add HostMemoryAllocator interface (#13924) @gerashegalov
  • Global stream pool (#13922) @etseidl
  • Create table_input_metadata from a table_metadata (#13920) @etseidl
  • Translate column size overflow exception to JNI (#13911) @mythrocks
  • Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
  • Exclude some tests from running with the compute sanitizer (#13872) @firestarman
  • Expand statistics support in ORC writer (#13848) @vuule
  • Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
  • Add cudf::strings::find function with target per row (#13808) @davidwendt
  • Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
  • Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
  • Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
  • Support corr in GroupBy.apply through the jit engine (#13767) @shwina
  • Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
  • Support more numeric types in Groupby.apply with engine=&#39;jit&#39; (#13729) @brandon-b-miller
  • [FEA] Add DELTA_BINARY_PACKED decoding support to Parquet reader (#13637) @etseidl
  • Read FIXED_LEN_BYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

🛠️ Improvements

  • Update shared-action-workflows references (backport from 23.12 to 23.10) (#14300) @AyodeAwe
  • Pin dask and distributed for 23.10 release (#14225) @galipremsagar
  • update rmm tag path (#14195) @AyodeAwe
  • Disable Recently Updated Check (#14193) @ajschmidt8
  • Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
  • Add Parquet reader benchmarks for row selection (#14147) @vuule
  • Update image names (#14145) @AyodeAwe
  • Support callables in DataFrame.assign (#14142) @wence-
  • Reduce memory usage of as_categorical_column (#14138) @wence-
  • Replace Python scalar conversions with libcudf (#14124) @vyasr
  • Update to clang 16.0.6. (#14120) @bdice
  • Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
  • Add stream parameter to external dict APIs (#14115) @SurajAralihalli
  • Add fallback matrix for nvcomp. (#14082) @bdice
  • [Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
  • Remove header tests (#14072) @ajschmidt8
  • Refactor contains_table with cuco::static_set (#14064) @PointKernel
  • Remove debug print in a Parquet test (#14063) @vuule
  • Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
  • Expose stream parameter in public strings find APIs (#14060) @davidwendt
  • Update doxygen to 1.9.1 (#14059) @vyasr
  • Remove the mr from the base fixture (#14057) @vyasr
  • Expose streams in public strings case APIs (#14056) @davidwendt
  • Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
  • Use cudf::make_empty_column instead of column_view constructor (#14030) @davidwendt
  • Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
  • Explicitly depend on zlib in conda recipes (#14018) @wence-
  • Use grid_stride for stride computations. (#13996) @bdice
  • Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
  • Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
  • Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
  • Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
  • Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
  • Use thread_index_type in partitioning.cu (#13973) @divyegala
  • Use cudf::thread_index_type in merge.cu (#13972) @divyegala
  • Use copy-pr-bot (#13970) @ajschmidt8
  • Use cudf::thread_index_type in strings custom kernels (#13968) @davidwendt
  • Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
  • Added pinned pool reservation API for java (#13964) @revans2
  • Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
  • Add bytes_per_second to copy_if_else benchmark (#13960) @Blonck
  • Add pandas compatible output to Series.unique (#13959) @galipremsagar
  • Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
  • Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
  • Make HostColumnVector.getRefCount public (#13934) @abellina
  • Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
  • Add java API to get size of host memory needed to copy column view (#13919) @revans2
  • Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
  • Enable hugepage for arrow host allocations (#13914) @madsbk
  • Improve performance of nvtext::edit_distance (#13912) @davidwendt
  • Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
  • Use empty() instead of size() where possible (#13908) @vuule
  • [JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
  • Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
  • Allow explicit shuffle=&quot;p2p&quot; within dask-cudf API (#13893) @rjzamora
  • Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
  • Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
  • Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
  • Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
  • Fixes a performance regression in FST (#13850) @elstehle
  • Set native handles to null on close in Java wrapper classes (#13818) @jlowe
  • Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
  • Update lists::contains to experimental row comparator (#13810) @divyegala
  • Reduce lists::contains dispatches for scalars (#13805) @divyegala
  • Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
  • Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
  • Remove the libcudf cudf::offset_type type (#13788) @davidwendt
  • Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
  • Update to Cython 3.0.0 (#13777) @vyasr
  • Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
  • Branch 23.10 merge 23.08 (#13773) @vyasr
  • Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
  • Branch 23.10 merge 23.08 (#13753) @vyasr
  • Enforce deprecations in 23.10 (#13732) @galipremsagar
  • Upgrade to arrow 12 (#13728) @galipremsagar
  • Refactors JSON reader's pushdown automaton (#13716) @elstehle
  • Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule
cudf - v23.04.01

Published by raydouglass 12 months ago

🚨 Breaking Changes

  • Pin dask and distributed for release (#13070) @galipremsagar
  • Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
  • Update minimum pandas and numpy pinnings (#12887) @galipremsagar
  • Deprecate names & dtype in Index.copy (#12825) @galipremsagar
  • Deprecate Index.is_* methods (#12820) @galipremsagar
  • Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
  • Deprecate na_sentinel in factorize (#12817) @galipremsagar
  • Make string methods return a Series with a useful Index (#12814) @shwina
  • Produce useful guidance on overflow error in to_csv (#12705) @wence-
  • Move strings_udf code into cuDF (#12669) @brandon-b-miller
  • Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
  • Replace message parsing with throwing more specific exceptions (#12426) @vyasr

🐛 Bug Fixes

  • Pin curand version (#13127) @vyasr
  • Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
  • Fix DataFrame constructor to broadcast scalar inputs properly (#12997) @galipremsagar
  • Drop force_nullable_schema from chunked parquet writer (#12996) @galipremsagar
  • Fix gtest column utility comparator diff reporting (#12995) @davidwendt
  • Handle index names while performing groupby (#12992) @galipremsagar
  • Fix __setitem__ on string columns when the scalar value ends in a null byte (#12991) @wence-
  • Fix sort_values when column is all empty strings (#12988) @eriknw
  • Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
  • Pre-emptive fix for upstream dask.dataframe.read_parquet changes (#12983) @rjzamora
  • Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
  • Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
  • cudftestutil supports static gtest dependencies (#12957) @robertmaynard
  • Include gtest in build environment. (#12956) @vyasr
  • Correctly handle scalar indices in Index.__getitem__ (#12955) @wence-
  • Avoid building cython twice (#12945) @galipremsagar
  • Fix set index error for Series rolling window operations (#12942) @galipremsagar
  • Fix calculation of null counts for Parquet statistics (#12938) @etseidl
  • Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
  • Use get_current_device_resource for intermediate allocations in COLLECT_LIST window code (#12927) @karthikeyann
  • Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
  • Fix conda recipe post-link.sh typo (#12916) @pentschev
  • min_rows and num_rows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
  • Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
  • Use python -m pytest for nightly wheel tests (#12871) @bdice
  • Parquet writer column_size() should return a size_t (#12870) @etseidl
  • Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
  • Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
  • Remove tokenizers pre-install pinning. (#12854) @vyasr
  • Fix parquet RangeIndex bug (#12838) @rjzamora
  • Remove KAFKA_HOST_TEST from compute-sanitizer check (#12831) @davidwendt
  • Make string methods return a Series with a useful Index (#12814) @shwina
  • Tell cudf_kafka to use header-only fmt (#12796) @vyasr
  • Add GroupBy.dtypes (#12783) @galipremsagar
  • Fix a leak in a test and clarify some test names (#12781) @revans2
  • Fix bug in all-null list due to join_list_elements special handling (#12767) @karthikeyann
  • Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
  • Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
  • Fix a bug with num_keys in _scatter_by_slice (#12749) @thomcom
  • Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
  • Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
  • Add always_nullable flag to Dremel encoding (#12727) @divyegala
  • Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
  • Fix faulty conditional logic in JIT GroupBy.apply (#12706) @brandon-b-miller
  • Produce useful guidance on overflow error in to_csv (#12705) @wence-
  • Handle parquet list data corner case (#12698) @nvdbaranec
  • Fix missing trailing comma in json writer (#12688) @karthikeyann
  • Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
  • Handle bool types in round API (#12670) @galipremsagar
  • Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
  • Fix from_arrow to load a sliced arrow table (#12665) @galipremsagar
  • Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
  • Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
  • Fix find_common_dtype and values to handle complex dtypes (#12537) @galipremsagar
  • Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
  • Fix Series comparison vs scalars (#12519) @brandon-b-miller
  • Allow casting from UDFString back to StringView to call methods in strings_udf (#12363) @brandon-b-miller

📖 Documentation

  • Fix GroupBy.apply doc examples rendering (#12994) @brandon-b-miller
  • add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
  • Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
  • Add README symlink for dask-cudf. (#12946) @bdice
  • Remove return type from @return doxygen tags (#12908) @davidwendt
  • Fix docs build to be pydata-sphinx-theme=0.13.0 compatible (#12874) @galipremsagar
  • Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
  • Enable doctests for GroupBy methods (#12658) @brandon-b-miller
  • Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt

🚀 New Features

  • Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
  • Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
  • Refactor orc chunked writer (#12949) @ttnghia
  • Make Parquet writer nullable option application to single table writes (#12933) @vuule
  • Refactor io::orc::ProtobufWriter (#12877) @ttnghia
  • Make timezone table independent from ORC (#12805) @vuule
  • Cache JIT GroupBy.apply functions (#12802) @brandon-b-miller
  • Implement initial support for avro logical types (#6482) (#12788) @tpn
  • Update tests/column_utilities to use experimental::equality row comparator (#12777) @divyegala
  • Update distinct/unique_count to experimental::row hasher/comparator (#12776) @divyegala
  • Update hash_partition to use experimental::row::row_hasher (#12761) @divyegala
  • Update is_sorted to use experimental::row::lexicographic (#12752) @divyegala
  • Update default data source in cuio reader benchmarks (#12740) @PointKernel
  • Reenable stream identification library in CI (#12714) @vyasr
  • Add regex_program strings splitting java APIs and tests (#12713) @cindyyuanjiang
  • Add regex_program strings replacing java APIs and tests (#12701) @cindyyuanjiang
  • Add regex_program strings extract java APIs and tests (#12699) @cindyyuanjiang
  • Variable fragment sizes for Parquet writer (#12685) @etseidl
  • Add segmented reduction support for fixed-point types (#12680) @davidwendt
  • Move strings_udf code into cuDF (#12669) @brandon-b-miller
  • Add regex_program searching APIs and related java classes (#12666) @cindyyuanjiang
  • Add logging to libcudf (#12637) @vuule
  • Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
  • Convert rank to use to experimental row comparators (#12481) @divyegala
  • Use rapids-cmake parallel testing feature (#12451) @robertmaynard
  • Enable detection of undesired stream usage (#12089) @vyasr

🛠️ Improvements

  • Pin dask and distributed for release (#13070) @galipremsagar
  • Pin cupy in wheel tests to supported versions (#13041) @vyasr
  • Pin numba version (#13001) @vyasr
  • Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
  • Stop setting package version attribute in wheels (#12977) @vyasr
  • Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
  • Remove default detail mrs: part7 (#12970) @vyasr
  • Remove default detail mrs: part6 (#12969) @vyasr
  • Remove default detail mrs: part5 (#12968) @vyasr
  • Remove default detail mrs: part4 (#12967) @vyasr
  • Remove default detail mrs: part3 (#12966) @vyasr
  • Remove default detail mrs: part2 (#12965) @vyasr
  • Remove default detail mrs: part1 (#12964) @vyasr
  • Add force_nullable_schema parameter to Parquet writer. (#12952) @galipremsagar
  • Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
  • Remove remaining default stream parameters (#12943) @vyasr
  • Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
  • Implement groupby.head and groupby.tail (#12939) @wence-
  • Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
  • Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
  • Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
  • Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
  • Pass SCCACHE_S3_USE_SSL to conda builds (#12910) @ajschmidt8
  • Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
  • Generate pyproject dependencies using dfg (#12906) @vyasr
  • Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
  • Fix moto env vars & pass AWS_SESSION_TOKEN to conda builds (#12902) @ajschmidt8
  • Rewrite CSV writer benchmark with nvbench (#12901) @PointKernel
  • Rework some code logic to reduce iterator and comparator inlining to improve compile time (#12900) @davidwendt
  • Deprecate line_terminator in favor of lineterminator in to_csv (#12896) @wence-
  • Add stream and mr parameters for structs::detail::flatten_nested_columns (#12892) @ttnghia
  • Deprecate libcudf regex APIs accepting pattern strings directly (#12891) @davidwendt
  • Remove default parameters from detail headers in include (#12888) @vyasr
  • Update minimum pandas and numpy pinnings (#12887) @galipremsagar
  • Implement groupby.sample (#12882) @wence-
  • Update JNI build ENV default to gcc 11 (#12881) @pxLi
  • Change return type of cudf::structs::detail::flatten_nested_columns to smart pointer (#12878) @ttnghia
  • Fix passing seed parameter to MurmurHash3_32 in cudf::hash() function (#12875) @davidwendt
  • Remove manual artifact upload step in CI (#12869) @ajschmidt8
  • Update to GCC 11 (#12868) @bdice
  • Fix null hive-partition behavior in dask-cudf parquet (#12866) @rjzamora
  • Update to protobuf>=4.21.6,<4.22. (#12864) @bdice
  • Update RMM allocators (#12861) @pentschev
  • Improve performance for replace-multi for long strings (#12858) @davidwendt
  • Drop Python 3.7 handling for pickle protocol 4 (#12857) @jakirkham
  • Migrate as much as possible to pyproject.toml (#12850) @vyasr
  • Enable nbqa pre-commit hooks for isort and black. (#12848) @bdice
  • Setting a threshold for KvikIO IO (#12841) @madsbk
  • Update datasets download URL (#12840) @jjacobelli
  • Make docs builds less verbose (#12836) @AyodeAwe
  • Consolidate linter configs into pyproject.toml (#12834) @vyasr
  • Deprecate names & dtype in Index.copy (#12825) @galipremsagar
  • Deprecate inplace parameters in categorical methods (#12824) @galipremsagar
  • Add optional text file support to ninja-log utility (#12823) @davidwendt
  • Deprecate Index.is_* methods (#12820) @galipremsagar
  • Add dfg as a pre-commit hook (#12819) @vyasr
  • Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
  • Deprecate na_sentinel in factorize (#12817) @galipremsagar
  • Shuffling read into a sub function in parquet read (#12809) @hyperbolic2346
  • Fixing parquet coalescing of reads (#12808) @hyperbolic2346
  • CI: Remove specification of manual stage for check_style.sh script. (#12803) @csadorf
  • Add compute-sanitizer github workflow action to nightly tests (#12800) @davidwendt
  • Enable groupby std and variance aggregation types in libcudf Debug build (#12799) @davidwendt
  • Expose seed argument to hash_values (#12795) @ayushdg
  • Fix groupby gtests coded in namespace cudf::test (#12784) @davidwendt
  • Improve performance for cudf::strings::count_characters for long strings (#12779) @davidwendt
  • Deallocate encoded data in ORC writer immediately after compression (#12770) @vuule
  • Stop force pulling fmt in nvbench. (#12768) @vyasr
  • Remove now redundant cuda initialization (#12758) @vyasr
  • Adds JSON reader, writer io benchmark (#12753) @karthikeyann
  • Use test paths relative to package directory. (#12751) @bdice
  • Add build metrics report as artifact to cpp-build workflow (#12750) @davidwendt
  • Add JNI methods for detecting and purging non-empty nulls from LIST and STRUCT (#12742) @razajafri
  • Stop using versioneer to manage versions (#12741) @vyasr
  • Reduce error handling verbosity in CI tests scripts (#12738) @AjayThorve
  • Reduce the number of test cases in multibyte_split benchmark (#12737) @PointKernel
  • Update shared workflow branches (#12733) @ajschmidt8
  • JNI switches to nested JSON reader (#12732) @res-life
  • Changing cudf::io::source_info to use cudf::host_span&lt;std::byte&gt; in a non-breaking form (#12730) @hyperbolic2346
  • Add nvbench environment class for initializing RMM in benchmarks (#12728) @davidwendt
  • Split C++ and Python build dependencies into separate lists. (#12724) @bdice
  • Add build dependencies to Java tests. (#12723) @bdice
  • Allow setting the seed argument for hash partition (#12715) @firestarman
  • Remove gpuCI scripts. (#12712) @bdice
  • Unpin dask and distributed for development (#12710) @galipremsagar
  • partition_by_hash(): use _split() (#12704) @madsbk
  • Remove DataFrame.quantiles from docs. (#12684) @bdice
  • Fast path for experimental::row::equality (#12676) @divyegala
  • Move date to build string in conda recipe (#12661) @ajschmidt8
  • Refactor reduction logic for fixed-point types (#12652) @davidwendt
  • Pay off some JNI RMM API tech debt (#12632) @revans2
  • Merge copy-on-write feature branch into branch-23.04 (#12619) @galipremsagar
  • Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
  • Pin cuda-nvrtc. (#12606) @bdice
  • Remove cudf::test::print calls from libcudf gtests (#12604) @davidwendt
  • Init JNI version 23.04.0-SNAPSHOT (#12599) @pxLi
  • Add performance benchmarks to user facing docs (#12595) @galipremsagar
  • Add docs build job (#12592) @AyodeAwe
  • Replace message parsing with throwing more specific exceptions (#12426) @vyasr
  • Support conversion to/from cudf in dask.dataframe.core.to_backend (#12380) @rjzamora
cudf - v23.10.00

Published by raydouglass about 1 year ago

🚨 Breaking Changes

  • Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
  • Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
  • Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
  • Create table_input_metadata from a table_metadata (#13920) @etseidl
  • Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
  • Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
  • Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
  • Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
  • Remove the libcudf cudf::offset_type type (#13788) @davidwendt
  • Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
  • Update to Cython 3.0.0 (#13777) @vyasr
  • Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
  • Enforce deprecations in 23.10 (#13732) @galipremsagar
  • Upgrade to arrow 12 (#13728) @galipremsagar
  • Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

🐛 Bug Fixes

  • Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
  • Fix inaccuracy in decimal128 rounding. (#14233) @bdice
  • Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
  • Fix pytorch related pytest (#14198) @galipremsagar
  • Pin to aws-sdk-cpp&lt;1.11 (#14173) @pentschev
  • Fix assert failure for range window functions (#14168) @mythrocks
  • Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
  • Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
  • Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
  • Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
  • Fix kernel launch error for cudf::io::orc::gpu::rowgroup_char_counts_kernel (#14139) @davidwendt
  • Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
  • Fix DataFrame.values with no columns but index (#14134) @mroeschke
  • Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
  • Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
  • Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
  • Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
  • Preserve name of the column while initializing a DataFrame (#14110) @galipremsagar
  • Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
  • Drop kwargs from Series.count (#14106) @galipremsagar
  • Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
  • Only use memory resources that haven't been freed (#14103) @robertmaynard
  • Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
  • Validate ignore_index type in drop_duplicates (#14098) @mroeschke
  • Fix renaming Series and Index (#14080) @galipremsagar
  • Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
  • Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
  • Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
  • Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
  • Raise NotImplementedError in to_datetime with dayfirst without infer_format (#14058) @mroeschke
  • Fix various issues in Index.intersection (#14054) @galipremsagar
  • Fix Index.difference to match with pandas (#14053) @galipremsagar
  • Fix empty string column construction (#14052) @galipremsagar
  • Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
  • Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
  • Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
  • Ignore compile_commands.json (#14048) @harrism
  • Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
  • Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
  • Implement sort_remaining for sort_index (#14033) @wence-
  • Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
  • Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
  • Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
  • Fix return type of MultiIndex.difference (#14009) @galipremsagar
  • Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
  • Fix map column can not be non-nullable for java (#14003) @res-life
  • Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
  • Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
  • Use thread_index_type to avoid out of bounds accesses in conditional joins (#13971) @vyasr
  • Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
  • Use cudf::thread_index_type in get_json_object and tdigest kernels (#13962) @nvdbaranec
  • Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
  • Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
  • Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
  • Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
  • Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
  • Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
  • Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
  • Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
  • Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
  • Fixed processed bytes calculation in APPLY_BOOLEAN_MASK benchmark. (#13937) @Blonck
  • Fix construction of Grouping objects (#13932) @galipremsagar
  • Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
  • Fix handling of typecasting in searchsorted (#13925) @galipremsagar
  • Preserve index name in reindex (#13917) @galipremsagar
  • Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
  • Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
  • Use cudf::thread_index_type in concatenate.cu. (#13906) @bdice
  • Use cudf::thread_index_type in replace.cu. (#13905) @bdice
  • Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
  • Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
  • Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
  • Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
  • Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
  • Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
  • Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
  • Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
  • Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
  • Fix nvtext::generate_character_ngrams performance regression for longer strings (#13874) @davidwendt
  • Fix return type of MultiIndex.levels (#13870) @galipremsagar
  • Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
  • Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
  • Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
  • Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
  • Fix binary operations between Series and Index (#13842) @galipremsagar
  • Update make_lists_column_from_scalar to use make_offsets_child_column utility (#13841) @davidwendt
  • Fix read out of bounds in string concatenate (#13838) @pentschev
  • Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
  • Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
  • Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
  • Fix cuFile I/O factories (#13829) @vuule
  • DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
  • Branch 23.10 merge 23.08 (#13822) @vyasr
  • Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
  • No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
  • Raise error when mixed types are being constructed (#13816) @galipremsagar
  • Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
  • Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
  • Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
  • Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
  • Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
  • Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
  • Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
  • Fix negative unary operation for boolean type (#13780) @galipremsagar
  • Fix contains(in) method for Series (#13779) @galipremsagar
  • Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
  • Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
  • Preserve names of column object in various APIs (#13772) @galipremsagar
  • Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
  • Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
  • Provide our own Cython declaration for make_unique (#13746) @wence-

📖 Documentation

  • Fix typo in docstring: metadata. (#14025) @bdice
  • Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
  • Simplify Python doc configuration (#13826) @vyasr
  • Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
  • Fix all warnings in Python docs (#13789) @vyasr

🚀 New Features

  • [Java] Add JNI bindings for integers_to_hex (#14205) @razajafri
  • Propagate errors from Parquet reader kernels back to host (#14167) @vuule
  • JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations (#14154) @ttnghia
  • Expose streams in all public sorting APIs (#14146) @vyasr
  • Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
  • Implement GroupBy.value_counts to match pandas API (#14114) @stmio
  • Refactor parquet thrift reader (#14097) @etseidl
  • Refactor hash_reduce_by_row (#14095) @ttnghia
  • Support negative preceding/following for ROW window functions (#14093) @mythrocks
  • Support for progressive parquet chunked reading. (#14079) @nvdbaranec
  • Implement HISTOGRAM and MERGE_HISTOGRAM aggregations (#14045) @ttnghia
  • Expose streams in public search APIs (#14034) @vyasr
  • Expose streams in public replace APIs (#14010) @vyasr
  • Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
  • Expose streams in public filling APIs (#13990) @vyasr
  • Expose streams in public concatenate APIs (#13987) @vyasr
  • Use HostMemoryAllocator in jni::allocate_host_buffer (#13975) @gerashegalov
  • Enable fractional null probability for hashing benchmark (#13967) @Blonck
  • Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
  • Add nvtext::tokenize_with_vocabulary API (#13930) @davidwendt
  • Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
  • Add HostMemoryAllocator interface (#13924) @gerashegalov
  • Global stream pool (#13922) @etseidl
  • Create table_input_metadata from a table_metadata (#13920) @etseidl
  • Translate column size overflow exception to JNI (#13911) @mythrocks
  • Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
  • Exclude some tests from running with the compute sanitizer (#13872) @firestarman
  • Expand statistics support in ORC writer (#13848) @vuule
  • Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
  • Add cudf::strings::find function with target per row (#13808) @davidwendt
  • Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
  • Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
  • Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
  • Support corr in GroupBy.apply through the jit engine (#13767) @shwina
  • Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
  • Support more numeric types in Groupby.apply with engine=&#39;jit&#39; (#13729) @brandon-b-miller
  • [FEA] Add DELTA_BINARY_PACKED decoding support to Parquet reader (#13637) @etseidl
  • Read FIXED_LEN_BYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

🛠️ Improvements

  • Pin dask and distributed for 23.10 release (#14225) @galipremsagar
  • update rmm tag path (#14195) @AyodeAwe
  • Disable Recently Updated Check (#14193) @ajschmidt8
  • Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
  • Add Parquet reader benchmarks for row selection (#14147) @vuule
  • Update image names (#14145) @AyodeAwe
  • Support callables in DataFrame.assign (#14142) @wence-
  • Reduce memory usage of as_categorical_column (#14138) @wence-
  • Replace Python scalar conversions with libcudf (#14124) @vyasr
  • Update to clang 16.0.6. (#14120) @bdice
  • Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
  • Add stream parameter to external dict APIs (#14115) @SurajAralihalli
  • Add fallback matrix for nvcomp. (#14082) @bdice
  • [Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
  • Remove header tests (#14072) @ajschmidt8
  • Refactor contains_table with cuco::static_set (#14064) @PointKernel
  • Remove debug print in a Parquet test (#14063) @vuule
  • Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
  • Expose stream parameter in public strings find APIs (#14060) @davidwendt
  • Update doxygen to 1.9.1 (#14059) @vyasr
  • Remove the mr from the base fixture (#14057) @vyasr
  • Expose streams in public strings case APIs (#14056) @davidwendt
  • Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
  • Use cudf::make_empty_column instead of column_view constructor (#14030) @davidwendt
  • Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
  • Explicitly depend on zlib in conda recipes (#14018) @wence-
  • Use grid_stride for stride computations. (#13996) @bdice
  • Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
  • Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
  • Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
  • Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
  • Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
  • Use thread_index_type in partitioning.cu (#13973) @divyegala
  • Use cudf::thread_index_type in merge.cu (#13972) @divyegala
  • Use copy-pr-bot (#13970) @ajschmidt8
  • Use cudf::thread_index_type in strings custom kernels (#13968) @davidwendt
  • Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
  • Added pinned pool reservation API for java (#13964) @revans2
  • Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
  • Add bytes_per_second to copy_if_else benchmark (#13960) @Blonck
  • Add pandas compatible output to Series.unique (#13959) @galipremsagar
  • Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
  • Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
  • Make HostColumnVector.getRefCount public (#13934) @abellina
  • Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
  • Add java API to get size of host memory needed to copy column view (#13919) @revans2
  • Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
  • Enable hugepage for arrow host allocations (#13914) @madsbk
  • Improve performance of nvtext::edit_distance (#13912) @davidwendt
  • Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
  • Use empty() instead of size() where possible (#13908) @vuule
  • [JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
  • Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
  • Allow explicit shuffle=&quot;p2p&quot; within dask-cudf API (#13893) @rjzamora
  • Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
  • Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
  • Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
  • Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
  • Fixes a performance regression in FST (#13850) @elstehle
  • Set native handles to null on close in Java wrapper classes (#13818) @jlowe
  • Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
  • Update lists::contains to experimental row comparator (#13810) @divyegala
  • Reduce lists::contains dispatches for scalars (#13805) @divyegala
  • Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
  • Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
  • Remove the libcudf cudf::offset_type type (#13788) @davidwendt
  • Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
  • Update to Cython 3.0.0 (#13777) @vyasr
  • Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
  • Branch 23.10 merge 23.08 (#13773) @vyasr
  • Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
  • Branch 23.10 merge 23.08 (#13753) @vyasr
  • Enforce deprecations in 23.10 (#13732) @galipremsagar
  • Upgrade to arrow 12 (#13728) @galipremsagar
  • Refactors JSON reader's pushdown automaton (#13716) @elstehle
  • Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule
cudf - v23.08.00

Published by raydouglass about 1 year ago

🚨 Breaking Changes

  • Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
  • Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
  • Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
  • Expose streams in all public copying APIs (#13629) @vyasr
  • Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
  • Remove deprecated cudf.set_allocator. (#13591) @bdice
  • Change build.sh to use pip install instead of setup.py (#13507) @vyasr
  • Remove unused max_rows_tensor parameter from subword tokenizer (#13463) @davidwendt
  • Fix decimal scale reductions in _get_decimal_type (#13224) @charlesbluca

🐛 Bug Fixes

  • Add CUDA version to cudf_kafka and libcudf-example build strings. (#13769) @bdice
  • Fix typo in wheels-test.yaml. (#13763) @bdice
  • Don't test strings shorter than the requested ngram size (#13758) @vyasr
  • Add CUDA version to custreamz build string. (#13754) @bdice
  • Fix writing of ORC files with empty child string columns (#13745) @vuule
  • Remove the erroneous "empty level" short-circuit from ORC reader (#13722) @vuule
  • Fix character counting when writing sliced tables into ORC (#13721) @vuule
  • Parquet uses row group row count if missing from header (#13712) @hyperbolic2346
  • Fix reading of RLE encoded boolean data from parquet files with V2 page headers (#13707) @etseidl
  • Fix a corner case of list lexicographic comparator (#13701) @ttnghia
  • Fix combined filtering and column projection in dask_cudf.read_parquet (#13697) @rjzamora
  • Revert fetch-rapids changes (#13696) @vyasr
  • Data generator - include offsets in the size estimate of list elments (#13688) @vuule
  • Add cuda-nvcc-impl to cudf for numba CUDA 12 (#13673) @jakirkham
  • Fix combined filtering and column projection in read_parquet (#13666) @rjzamora
  • Use thrust::identity as hash functions for byte pair encoding (#13665) @PointKernel
  • Fix loc-getitem ordering when index contains duplicate labels (#13659) @wence-
  • [REVIEW] Introduce parity with pandas for MultiIndex.loc ordering & fix a bug in Groupby with as_index (#13657) @galipremsagar
  • Fix memcheck error found in nvtext tokenize functions (#13649) @davidwendt
  • Fix has_nonempty_nulls ignoring column offset (#13647) @ttnghia
  • [Java] Avoid double-free corruption in case of an Exception while creating a ColumnView (#13645) @razajafri
  • Fix memcheck error in ORC reader call to cudf::io::copy_uncompressed_kernel (#13643) @davidwendt
  • Fix CUDA 12 conda environment to remove cubinlinker and ptxcompiler. (#13636) @bdice
  • Fix inf/NaN comparisons for FLOAT orderby in window functions (#13635) @mythrocks
  • Refactor Index search to simplify code and increase correctness (#13625) @wence-
  • Fix compile warning for unused variable in split_re.cu (#13621) @davidwendt
  • Fix tz_localize for dask_cudf Series (#13610) @shwina
  • Fix issue with no decompressed data in ORC reader (#13609) @vuule
  • Fix floating point window range extents. (#13606) @mythrocks
  • Fix localize(None) for timezone-naive columns (#13603) @shwina
  • Fixed a memory leak caused by Exception thrown while constructing a ColumnView (#13597) @razajafri
  • Handle nullptr return value from bitmask_or in distinct_count (#13590) @wence-
  • Bring parity with pandas in Index.join (#13589) @galipremsagar
  • Fix cudf.melt when there are more than 255 columns (#13588) @hcho3
  • Fix memory issues in cuIO due to removal of memory padding (#13586) @ttnghia
  • Fix Parquet multi-file reading (#13584) @etseidl
  • Fix memcheck error found in LISTS_TEST (#13579) @davidwendt
  • Fix memcheck error found in STRINGS_TEST (#13578) @davidwendt
  • Fix memcheck error found in INTEROP_TEST (#13577) @davidwendt
  • Fix memcheck errors found in REDUCTION_TEST (#13574) @davidwendt
  • Preemptive fix for hive-partitioning change in dask (#13564) @rjzamora
  • Fix an issue with dask_cudf.read_csv when lines are needed to be skipped (#13555) @galipremsagar
  • Fix out-of-bounds memory write in cudf::dictionary::detail::concatenate (#13554) @davidwendt
  • Fix the null mask size in json reader (#13537) @karthikeyann
  • Fix cudf::strings::strip for all-empty input column (#13533) @davidwendt
  • Make sure to build without isolation or installing dependencies (#13524) @vyasr
  • Remove preload lib from CMake for now (#13519) @vyasr
  • Fix missing separator after null values in JSON writer (#13503) @karthikeyann
  • Ensure single_lane_block_sum_reduce is safe to call in a loop (#13488) @wence-
  • Update all versions in pyproject.toml files. (#13486) @bdice
  • Remove applying nvbench that doesn't exist in 23.08 (#13484) @robertmaynard
  • Fix chunked Parquet reader benchmark (#13482) @vuule
  • Update JNI JSON reader column compatability for Spark (#13477) @revans2
  • Fix unsanitized output of scan with strings (#13455) @davidwendt
  • Reject functions without bytecode from _can_be_jitted in GroupBy Apply (#13429) @brandon-b-miller
  • Fix decimal scale reductions in _get_decimal_type (#13224) @charlesbluca

📖 Documentation

  • Fix doxygen groups for io data sources and sinks (#13718) @davidwendt
  • Add pandas compatibility note to DataFrame.query docstring (#13693) @beckernick
  • Add pylibcudf to developer guide (#13639) @vyasr
  • Fix repeated words in doxygen text (#13598) @karthikeyann
  • Update docs for top-level API. (#13592) @bdice
  • Fix the the doxygen text for cudf::concatenate and other places (#13561) @davidwendt
  • Document stream validation approach used in testing (#13556) @vyasr
  • Cleanup doc repetitions in libcudf (#13470) @karthikeyann

🚀 New Features

  • Support min and max aggregations for list type in groupby and reduction (#13676) @ttnghia
  • Add nvtext::jaccard_index API for strings columns (#13669) @davidwendt
  • Add read_parquet_metadata libcudf API (#13663) @karthikeyann
  • Expose streams in all public copying APIs (#13629) @vyasr
  • Add XXHash_64 hash function to cudf (#13612) @davidwendt
  • Java support: Floating point order-by columns for RANGE window functions (#13595) @mythrocks
  • Use cuco::static_map to build string dictionaries in ORC writer (#13580) @vuule
  • Add pylibcudf subpackage with gather implementation (#13562) @vyasr
  • Add JNI for lists::concatenate_list_elements (#13547) @ttnghia
  • Enable nested types for lists::concatenate_list_elements (#13545) @ttnghia
  • Add unicode encoding for string columns in JSON writer (#13539) @karthikeyann
  • Remove numba kernels from find_index_of_val (#13517) @brandon-b-miller
  • Floating point order-by columns for RANGE window functions (#13512) @mythrocks
  • Parse column chunk metadata statistics in parquet reader (#13472) @karthikeyann
  • Add abs function to apply (#13408) @brandon-b-miller
  • [FEA] AST filtering in parquet reader (#13348) @karthikeyann
  • [FEA] Adds option to recover from invalid JSON lines in JSON tokenizer (#13344) @elstehle
  • Ensure cccl packages don't clash with upstream version (#13235) @robertmaynard
  • Update struct_minmax_util to experimental row comparator (#13069) @divyegala
  • Add stream parameter to hashing APIs (#12090) @vyasr

🛠️ Improvements

  • Pin dask and distributed for 23.08 release (#13802) @galipremsagar
  • Relax protobuf pinnings. (#13770) @bdice
  • Switch fully unbounded window functions to use aggregations (#13727) @mythrocks
  • Switch to new wheel building pipeline (#13723) @vyasr
  • Revert CUDA 12.0 CI workflows to branch-23.08. (#13719) @bdice
  • Adding identify minimum version requirement (#13713) @hyperbolic2346
  • Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
  • Optimize ORC reader performance for list data (#13708) @vyasr
  • fix limit overflow message in a docstring (#13703) @ahmet-uyar
  • Alleviates JSON parser's need for multi-file sources to end with a newline (#13702) @elstehle
  • Update cython-lint and replace flake8 with ruff (#13699) @vyasr
  • Add __dask_tokenize__ definitions to cudf classes (#13695) @rjzamora
  • Convert libcudf hashing benchmarks to nvbench (#13694) @davidwendt
  • Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
  • Improve performance of cudf::strings::split on whitespace (#13680) @davidwendt
  • Allow ORC and Parquet writers to write nullable columns without nulls as non-nullable (#13675) @vuule
  • Raise a NotImplementedError in to_datetime when utc is passed (#13670) @shwina
  • Add rmm_mode parameter to nvbench base fixture (#13668) @davidwendt
  • Fix multiindex loc ordering in pandas-compat mode (#13660) @wence-
  • Add nvtext hash_character_ngrams function (#13654) @davidwendt
  • Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
  • Acquire spill lock in to/from_arrow (#13646) @shwina
  • Expose stable versions of libcudf sort routines (#13634) @wence-
  • Separate out hash_test.cpp source for each hash API (#13633) @davidwendt
  • Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
  • Create separate libcudf hash APIs for each supported hash function (#13626) @davidwendt
  • Add convert_dtypes API (#13623) @shwina
  • Clean up cupy in dependencies.yaml. (#13617) @bdice
  • Use cuda-version to constrain cudatoolkit. (#13615) @bdice
  • Add murmurhash3_x64_128 function to libcudf (#13604) @davidwendt
  • Performance improvement for cudf::strings::like (#13594) @davidwendt
  • Remove deprecated cudf.set_allocator. (#13591) @bdice
  • Clean up cudf device atomic with cuda::atomic_ref (#13583) @PointKernel
  • Add java bindings for distinct count (#13573) @revans2
  • Use nvcomp conda package. (#13566) @bdice
  • Add exception to string_scalar if input string exceeds size_type (#13560) @davidwendt
  • Add dispatch for cudf.Dataframe to/from pyarrow.Table conversion (#13558) @rjzamora
  • Get rid of cuco::pair_type aliases (#13553) @PointKernel
  • Introduce parity with pandas when sort=False in Groupby (#13551) @galipremsagar
  • Update CMake in docker to 3.26.4 (#13550) @NvTimLiu
  • Clarify source of error message in stream testing. (#13541) @bdice
  • Deprecate strings_to_categorical in cudf.read_parquet (#13540) @galipremsagar
  • Update to CMake 3.26.4 (#13538) @vyasr
  • s3 folder naming fix (#13536) @AyodeAwe
  • Implement iloc-getitem using parse-don't-validate approach (#13534) @wence-
  • Make synchronization explicit in the names of hostdevice_* copying APIs (#13530) @ttnghia
  • Add benchmark (Google Benchmark) dependency to conda packages. (#13528) @bdice
  • Add libcufile to dependencies.yaml. (#13523) @bdice
  • Fix some memoization logic in groupby/sort/sort_helper.cu (#13521) @davidwendt
  • Use sizes_to_offsets_iterator in cudf::gather for strings (#13520) @davidwendt
  • use rapids-upload-docs script (#13518) @AyodeAwe
  • Support UTF-8 BOM in CSV reader (#13516) @davidwendt
  • Move stream-related test configuration to CMake (#13513) @vyasr
  • Implement cudf.option_context (#13511) @galipremsagar
  • Unpin dask and distributed for development (#13508) @galipremsagar
  • Change build.sh to use pip install instead of setup.py (#13507) @vyasr
  • Use test default stream (#13506) @vyasr
  • Remove documentation build scripts for Jenkins (#13495) @ajschmidt8
  • Use east const in include files (#13494) @karthikeyann
  • Use east const in src files (#13493) @karthikeyann
  • Use east const in tests files (#13492) @karthikeyann
  • Use east const in benchmarks files (#13491) @karthikeyann
  • Performance improvement for nvtext tokenize/token functions (#13480) @davidwendt
  • Add pd.Float*Dtype to Avro and ORC mappings (#13475) @mroeschke
  • Use pandas public APIs where available (#13467) @mroeschke
  • Allow pd.ArrowDtype in cudf.from_pandas (#13465) @mroeschke
  • Rework libcudf regex benchmarks with nvbench (#13464) @davidwendt
  • Remove unused max_rows_tensor parameter from subword tokenizer (#13463) @davidwendt
  • Separate io-text and nvtext pytests into different files (#13435) @davidwendt
  • Add a move_to function to cudf::string_view::const_iterator (#13428) @davidwendt
  • Allow newer scikit-build (#13424) @vyasr
  • Refactor sort_by_values to sort_values, drop indices from return values. (#13419) @bdice
  • Inline Cython exception handler (#13411) @vyasr
  • Init JNI version 23.08.0-SNAPSHOT (#13401) @pxLi
  • Refactor ORC reader (#13396) @ttnghia
  • JNI: Remove cleaned objects in memory cleaner (#13378) @res-life
  • Add tests of currently unsupported indexing (#13338) @wence-
  • Performance improvement for some libcudf regex functions for long strings (#13322) @davidwendt
  • Exposure Tracked Buffer (first step towards unifying copy-on-write and spilling) (#13307) @madsbk
  • Write string data directly to column_buffer in Parquet reader (#13302) @etseidl
  • Add stacktrace into cudf exception types (#13298) @ttnghia
  • cuDF: Build CUDA 12 packages (#12922) @bdice
cudf - v23.06.01

Published by raydouglass over 1 year ago

🚨 Breaking Changes

  • Fix batch processing for parquet writer (#13438) @ttnghia
  • Use <NA> instead of null to match pandas. (#13415) @bdice
  • Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
  • Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
  • Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
  • Remove null mask and null count from column_view constructors (#13311) @vyasr
  • Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
  • Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
  • Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
  • Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
  • Update minimum Python version to Python 3.9 (#13196) @shwina
  • Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
  • Cleanup Parquet chunked writer (#13094) @ttnghia
  • Cleanup ORC chunked writer (#13091) @ttnghia
  • Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
  • Remove deprecated regex functions from libcudf (#13067) @davidwendt
  • [REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
  • Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🐛 Bug Fixes

  • Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
  • Fix writing of ORC files with empty rowgroups (#13466) @vuule
  • Fix cudf::repeat logic when count is zero (#13459) @davidwendt
  • Fix batch processing for parquet writer (#13438) @ttnghia
  • Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
  • Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
  • Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
  • Use <NA> instead of null to match pandas. (#13415) @bdice
  • Fix tokenize with non-space delimiter (#13403) @shwina
  • Fix groupby head/tail for empty dataframe (#13398) @shwina
  • Default to closed="right" in IntervalIndex constructor (#13394) @shwina
  • Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
  • Fix unused argument errors in nvcc 11.5 (#13387) @abellina
  • Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
  • Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
  • Fix page size estimation in Parquet writer (#13364) @etseidl
  • Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
  • Support gcc 12 as the C++ compiler (#13316) @robertmaynard
  • Correctly set bitmask size in from_column_view (#13315) @wence-
  • Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
  • Fix parquet schema interpretation issue (#13277) @hyperbolic2346
  • Fix 64bit shift bug in avro reader (#13276) @karthikeyann
  • Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
  • Clean up buffers in case AssertionError (#13262) @razajafri
  • Allow empty input table in ast compute_column (#13245) @wence-
  • Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
  • Fix the row index stream order in ORC reader (#13242) @vuule
  • Make is_decompression_disabled and is_compression_disabled thread-safe (#13240) @vuule
  • Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
  • Fix race in ORC string dictionary creation (#13214) @revans2
  • Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
  • Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
  • Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
  • Fix hostdevice_vector::subspan (#13187) @ttnghia
  • Use custom nvbench entry point to ensure cudf::nvbench_base_fixture usage (#13183) @robertmaynard
  • Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
  • Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
  • Fix a few clang-format style check errors (#13146) @davidwendt
  • [REVIEW] Fix Series and DataFrame constructors to validate index lengths (#13122) @galipremsagar
  • Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
  • Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
  • Adds checks to make sure json reader won't overflow (#13115) @elstehle
  • Fix null_count of columns returned by chunked_parquet_reader (#13111) @vuule
  • Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
  • [REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
  • Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
  • Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
  • Fix column selection read_parquet benchmarks (#13082) @vuule
  • Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
  • Add algorithm include in data_sink.hpp (#13068) @ahendriksen
  • Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
  • Prevent overflow with skip_rows in ORC and Parquet readers (#13063) @vuule
  • Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
  • [REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
  • Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
  • Fix read_avro() skip_rows and num_rows. (#12912) @tpn
  • Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
  • Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina

🚀 New Features

  • Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
  • Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
  • Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
  • cuDF numba cuda 12 updates (#13337) @brandon-b-miller
  • Add tz_convert method to convert between timestamps (#13328) @shwina
  • Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
  • Support the case=False argument to str.contains (#13290) @shwina
  • Add an event handler for ColumnVector.close (#13279) @abellina
  • JNI api for cudf::chunked_pack (#13278) @abellina
  • Implement a chunked_pack API (#13260) @abellina
  • Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
  • JNI changes for range-extents in window functions. (#13199) @mythrocks
  • Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
  • Add IS_NULL operator to AST (#13145) @karthikeyann
  • STRING order-by column for RANGE window functions (#13143) @mythrocks
  • Update contains_table to experimental row hasher and equality comparator (#13119) @divyegala
  • Automatically select GroupBy.apply algorithm based on if the UDF is jittable (#13113) @brandon-b-miller
  • Refactor Parquet chunked writer (#13076) @ttnghia
  • Add Python bindings for string literal support in AST (#13073) @karthikeyann
  • Add Java bindings for string literal support in AST (#13072) @karthikeyann
  • Add string scalar support in AST (#13061) @karthikeyann
  • Log cuIO warnings using the libcudf logger (#13043) @vuule
  • Update mixed_join to use experimental row hasher and comparator (#13028) @divyegala
  • Support structs of lists in row lexicographic comparator (#13005) @ttnghia
  • Adding hostdevice_span that is a span createable from hostdevice_vector (#12981) @hyperbolic2346
  • Add nvtext::minhash function (#12961) @davidwendt
  • Support lists of structs in row lexicographic comparator (#12953) @ttnghia
  • Update join to use experimental row hasher and comparator (#12787) @divyegala
  • Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🛠️ Improvements

  • Bump typing_extensions minimum version to 4.0.0 (#13618) @shwina
  • Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
  • Handle some corner-cases in indexing with boolean masks (#13402) @wence-
  • Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
  • [JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
  • Fix JNI method with mismatched parameter list (#13384) @ttnghia
  • Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
  • Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
  • Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
  • Move some nvtext benchmarks to nvbench (#13368) @davidwendt
  • run docs nightly too (#13366) @AyodeAwe
  • Add warning for default dtype parameter in get_dummies (#13365) @galipremsagar
  • Add log messages about kvikIO compatibility mode (#13363) @vuule
  • Switch back to using primary shared-action-workflows branch (#13362) @vyasr
  • Deprecate StringIndex and use Index instead (#13361) @galipremsagar
  • Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
  • Expunge most uses of TypeVar(bound=&quot;Foo&quot;) (#13346) @wence-
  • Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
  • Improve distinct_count with cuco::static_set (#13343) @PointKernel
  • Fix contiguous_split performance (#13342) @ttnghia
  • Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
  • Update mypy to 1.3 (#13340) @wence-
  • [Java] Purge non-empty nulls when setting validity (#13335) @razajafri
  • Add row-wise filtering step to read_parquet (#13334) @rjzamora
  • Performance improvement for nvtext::minhash (#13333) @davidwendt
  • Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
  • Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
  • Move meta calculation in dask_cudf.read_parquet (#13327) @rjzamora
  • Changes to support Numpy >= 1.24 (#13325) @shwina
  • Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
  • Clean up distinct_count benchmark (#13321) @PointKernel
  • Fix gtest pinning to 1.13.0. (#13319) @bdice
  • Remove null mask and null count from column_view constructors (#13311) @vyasr
  • Address feedback from 13289 (#13306) @vyasr
  • Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
  • First check for BaseDtype when infering the data type of an arbitrary object (#13295) @shwina
  • Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
  • Support CUDA 12.0 for pip wheels (#13289) @divyegala
  • Refactor transform_lists_of_structs in row_operators.cu (#13288) @ttnghia
  • Branch 23.06 merge 23.04 (#13286) @vyasr
  • Update cupy dependency (#13284) @vyasr
  • Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
  • Fix unused variables and functions (#13275) @karthikeyann
  • Fix integer overflow in partition scatter_map construction (#13272) @wence-
  • Numba 0.57 compatibility fixes (#13271) @gmarkall
  • Performance improvement in cudf::strings::all_characters_of_type (#13259) @davidwendt
  • Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
  • Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
  • Build wheels using new single image workflow (#13249) @vyasr
  • Enable sccache hits from local builds (#13248) @AyodeAwe
  • Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
  • Introduce pandas_compatible option in cudf (#13241) @galipremsagar
  • Add metadata_builder helper class (#13232) @abellina
  • Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
  • Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
  • Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
  • Add chunked reader benchmark (#13223) @SrikarVanavasam
  • Set the null count in output columns in the CSV reader (#13221) @vuule
  • Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
  • Fix string_scalar stream usage in write_json.cu (#13212) @davidwendt
  • Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
  • Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
  • Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
  • Optimization to decoding of parquet level streams (#13203) @nvdbaranec
  • Clean up and simplify gpuDecideCompression (#13202) @vuule
  • Use std::array for a statically sized vector in create_serialized_trie (#13201) @vuule
  • Update minimum Python version to Python 3.9 (#13196) @shwina
  • Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
  • Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
  • Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
  • Split up unique_count.cu to improve build time (#13169) @davidwendt
  • Use nvtx3 includes in string examples. (#13165) @bdice
  • Change some .cu gtest files to .cpp (#13155) @davidwendt
  • Remove wheel pytest verbosity (#13151) @sevagh
  • Fix libcudf to always pass null-count to set_null_mask (#13149) @davidwendt
  • Fix gtests to always pass null-count to set_null_mask calls (#13148) @davidwendt
  • Optimize JSON writer (#13144) @karthikeyann
  • Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
  • [REVIEW] Deprecate pad and backfill methods (#13140) @galipremsagar
  • Use CTAD instead of functions in ProtobufReader (#13135) @vuule
  • Remove more instances of UNKNOWN_NULL_COUNT (#13134) @vyasr
  • Update clang-format to 16.0.1. (#13133) @bdice
  • Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
  • Branch 23.06 merge 23.04 (#13131) @vyasr
  • Compute null-count in cudf::detail::slice (#13124) @davidwendt
  • Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
  • Set null-count in linked_column_view conversion operator (#13121) @davidwendt
  • Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
  • Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
  • Remove uses-setup-env-vars (#13105) @vyasr
  • Explicitly compute null count in concatenate APIs (#13104) @vyasr
  • Replace unnecessary uses of UNKNOWN_NULL_COUNT (#13102) @vyasr
  • Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
  • Use .element() instead of .data() for window range calculations (#13095) @mythrocks
  • Cleanup Parquet chunked writer (#13094) @ttnghia
  • Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
  • Cleanup ORC chunked writer (#13091) @ttnghia
  • Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
  • Change cudf::test::make_null_mask to also return null-count (#13081) @davidwendt
  • Resolved automerger from branch-23.04 to branch-23.06 (#13080) @galipremsagar
  • Assert for non-empty nulls (#13071) @razajafri
  • Remove deprecated regex functions from libcudf (#13067) @davidwendt
  • Refactor cudf::detail::sorted_order (#13062) @ttnghia
  • Improve performance of slice_strings for long strings (#13057) @davidwendt
  • Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
  • [REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
  • Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
  • Remove console output from some libcudf gtests (#13027) @davidwendt
  • Remove underscore in build string. (#13025) @bdice
  • Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
  • Fix auto merger from branch-23.04 to branch-23.06 (#13009) @galipremsagar
  • Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
  • Add nvtx annotatations to groupby methods (#12941) @wence-
  • Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
  • Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
  • Optimize set-like operations (#12769) @ttnghia
  • [REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
  • Add empty test files for test reorganization (#12288) @shwina
cudf - v23.06.00

Published by raydouglass over 1 year ago

🚨 Breaking Changes

  • Fix batch processing for parquet writer (#13438) @ttnghia
  • Use <NA> instead of null to match pandas. (#13415) @bdice
  • Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
  • Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
  • Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
  • Remove null mask and null count from column_view constructors (#13311) @vyasr
  • Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
  • Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
  • Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
  • Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
  • Update minimum Python version to Python 3.9 (#13196) @shwina
  • Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
  • Cleanup Parquet chunked writer (#13094) @ttnghia
  • Cleanup ORC chunked writer (#13091) @ttnghia
  • Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
  • Remove deprecated regex functions from libcudf (#13067) @davidwendt
  • [REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
  • Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🐛 Bug Fixes

  • Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
  • Fix writing of ORC files with empty rowgroups (#13466) @vuule
  • Fix cudf::repeat logic when count is zero (#13459) @davidwendt
  • Fix batch processing for parquet writer (#13438) @ttnghia
  • Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
  • Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
  • Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
  • Use <NA> instead of null to match pandas. (#13415) @bdice
  • Fix tokenize with non-space delimiter (#13403) @shwina
  • Fix groupby head/tail for empty dataframe (#13398) @shwina
  • Default to closed="right" in IntervalIndex constructor (#13394) @shwina
  • Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
  • Fix unused argument errors in nvcc 11.5 (#13387) @abellina
  • Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
  • Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
  • Fix page size estimation in Parquet writer (#13364) @etseidl
  • Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
  • Support gcc 12 as the C++ compiler (#13316) @robertmaynard
  • Correctly set bitmask size in from_column_view (#13315) @wence-
  • Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
  • Fix parquet schema interpretation issue (#13277) @hyperbolic2346
  • Fix 64bit shift bug in avro reader (#13276) @karthikeyann
  • Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
  • Clean up buffers in case AssertionError (#13262) @razajafri
  • Allow empty input table in ast compute_column (#13245) @wence-
  • Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
  • Fix the row index stream order in ORC reader (#13242) @vuule
  • Make is_decompression_disabled and is_compression_disabled thread-safe (#13240) @vuule
  • Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
  • Fix race in ORC string dictionary creation (#13214) @revans2
  • Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
  • Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
  • Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
  • Fix hostdevice_vector::subspan (#13187) @ttnghia
  • Use custom nvbench entry point to ensure cudf::nvbench_base_fixture usage (#13183) @robertmaynard
  • Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
  • Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
  • Fix a few clang-format style check errors (#13146) @davidwendt
  • [REVIEW] Fix Series and DataFrame constructors to validate index lengths (#13122) @galipremsagar
  • Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
  • Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
  • Adds checks to make sure json reader won't overflow (#13115) @elstehle
  • Fix null_count of columns returned by chunked_parquet_reader (#13111) @vuule
  • Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
  • [REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
  • Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
  • Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
  • Fix column selection read_parquet benchmarks (#13082) @vuule
  • Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
  • Add algorithm include in data_sink.hpp (#13068) @ahendriksen
  • Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
  • Prevent overflow with skip_rows in ORC and Parquet readers (#13063) @vuule
  • Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
  • [REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
  • Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
  • Fix read_avro() skip_rows and num_rows. (#12912) @tpn
  • Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
  • Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina

🚀 New Features

  • Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
  • Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
  • Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
  • cuDF numba cuda 12 updates (#13337) @brandon-b-miller
  • Add tz_convert method to convert between timestamps (#13328) @shwina
  • Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
  • Support the case=False argument to str.contains (#13290) @shwina
  • Add an event handler for ColumnVector.close (#13279) @abellina
  • JNI api for cudf::chunked_pack (#13278) @abellina
  • Implement a chunked_pack API (#13260) @abellina
  • Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
  • JNI changes for range-extents in window functions. (#13199) @mythrocks
  • Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
  • Add IS_NULL operator to AST (#13145) @karthikeyann
  • STRING order-by column for RANGE window functions (#13143) @mythrocks
  • Update contains_table to experimental row hasher and equality comparator (#13119) @divyegala
  • Automatically select GroupBy.apply algorithm based on if the UDF is jittable (#13113) @brandon-b-miller
  • Refactor Parquet chunked writer (#13076) @ttnghia
  • Add Python bindings for string literal support in AST (#13073) @karthikeyann
  • Add Java bindings for string literal support in AST (#13072) @karthikeyann
  • Add string scalar support in AST (#13061) @karthikeyann
  • Log cuIO warnings using the libcudf logger (#13043) @vuule
  • Update mixed_join to use experimental row hasher and comparator (#13028) @divyegala
  • Support structs of lists in row lexicographic comparator (#13005) @ttnghia
  • Adding hostdevice_span that is a span createable from hostdevice_vector (#12981) @hyperbolic2346
  • Add nvtext::minhash function (#12961) @davidwendt
  • Support lists of structs in row lexicographic comparator (#12953) @ttnghia
  • Update join to use experimental row hasher and comparator (#12787) @divyegala
  • Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🛠️ Improvements

  • Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
  • Handle some corner-cases in indexing with boolean masks (#13402) @wence-
  • Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
  • [JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
  • Fix JNI method with mismatched parameter list (#13384) @ttnghia
  • Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
  • Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
  • Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
  • Move some nvtext benchmarks to nvbench (#13368) @davidwendt
  • run docs nightly too (#13366) @AyodeAwe
  • Add warning for default dtype parameter in get_dummies (#13365) @galipremsagar
  • Add log messages about kvikIO compatibility mode (#13363) @vuule
  • Switch back to using primary shared-action-workflows branch (#13362) @vyasr
  • Deprecate StringIndex and use Index instead (#13361) @galipremsagar
  • Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
  • Expunge most uses of TypeVar(bound=&quot;Foo&quot;) (#13346) @wence-
  • Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
  • Improve distinct_count with cuco::static_set (#13343) @PointKernel
  • Fix contiguous_split performance (#13342) @ttnghia
  • Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
  • Update mypy to 1.3 (#13340) @wence-
  • [Java] Purge non-empty nulls when setting validity (#13335) @razajafri
  • Add row-wise filtering step to read_parquet (#13334) @rjzamora
  • Performance improvement for nvtext::minhash (#13333) @davidwendt
  • Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
  • Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
  • Move meta calculation in dask_cudf.read_parquet (#13327) @rjzamora
  • Changes to support Numpy >= 1.24 (#13325) @shwina
  • Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
  • Clean up distinct_count benchmark (#13321) @PointKernel
  • Fix gtest pinning to 1.13.0. (#13319) @bdice
  • Remove null mask and null count from column_view constructors (#13311) @vyasr
  • Address feedback from 13289 (#13306) @vyasr
  • Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
  • First check for BaseDtype when infering the data type of an arbitrary object (#13295) @shwina
  • Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
  • Support CUDA 12.0 for pip wheels (#13289) @divyegala
  • Refactor transform_lists_of_structs in row_operators.cu (#13288) @ttnghia
  • Branch 23.06 merge 23.04 (#13286) @vyasr
  • Update cupy dependency (#13284) @vyasr
  • Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
  • Fix unused variables and functions (#13275) @karthikeyann
  • Fix integer overflow in partition scatter_map construction (#13272) @wence-
  • Numba 0.57 compatibility fixes (#13271) @gmarkall
  • Performance improvement in cudf::strings::all_characters_of_type (#13259) @davidwendt
  • Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
  • Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
  • Build wheels using new single image workflow (#13249) @vyasr
  • Enable sccache hits from local builds (#13248) @AyodeAwe
  • Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
  • Introduce pandas_compatible option in cudf (#13241) @galipremsagar
  • Add metadata_builder helper class (#13232) @abellina
  • Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
  • Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
  • Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
  • Add chunked reader benchmark (#13223) @SrikarVanavasam
  • Set the null count in output columns in the CSV reader (#13221) @vuule
  • Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
  • Fix string_scalar stream usage in write_json.cu (#13212) @davidwendt
  • Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
  • Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
  • Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
  • Optimization to decoding of parquet level streams (#13203) @nvdbaranec
  • Clean up and simplify gpuDecideCompression (#13202) @vuule
  • Use std::array for a statically sized vector in create_serialized_trie (#13201) @vuule
  • Update minimum Python version to Python 3.9 (#13196) @shwina
  • Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
  • Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
  • Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
  • Split up unique_count.cu to improve build time (#13169) @davidwendt
  • Use nvtx3 includes in string examples. (#13165) @bdice
  • Change some .cu gtest files to .cpp (#13155) @davidwendt
  • Remove wheel pytest verbosity (#13151) @sevagh
  • Fix libcudf to always pass null-count to set_null_mask (#13149) @davidwendt
  • Fix gtests to always pass null-count to set_null_mask calls (#13148) @davidwendt
  • Optimize JSON writer (#13144) @karthikeyann
  • Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
  • [REVIEW] Deprecate pad and backfill methods (#13140) @galipremsagar
  • Use CTAD instead of functions in ProtobufReader (#13135) @vuule
  • Remove more instances of UNKNOWN_NULL_COUNT (#13134) @vyasr
  • Update clang-format to 16.0.1. (#13133) @bdice
  • Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
  • Branch 23.06 merge 23.04 (#13131) @vyasr
  • Compute null-count in cudf::detail::slice (#13124) @davidwendt
  • Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
  • Set null-count in linked_column_view conversion operator (#13121) @davidwendt
  • Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
  • Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
  • Remove uses-setup-env-vars (#13105) @vyasr
  • Explicitly compute null count in concatenate APIs (#13104) @vyasr
  • Replace unnecessary uses of UNKNOWN_NULL_COUNT (#13102) @vyasr
  • Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
  • Use .element() instead of .data() for window range calculations (#13095) @mythrocks
  • Cleanup Parquet chunked writer (#13094) @ttnghia
  • Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
  • Cleanup ORC chunked writer (#13091) @ttnghia
  • Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
  • Change cudf::test::make_null_mask to also return null-count (#13081) @davidwendt
  • Resolved automerger from branch-23.04 to branch-23.06 (#13080) @galipremsagar
  • Assert for non-empty nulls (#13071) @razajafri
  • Remove deprecated regex functions from libcudf (#13067) @davidwendt
  • Refactor cudf::detail::sorted_order (#13062) @ttnghia
  • Improve performance of slice_strings for long strings (#13057) @davidwendt
  • Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
  • [REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
  • Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
  • Remove console output from some libcudf gtests (#13027) @davidwendt
  • Remove underscore in build string. (#13025) @bdice
  • Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
  • Fix auto merger from branch-23.04 to branch-23.06 (#13009) @galipremsagar
  • Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
  • Add nvtx annotatations to groupby methods (#12941) @wence-
  • Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
  • Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
  • Optimize set-like operations (#12769) @ttnghia
  • [REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
  • Add empty test files for test reorganization (#12288) @shwina
cudf - v23.04.00

Published by raydouglass over 1 year ago

🚨 Breaking Changes

  • Pin dask and distributed for release (#13070) @galipremsagar
  • Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
  • Update minimum pandas and numpy pinnings (#12887) @galipremsagar
  • Deprecate names & dtype in Index.copy (#12825) @galipremsagar
  • Deprecate Index.is_* methods (#12820) @galipremsagar
  • Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
  • Deprecate na_sentinel in factorize (#12817) @galipremsagar
  • Make string methods return a Series with a useful Index (#12814) @shwina
  • Produce useful guidance on overflow error in to_csv (#12705) @wence-
  • Move strings_udf code into cuDF (#12669) @brandon-b-miller
  • Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
  • Replace message parsing with throwing more specific exceptions (#12426) @vyasr

🐛 Bug Fixes

  • Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
  • Fix DataFrame constructor to broadcast scalar inputs properly (#12997) @galipremsagar
  • Drop force_nullable_schema from chunked parquet writer (#12996) @galipremsagar
  • Fix gtest column utility comparator diff reporting (#12995) @davidwendt
  • Handle index names while performing groupby (#12992) @galipremsagar
  • Fix __setitem__ on string columns when the scalar value ends in a null byte (#12991) @wence-
  • Fix sort_values when column is all empty strings (#12988) @eriknw
  • Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
  • Pre-emptive fix for upstream dask.dataframe.read_parquet changes (#12983) @rjzamora
  • Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
  • Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
  • cudftestutil supports static gtest dependencies (#12957) @robertmaynard
  • Include gtest in build environment. (#12956) @vyasr
  • Correctly handle scalar indices in Index.__getitem__ (#12955) @wence-
  • Avoid building cython twice (#12945) @galipremsagar
  • Fix set index error for Series rolling window operations (#12942) @galipremsagar
  • Fix calculation of null counts for Parquet statistics (#12938) @etseidl
  • Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
  • Use get_current_device_resource for intermediate allocations in COLLECT_LIST window code (#12927) @karthikeyann
  • Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
  • Fix conda recipe post-link.sh typo (#12916) @pentschev
  • min_rows and num_rows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
  • Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
  • Use python -m pytest for nightly wheel tests (#12871) @bdice
  • Parquet writer column_size() should return a size_t (#12870) @etseidl
  • Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
  • Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
  • Remove tokenizers pre-install pinning. (#12854) @vyasr
  • Fix parquet RangeIndex bug (#12838) @rjzamora
  • Remove KAFKA_HOST_TEST from compute-sanitizer check (#12831) @davidwendt
  • Make string methods return a Series with a useful Index (#12814) @shwina
  • Tell cudf_kafka to use header-only fmt (#12796) @vyasr
  • Add GroupBy.dtypes (#12783) @galipremsagar
  • Fix a leak in a test and clarify some test names (#12781) @revans2
  • Fix bug in all-null list due to join_list_elements special handling (#12767) @karthikeyann
  • Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
  • Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
  • Fix a bug with num_keys in _scatter_by_slice (#12749) @thomcom
  • Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
  • Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
  • Add always_nullable flag to Dremel encoding (#12727) @divyegala
  • Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
  • Fix faulty conditional logic in JIT GroupBy.apply (#12706) @brandon-b-miller
  • Produce useful guidance on overflow error in to_csv (#12705) @wence-
  • Handle parquet list data corner case (#12698) @nvdbaranec
  • Fix missing trailing comma in json writer (#12688) @karthikeyann
  • Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
  • Handle bool types in round API (#12670) @galipremsagar
  • Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
  • Fix from_arrow to load a sliced arrow table (#12665) @galipremsagar
  • Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
  • Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
  • Fix find_common_dtype and values to handle complex dtypes (#12537) @galipremsagar
  • Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
  • Fix Series comparison vs scalars (#12519) @brandon-b-miller
  • Allow casting from UDFString back to StringView to call methods in strings_udf (#12363) @brandon-b-miller

📖 Documentation

  • Fix GroupBy.apply doc examples rendering (#12994) @brandon-b-miller
  • add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
  • Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
  • Add README symlink for dask-cudf. (#12946) @bdice
  • Remove return type from @return doxygen tags (#12908) @davidwendt
  • Fix docs build to be pydata-sphinx-theme=0.13.0 compatible (#12874) @galipremsagar
  • Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
  • Enable doctests for GroupBy methods (#12658) @brandon-b-miller
  • Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt

🚀 New Features

  • Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
  • Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
  • Refactor orc chunked writer (#12949) @ttnghia
  • Make Parquet writer nullable option application to single table writes (#12933) @vuule
  • Refactor io::orc::ProtobufWriter (#12877) @ttnghia
  • Make timezone table independent from ORC (#12805) @vuule
  • Cache JIT GroupBy.apply functions (#12802) @brandon-b-miller
  • Implement initial support for avro logical types (#6482) (#12788) @tpn
  • Update tests/column_utilities to use experimental::equality row comparator (#12777) @divyegala
  • Update distinct/unique_count to experimental::row hasher/comparator (#12776) @divyegala
  • Update hash_partition to use experimental::row::row_hasher (#12761) @divyegala
  • Update is_sorted to use experimental::row::lexicographic (#12752) @divyegala
  • Update default data source in cuio reader benchmarks (#12740) @PointKernel
  • Reenable stream identification library in CI (#12714) @vyasr
  • Add regex_program strings splitting java APIs and tests (#12713) @cindyyuanjiang
  • Add regex_program strings replacing java APIs and tests (#12701) @cindyyuanjiang
  • Add regex_program strings extract java APIs and tests (#12699) @cindyyuanjiang
  • Variable fragment sizes for Parquet writer (#12685) @etseidl
  • Add segmented reduction support for fixed-point types (#12680) @davidwendt
  • Move strings_udf code into cuDF (#12669) @brandon-b-miller
  • Add regex_program searching APIs and related java classes (#12666) @cindyyuanjiang
  • Add logging to libcudf (#12637) @vuule
  • Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
  • Convert rank to use to experimental row comparators (#12481) @divyegala
  • Use rapids-cmake parallel testing feature (#12451) @robertmaynard
  • Enable detection of undesired stream usage (#12089) @vyasr

🛠️ Improvements

  • Pin dask and distributed for release (#13070) @galipremsagar
  • Pin cupy in wheel tests to supported versions (#13041) @vyasr
  • Pin numba version (#13001) @vyasr
  • Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
  • Stop setting package version attribute in wheels (#12977) @vyasr
  • Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
  • Remove default detail mrs: part7 (#12970) @vyasr
  • Remove default detail mrs: part6 (#12969) @vyasr
  • Remove default detail mrs: part5 (#12968) @vyasr
  • Remove default detail mrs: part4 (#12967) @vyasr
  • Remove default detail mrs: part3 (#12966) @vyasr
  • Remove default detail mrs: part2 (#12965) @vyasr
  • Remove default detail mrs: part1 (#12964) @vyasr
  • Add force_nullable_schema parameter to Parquet writer. (#12952) @galipremsagar
  • Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
  • Remove remaining default stream parameters (#12943) @vyasr
  • Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
  • Implement groupby.head and groupby.tail (#12939) @wence-
  • Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
  • Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
  • Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
  • Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
  • Pass SCCACHE_S3_USE_SSL to conda builds (#12910) @ajschmidt8
  • Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
  • Generate pyproject dependencies using dfg (#12906) @vyasr
  • Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
  • Fix moto env vars & pass AWS_SESSION_TOKEN to conda builds (#12902) @ajschmidt8
  • Rewrite CSV writer benchmark with nvbench (#12901) @PointKernel
  • Rework some code logic to reduce iterator and comparator inlining to improve compile time (#12900) @davidwendt
  • Deprecate line_terminator in favor of lineterminator in to_csv (#12896) @wence-
  • Add stream and mr parameters for structs::detail::flatten_nested_columns (#12892) @ttnghia
  • Deprecate libcudf regex APIs accepting pattern strings directly (#12891) @davidwendt
  • Remove default parameters from detail headers in include (#12888) @vyasr
  • Update minimum pandas and numpy pinnings (#12887) @galipremsagar
  • Implement groupby.sample (#12882) @wence-
  • Update JNI build ENV default to gcc 11 (#12881) @pxLi
  • Change return type of cudf::structs::detail::flatten_nested_columns to smart pointer (#12878) @ttnghia
  • Fix passing seed parameter to MurmurHash3_32 in cudf::hash() function (#12875) @davidwendt
  • Remove manual artifact upload step in CI (#12869) @ajschmidt8
  • Update to GCC 11 (#12868) @bdice
  • Fix null hive-partition behavior in dask-cudf parquet (#12866) @rjzamora
  • Update to protobuf>=4.21.6,<4.22. (#12864) @bdice
  • Update RMM allocators (#12861) @pentschev
  • Improve performance for replace-multi for long strings (#12858) @davidwendt
  • Drop Python 3.7 handling for pickle protocol 4 (#12857) @jakirkham
  • Migrate as much as possible to pyproject.toml (#12850) @vyasr
  • Enable nbqa pre-commit hooks for isort and black. (#12848) @bdice
  • Setting a threshold for KvikIO IO (#12841) @madsbk
  • Update datasets download URL (#12840) @jjacobelli
  • Make docs builds less verbose (#12836) @AyodeAwe
  • Consolidate linter configs into pyproject.toml (#12834) @vyasr
  • Deprecate names & dtype in Index.copy (#12825) @galipremsagar
  • Deprecate inplace parameters in categorical methods (#12824) @galipremsagar
  • Add optional text file support to ninja-log utility (#12823) @davidwendt
  • Deprecate Index.is_* methods (#12820) @galipremsagar
  • Add dfg as a pre-commit hook (#12819) @vyasr
  • Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
  • Deprecate na_sentinel in factorize (#12817) @galipremsagar
  • Shuffling read into a sub function in parquet read (#12809) @hyperbolic2346
  • Fixing parquet coalescing of reads (#12808) @hyperbolic2346
  • CI: Remove specification of manual stage for check_style.sh script. (#12803) @csadorf
  • Add compute-sanitizer github workflow action to nightly tests (#12800) @davidwendt
  • Enable groupby std and variance aggregation types in libcudf Debug build (#12799) @davidwendt
  • Expose seed argument to hash_values (#12795) @ayushdg
  • Fix groupby gtests coded in namespace cudf::test (#12784) @davidwendt
  • Improve performance for cudf::strings::count_characters for long strings (#12779) @davidwendt
  • Deallocate encoded data in ORC writer immediately after compression (#12770) @vuule
  • Stop force pulling fmt in nvbench. (#12768) @vyasr
  • Remove now redundant cuda initialization (#12758) @vyasr
  • Adds JSON reader, writer io benchmark (#12753) @karthikeyann
  • Use test paths relative to package directory. (#12751) @bdice
  • Add build metrics report as artifact to cpp-build workflow (#12750) @davidwendt
  • Add JNI methods for detecting and purging non-empty nulls from LIST and STRUCT (#12742) @razajafri
  • Stop using versioneer to manage versions (#12741) @vyasr
  • Reduce error handling verbosity in CI tests scripts (#12738) @AjayThorve
  • Reduce the number of test cases in multibyte_split benchmark (#12737) @PointKernel
  • Update shared workflow branches (#12733) @ajschmidt8
  • JNI switches to nested JSON reader (#12732) @res-life
  • Changing cudf::io::source_info to use cudf::host_span&lt;std::byte&gt; in a non-breaking form (#12730) @hyperbolic2346
  • Add nvbench environment class for initializing RMM in benchmarks (#12728) @davidwendt
  • Split C++ and Python build dependencies into separate lists. (#12724) @bdice
  • Add build dependencies to Java tests. (#12723) @bdice
  • Allow setting the seed argument for hash partition (#12715) @firestarman
  • Remove gpuCI scripts. (#12712) @bdice
  • Unpin dask and distributed for development (#12710) @galipremsagar
  • partition_by_hash(): use _split() (#12704) @madsbk
  • Remove DataFrame.quantiles from docs. (#12684) @bdice
  • Fast path for experimental::row::equality (#12676) @divyegala
  • Move date to build string in conda recipe (#12661) @ajschmidt8
  • Refactor reduction logic for fixed-point types (#12652) @davidwendt
  • Pay off some JNI RMM API tech debt (#12632) @revans2
  • Merge copy-on-write feature branch into branch-23.04 (#12619) @galipremsagar
  • Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
  • Pin cuda-nvrtc. (#12606) @bdice
  • Remove cudf::test::print calls from libcudf gtests (#12604) @davidwendt
  • Init JNI version 23.04.0-SNAPSHOT (#12599) @pxLi
  • Add performance benchmarks to user facing docs (#12595) @galipremsagar
  • Add docs build job (#12592) @AyodeAwe
  • Replace message parsing with throwing more specific exceptions (#12426) @vyasr
  • Support conversion to/from cudf in dask.dataframe.core.to_backend (#12380) @rjzamora
cudf - v23.02.00

Published by raydouglass over 1 year ago

🚨 Breaking Changes

  • Pin dask and distributed for release (#12695) @galipremsagar
  • Change ways to access ptr in Buffer (#12587) @galipremsagar
  • Remove column names (#12578) @vuule
  • Default cudf::io::read_json to nested JSON parser (#12544) @vuule
  • Switch engine=cudf to the new JSON reader (#12509) @galipremsagar
  • Add trailing comma support for nested JSON reader (#12448) @karthikeyann
  • Upgrade to arrow-10.0.1 (#12327) @galipremsagar
  • Fail loudly to avoid data corruption with unsupported input in read_orc (#12325) @vuule
  • CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
  • Remove deprecated code for 23.02 (#12281) @vyasr
  • Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
  • Purge non-empty nulls for superimpose_nulls and push_down_nulls (#12239) @ttnghia
  • Rename cudf::structs::detail::superimpose_parent_nulls APIs (#12230) @ttnghia
  • Remove JIT type names, refactor id_to_type. (#12158) @bdice
  • Floor division uses integer division for integral arguments (#12131) @wence-

🐛 Bug Fixes

  • Fix a mask data corruption in UDF (#12647) @galipremsagar
  • pre-commit: Update isort version to 5.12.0 (#12645) @wence-
  • tests: Skip cuInit tests if cuda-gdb is not found or not working (#12644) @wence-
  • Revert regex program java APIs and tests (#12639) @cindyyuanjiang
  • Fix leaks in ColumnVectorTest (#12625) @jlowe
  • Handle when spillable buffers own each other (#12607) @madsbk
  • Fix incorrect null counts for sliced columns in JCudfSerialization (#12589) @jlowe
  • lists: Transfer dtypes correctly through list.get (#12586) @wence-
  • timedelta: Don't go via float intermediates for floordiv (#12585) @wence-
  • Fixing BUG, get_next_chunk() should use the blocking function device_read() (#12584) @madsbk
  • Make JNI QuoteStyle accessible outside ai.rapids.cudf (#12572) @mythrocks
  • partition_by_hash(): support index (#12554) @madsbk
  • Mixed Join benchmark bug due to wrong conditional column (#12553) @divyegala
  • Update List Lexicographical Comparator (#12538) @divyegala
  • Dynamically read PTX version (#12534) @brandon-b-miller
  • build.sh switch to use RAPIDS magic value (#12525) @robertmaynard
  • Loosen runtime arrow pinning (#12522) @vyasr
  • Enable metadata transfer for complex types in transpose (#12491) @galipremsagar
  • Fix issues with parquet chunked reader (#12488) @nvdbaranec
  • Fix missing metadata transfer in concat for ListColumn (#12487) @galipremsagar
  • Rename libcudf substring source files to slice (#12484) @davidwendt
  • Fix compile issue with arrow 10 (#12465) @ttnghia
  • Fix List offsets bug in mixed type list column in nested JSON reader (#12447) @karthikeyann
  • Fix xfail incompatibilities (#12423) @vyasr
  • Fix bug in Parquet column index encoding (#12404) @etseidl
  • When building Arrow shared look for a shared OpenSSL (#12396) @robertmaynard
  • Fix get_json_object to return empty column on empty input (#12384) @davidwendt
  • Pin arrow 9 in testing dependencies to prevent conda solve issues (#12377) @vyasr
  • Fix reductions any/all return value for empty input (#12374) @davidwendt
  • Fix debug compile errors in parquet.hpp (#12372) @davidwendt
  • Purge non-empty nulls in cudf::make_lists_column (#12370) @ttnghia
  • Use correct memory resource in io::make_column (#12364) @vyasr
  • Add code to detect possible malformed page data in parquet files. (#12360) @nvdbaranec
  • Fail loudly to avoid data corruption with unsupported input in read_orc (#12325) @vuule
  • Fix NumericPairIteratorTest for float values (#12306) @davidwendt
  • Fixes memory allocation in nested JSON tokenizer (#12300) @elstehle
  • Reconstruct dtypes correctly for list aggs of struct columns (#12290) @wence-
  • Fix regex \A and \Z to strictly match string begin/end (#12282) @davidwendt
  • Fix compile issue in json_chunked_reader.cpp (#12280) @ttnghia
  • Change reductions any/all to return valid values for empty input (#12279) @davidwendt
  • Only exclude join keys that are indices from key columns (#12271) @wence-
  • Fix spill to device limit (#12252) @madsbk
  • Correct behaviour of sort in concat for singleton concatenations (#12247) @wence-
  • Purge non-empty nulls for superimpose_nulls and push_down_nulls (#12239) @ttnghia
  • Patch CUB DeviceSegmentedSort and remove workaround (#12234) @davidwendt
  • Fix memory leak in udf_string::assign(&&) function (#12206) @davidwendt
  • Workaround thrust-copy-if limit in json get_tree_representation (#12190) @davidwendt
  • Fix page size calculation in Parquet writer (#12182) @etseidl
  • Add cudf::detail::sizes_to_offsets_iterator to allow checking overflow in offsets (#12180) @davidwendt
  • Workaround thrust-copy-if limit in wordpiece-tokenizer (#12168) @davidwendt
  • Floor division uses integer division for integral arguments (#12131) @wence-

📖 Documentation

  • Fix link to NVTX (#12598) @sameerz
  • Include missing groupby functions in documentation (#12580) @quasiben
  • Fix documentation author (#12527) @bdice
  • Update libcudf reduction docs for casting output types (#12526) @davidwendt
  • Add JSON reader page in user guide (#12499) @GregoryKimball
  • Link unsupported iteration API docstrings (#12482) @galipremsagar
  • strings_udf doc update (#12469) @brandon-b-miller
  • Update cudf_assert docs with correct NDEBUG behavior (#12464) @robertmaynard
  • Update pre-commit hooks guide (#12395) @bdice
  • Update test docs to not use detail comparison utilities (#12332) @PointKernel
  • Fix doxygen description for regex_program::compute_working_memory_size (#12329) @davidwendt
  • Add eval to docs. (#12322) @vyasr
  • Turn on xfail_strict=true (#12244) @wence-
  • Update 10 minutes to cuDF (#12114) @wence-

🚀 New Features

  • Use kvikIO as the default IO backend (#12574) @vuule
  • Use has_nonempty_nulls instead of may_contain_non_empty_nulls in superimpose_nulls and push_down_nulls (#12560) @ttnghia
  • Add strings methods removeprefix and removesuffix (#12557) @davidwendt
  • Add regex_program java APIs and unit tests (#12548) @cindyyuanjiang
  • Default cudf::io::read_json to nested JSON parser (#12544) @vuule
  • Make string quoting optional on CSV write (#12539) @mythrocks
  • Use new nvCOMP API to optimize the compression temp memory size (#12533) @vuule
  • Support "values" orient (array of arrays) in Nested JSON reader (#12498) @karthikeyann
  • one_hot_encode to use experimental row comparators (#12478) @divyegala
  • Support %W and %w format specifiers in cudf::strings::to_timestamps (#12475) @davidwendt
  • Add JSON Writer (#12474) @karthikeyann
  • Refactor thrust_copy_if into cudf::detail::copy_if_safe (#12455) @ttnghia
  • Add trailing comma support for nested JSON reader (#12448) @karthikeyann
  • Extract tokenize_json.hpp detail header from src/io/json/nested_json.hpp (#12432) @ttnghia
  • JNI bindings to write CSV (#12425) @mythrocks
  • Nested JSON depth benchmark (#12371) @karthikeyann
  • Implement lists::reverse (#12336) @ttnghia
  • Use device_read in experimental read_json (#12314) @vuule
  • Implement JNI for strings::reverse (#12283) @ttnghia
  • Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
  • Add cudf::strings:like function with multiple patterns (#12269) @davidwendt
  • Add environment variable to control host memory allocation in hostdevice_vector (#12251) @vuule
  • Add cudf::strings::reverse function (#12227) @davidwendt
  • Selectively use dictionary encoding in Parquet writer (#12211) @etseidl
  • Support replace in strings_udf (#12207) @brandon-b-miller
  • Add support to read binary encoded decimals in parquet (#12205) @PointKernel
  • Support regex EOL where the string ends with a new-line character (#12181) @davidwendt
  • Updating stream_compaction/unique to use new row comparators (#12159) @divyegala
  • Add device buffer datasource (#12024) @PointKernel
  • Implement groupby apply with JIT (#11452) @bwyogatama

🛠️ Improvements

  • Update shared workflow branches (#12696) @ajschmidt8
  • Pin dask and distributed for release (#12695) @galipremsagar
  • Don't upload libcudf-example to Anaconda.org (#12671) @ajschmidt8
  • Pin wheel dependencies to same RAPIDS release (#12659) @sevagh
  • Use CTK 118/cp310 branch of wheel workflows (#12602) @sevagh
  • Change ways to access ptr in Buffer (#12587) @galipremsagar
  • Version a parquet writer xfail (#12579) @galipremsagar
  • Remove column names (#12578) @vuule
  • Parquet reader optimization to address V100 regression. (#12577) @nvdbaranec
  • Add support for category dtypes in CSV reader (#12571) @galipremsagar
  • Remove spill_lock parameter from SpillableBuffer.get_ptr() (#12564) @madsbk
  • Optimize cudf::make_lists_column (#12547) @ttnghia
  • Remove cudf::strings::repeat_strings_output_sizes from Java and JNI (#12546) @ttnghia
  • Test that cuInit is not called when RAPIDS_NO_INITIALIZE is set (#12545) @wence-
  • Rework repeat_strings to use sizes-to-offsets utility (#12543) @davidwendt
  • Replace exclusive_scan with sizes_to_offsets in cudf::lists::sequences (#12541) @davidwendt
  • Rework nvtext::ngrams_tokenize to use sizes-to-offsets utility (#12540) @davidwendt
  • Fix binary-ops gtests coded in namespace cudf::test (#12536) @davidwendt
  • More @acquire_spill_lock() and as_buffer(..., exposed=False) (#12535) @madsbk
  • Guard CUDA runtime APIs with error checking (#12531) @PointKernel
  • Update TODOs from issue 10432. (#12528) @bdice
  • Update rapids-cmake definitions version in GitHub Actions style checks. (#12511) @bdice
  • Switch engine=cudf to the new JSON reader (#12509) @galipremsagar
  • Fix SUM/MEAN aggregation type support. (#12503) @bdice
  • Stop using pandas._testing (#12492) @vyasr
  • Fix ROLLING_TEST gtests coded in namespace cudf::test (#12490) @davidwendt
  • Fix erroneously skipped ORC ZSTD test (#12486) @vuule
  • Rework nvtext::generate_character_ngrams to use make_strings_children (#12480) @davidwendt
  • Raise warnings as errors in the test suite (#12468) @vyasr
  • Remove int32 hard-coding in python (#12467) @galipremsagar
  • Use cudaMemcpyDefault. (#12466) @bdice
  • Update workflows for nightly tests (#12462) @ajschmidt8
  • Build CUDA 11.8 and Python 3.10 Packages (#12457) @ajschmidt8
  • JNI build image default as cuda11.8 (#12441) @pxLi
  • Re-enable Recently Updated Check (#12435) @ajschmidt8
  • Rework remaining cudf::strings::from_xyz functions to use make_strings_children (#12434) @vuule
  • Build wheels alongside conda CI (#12427) @sevagh
  • Remove arguments for checking exception messages in Python (#12424) @vyasr
  • Clean up cuco usage (#12421) @PointKernel
  • Fix warnings in remaining modules (#12406) @vyasr
  • Update ops-bot.yaml (#12402) @ajschmidt8
  • Rework cudf::strings::integers_to_ipv4 to use make_strings_children utility (#12401) @davidwendt
  • Use numpy.empty() instead of bytearray to allocate host memory for spilling (#12399) @madsbk
  • Deprecate chunksize from dask_cudf.read_csv (#12394) @rjzamora
  • Expose the RMM pool size in JNI (#12390) @revans2
  • Fix COPYING_TEST: gtests coded in namespace cudf::test (#12387) @davidwendt
  • Rework cudf::strings::url_encode to use make_strings_children utility (#12385) @davidwendt
  • Use make_strings_children in parse_data nested json reader (#12382) @karthikeyann
  • Fix warnings in test_datetime.py (#12381) @vyasr
  • Mixed Join Benchmarks (#12375) @divyegala
  • Fix warnings in dataframe.py (#12369) @vyasr
  • Update conda recipes. (#12368) @bdice
  • Use gpu-latest-1 runner tag (#12366) @bdice
  • Rework cudf::strings::from_booleans to use make_strings_children (#12365) @vuule
  • Fix warnings in test modules up to test_dataframe.py (#12355) @vyasr
  • JSON column performance optimization - struct column nulls (#12354) @karthikeyann
  • Accelerate stable-segmented-sort with CUB segmented sort (#12347) @davidwendt
  • Add size check to make_offsets_child_column utility (#12345) @davidwendt
  • Enable max compression ratio small block optimization for ZSTD (#12338) @vuule
  • Fix warnings in test_monotonic.py (#12334) @vyasr
  • Improve JSON column creation performance (list offsets) (#12330) @karthikeyann
  • Upgrade to arrow-10.0.1 (#12327) @galipremsagar
  • Fix warnings in test_orc.py (#12326) @vyasr
  • Fix warnings in test_groupby.py (#12324) @vyasr
  • Fix test_notebooks.sh (#12323) @ajschmidt8
  • Fix transform gtests coded in namespace cudf::test (#12321) @davidwendt
  • Fix check_style.sh script (#12320) @ajschmidt8
  • Rework cudf::strings::from_timestamps to use make_strings_children (#12317) @davidwendt
  • Fix warnings in test_index.py (#12313) @vyasr
  • Fix warnings in test_multiindex.py (#12310) @vyasr
  • CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
  • Fix warnings in test_indexing.py (#12305) @vyasr
  • Fix warnings in test_joining.py (#12304) @vyasr
  • Unpin dask and distributed for development (#12302) @galipremsagar
  • Re-enable sccache for Jenkins builds (#12297) @ajschmidt8
  • Define needs for pr-builder workflow. (#12296) @bdice
  • Forward merge 22.12 into 23.02 (#12294) @vyasr
  • Fix warnings in test_stats.py (#12293) @vyasr
  • Fix table gtests coded in namespace cudf::test (#12292) @davidwendt
  • Change cython for regex calls to use cudf::strings::regex_program (#12289) @davidwendt
  • Improved error reporting when reading multiple JSON files (#12285) @vuule
  • Deprecate Frame.sum_of_squares (#12284) @vyasr
  • Remove deprecated code for 23.02 (#12281) @vyasr
  • Clean up handling of max_page_size_bytes in Parquet writer (#12277) @etseidl
  • Fix replace gtests coded in namespace cudf::test (#12270) @davidwendt
  • Add pandas nullable type support in Index.to_pandas (#12268) @galipremsagar
  • Rework nvtext::detokenize to use indexalator for row indices (#12267) @davidwendt
  • Fix reduction gtests coded in namespace cudf::test (#12257) @davidwendt
  • Remove default parameters from cudf::detail::sort function declarations (#12254) @davidwendt
  • Add duplicated support for Series, DataFrame and Index (#12246) @galipremsagar
  • Replace column/table test utilities with macros (#12242) @PointKernel
  • Rework cudf::strings::pad and zfill to use make_strings_children (#12238) @davidwendt
  • Fix sort gtests coded in namespace cudf::test (#12237) @davidwendt
  • Wrapping concat and file writes in @acquire_spill_lock() (#12232) @madsbk
  • Rename cudf::structs::detail::superimpose_parent_nulls APIs (#12230) @ttnghia
  • Cover parsing to decimal types in read_json tests (#12229) @vuule
  • Spill Statistics (#12223) @madsbk
  • Use CUDF_JNI_ENABLE_PROFILING to conditionally enable profiling support. (#12221) @bdice
  • Clean up of test_spilling.py (#12220) @madsbk
  • Simplify repetitive boolean logic (#12218) @vuule
  • Add Series.hasnans and Index.hasnans (#12214) @galipremsagar
  • Add cudf::strings:udf::replace function (#12210) @davidwendt
  • Adds in new java APIs for appending byte arrays to host columnar data (#12208) @revans2
  • Remove Python dependencies from Java CI. (#12193) @bdice
  • Fix null order in sort-based groupby and improve groupby tests (#12191) @divyegala
  • Move strings children functions from cudf/strings/detail/utilities.cuh to new header (#12185) @davidwendt
  • Clean up existing JNI scalar to column code (#12173) @revans2
  • Remove JIT type names, refactor id_to_type. (#12158) @bdice
  • Update JNI version to 23.02.0-SNAPSHOT (#12129) @pxLi
  • Minor refactor of cpp/src/io/parquet/page_data.cu (#12126) @etseidl
  • Add codespell as a linter (#12097) @benfred
  • Enable specifying exceptions in error macros (#12078) @vyasr
  • Move _label_encoding from Series to Column (#12040) @shwina
  • Add GitHub Actions Workflows (#12002) @ajschmidt8
  • Consolidate dask-cudf groupby_agg calls in one place (#10835) @charlesbluca
cudf - v22.12.01

Published by GPUtester almost 2 years ago

🚨 Breaking Changes

  • Add JNI for substring without 'end' parameter. (#12113) @firestarman
  • Refactor purge_nonempty_nulls (#12111) @ttnghia
  • Create an int8 column in read_csv when all elements are missing (#12110) @vuule
  • Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to &quot;ALWAYS&quot; (#12080) @vuule
  • Fix type promotion edge cases in numerical binops (#12074) @wence-
  • Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
  • Rollback of DeviceBufferLike (#12009) @madsbk
  • Remove unused managed_allocator (#12005) @vyasr
  • Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
  • Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
  • Default to equal NaNs in make_merge_sets_aggregation. (#11952) @bdice
  • Remove validation that requires introspection (#11938) @vyasr
  • Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
  • Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
  • Support nested types as groupby keys in libcudf (#11792) @PointKernel
  • Default to equal NaNs in make_collect_set_aggregation. (#11621) @bdice
  • Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
  • part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source

🐛 Bug Fixes

  • strings_udf: use libcudf caching of character tables (#12343) @wence-
  • Fix include line for IO Cython modules (#12250) @vyasr
  • Make dask pinning looser (#12231) @vyasr
  • Workaround for CUB segmented-sort bug with boolean keys (#12217) @davidwendt
  • Fix from_dict backend dispatch to match upstream dask (#12203) @galipremsagar
  • Merge branch-22.10 into branch-22.12 (#12198) @davidwendt
  • Fix compression in ORC writer (#12194) @vuule
  • Don't use CMake 3.25.0 as it has a show stopping FindCUDAToolkit bug (#12188) @robertmaynard
  • Fix data corruption when reading ORC files with empty stripes (#12160) @vuule
  • Fix decimal binary operations (#12142) @galipremsagar
  • Ensure dlpack include is provided to cudf interop lib (#12139) @robertmaynard
  • Safely allocate udf_string pointers in strings_udf (#12138) @brandon-b-miller
  • Fix/disable jitify lto (#12122) @robertmaynard
  • Fix conditional_full_join benchmark (#12121) @GregoryKimball
  • Fix regex working-memory-size refactor error (#12119) @davidwendt
  • Add in negative size checks for columns (#12118) @revans2
  • Add JNI for substring without 'end' parameter. (#12113) @firestarman
  • Fix reading of CSV files with blank second row (#12098) @vuule
  • Fix an error in IO with GzipFile type (#12085) @galipremsagar
  • Workaround groupby aggregate thrust::copy_if overflow (#12079) @davidwendt
  • Fix alignment of compressed blocks in ORC writer (#12077) @vuule
  • Fix singleton-range __setitem__ edge case (#12075) @wence-
  • Fix type promotion edge cases in numerical binops (#12074) @wence-
  • Force using old fmt in nvbench. (#12067) @vyasr
  • Fixes List offset bug in Nested JSON reader (#12060) @karthikeyann
  • Allow falling back to shim_60.ptx by default in strings_udf (#12056) @brandon-b-miller
  • Force black exclusions for pre-commit. (#12036) @bdice
  • Add memory_usage & items implementation for Struct column & dtype (#12033) @galipremsagar
  • Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
  • Fixes bug in csv_reader_options construction in cython (#12021) @karthikeyann
  • Fix issues when both usecols and names options are used in read_csv (#12018) @vuule
  • Port thrust's pinned_allocator to cudf, since Thrust 1.17 removes the type (#12004) @robertmaynard
  • Revert "Replace most of preprocessor usage in nvcomp adapter with constexpr" (#11999) @vuule
  • Fix bug where df.loc resulting in single row could give wrong index (#11998) @eriknw
  • Switch to DISABLE_DEPRECATION_WARNINGS to match other RAPIDS projects (#11989) @robertmaynard
  • Fix maximum page size estimate in Parquet writer (#11962) @vuule
  • Fix local offset handling in bgzip reader (#11918) @upsj
  • Fix an issue reading struct-of-list types in Parquet. (#11910) @nvdbaranec
  • Fix memcheck error in TypeInference.Timestamp gtest (#11905) @davidwendt
  • Fix type casting in Series.setitem (#11904) @wence-
  • Fix memcheck error in get_dremel_data (#11903) @davidwendt
  • Fixes Unsupported column type error due to empty list columns in Nested JSON reader (#11897) @karthikeyann
  • Fix segmented-sort to ignore indices outside the offsets (#11888) @davidwendt
  • Fix cudf::stable_sorted_order for NaN and -NaN in FLOAT64 columns (#11874) @davidwendt
  • Fix writing of Parquet files with many fragments (#11869) @etseidl
  • Fix RangeIndex unary operators. (#11868) @vyasr
  • JNI Avoid NPE for reading host binary data (#11865) @revans2
  • Fix decimal benchmark input data generation (#11863) @karthikeyann
  • Fix pre-commit copyright check (#11860) @galipremsagar
  • Fix Parquet support for seconds and milliseconds duration types (#11854) @vuule
  • Ensure better compiler cache results between cudf cal-ver branches (#11835) @robertmaynard
  • Fix make_column_from_scalar for all-null strings column (#11807) @davidwendt
  • Tell jitify_preprocess where to search for libnvrtc (#11787) @robertmaynard
  • add V2 page header support to parquet reader (#11778) @etseidl
  • Parquet reader: bug fix for a num_rows/skip_rows corner case, w/optimization for nested preprocessing (#11752) @nvdbaranec
  • Determine if Arrow has S3 support at runtime in unit test. (#11560) @bdice

📖 Documentation

  • Use rapidsai CODE_OF_CONDUCT.md (#12166) @bdice
  • Add symlinks to notebooks. (#12128) @bdice
  • Add truncate API to python doc pages (#12109) @galipremsagar
  • Update Numba docs links. (#12107) @bdice
  • Remove "Multi-GPU with Dask-cuDF" notebook. (#12095) @bdice
  • Fix link to c++ developer guide from CONTRIBUTING.md (#12084) @brandon-b-miller
  • Add pivot_table and crosstab to docs. (#12014) @bdice
  • Fix doxygen text for cudf::dictionary::encode (#11991) @davidwendt
  • Replace default_stream_value with get_default_stream in docs. (#11985) @vyasr
  • Add dtype docs pages and docstrings for cudf specific dtypes (#11974) @galipremsagar
  • Update Unit Testing in libcudf guidelines to code tests outside the cudf::test namespace (#11959) @davidwendt
  • Rename libcudf++ to libcudf. (#11953) @bdice
  • Fix documentation referring to removed as_gpu_matrix method. (#11937) @bdice
  • Remove "experimental" warning for struct columns in ORC reader and writer (#11880) @vuule
  • Initial draft of policies and guidelines for libcudf usage. (#11853) @vyasr
  • Add clear indication of non-GPU accelerated parameters in read_json docstring (#11825) @GregoryKimball
  • Add developer docs for writing tests (#11199) @vyasr

🚀 New Features

  • Adds an EventHandler to Java MemoryBuffer to be invoked on close (#12125) @abellina
  • Support + in strings_udf (#12117) @brandon-b-miller
  • Support upper and lower in strings_udf (#12099) @brandon-b-miller
  • Add wheel builds (#12096) @vyasr
  • Allow setting malloc heap size in string udfs (#12094) @brandon-b-miller
  • Support strip, lstrip, and rstrip in strings_udf (#12091) @brandon-b-miller
  • Mark nvcomp zstd compression stable (#12059) @jbrennan333
  • Add debug-only onAllocated/onDeallocated to RmmEventHandler (#12054) @abellina
  • Enable building against the libarrow contained in pyarrow (#12034) @vyasr
  • Add strings like jni and native method (#12032) @cindyyuanjiang
  • Cleanup common parsing code in JSON, CSV reader (#12022) @karthikeyann
  • byte_range support for JSON Lines format (#12017) @karthikeyann
  • Minor cleanup of root CMakeLists.txt for better organization (#11988) @robertmaynard
  • Add inplace arithmetic operators to MaskedType (#11987) @brandon-b-miller
  • Implement JNI for chunked Parquet reader (#11961) @ttnghia
  • Add method argument to DataFrame.quantile (#11957) @rjzamora
  • Add gpu memory watermark apis to JNI (#11950) @abellina
  • Adds retryCount to RmmEventHandler.onAllocFailure (#11940) @abellina
  • Enable returning string data from UDFs used through apply (#11933) @brandon-b-miller
  • Switch over to rapids-cmake patches for thrust (#11921) @robertmaynard
  • Add strings udf C++ classes and functions for phase II (#11912) @davidwendt
  • Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
  • Enable CEC for strings_udf (#11884) @brandon-b-miller
  • ArrowIPCTableWriter writes en empty batch in the case of an empty table. (#11883) @firestarman
  • Implement chunked Parquet reader (#11867) @ttnghia
  • Add read_orc_metadata to libcudf (#11815) @vuule
  • Support nested types as groupby keys in libcudf (#11792) @PointKernel
  • Adding feature Truncate to DataFrame and Series (#11435) @VamsiTallam95

🛠️ Improvements

  • Reduce number of tests marked spilling (#12197) @madsbk
  • Pin dask and distributed for release (#12165) @galipremsagar
  • Don't rely on GNU find in headers_test.sh (#12164) @wence-
  • Update cp.clip call (#12148) @quasiben
  • Enable automatic column projection in groupby().agg (#12124) @rjzamora
  • Refactor purge_nonempty_nulls (#12111) @ttnghia
  • Create an int8 column in read_csv when all elements are missing (#12110) @vuule
  • Spilling to host memory (#12106) @madsbk
  • First pass of pd.read_orc changes in tests (#12103) @galipremsagar
  • Expose engine argument in dask_cudf.read_json (#12101) @rjzamora
  • Remove CUDA 10 compatibility code. (#12088) @bdice
  • Move and update dask nigthly install in CI (#12082) @galipremsagar
  • Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to &quot;ALWAYS&quot; (#12080) @vuule
  • Remove macros that inspect the contents of exceptions (#12076) @vyasr
  • Fix ingest_raw_data performance issue in Nested JSON reader due to RVO (#12070) @karthikeyann
  • Remove overflow error during decimal binops (#12063) @galipremsagar
  • Change cudf::detail::tdigest to cudf::tdigest::detail (#12050) @davidwendt
  • Fix quantile gtests coded in namespace cudf::test (#12049) @davidwendt
  • Add support for DataFrame.from_dict`to_dictandSeries.to_dict` (#12048) @galipremsagar
  • Refactor Parquet reader (#12046) @ttnghia
  • Forward merge 22.10 into 22.12 (#12045) @vyasr
  • Standardize newlines at ends of files. (#12042) @bdice
  • Trim trailing whitespace from all files. (#12041) @bdice
  • Use nosync policy in gather and scatter implementations. (#12038) @bdice
  • Remove smart quotes from all docstrings. (#12035) @bdice
  • Update cuda-python dependency to 11.7.1 (#12030) @galipremsagar
  • Add cython-lint to pre-commit checks. (#12020) @bdice
  • Use pragma once (#12019) @bdice
  • New GHA to add issues/prs to project board (#12016) @jarmak-nv
  • Add DataFrame.pivot_table. (#12015) @bdice
  • Rollback of DeviceBufferLike (#12009) @madsbk
  • Remove default parameters for nvtext::detail functions (#12007) @davidwendt
  • Remove default parameters for cudf::dictionary::detail functions (#12006) @davidwendt
  • Remove unused managed_allocator (#12005) @vyasr
  • Remove default parameters for cudf::strings::detail functions (#12003) @davidwendt
  • Remove unnecessary code from dask-cudf _Frame (#12001) @rjzamora
  • Ignore python docs build artifacts (#12000) @galipremsagar
  • Use rapids-cmake for google benchmark. (#11997) @vyasr
  • Leverage rapids_cython for more automated RPATH handling (#11996) @vyasr
  • Remove stale labeler (#11995) @raydouglass
  • Move protobuf compilation to CMake (#11986) @vyasr
  • Replace most of preprocessor usage in nvcomp adapter with constexpr (#11980) @vuule
  • Add missing noexcepts to column_in_metadata methods (#11973) @vyasr
  • Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
  • Accelerate libcudf segmented sort with CUB segmented sort (#11969) @davidwendt
  • Feature/remove default streams (#11967) @vyasr
  • Add pool memory resource to libcudf basic example (#11966) @davidwendt
  • Fix some libcudf calls to cudf::detail::gather (#11963) @davidwendt
  • Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
  • Add deprecation warning for set_allocator. (#11958) @vyasr
  • Fix lists and structs gtests coded in namespace cudf::test (#11956) @davidwendt
  • Add full page indexes to Parquet writer benchmarks (#11955) @etseidl
  • Use gather-based strings factory in cudf::strings::strip (#11954) @davidwendt
  • Default to equal NaNs in make_merge_sets_aggregation. (#11952) @bdice
  • Add strip_delimiters option to read_text (#11946) @upsj
  • Refactor multibyte_split output_builder (#11945) @upsj
  • Remove validation that requires introspection (#11938) @vyasr
  • Add .str.find_multiple API (#11928) @galipremsagar
  • Add regex_program class for use with all regex APIs (#11927) @davidwendt
  • Enable backend dispatching for Dask-DataFrame creation (#11920) @rjzamora
  • Performance improvement in JSON Tree traversal (#11919) @karthikeyann
  • Fix some gtests incorrectly coded in namespace cudf::test (part I) (#11917) @davidwendt
  • Refactor pad/zfill functions for reuse with strings udf (#11914) @davidwendt
  • Add nanosecond & microsecond to DatetimeProperties (#11911) @galipremsagar
  • Pin mimesis version in setup.py. (#11906) @bdice
  • Error on ListColumn or any new unsupported column in cudf.Index (#11902) @galipremsagar
  • Add thrust output iterator fix (1805) to thrust.patch (#11900) @davidwendt
  • Relax codecov threshold diff (#11899) @galipremsagar
  • Use public APIs in STREAM_COMPACTION_NVBENCH (#11892) @GregoryKimball
  • Add coverage for string UDF tests. (#11891) @vyasr
  • Provide data_chunk_source wrapper for datasource (#11886) @upsj
  • Handle multibyte_split byte_range out-of-bounds offsets on host (#11885) @upsj
  • Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
  • Change expect_strings_empty into expect_column_empty libcudf test utility (#11873) @davidwendt
  • Add ngroup (#11871) @shwina
  • Reduce memory usage in nested JSON parser - tree generation (#11864) @karthikeyann
  • Unpin dask and distributed for development (#11859) @galipremsagar
  • Remove unused includes for table/row_operators (#11857) @GregoryKimball
  • Use conda-forge's pyorc (#11855) @jakirkham
  • Add libcudf strings examples (#11849) @davidwendt
  • Remove cudf_io namespace alias (#11827) @vuule
  • Test/remove thrust vector usage (#11813) @vyasr
  • Add BGZIP reader to python read_text (#11802) @upsj
  • Merge branch-22.10 into branch-22.12 (#11801) @davidwendt
  • Fix compile warning from CUDF_FUNC_RANGE in a member function (#11798) @davidwendt
  • Update cudf JNI version to 22.12.0-SNAPSHOT (#11764) @pxLi
  • Update flake8 to 5.0.4 and use flake8-force to check Cython. (#11736) @bdice
  • Add BGZIP multibyte_split benchmark (#11723) @upsj
  • Bifurcate Dependency Lists (#11674) @bdice
  • Default to equal NaNs in make_collect_set_aggregation. (#11621) @bdice
  • Conform "bench_isin" to match generator column names (#11549) @GregoryKimball
  • Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
  • Add checks for HLG layers in dask-cudf groupby tests (#10853) @charlesbluca
  • part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
  • Make all nvcc warnings into errors (#8916) @trxcllnt
cudf - v22.12.00

Published by GPUtester almost 2 years ago

🚨 Breaking Changes

  • Add JNI for substring without 'end' parameter. (#12113) @firestarman
  • Refactor purge_nonempty_nulls (#12111) @ttnghia
  • Create an int8 column in read_csv when all elements are missing (#12110) @vuule
  • Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to &quot;ALWAYS&quot; (#12080) @vuule
  • Fix type promotion edge cases in numerical binops (#12074) @wence-
  • Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
  • Rollback of DeviceBufferLike (#12009) @madsbk
  • Remove unused managed_allocator (#12005) @vyasr
  • Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
  • Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
  • Default to equal NaNs in make_merge_sets_aggregation. (#11952) @bdice
  • Remove validation that requires introspection (#11938) @vyasr
  • Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
  • Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
  • Support nested types as groupby keys in libcudf (#11792) @PointKernel
  • Default to equal NaNs in make_collect_set_aggregation. (#11621) @bdice
  • Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
  • part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source

🐛 Bug Fixes

  • Fix include line for IO Cython modules (#12250) @vyasr
  • Make dask pinning looser (#12231) @vyasr
  • Workaround for CUB segmented-sort bug with boolean keys (#12217) @davidwendt
  • Fix from_dict backend dispatch to match upstream dask (#12203) @galipremsagar
  • Merge branch-22.10 into branch-22.12 (#12198) @davidwendt
  • Fix compression in ORC writer (#12194) @vuule
  • Don't use CMake 3.25.0 as it has a show stopping FindCUDAToolkit bug (#12188) @robertmaynard
  • Fix data corruption when reading ORC files with empty stripes (#12160) @vuule
  • Fix decimal binary operations (#12142) @galipremsagar
  • Ensure dlpack include is provided to cudf interop lib (#12139) @robertmaynard
  • Safely allocate udf_string pointers in strings_udf (#12138) @brandon-b-miller
  • Fix/disable jitify lto (#12122) @robertmaynard
  • Fix conditional_full_join benchmark (#12121) @GregoryKimball
  • Fix regex working-memory-size refactor error (#12119) @davidwendt
  • Add in negative size checks for columns (#12118) @revans2
  • Add JNI for substring without 'end' parameter. (#12113) @firestarman
  • Fix reading of CSV files with blank second row (#12098) @vuule
  • Fix an error in IO with GzipFile type (#12085) @galipremsagar
  • Workaround groupby aggregate thrust::copy_if overflow (#12079) @davidwendt
  • Fix alignment of compressed blocks in ORC writer (#12077) @vuule
  • Fix singleton-range __setitem__ edge case (#12075) @wence-
  • Fix type promotion edge cases in numerical binops (#12074) @wence-
  • Force using old fmt in nvbench. (#12067) @vyasr
  • Fixes List offset bug in Nested JSON reader (#12060) @karthikeyann
  • Allow falling back to shim_60.ptx by default in strings_udf (#12056) @brandon-b-miller
  • Force black exclusions for pre-commit. (#12036) @bdice
  • Add memory_usage & items implementation for Struct column & dtype (#12033) @galipremsagar
  • Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
  • Fixes bug in csv_reader_options construction in cython (#12021) @karthikeyann
  • Fix issues when both usecols and names options are used in read_csv (#12018) @vuule
  • Port thrust's pinned_allocator to cudf, since Thrust 1.17 removes the type (#12004) @robertmaynard
  • Revert "Replace most of preprocessor usage in nvcomp adapter with constexpr" (#11999) @vuule
  • Fix bug where df.loc resulting in single row could give wrong index (#11998) @eriknw
  • Switch to DISABLE_DEPRECATION_WARNINGS to match other RAPIDS projects (#11989) @robertmaynard
  • Fix maximum page size estimate in Parquet writer (#11962) @vuule
  • Fix local offset handling in bgzip reader (#11918) @upsj
  • Fix an issue reading struct-of-list types in Parquet. (#11910) @nvdbaranec
  • Fix memcheck error in TypeInference.Timestamp gtest (#11905) @davidwendt
  • Fix type casting in Series.setitem (#11904) @wence-
  • Fix memcheck error in get_dremel_data (#11903) @davidwendt
  • Fixes Unsupported column type error due to empty list columns in Nested JSON reader (#11897) @karthikeyann
  • Fix segmented-sort to ignore indices outside the offsets (#11888) @davidwendt
  • Fix cudf::stable_sorted_order for NaN and -NaN in FLOAT64 columns (#11874) @davidwendt
  • Fix writing of Parquet files with many fragments (#11869) @etseidl
  • Fix RangeIndex unary operators. (#11868) @vyasr
  • JNI Avoid NPE for reading host binary data (#11865) @revans2
  • Fix decimal benchmark input data generation (#11863) @karthikeyann
  • Fix pre-commit copyright check (#11860) @galipremsagar
  • Fix Parquet support for seconds and milliseconds duration types (#11854) @vuule
  • Ensure better compiler cache results between cudf cal-ver branches (#11835) @robertmaynard
  • Fix make_column_from_scalar for all-null strings column (#11807) @davidwendt
  • Tell jitify_preprocess where to search for libnvrtc (#11787) @robertmaynard
  • add V2 page header support to parquet reader (#11778) @etseidl
  • Parquet reader: bug fix for a num_rows/skip_rows corner case, w/optimization for nested preprocessing (#11752) @nvdbaranec
  • Determine if Arrow has S3 support at runtime in unit test. (#11560) @bdice

📖 Documentation

  • Use rapidsai CODE_OF_CONDUCT.md (#12166) @bdice
  • Add symlinks to notebooks. (#12128) @bdice
  • Add truncate API to python doc pages (#12109) @galipremsagar
  • Update Numba docs links. (#12107) @bdice
  • Remove "Multi-GPU with Dask-cuDF" notebook. (#12095) @bdice
  • Fix link to c++ developer guide from CONTRIBUTING.md (#12084) @brandon-b-miller
  • Add pivot_table and crosstab to docs. (#12014) @bdice
  • Fix doxygen text for cudf::dictionary::encode (#11991) @davidwendt
  • Replace default_stream_value with get_default_stream in docs. (#11985) @vyasr
  • Add dtype docs pages and docstrings for cudf specific dtypes (#11974) @galipremsagar
  • Update Unit Testing in libcudf guidelines to code tests outside the cudf::test namespace (#11959) @davidwendt
  • Rename libcudf++ to libcudf. (#11953) @bdice
  • Fix documentation referring to removed as_gpu_matrix method. (#11937) @bdice
  • Remove "experimental" warning for struct columns in ORC reader and writer (#11880) @vuule
  • Initial draft of policies and guidelines for libcudf usage. (#11853) @vyasr
  • Add clear indication of non-GPU accelerated parameters in read_json docstring (#11825) @GregoryKimball
  • Add developer docs for writing tests (#11199) @vyasr

🚀 New Features

  • Adds an EventHandler to Java MemoryBuffer to be invoked on close (#12125) @abellina
  • Support + in strings_udf (#12117) @brandon-b-miller
  • Support upper and lower in strings_udf (#12099) @brandon-b-miller
  • Add wheel builds (#12096) @vyasr
  • Allow setting malloc heap size in string udfs (#12094) @brandon-b-miller
  • Support strip, lstrip, and rstrip in strings_udf (#12091) @brandon-b-miller
  • Mark nvcomp zstd compression stable (#12059) @jbrennan333
  • Add debug-only onAllocated/onDeallocated to RmmEventHandler (#12054) @abellina
  • Enable building against the libarrow contained in pyarrow (#12034) @vyasr
  • Add strings like jni and native method (#12032) @cindyyuanjiang
  • Cleanup common parsing code in JSON, CSV reader (#12022) @karthikeyann
  • byte_range support for JSON Lines format (#12017) @karthikeyann
  • Minor cleanup of root CMakeLists.txt for better organization (#11988) @robertmaynard
  • Add inplace arithmetic operators to MaskedType (#11987) @brandon-b-miller
  • Implement JNI for chunked Parquet reader (#11961) @ttnghia
  • Add method argument to DataFrame.quantile (#11957) @rjzamora
  • Add gpu memory watermark apis to JNI (#11950) @abellina
  • Adds retryCount to RmmEventHandler.onAllocFailure (#11940) @abellina
  • Enable returning string data from UDFs used through apply (#11933) @brandon-b-miller
  • Switch over to rapids-cmake patches for thrust (#11921) @robertmaynard
  • Add strings udf C++ classes and functions for phase II (#11912) @davidwendt
  • Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
  • Enable CEC for strings_udf (#11884) @brandon-b-miller
  • ArrowIPCTableWriter writes en empty batch in the case of an empty table. (#11883) @firestarman
  • Implement chunked Parquet reader (#11867) @ttnghia
  • Add read_orc_metadata to libcudf (#11815) @vuule
  • Support nested types as groupby keys in libcudf (#11792) @PointKernel
  • Adding feature Truncate to DataFrame and Series (#11435) @VamsiTallam95

🛠️ Improvements

  • Reduce number of tests marked spilling (#12197) @madsbk
  • Pin dask and distributed for release (#12165) @galipremsagar
  • Don't rely on GNU find in headers_test.sh (#12164) @wence-
  • Update cp.clip call (#12148) @quasiben
  • Enable automatic column projection in groupby().agg (#12124) @rjzamora
  • Refactor purge_nonempty_nulls (#12111) @ttnghia
  • Create an int8 column in read_csv when all elements are missing (#12110) @vuule
  • Spilling to host memory (#12106) @madsbk
  • First pass of pd.read_orc changes in tests (#12103) @galipremsagar
  • Expose engine argument in dask_cudf.read_json (#12101) @rjzamora
  • Remove CUDA 10 compatibility code. (#12088) @bdice
  • Move and update dask nigthly install in CI (#12082) @galipremsagar
  • Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to &quot;ALWAYS&quot; (#12080) @vuule
  • Remove macros that inspect the contents of exceptions (#12076) @vyasr
  • Fix ingest_raw_data performance issue in Nested JSON reader due to RVO (#12070) @karthikeyann
  • Remove overflow error during decimal binops (#12063) @galipremsagar
  • Change cudf::detail::tdigest to cudf::tdigest::detail (#12050) @davidwendt
  • Fix quantile gtests coded in namespace cudf::test (#12049) @davidwendt
  • Add support for DataFrame.from_dict`to_dictandSeries.to_dict` (#12048) @galipremsagar
  • Refactor Parquet reader (#12046) @ttnghia
  • Forward merge 22.10 into 22.12 (#12045) @vyasr
  • Standardize newlines at ends of files. (#12042) @bdice
  • Trim trailing whitespace from all files. (#12041) @bdice
  • Use nosync policy in gather and scatter implementations. (#12038) @bdice
  • Remove smart quotes from all docstrings. (#12035) @bdice
  • Update cuda-python dependency to 11.7.1 (#12030) @galipremsagar
  • Add cython-lint to pre-commit checks. (#12020) @bdice
  • Use pragma once (#12019) @bdice
  • New GHA to add issues/prs to project board (#12016) @jarmak-nv
  • Add DataFrame.pivot_table. (#12015) @bdice
  • Rollback of DeviceBufferLike (#12009) @madsbk
  • Remove default parameters for nvtext::detail functions (#12007) @davidwendt
  • Remove default parameters for cudf::dictionary::detail functions (#12006) @davidwendt
  • Remove unused managed_allocator (#12005) @vyasr
  • Remove default parameters for cudf::strings::detail functions (#12003) @davidwendt
  • Remove unnecessary code from dask-cudf _Frame (#12001) @rjzamora
  • Ignore python docs build artifacts (#12000) @galipremsagar
  • Use rapids-cmake for google benchmark. (#11997) @vyasr
  • Leverage rapids_cython for more automated RPATH handling (#11996) @vyasr
  • Remove stale labeler (#11995) @raydouglass
  • Move protobuf compilation to CMake (#11986) @vyasr
  • Replace most of preprocessor usage in nvcomp adapter with constexpr (#11980) @vuule
  • Add missing noexcepts to column_in_metadata methods (#11973) @vyasr
  • Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
  • Accelerate libcudf segmented sort with CUB segmented sort (#11969) @davidwendt
  • Feature/remove default streams (#11967) @vyasr
  • Add pool memory resource to libcudf basic example (#11966) @davidwendt
  • Fix some libcudf calls to cudf::detail::gather (#11963) @davidwendt
  • Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
  • Add deprecation warning for set_allocator. (#11958) @vyasr
  • Fix lists and structs gtests coded in namespace cudf::test (#11956) @davidwendt
  • Add full page indexes to Parquet writer benchmarks (#11955) @etseidl
  • Use gather-based strings factory in cudf::strings::strip (#11954) @davidwendt
  • Default to equal NaNs in make_merge_sets_aggregation. (#11952) @bdice
  • Add strip_delimiters option to read_text (#11946) @upsj
  • Refactor multibyte_split output_builder (#11945) @upsj
  • Remove validation that requires introspection (#11938) @vyasr
  • Add .str.find_multiple API (#11928) @galipremsagar
  • Add regex_program class for use with all regex APIs (#11927) @davidwendt
  • Enable backend dispatching for Dask-DataFrame creation (#11920) @rjzamora
  • Performance improvement in JSON Tree traversal (#11919) @karthikeyann
  • Fix some gtests incorrectly coded in namespace cudf::test (part I) (#11917) @davidwendt
  • Refactor pad/zfill functions for reuse with strings udf (#11914) @davidwendt
  • Add nanosecond & microsecond to DatetimeProperties (#11911) @galipremsagar
  • Pin mimesis version in setup.py. (#11906) @bdice
  • Error on ListColumn or any new unsupported column in cudf.Index (#11902) @galipremsagar
  • Add thrust output iterator fix (1805) to thrust.patch (#11900) @davidwendt
  • Relax codecov threshold diff (#11899) @galipremsagar
  • Use public APIs in STREAM_COMPACTION_NVBENCH (#11892) @GregoryKimball
  • Add coverage for string UDF tests. (#11891) @vyasr
  • Provide data_chunk_source wrapper for datasource (#11886) @upsj
  • Handle multibyte_split byte_range out-of-bounds offsets on host (#11885) @upsj
  • Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
  • Change expect_strings_empty into expect_column_empty libcudf test utility (#11873) @davidwendt
  • Add ngroup (#11871) @shwina
  • Reduce memory usage in nested JSON parser - tree generation (#11864) @karthikeyann
  • Unpin dask and distributed for development (#11859) @galipremsagar
  • Remove unused includes for table/row_operators (#11857) @GregoryKimball
  • Use conda-forge's pyorc (#11855) @jakirkham
  • Add libcudf strings examples (#11849) @davidwendt
  • Remove cudf_io namespace alias (#11827) @vuule
  • Test/remove thrust vector usage (#11813) @vyasr
  • Add BGZIP reader to python read_text (#11802) @upsj
  • Merge branch-22.10 into branch-22.12 (#11801) @davidwendt
  • Fix compile warning from CUDF_FUNC_RANGE in a member function (#11798) @davidwendt
  • Update cudf JNI version to 22.12.0-SNAPSHOT (#11764) @pxLi
  • Update flake8 to 5.0.4 and use flake8-force to check Cython. (#11736) @bdice
  • Add BGZIP multibyte_split benchmark (#11723) @upsj
  • Bifurcate Dependency Lists (#11674) @bdice
  • Default to equal NaNs in make_collect_set_aggregation. (#11621) @bdice
  • Conform "bench_isin" to match generator column names (#11549) @GregoryKimball
  • Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
  • Add checks for HLG layers in dask-cudf groupby tests (#10853) @charlesbluca
  • part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
  • Make all nvcc warnings into errors (#8916) @trxcllnt
cudf - v22.10.01

Published by GPUtester almost 2 years ago

🚨 Breaking Changes

  • Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
  • Disable nvCOMP DEFLATE integration (#11811) @vuule
  • Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
  • Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
  • Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
  • Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
  • Update zfill to match Python output (#11634) @davidwendt
  • Upgrade pandas to 1.5 (#11617) @galipremsagar
  • Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
  • Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
  • Adding optional parquet reader schema (#11524) @hyperbolic2346
  • Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
  • Remove support for skip_rows / num_rows options in the parquet reader. (#11503) @nvdbaranec
  • Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
  • Disable Arrow S3 support by default. (#11470) @bdice
  • Convert thrust::optional usages to std::optional (#11455) @robertmaynard
  • Remove unused is_struct trait. (#11450) @bdice
  • Refactor the Buffer class (#11447) @madsbk
  • Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
  • Refactor pad_side and strip_type enums into side_type enum (#11438) @davidwendt
  • Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
  • Use the new JSON parser when the experimental reader is selected (#11364) @vuule
  • Remove deprecated Series.applymap. (#11031) @bdice
  • Remove deprecated expand parameter from str.findall. (#11030) @bdice

🐛 Bug Fixes

  • Update cuda-python dependency to 11.7.1 (#11994) @shwina
  • Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
  • Handle ptx file paths during strings_udf import (#11862) @galipremsagar
  • Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
  • Reset strings_udf CEC and solve several related issues (#11846) @brandon-b-miller
  • Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
  • Fix is_valid checks in Scalar._binaryop (#11818) @wence-
  • Fix operator NotImplemented issue with numpy (#11816) @galipremsagar
  • Disable nvCOMP DEFLATE integration (#11811) @vuule
  • Build strings_udf package with other python packages in nightlies (#11808) @brandon-b-miller
  • Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
  • Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
  • Build cudf locally before building strings_udf conda packages in CI (#11785) @brandon-b-miller
  • Fix an issue in cudf::row_bit_count involving structs and lists at multiple levels. (#11779) @nvdbaranec
  • Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
  • Fix issue with set-item incase of list and struct types (#11760) @galipremsagar
  • Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
  • Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
  • Fix ORC string sum statistics (#11740) @vuule
  • Add strings_udf package for python 3.9 (#11730) @brandon-b-miller
  • Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
  • Don't assume stream is a compile-time constant expression (#11725) @vyasr
  • Fix get_thrust.cmake format at patch command (#11715) @davidwendt
  • Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
  • Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
  • Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
  • Fix DataFrame.from_arrow to preserve type metadata (#11698) @galipremsagar
  • Fix compile error due to missing header (#11697) @ttnghia
  • Default to Snappy compression in to_orc when using cuDF or Dask (#11690) @vuule
  • Fix an issue related to Multindex when group_keys=True (#11689) @galipremsagar
  • Transfer correct dtype to exploded column (#11687) @wence-
  • Ignore protobuf generated files in mypy checks (#11685) @galipremsagar
  • Maintain the index name after .loc (#11677) @shwina
  • Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
  • Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
  • Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
  • Fix multi-file remote datasource bug (#11655) @rjzamora
  • Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
  • Fix bug in device_write(): it uses an incorrect size (#11651) @madsbk
  • fixes overflows in benchmarks (#11649) @elstehle
  • Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
  • Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
  • Update zfill to match Python output (#11634) @davidwendt
  • Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
  • Fix host scalars construction of nested types (#11612) @galipremsagar
  • Fix compile warning in nested_json_gpu.cu (#11607) @davidwendt
  • Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
  • Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
  • Add is_timestamp test for leap second (60) (#11594) @davidwendt
  • Fix an issue with to_arrow when column name type is not a string (#11590) @galipremsagar
  • Fix exception in segmented-reduce benchmark (#11588) @davidwendt
  • Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
  • Correct distribution data type in quantiles benchmark (#11584) @vuule
  • Fix multibyte_split benchmark for host buffers (#11583) @upsj
  • xfail custreamz display test for now (#11567) @shwina
  • Fix JNI for TableWithMeta to use schema_info instead of column_names (#11566) @jlowe
  • Reduce code duplication for dask & distributed nightly/stable installs (#11565) @galipremsagar
  • Fix groupby failures in dask_cudf CI (#11561) @rjzamora
  • Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
  • find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
  • Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
  • Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
  • Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
  • Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
  • Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
  • Update parquet fuzz tests to drop support for skiprows & num_rows (#11505) @galipremsagar
  • Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
  • Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
  • Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
  • libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
  • Fix regex quantifier check to include capture groups (#11373) @davidwendt
  • Fix read_text when byte_range is aligned with field (#11371) @upsj
  • Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
  • column: calculate null_count before release()ing the cudf::column (#11365) @wence-

📖 Documentation

  • Update guide-to-udfs notebook (#11861) @brandon-b-miller
  • Update docstring for cudf.read_text (#11799) @GregoryKimball
  • Add doc section for list & struct handling (#11770) @galipremsagar
  • Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
  • Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
  • Add docs for use of string data to DataFrame.apply and Series.apply and update guide to UDFs notebook (#11733) @brandon-b-miller
  • Enable more Pydocstyle rules (#11582) @bdice
  • Remove unused cpp/img folder (#11554) @davidwendt
  • Publish C++ developer docs (#11475) @vyasr
  • Fix a misalignment in cudf.get_dummies docstring (#11443) @galipremsagar
  • Update contributing doc to include links to the developer guides (#11390) @davidwendt
  • Fix table_view_base doxygen format (#11340) @davidwendt
  • Create main developer guide for Python (#11235) @vyasr
  • Add developer documentation for benchmarking (#11122) @vyasr
  • cuDF error handling document (#7917) @isVoid

🚀 New Features

  • Add hasNull statistic reading ability to ORC (#11747) @devavret
  • Add istitle to string UDFs (#11738) @brandon-b-miller
  • JSON Column creation in GPU (#11714) @karthikeyann
  • Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
  • Add BGZIP data_chunk_reader (#11652) @upsj
  • Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
  • changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
  • Generate unique keys table in java JNI contiguousSplitGroups (#11614) @res-life
  • Generic type casting to support the new nested JSON reader (#11613) @elstehle
  • JSON tree traversal (#11610) @karthikeyann
  • Add casting operators to masked UDFs (#11578) @brandon-b-miller
  • Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
  • Add strings 'like' function (#11558) @davidwendt
  • Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
  • Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
  • Adds support for json lines format to the nested JSON reader (#11534) @elstehle
  • Adding optional parquet reader schema (#11524) @hyperbolic2346
  • Adds GPU implementation of JSON-token-stream to JSON-tree (#11518) @karthikeyann
  • Add gdb pretty-printers for simple types (#11499) @upsj
  • Add create_random_column function to the data generator (#11490) @vuule
  • Add fluent API builder to data_profile (#11479) @vuule
  • Adds Nested Json benchmark (#11466) @karthikeyann
  • Convert thrust::optional usages to std::optional (#11455) @robertmaynard
  • Python API for the future experimental JSON reader (#11426) @vuule
  • Return schema info from JSON reader (#11419) @vuule
  • Add regex ASCII flag support for matching builtin character classes (#11404) @davidwendt
  • Truncate parquet column indexes (#11403) @etseidl
  • Adds the end-to-end JSON parser implementation (#11388) @elstehle
  • Use the new JSON parser when the experimental reader is selected (#11364) @vuule
  • Add placeholder for the experimental JSON reader (#11334) @vuule
  • Add read-only functions on string dtypes to DataFrame.apply and Series.apply (#11319) @brandon-b-miller
  • Added 'crosstab' and 'pivot_table' features (#11314) @shaswat-indian
  • Quickly error out when trying to build with unsupported nvcc versions (#11297) @robertmaynard
  • Adds JSON tokenizer (#11264) @elstehle
  • List lexicographic comparator (#11129) @devavret
  • Add generic type inference for cuIO (#11121) @PointKernel
  • Fully support nested types in cudf::contains (#10656) @ttnghia
  • Support nested types in lists::contains (#10548) @ttnghia

🛠️ Improvements

  • Pin dask and distributed for release (#11822) @galipremsagar
  • Add examples for Nested JSON reader (#11814) @GregoryKimball
  • Support shuffle-based groupby aggregations in dask_cudf (#11800) @rjzamora
  • Update strings udf version updater script (#11772) @galipremsagar
  • Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
  • Pass dtype param to avoid pd.Series warnings (#11761) @galipremsagar
  • Enable schema_element & keep_quotes support in json reader (#11746) @galipremsagar
  • Add ability to construct ListColumn when size is None (#11745) @galipremsagar
  • Reduces memory requirements in JSON parser and adds bytes/s and peak memory usage to benchmarks (#11732) @elstehle
  • Add missing copyright headers. (#11712) @bdice
  • Fix copyright check issues in pre-commit (#11711) @bdice
  • Include decimal in supported types for range window order-by columns (#11710) @mythrocks
  • Disable very large column gtest for contiguous-split (#11706) @davidwendt
  • Drop split_out=None test from groupby.agg (#11704) @wence-
  • Use CubinLinker for CUDA Minor Version Compatibility (#11701) @gmarkall
  • Add regex capture-group parameter to auto convert to non-capture groups (#11695) @davidwendt
  • Add a __dataframe__ method to the protocol dataframe object (#11692) @rgommers
  • Special-case multibyte_split for single-byte delimiter (#11681) @upsj
  • Remove isort exclusions (#11680) @bdice
  • Refactor CSV reader benchmarks with nvbench (#11678) @PointKernel
  • Check conda recipe headers with pre-commit (#11669) @bdice
  • Remove redundant style check for clang-format. (#11668) @bdice
  • Add support for group_keys in groupby (#11659) @galipremsagar
  • Fix pandoc pinning. (#11658) @bdice
  • Revert removal of skip_rows / num_rows options from the Parquet reader. (#11657) @nvdbaranec
  • Update git metadata (#11647) @bdice
  • Call set_null_count on a returning column if null-count is known (#11646) @davidwendt
  • Fix some libcudf detail calls not passing the stream variable (#11642) @davidwendt
  • Update to mypy 0.971 (#11640) @wence-
  • Refactor strings strip functor to details header (#11635) @davidwendt
  • Fix incorrect nullCount in get_json_object (#11633) @trxcllnt
  • Simplify hostdevice_vector (#11631) @upsj
  • Refactor parquet writer benchmarks with nvbench (#11623) @PointKernel
  • Rework contains_scalar to check nulls at runtime (#11622) @davidwendt
  • Fix incorrect memory resource used in rolling temp columns (#11618) @mythrocks
  • Upgrade pandas to 1.5 (#11617) @galipremsagar
  • Move type-dispatcher calls from traits.hpp to traits.cpp (#11616) @davidwendt
  • Refactor parquet reader benchmarks with nvbench (#11611) @PointKernel
  • Forward-merge branch-22.08 to branch-22.10 (#11608) @bdice
  • Use stream in Java API. (#11601) @bdice
  • Refactors of public/detail APIs, CUDF_FUNC_RANGE, stream handling. (#11600) @bdice
  • Improve ORC writer benchmark with nvbench (#11598) @PointKernel
  • Tune multibyte_split kernel (#11587) @upsj
  • Move split_utils.cuh to strings/detail (#11585) @davidwendt
  • Fix warnings due to compiler regression with if constexpr (#11581) @ttnghia
  • Add full 24-bit dictionary support to Parquet writer (#11580) @etseidl
  • Expose "explicit-comms" option in shuffle-based dask_cudf functions (#11576) @rjzamora
  • Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
  • Refactor dask_cudf groupby to use apply_concat_apply (#11571) @rjzamora
  • Add ability to write list(struct) columns as map type in orc writer (#11568) @galipremsagar
  • Add byte_range to multibyte_split benchmark + NVBench refactor (#11562) @upsj
  • JNI support for writing binary columns in parquet (#11556) @revans2
  • Support additional dictionary bit widths in Parquet writer (#11547) @etseidl
  • Refactor string/numeric conversion utilities (#11545) @davidwendt
  • Removing unnecessary asserts in parquet tests (#11544) @hyperbolic2346
  • Clean up ORC reader benchmarks with NVBench (#11543) @PointKernel
  • Reuse MurmurHash3_32 in Parquet page data. (#11528) @bdice
  • Add hexadecimal value separators (#11527) @bdice
  • Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
  • Struct support for NULL_EQUALS binary operation (#11520) @rwlee
  • Bump hadoop-common from 3.2.3 to 3.2.4 in /java (#11516) @dependabot[bot]
  • Fix Feather test warning. (#11511) @bdice
  • copy_range ballot_syncs to have no execution dependency (#11508) @robertmaynard
  • Upgrade to arrow-9.x (#11507) @galipremsagar
  • Remove support for skip_rows / num_rows options in the parquet reader. (#11503) @nvdbaranec
  • Single-pass multibyte_split (#11500) @upsj
  • Sanitize percentile_approx() output for empty input (#11498) @SrikarVanavasam
  • Unpin dask and distributed for development (#11492) @galipremsagar
  • Move SparkMurmurHash3_32 functor. (#11489) @bdice
  • Refactor group_nunique.cu to use nullate::DYNAMIC for reduce-by-key functor (#11482) @davidwendt
  • Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
  • Add reduction distinct_count benchmark (#11473) @ttnghia
  • Add groupby nunique aggregation benchmark (#11472) @ttnghia
  • Disable Arrow S3 support by default. (#11470) @bdice
  • Add groupby max aggregation benchmark (#11464) @ttnghia
  • Extract Dremel encoding code from Parquet (#11461) @vyasr
  • Add missing Thrust #includes. (#11457) @bdice
  • Make CMake hooks verbose (#11456) @vyasr
  • Control Parquet page size through Python API (#11454) @etseidl
  • Add control of Parquet column index creation to python (#11453) @etseidl
  • Remove unused is_struct trait. (#11450) @bdice
  • Refactor the Buffer class (#11447) @madsbk
  • Refactor pad_side and strip_type enums into side_type enum (#11438) @davidwendt
  • Update to Thrust 1.17.0 (#11437) @bdice
  • Add in JNI for parsing JSON data and getting the metadata back too. (#11431) @revans2
  • Convert byte_array_view to use std::byte (#11424) @hyperbolic2346
  • Deprecate unflatten_nested_columns (#11421) @SrikarVanavasam
  • Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
  • Add Spark list hashing Java tests (#11379) @bdice
  • Move cmake to the build section. (#11376) @vyasr
  • Remove use of CUDA driver API calls from libcudf (#11370) @shwina
  • Add column constructor from device_uvector&& (#11356) @SrikarVanavasam
  • Remove unused custreamz thirdparty directory (#11343) @vyasr
  • Update jni version to 22.10.0-SNAPSHOT (#11338) @pxLi
  • Enable using upstream jitify2 (#11287) @shwina
  • Cache cudf.Scalar (#11246) @shwina
  • Remove deprecated Series.applymap. (#11031) @bdice
  • Remove deprecated expand parameter from str.findall. (#11030) @bdice
cudf - v22.10.00

Published by GPUtester about 2 years ago

🚨 Breaking Changes

  • Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
  • Disable nvCOMP DEFLATE integration (#11811) @vuule
  • Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
  • Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
  • Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
  • Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
  • Update zfill to match Python output (#11634) @davidwendt
  • Upgrade pandas to 1.5 (#11617) @galipremsagar
  • Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
  • Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
  • Adding optional parquet reader schema (#11524) @hyperbolic2346
  • Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
  • Remove support for skip_rows / num_rows options in the parquet reader. (#11503) @nvdbaranec
  • Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
  • Disable Arrow S3 support by default. (#11470) @bdice
  • Convert thrust::optional usages to std::optional (#11455) @robertmaynard
  • Remove unused is_struct trait. (#11450) @bdice
  • Refactor the Buffer class (#11447) @madsbk
  • Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
  • Refactor pad_side and strip_type enums into side_type enum (#11438) @davidwendt
  • Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
  • Use the new JSON parser when the experimental reader is selected (#11364) @vuule
  • Remove deprecated Series.applymap. (#11031) @bdice
  • Remove deprecated expand parameter from str.findall. (#11030) @bdice

🐛 Bug Fixes

  • Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
  • Handle ptx file paths during strings_udf import (#11862) @galipremsagar
  • Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
  • Reset strings_udf CEC and solve several related issues (#11846) @brandon-b-miller
  • Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
  • Fix is_valid checks in Scalar._binaryop (#11818) @wence-
  • Fix operator NotImplemented issue with numpy (#11816) @galipremsagar
  • Disable nvCOMP DEFLATE integration (#11811) @vuule
  • Build strings_udf package with other python packages in nightlies (#11808) @brandon-b-miller
  • Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
  • Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
  • Build cudf locally before building strings_udf conda packages in CI (#11785) @brandon-b-miller
  • Fix an issue in cudf::row_bit_count involving structs and lists at multiple levels. (#11779) @nvdbaranec
  • Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
  • Fix issue with set-item incase of list and struct types (#11760) @galipremsagar
  • Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
  • Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
  • Fix ORC string sum statistics (#11740) @vuule
  • Add strings_udf package for python 3.9 (#11730) @brandon-b-miller
  • Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
  • Don't assume stream is a compile-time constant expression (#11725) @vyasr
  • Fix get_thrust.cmake format at patch command (#11715) @davidwendt
  • Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
  • Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
  • Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
  • Fix DataFrame.from_arrow to preserve type metadata (#11698) @galipremsagar
  • Fix compile error due to missing header (#11697) @ttnghia
  • Default to Snappy compression in to_orc when using cuDF or Dask (#11690) @vuule
  • Fix an issue related to Multindex when group_keys=True (#11689) @galipremsagar
  • Transfer correct dtype to exploded column (#11687) @wence-
  • Ignore protobuf generated files in mypy checks (#11685) @galipremsagar
  • Maintain the index name after .loc (#11677) @shwina
  • Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
  • Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
  • Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
  • Fix multi-file remote datasource bug (#11655) @rjzamora
  • Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
  • Fix bug in device_write(): it uses an incorrect size (#11651) @madsbk
  • fixes overflows in benchmarks (#11649) @elstehle
  • Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
  • Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
  • Update zfill to match Python output (#11634) @davidwendt
  • Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
  • Fix host scalars construction of nested types (#11612) @galipremsagar
  • Fix compile warning in nested_json_gpu.cu (#11607) @davidwendt
  • Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
  • Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
  • Add is_timestamp test for leap second (60) (#11594) @davidwendt
  • Fix an issue with to_arrow when column name type is not a string (#11590) @galipremsagar
  • Fix exception in segmented-reduce benchmark (#11588) @davidwendt
  • Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
  • Correct distribution data type in quantiles benchmark (#11584) @vuule
  • Fix multibyte_split benchmark for host buffers (#11583) @upsj
  • xfail custreamz display test for now (#11567) @shwina
  • Fix JNI for TableWithMeta to use schema_info instead of column_names (#11566) @jlowe
  • Reduce code duplication for dask & distributed nightly/stable installs (#11565) @galipremsagar
  • Fix groupby failures in dask_cudf CI (#11561) @rjzamora
  • Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
  • find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
  • Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
  • Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
  • Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
  • Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
  • Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
  • Update parquet fuzz tests to drop support for skiprows & num_rows (#11505) @galipremsagar
  • Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
  • Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
  • Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
  • libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
  • Fix regex quantifier check to include capture groups (#11373) @davidwendt
  • Fix read_text when byte_range is aligned with field (#11371) @upsj
  • Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
  • column: calculate null_count before release()ing the cudf::column (#11365) @wence-

📖 Documentation

  • Update guide-to-udfs notebook (#11861) @brandon-b-miller
  • Update docstring for cudf.read_text (#11799) @GregoryKimball
  • Add doc section for list & struct handling (#11770) @galipremsagar
  • Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
  • Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
  • Add docs for use of string data to DataFrame.apply and Series.apply and update guide to UDFs notebook (#11733) @brandon-b-miller
  • Enable more Pydocstyle rules (#11582) @bdice
  • Remove unused cpp/img folder (#11554) @davidwendt
  • Publish C++ developer docs (#11475) @vyasr
  • Fix a misalignment in cudf.get_dummies docstring (#11443) @galipremsagar
  • Update contributing doc to include links to the developer guides (#11390) @davidwendt
  • Fix table_view_base doxygen format (#11340) @davidwendt
  • Create main developer guide for Python (#11235) @vyasr
  • Add developer documentation for benchmarking (#11122) @vyasr
  • cuDF error handling document (#7917) @isVoid

🚀 New Features

  • Add hasNull statistic reading ability to ORC (#11747) @devavret
  • Add istitle to string UDFs (#11738) @brandon-b-miller
  • JSON Column creation in GPU (#11714) @karthikeyann
  • Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
  • Add BGZIP data_chunk_reader (#11652) @upsj
  • Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
  • changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
  • Generate unique keys table in java JNI contiguousSplitGroups (#11614) @res-life
  • Generic type casting to support the new nested JSON reader (#11613) @elstehle
  • JSON tree traversal (#11610) @karthikeyann
  • Add casting operators to masked UDFs (#11578) @brandon-b-miller
  • Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
  • Add strings 'like' function (#11558) @davidwendt
  • Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
  • Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
  • Adds support for json lines format to the nested JSON reader (#11534) @elstehle
  • Adding optional parquet reader schema (#11524) @hyperbolic2346
  • Adds GPU implementation of JSON-token-stream to JSON-tree (#11518) @karthikeyann
  • Add gdb pretty-printers for simple types (#11499) @upsj
  • Add create_random_column function to the data generator (#11490) @vuule
  • Add fluent API builder to data_profile (#11479) @vuule
  • Adds Nested Json benchmark (#11466) @karthikeyann
  • Convert thrust::optional usages to std::optional (#11455) @robertmaynard
  • Python API for the future experimental JSON reader (#11426) @vuule
  • Return schema info from JSON reader (#11419) @vuule
  • Add regex ASCII flag support for matching builtin character classes (#11404) @davidwendt
  • Truncate parquet column indexes (#11403) @etseidl
  • Adds the end-to-end JSON parser implementation (#11388) @elstehle
  • Use the new JSON parser when the experimental reader is selected (#11364) @vuule
  • Add placeholder for the experimental JSON reader (#11334) @vuule
  • Add read-only functions on string dtypes to DataFrame.apply and Series.apply (#11319) @brandon-b-miller
  • Added 'crosstab' and 'pivot_table' features (#11314) @shaswat-indian
  • Quickly error out when trying to build with unsupported nvcc versions (#11297) @robertmaynard
  • Adds JSON tokenizer (#11264) @elstehle
  • List lexicographic comparator (#11129) @devavret
  • Add generic type inference for cuIO (#11121) @PointKernel
  • Fully support nested types in cudf::contains (#10656) @ttnghia
  • Support nested types in lists::contains (#10548) @ttnghia

🛠️ Improvements

  • Pin dask and distributed for release (#11822) @galipremsagar
  • Add examples for Nested JSON reader (#11814) @GregoryKimball
  • Support shuffle-based groupby aggregations in dask_cudf (#11800) @rjzamora
  • Update strings udf version updater script (#11772) @galipremsagar
  • Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
  • Pass dtype param to avoid pd.Series warnings (#11761) @galipremsagar
  • Enable schema_element & keep_quotes support in json reader (#11746) @galipremsagar
  • Add ability to construct ListColumn when size is None (#11745) @galipremsagar
  • Reduces memory requirements in JSON parser and adds bytes/s and peak memory usage to benchmarks (#11732) @elstehle
  • Add missing copyright headers. (#11712) @bdice
  • Fix copyright check issues in pre-commit (#11711) @bdice
  • Include decimal in supported types for range window order-by columns (#11710) @mythrocks
  • Disable very large column gtest for contiguous-split (#11706) @davidwendt
  • Drop split_out=None test from groupby.agg (#11704) @wence-
  • Use CubinLinker for CUDA Minor Version Compatibility (#11701) @gmarkall
  • Add regex capture-group parameter to auto convert to non-capture groups (#11695) @davidwendt
  • Add a __dataframe__ method to the protocol dataframe object (#11692) @rgommers
  • Special-case multibyte_split for single-byte delimiter (#11681) @upsj
  • Remove isort exclusions (#11680) @bdice
  • Refactor CSV reader benchmarks with nvbench (#11678) @PointKernel
  • Check conda recipe headers with pre-commit (#11669) @bdice
  • Remove redundant style check for clang-format. (#11668) @bdice
  • Add support for group_keys in groupby (#11659) @galipremsagar
  • Fix pandoc pinning. (#11658) @bdice
  • Revert removal of skip_rows / num_rows options from the Parquet reader. (#11657) @nvdbaranec
  • Update git metadata (#11647) @bdice
  • Call set_null_count on a returning column if null-count is known (#11646) @davidwendt
  • Fix some libcudf detail calls not passing the stream variable (#11642) @davidwendt
  • Update to mypy 0.971 (#11640) @wence-
  • Refactor strings strip functor to details header (#11635) @davidwendt
  • Fix incorrect nullCount in get_json_object (#11633) @trxcllnt
  • Simplify hostdevice_vector (#11631) @upsj
  • Refactor parquet writer benchmarks with nvbench (#11623) @PointKernel
  • Rework contains_scalar to check nulls at runtime (#11622) @davidwendt
  • Fix incorrect memory resource used in rolling temp columns (#11618) @mythrocks
  • Upgrade pandas to 1.5 (#11617) @galipremsagar
  • Move type-dispatcher calls from traits.hpp to traits.cpp (#11616) @davidwendt
  • Refactor parquet reader benchmarks with nvbench (#11611) @PointKernel
  • Forward-merge branch-22.08 to branch-22.10 (#11608) @bdice
  • Use stream in Java API. (#11601) @bdice
  • Refactors of public/detail APIs, CUDF_FUNC_RANGE, stream handling. (#11600) @bdice
  • Improve ORC writer benchmark with nvbench (#11598) @PointKernel
  • Tune multibyte_split kernel (#11587) @upsj
  • Move split_utils.cuh to strings/detail (#11585) @davidwendt
  • Fix warnings due to compiler regression with if constexpr (#11581) @ttnghia
  • Add full 24-bit dictionary support to Parquet writer (#11580) @etseidl
  • Expose "explicit-comms" option in shuffle-based dask_cudf functions (#11576) @rjzamora
  • Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
  • Refactor dask_cudf groupby to use apply_concat_apply (#11571) @rjzamora
  • Add ability to write list(struct) columns as map type in orc writer (#11568) @galipremsagar
  • Add byte_range to multibyte_split benchmark + NVBench refactor (#11562) @upsj
  • JNI support for writing binary columns in parquet (#11556) @revans2
  • Support additional dictionary bit widths in Parquet writer (#11547) @etseidl
  • Refactor string/numeric conversion utilities (#11545) @davidwendt
  • Removing unnecessary asserts in parquet tests (#11544) @hyperbolic2346
  • Clean up ORC reader benchmarks with NVBench (#11543) @PointKernel
  • Reuse MurmurHash3_32 in Parquet page data. (#11528) @bdice
  • Add hexadecimal value separators (#11527) @bdice
  • Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
  • Struct support for NULL_EQUALS binary operation (#11520) @rwlee
  • Bump hadoop-common from 3.2.3 to 3.2.4 in /java (#11516) @dependabot[bot]
  • Fix Feather test warning. (#11511) @bdice
  • copy_range ballot_syncs to have no execution dependency (#11508) @robertmaynard
  • Upgrade to arrow-9.x (#11507) @galipremsagar
  • Remove support for skip_rows / num_rows options in the parquet reader. (#11503) @nvdbaranec
  • Single-pass multibyte_split (#11500) @upsj
  • Sanitize percentile_approx() output for empty input (#11498) @SrikarVanavasam
  • Unpin dask and distributed for development (#11492) @galipremsagar
  • Move SparkMurmurHash3_32 functor. (#11489) @bdice
  • Refactor group_nunique.cu to use nullate::DYNAMIC for reduce-by-key functor (#11482) @davidwendt
  • Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
  • Add reduction distinct_count benchmark (#11473) @ttnghia
  • Add groupby nunique aggregation benchmark (#11472) @ttnghia
  • Disable Arrow S3 support by default. (#11470) @bdice
  • Add groupby max aggregation benchmark (#11464) @ttnghia
  • Extract Dremel encoding code from Parquet (#11461) @vyasr
  • Add missing Thrust #includes. (#11457) @bdice
  • Make CMake hooks verbose (#11456) @vyasr
  • Control Parquet page size through Python API (#11454) @etseidl
  • Add control of Parquet column index creation to python (#11453) @etseidl
  • Remove unused is_struct trait. (#11450) @bdice
  • Refactor the Buffer class (#11447) @madsbk
  • Refactor pad_side and strip_type enums into side_type enum (#11438) @davidwendt
  • Update to Thrust 1.17.0 (#11437) @bdice
  • Add in JNI for parsing JSON data and getting the metadata back too. (#11431) @revans2
  • Convert byte_array_view to use std::byte (#11424) @hyperbolic2346
  • Deprecate unflatten_nested_columns (#11421) @SrikarVanavasam
  • Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
  • Add Spark list hashing Java tests (#11379) @bdice
  • Move cmake to the build section. (#11376) @vyasr
  • Remove use of CUDA driver API calls from libcudf (#11370) @shwina
  • Add column constructor from device_uvector&& (#11356) @SrikarVanavasam
  • Remove unused custreamz thirdparty directory (#11343) @vyasr
  • Update jni version to 22.10.0-SNAPSHOT (#11338) @pxLi
  • Enable using upstream jitify2 (#11287) @shwina
  • Cache cudf.Scalar (#11246) @shwina
  • Remove deprecated Series.applymap. (#11031) @bdice
  • Remove deprecated expand parameter from str.findall. (#11030) @bdice
cudf - v22.08.01

Published by GPUtester about 2 years ago

🚨 Breaking Changes

  • Pin numpy to &lt;1.23 (#11824) @galipremsagar
  • Remove legacy join APIs (#11274) @vyasr
  • Remove lists::drop_list_duplicates (#11236) @ttnghia
  • Remove Index.replace API (#11131) @vyasr
  • Remove deprecated Index methods from Frame (#11073) @vyasr
  • Remove public API of cudf.merge_sorted. (#11032) @bdice
  • Drop python 3.7 in code-base (#11029) @galipremsagar
  • Return empty dataframe when reading a Parquet file using empty columns option (#11018) @vuule
  • Remove Arrow CUDA IPC code (#10995) @shwina
  • Buffer: make .ptr read-only (#10872) @madsbk

🐛 Bug Fixes

  • Fix out-of-bound access in cudf::detail::label_segments (#11497) @ttnghia
  • Fix distributed error related to loop_in_thread (#11428) @galipremsagar
  • Fix atomic operations on NaN values (#11420) @ttnghia
  • Relax arrow pinning to just 8.x and remove cuda build dependency from cudf recipe (#11412) @kkraus14
  • Revert "Allow CuPy 11" (#11409) @jakirkham
  • Fix moto timeouts (#11369) @galipremsagar
  • Set +/-infinity as the identity values for floating-point numbers in device operators min and max (#11357) @ttnghia
  • Fix memory_usage() for ListSeries (#11355) @thomcom
  • Fix constructing Column from column_view with expired mask (#11354) @shwina
  • Handle parquet corner case: Columns with more rows than are in the row group. (#11353) @nvdbaranec
  • Fix DatetimeIndex & TimedeltaIndex constructors (#11342) @galipremsagar
  • Fix unsigned-compare compile warning in IntPow binops (#11339) @davidwendt
  • Fix performance issue and add a new code path to cudf::detail::contains (#11330) @ttnghia
  • Pin pytorch to temporarily unblock from libcupti errors (#11289) @galipremsagar
  • Workaround for nvcomp zstd overwriting blocks for orc due to underestimate of sizes (#11288) @jbrennan333
  • Fix inconsistency when hashing two tables in cudf::detail::contains (#11284) @ttnghia
  • Fix issue related to numpy array and category dtype (#11282) @galipremsagar
  • Add NotImplementedError when on is specified in DataFrame.join. (#11275) @vyasr
  • Fix invalid allocate_like() and empty_like() tests. (#11268) @nvdbaranec
  • Returns DataFrame When Concating Along Axis 1 (#11263) @isVoid
  • Fix compile error due to missing header (#11257) @ttnghia
  • Fix a memory aliasing/crash issue in scatter for lists. (#11254) @nvdbaranec
  • Fix tests/rolling/empty_input_test (#11238) @ttnghia
  • Fix const qualifier when using host_span&lt;bitmask_type const*&gt; (#11220) @ttnghia
  • Avoid using nvcompBatchedDeflateDecompressGetTempSizeEx in cuIO (#11213) @vuule
  • Generate benchmark data with correct run length regardless of cardinality (#11205) @vuule
  • Fix cumulative count index behavior (#11188) @brandon-b-miller
  • Fix assertion in dask_cudf test_struct_explode (#11170) @rjzamora
  • Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager (#11161) @res-life
  • Fix compatibility issues with pandas 1.4.3 (#11152) @vyasr
  • Ensure cuco export set is installed in cmake build (#11147) @jlowe
  • Avoid redundant deepcopy in cudf.from_pandas (#11142) @galipremsagar
  • Fix compile error due to missing header (#11126) @ttnghia
  • Fix __cuda_array_interface__ failures (#11113) @galipremsagar
  • Support octal and hex within regex character class pattern (#11112) @davidwendt
  • Fix split_re matching logic for word boundaries (#11106) @davidwendt
  • Handle multiple files metadata in read_parquet (#11105) @galipremsagar
  • Fix index alignment for Series objects with repeated index (#11103) @shwina
  • FindcuFile now searches in the current CUDA Toolkit location (#11101) @robertmaynard
  • Fix regex word boundary logic to include underline (#11099) @davidwendt
  • Exclude CudaFatalTest when selecting all Java tests (#11083) @jlowe
  • Fix duplicate cudatoolkit pinning issue (#11070) @galipremsagar
  • Maintain the input index in the result of a groupby-transform (#11068) @shwina
  • Fix bug with row count comparison for expect_columns_equivalent(). (#11059) @nvdbaranec
  • Fix BPE uninitialized size value for null and empty input strings (#11054) @davidwendt
  • Include missing header for usage of get_current_device_resource() (#11047) @AtlantaPepsi
  • Fix warn_unused_result error in parquet test (#11026) @karthikeyann
  • Return empty dataframe when reading a Parquet file using empty columns option (#11018) @vuule
  • Fix small error in page row count limiting (#10991) @etseidl
  • Fix a row index entry error in ORC writer issue (#10989) @vuule
  • Fix grouped covariance to require both values to be convertible to double. (#10891) @bdice

📖 Documentation

  • Defer loading of custom.js (#11465) @galipremsagar
  • Fix issues with day & night modes in python docs (#11400) @galipremsagar
  • Update missing data handling APIs in docs (#11345) @galipremsagar
  • Add lists filtering APIs to doxygen group. (#11336) @bdice
  • Remove unused import in README sample (#11318) @vyasr
  • Note null behavior in where docs (#11276) @brandon-b-miller
  • Update docstring for spans in get_row_data_range (#11271) @vyasr
  • Update nvCOMP integration table (#11231) @vuule
  • Add dev docs for documentation writing (#11217) @vyasr
  • Documentation fix for concatenate (#11187) @dagardner-nv
  • Fix unresolved links in markdown (#11173) @karthikeyann
  • Fix cudf version in README.md install commands (#11164) @jvanstraten
  • Switch language from None to &quot;en&quot; in docs build (#11133) @galipremsagar
  • Remove docs mentioning scalar_view since no such class exists. (#11132) @bdice
  • Add docstring entry for DataFrame.value_counts (#11039) @galipremsagar
  • Add docs to rolling var, std, count. (#11035) @bdice
  • Fix docs for Numba UDFs. (#11020) @bdice
  • Replace column comparison utilities functions with macros (#11007) @karthikeyann
  • Fix Doxygen warnings in multiple headers files (#11003) @karthikeyann
  • Fix doxygen warnings in utilities/ headers (#10974) @karthikeyann
  • Fix Doxygen warnings in table header files (#10964) @karthikeyann
  • Fix Doxygen warnings in column header files (#10963) @karthikeyann
  • Fix Doxygen warnings in strings / header files (#10937) @karthikeyann
  • Generate Doxygen Tag File for Libcudf (#10932) @isVoid
  • Fix doxygen warnings in structs, lists headers (#10923) @karthikeyann
  • Fix doxygen warnings in fixed_point.hpp (#10922) @karthikeyann
  • Fix doxygen warnings in ast/, rolling, tdigest/, wrappers/, dictionary/ headers (#10921) @karthikeyann
  • fix doxygen warnings in cudf/io/types.hpp, other header files (#10913) @karthikeyann
  • fix doxygen warnings in cudf/io/ avro, csv, json, orc, parquet header files (#10912) @karthikeyann
  • Fix doxygen warnings in cudf/*.hpp (#10896) @karthikeyann
  • Add missing documentation in aggregation.hpp (#10887) @karthikeyann
  • Revise PR template. (#10774) @bdice

🚀 New Features

  • Change cmake to allow controlling Arrow version via cmake variable (#11429) @kkraus14
  • Adding support for list<int8> columns to be written as byte arrays in parquet (#11328) @hyperbolic2346
  • Adding byte array view structure (#11322) @hyperbolic2346
  • Adding byte_array statistics (#11303) @hyperbolic2346
  • Add column indexes to Parquet writer (#11302) @etseidl
  • Provide an Option for Default Integer and Floating Bitwidth (#11272) @isVoid
  • FST benchmark (#11243) @karthikeyann
  • Adds the Finite-State Transducer algorithm (#11242) @elstehle
  • Refactor collect_set to use cudf::distinct and cudf::lists::distinct (#11228) @ttnghia
  • Treat zstd as stable in nvcomp releases 2.3.2 and later (#11226) @jbrennan333
  • Add 24 bit dictionary support to Parquet writer (#11216) @devavret
  • Enable positive group indices for extractAllRecord on JNI (#11215) @anthony-chang
  • JNI bindings for NTH_ELEMENT window aggregation (#11201) @mythrocks
  • Add JNI bindings for extractAllRecord (#11196) @anthony-chang
  • Add cudf.options (#11193) @isVoid
  • Add thrift support for parquet column and offset indexes (#11178) @etseidl
  • Adding binary read/write as options for parquet (#11160) @hyperbolic2346
  • Support nth_element for window functions (#11158) @mythrocks
  • Implement lists::distinct and cudf::detail::stable_distinct (#11149) @ttnghia
  • Implement Groupby pct_change (#11144) @skirui-source
  • Add JNI for set operations (#11143) @ttnghia
  • Remove deprecated PER_THREAD_DEFAULT_STREAM (#11134) @jbrennan333
  • Added a Java method to check the existence of a list of keys in a map (#11128) @razajafri
  • Feature/python benchmarking (#11125) @vyasr
  • Support nan_equality in cudf::distinct (#11118) @ttnghia
  • Added JNI for getMapValueForKeys (#11104) @razajafri
  • Refactor semi_anti_join (#11100) @ttnghia
  • Replace remaining instances of rmm::cuda_stream_default with cudf::default_stream_value (#11082) @jbrennan333
  • Adds the Logical Stack algorithm (#11078) @elstehle
  • Add doxygen-check pre-commit hook (#11076) @karthikeyann
  • Use new nvCOMP API to optimize the decompression temp memory size (#11064) @vuule
  • Add Doxygen CI check (#11057) @karthikeyann
  • Support duplicate_keep_option in cudf::distinct (#11052) @ttnghia
  • Support set operations (#11043) @ttnghia
  • Support for ZLIB compression in ORC writer (#11036) @vuule
  • Adding feature swaplevels (#11027) @VamsiTallam95
  • Use nvCOMP for ZLIB decompression in ORC reader (#11024) @vuule
  • Function for bfill, ffill #9591 (#11022) @Sreekiran096
  • Generate group offsets from element labels (#11017) @ttnghia
  • Feature axes (#10979) @VamsiTallam95
  • Generate group labels from offsets (#10945) @ttnghia
  • Add missing cuIO benchmark coverage for duration types (#10933) @vuule
  • Dask-cuDF cumulative groupby ops (#10889) @brandon-b-miller
  • Reindex Improvements (#10815) @brandon-b-miller
  • Implement value_counts for DataFrame (#10813) @martinfalisse

🛠️ Improvements

  • Pin numpy to &lt;1.23 (#11824) @galipremsagar
  • Make Index Join Tests on Default Precisions Deterministic (#11451) @isVoid
  • Pin dask & distributed for release (#11433) @galipremsagar
  • Use documented header template for doxygen (#11430) @galipremsagar
  • Relax arrow version in dev env (#11418) @galipremsagar
  • Added Java bindings for Parquet options for binary read (#11410) @razajafri
  • Allow CuPy 11 (#11393) @jakirkham
  • Improve multibyte_split performance (#11347) @cwharris
  • Switch death test to use explicit trap. (#11326) @vyasr
  • Add --output-on-failure to ctest args. (#11321) @vyasr
  • Consolidate remaining DataFrame/Series APIs (#11315) @vyasr
  • Add JNI support for the join_strings API (#11309) @revans2
  • Add cupy version to setup.py install_requires (#11306) @vyasr
  • removing some unused code (#11305) @hyperbolic2346
  • Add test of wildcard selection (#11300) @vyasr
  • Update parquet reader to take stream parameter (#11294) @PointKernel
  • Spark list hashing (#11292) @bdice
  • Remove legacy join APIs (#11274) @vyasr
  • Fix cudf recipes syntax (#11273) @ajschmidt8
  • Fix cudf recipe (#11267) @ajschmidt8
  • Cleanup config files (#11266) @vyasr
  • Run mypy on all packages (#11265) @vyasr
  • Update to isort 5.10.1. (#11262) @vyasr
  • Consolidate flake8 and pydocstyle configuration (#11260) @vyasr
  • Remove redundant black config specifications. (#11258) @vyasr
  • Ensure DeprecationWarnings are not introduced via pre-commit (#11255) @wence-
  • Optimization to gpu::PreprocessColumnData in parquet reader. (#11252) @nvdbaranec
  • Move rolling impl details to detail/ directory. (#11250) @mythrocks
  • Remove lists::drop_list_duplicates (#11236) @ttnghia
  • Use cudf::lists::distinct in Python binding (#11234) @ttnghia
  • Use cudf::lists::distinct in Java binding (#11233) @ttnghia
  • Use cudf::distinct in Java binding (#11232) @ttnghia
  • Pin dask-cuda in dev environment (#11229) @galipremsagar
  • Remove cruft in map_lookup (#11221) @mythrocks
  • Deprecate skiprows & num_rows in parquet reader (#11218) @galipremsagar
  • Remove Frame._index (#11210) @vyasr
  • Improve performance for cudf::contains when searching for a scalar (#11202) @ttnghia
  • Document why Development component is needing for CMake. (#11200) @vyasr
  • cleanup unused code in rolling_test.hpp (#11195) @karthikeyann
  • Standardize join internals around DataFrame (#11184) @vyasr
  • Move character case table declarations from src to detail (#11183) @davidwendt
  • Remove usage of Frame in StringMethods (#11181) @vyasr
  • Expose get_json_object_options to Python (#11180) @SrikarVanavasam
  • Fix decimal128 stats in parquet writer (#11179) @etseidl
  • Modify CheckPageRows in parquet_test to use datasources (#11177) @etseidl
  • Pin max version of cuda-python to 11.7.0 (#11174) @Ethyling
  • Refactor and optimize Frame.where (#11168) @vyasr
  • Add npos const static member to cudf::string_view (#11166) @davidwendt
  • Move _drop_rows_by_label from Frame to IndexedFrame (#11157) @vyasr
  • Clean up _copy_type_metadata (#11156) @vyasr
  • Add nvcc conda package in dev environment (#11154) @galipremsagar
  • Struct binary comparison op functionality for spark rapids (#11153) @rwlee
  • Refactor inline conditionals. (#11151) @bdice
  • Refactor Spark hashing tests (#11145) @bdice
  • Add new _from_data_like_self factory (#11140) @vyasr
  • Update get_cucollections to use rapids-cmake (#11139) @vyasr
  • Remove unnecessary extra function for libcudacxx detection (#11138) @vyasr
  • Allow initial value for cudf::reduce and cudf::segmented_reduce. (#11137) @SrikarVanavasam
  • Remove Index.replace API (#11131) @vyasr
  • Move char-type table function declarations from src to detail (#11127) @davidwendt
  • Clean up repo root (#11124) @bdice
  • Improve print formatting of strings containing newline characters. (#11108) @nvdbaranec
  • Fix cudf::string_view::find() to return pos for empty string argument (#11107) @davidwendt
  • Forward-merge branch-22.06 to branch-22.08 (#11086) @bdice
  • Take iterators by value in clamp.cu. (#11084) @bdice
  • Performance improvements for row to column conversions (#11075) @hyperbolic2346
  • Remove deprecated Index methods from Frame (#11073) @vyasr
  • Use per-page max compressed size estimate for compression (#11066) @devavret
  • column to row refactor for performance (#11063) @hyperbolic2346
  • Include skbuild directory into build.sh clean operation (#11060) @galipremsagar
  • Unpin dask & distributed for development (#11058) @galipremsagar
  • Add support for Series.between (#11051) @galipremsagar
  • Fix groupby include (#11046) @bwyogatama
  • Regex cleanup internal reclass and reclass_device classes (#11045) @davidwendt
  • Remove public API of cudf.merge_sorted. (#11032) @bdice
  • Drop python 3.7 in code-base (#11029) @galipremsagar
  • Addition & integration of the integer power operator (#11025) @AtlantaPepsi
  • Refactor lists::contains (#11019) @ttnghia
  • Change build.sh to find C++ library by default and avoid shadowing CMAKE_ARGS (#11013) @vyasr
  • Clean up parquet unit test (#11005) @PointKernel
  • Add missing #pragma once to header files (#11004) @karthikeyann
  • Cleanup iterator.cuh and add fixed point support for scalar_optional_accessor (#10999) @ttnghia
  • Refactor cudf::contains (#10997) @ttnghia
  • Remove Arrow CUDA IPC code (#10995) @shwina
  • Change file extension for groupby benchmark (#10985) @ttnghia
  • Sort recipe include checks. (#10984) @bdice
  • Update cuCollections for thrust upgrade (#10983) @PointKernel
  • Expose row-group size options in cudf ParquetWriter (#10980) @rjzamora
  • Cleanup cudf::strings::detail::regex_parser class source (#10975) @davidwendt
  • Handle missing fields as nulls in get_json_object() (#10970) @SrikarVanavasam
  • Fix license families to match all-caps expected by conda-verify. (#10931) @bdice
  • Include <optional> for GCC 11 compatibility. (#10927) @bdice
  • Enable builds with scikit-build (#10919) @vyasr
  • Improve distinct by using cuco::static_map::retrieve_all (#10916) @PointKernel
  • update cudfjni to 22.08.0-SNAPSHOT (#10910) @pxLi
  • Improve the capture of fatal cuda error (#10884) @sperlingxx
  • Cleanup regex compiler operators and operands source (#10879) @davidwendt
  • Buffer: make .ptr read-only (#10872) @madsbk
  • Configurable NaN handling in device_row_comparators (#10870) @rwlee
  • Register cudf.core.groupby.Grouper objects to dask grouper_dispatch (#10838) @brandon-b-miller
  • Upgrade to arrow-8 (#10816) @galipremsagar
  • Remove getattr method in RangeIndex class (#10538) @skirui-source
  • Adding bins to value counts (#8247) @marlenezw
cudf - v22.08.00

Published by GPUtester about 2 years ago

🚨 Breaking Changes

  • Remove legacy join APIs (#11274) @vyasr
  • Remove lists::drop_list_duplicates (#11236) @ttnghia
  • Remove Index.replace API (#11131) @vyasr
  • Remove deprecated Index methods from Frame (#11073) @vyasr
  • Remove public API of cudf.merge_sorted. (#11032) @bdice
  • Drop python 3.7 in code-base (#11029) @galipremsagar
  • Return empty dataframe when reading a Parquet file using empty columns option (#11018) @vuule
  • Remove Arrow CUDA IPC code (#10995) @shwina
  • Buffer: make .ptr read-only (#10872) @madsbk

🐛 Bug Fixes

  • Fix distributed error related to loop_in_thread (#11428) @galipremsagar
  • Relax arrow pinning to just 8.x and remove cuda build dependency from cudf recipe (#11412) @kkraus14
  • Revert "Allow CuPy 11" (#11409) @jakirkham
  • Fix moto timeouts (#11369) @galipremsagar
  • Set +/-infinity as the identity values for floating-point numbers in device operators min and max (#11357) @ttnghia
  • Fix memory_usage() for ListSeries (#11355) @thomcom
  • Fix constructing Column from column_view with expired mask (#11354) @shwina
  • Handle parquet corner case: Columns with more rows than are in the row group. (#11353) @nvdbaranec
  • Fix DatetimeIndex & TimedeltaIndex constructors (#11342) @galipremsagar
  • Fix unsigned-compare compile warning in IntPow binops (#11339) @davidwendt
  • Fix performance issue and add a new code path to cudf::detail::contains (#11330) @ttnghia
  • Pin pytorch to temporarily unblock from libcupti errors (#11289) @galipremsagar
  • Workaround for nvcomp zstd overwriting blocks for orc due to underestimate of sizes (#11288) @jbrennan333
  • Fix inconsistency when hashing two tables in cudf::detail::contains (#11284) @ttnghia
  • Fix issue related to numpy array and category dtype (#11282) @galipremsagar
  • Add NotImplementedError when on is specified in DataFrame.join. (#11275) @vyasr
  • Fix invalid allocate_like() and empty_like() tests. (#11268) @nvdbaranec
  • Returns DataFrame When Concating Along Axis 1 (#11263) @isVoid
  • Fix compile error due to missing header (#11257) @ttnghia
  • Fix a memory aliasing/crash issue in scatter for lists. (#11254) @nvdbaranec
  • Fix tests/rolling/empty_input_test (#11238) @ttnghia
  • Fix const qualifier when using host_span&lt;bitmask_type const*&gt; (#11220) @ttnghia
  • Avoid using nvcompBatchedDeflateDecompressGetTempSizeEx in cuIO (#11213) @vuule
  • Generate benchmark data with correct run length regardless of cardinality (#11205) @vuule
  • Fix cumulative count index behavior (#11188) @brandon-b-miller
  • Fix assertion in dask_cudf test_struct_explode (#11170) @rjzamora
  • Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager (#11161) @res-life
  • Fix compatibility issues with pandas 1.4.3 (#11152) @vyasr
  • Ensure cuco export set is installed in cmake build (#11147) @jlowe
  • Avoid redundant deepcopy in cudf.from_pandas (#11142) @galipremsagar
  • Fix compile error due to missing header (#11126) @ttnghia
  • Fix __cuda_array_interface__ failures (#11113) @galipremsagar
  • Support octal and hex within regex character class pattern (#11112) @davidwendt
  • Fix split_re matching logic for word boundaries (#11106) @davidwendt
  • Handle multiple files metadata in read_parquet (#11105) @galipremsagar
  • Fix index alignment for Series objects with repeated index (#11103) @shwina
  • FindcuFile now searches in the current CUDA Toolkit location (#11101) @robertmaynard
  • Fix regex word boundary logic to include underline (#11099) @davidwendt
  • Exclude CudaFatalTest when selecting all Java tests (#11083) @jlowe
  • Fix duplicate cudatoolkit pinning issue (#11070) @galipremsagar
  • Maintain the input index in the result of a groupby-transform (#11068) @shwina
  • Fix bug with row count comparison for expect_columns_equivalent(). (#11059) @nvdbaranec
  • Fix BPE uninitialized size value for null and empty input strings (#11054) @davidwendt
  • Include missing header for usage of get_current_device_resource() (#11047) @AtlantaPepsi
  • Fix warn_unused_result error in parquet test (#11026) @karthikeyann
  • Return empty dataframe when reading a Parquet file using empty columns option (#11018) @vuule
  • Fix small error in page row count limiting (#10991) @etseidl
  • Fix a row index entry error in ORC writer issue (#10989) @vuule
  • Fix grouped covariance to require both values to be convertible to double. (#10891) @bdice

📖 Documentation

  • Fix issues with day & night modes in python docs (#11400) @galipremsagar
  • Update missing data handling APIs in docs (#11345) @galipremsagar
  • Add lists filtering APIs to doxygen group. (#11336) @bdice
  • Remove unused import in README sample (#11318) @vyasr
  • Note null behavior in where docs (#11276) @brandon-b-miller
  • Update docstring for spans in get_row_data_range (#11271) @vyasr
  • Update nvCOMP integration table (#11231) @vuule
  • Add dev docs for documentation writing (#11217) @vyasr
  • Documentation fix for concatenate (#11187) @dagardner-nv
  • Fix unresolved links in markdown (#11173) @karthikeyann
  • Fix cudf version in README.md install commands (#11164) @jvanstraten
  • Switch language from None to &quot;en&quot; in docs build (#11133) @galipremsagar
  • Remove docs mentioning scalar_view since no such class exists. (#11132) @bdice
  • Add docstring entry for DataFrame.value_counts (#11039) @galipremsagar
  • Add docs to rolling var, std, count. (#11035) @bdice
  • Fix docs for Numba UDFs. (#11020) @bdice
  • Replace column comparison utilities functions with macros (#11007) @karthikeyann
  • Fix Doxygen warnings in multiple headers files (#11003) @karthikeyann
  • Fix doxygen warnings in utilities/ headers (#10974) @karthikeyann
  • Fix Doxygen warnings in table header files (#10964) @karthikeyann
  • Fix Doxygen warnings in column header files (#10963) @karthikeyann
  • Fix Doxygen warnings in strings / header files (#10937) @karthikeyann
  • Generate Doxygen Tag File for Libcudf (#10932) @isVoid
  • Fix doxygen warnings in structs, lists headers (#10923) @karthikeyann
  • Fix doxygen warnings in fixed_point.hpp (#10922) @karthikeyann
  • Fix doxygen warnings in ast/, rolling, tdigest/, wrappers/, dictionary/ headers (#10921) @karthikeyann
  • fix doxygen warnings in cudf/io/types.hpp, other header files (#10913) @karthikeyann
  • fix doxygen warnings in cudf/io/ avro, csv, json, orc, parquet header files (#10912) @karthikeyann
  • Fix doxygen warnings in cudf/*.hpp (#10896) @karthikeyann
  • Add missing documentation in aggregation.hpp (#10887) @karthikeyann
  • Revise PR template. (#10774) @bdice

🚀 New Features

  • Change cmake to allow controlling Arrow version via cmake variable (#11429) @kkraus14
  • Adding support for list<int8> columns to be written as byte arrays in parquet (#11328) @hyperbolic2346
  • Adding byte array view structure (#11322) @hyperbolic2346
  • Adding byte_array statistics (#11303) @hyperbolic2346
  • Add column indexes to Parquet writer (#11302) @etseidl
  • Provide an Option for Default Integer and Floating Bitwidth (#11272) @isVoid
  • FST benchmark (#11243) @karthikeyann
  • Adds the Finite-State Transducer algorithm (#11242) @elstehle
  • Refactor collect_set to use cudf::distinct and cudf::lists::distinct (#11228) @ttnghia
  • Treat zstd as stable in nvcomp releases 2.3.2 and later (#11226) @jbrennan333
  • Add 24 bit dictionary support to Parquet writer (#11216) @devavret
  • Enable positive group indices for extractAllRecord on JNI (#11215) @anthony-chang
  • JNI bindings for NTH_ELEMENT window aggregation (#11201) @mythrocks
  • Add JNI bindings for extractAllRecord (#11196) @anthony-chang
  • Add cudf.options (#11193) @isVoid
  • Add thrift support for parquet column and offset indexes (#11178) @etseidl
  • Adding binary read/write as options for parquet (#11160) @hyperbolic2346
  • Support nth_element for window functions (#11158) @mythrocks
  • Implement lists::distinct and cudf::detail::stable_distinct (#11149) @ttnghia
  • Implement Groupby pct_change (#11144) @skirui-source
  • Add JNI for set operations (#11143) @ttnghia
  • Remove deprecated PER_THREAD_DEFAULT_STREAM (#11134) @jbrennan333
  • Added a Java method to check the existence of a list of keys in a map (#11128) @razajafri
  • Feature/python benchmarking (#11125) @vyasr
  • Support nan_equality in cudf::distinct (#11118) @ttnghia
  • Added JNI for getMapValueForKeys (#11104) @razajafri
  • Refactor semi_anti_join (#11100) @ttnghia
  • Replace remaining instances of rmm::cuda_stream_default with cudf::default_stream_value (#11082) @jbrennan333
  • Adds the Logical Stack algorithm (#11078) @elstehle
  • Add doxygen-check pre-commit hook (#11076) @karthikeyann
  • Use new nvCOMP API to optimize the decompression temp memory size (#11064) @vuule
  • Add Doxygen CI check (#11057) @karthikeyann
  • Support duplicate_keep_option in cudf::distinct (#11052) @ttnghia
  • Support set operations (#11043) @ttnghia
  • Support for ZLIB compression in ORC writer (#11036) @vuule
  • Adding feature swaplevels (#11027) @VamsiTallam95
  • Use nvCOMP for ZLIB decompression in ORC reader (#11024) @vuule
  • Function for bfill, ffill #9591 (#11022) @Sreekiran096
  • Generate group offsets from element labels (#11017) @ttnghia
  • Feature axes (#10979) @VamsiTallam95
  • Generate group labels from offsets (#10945) @ttnghia
  • Add missing cuIO benchmark coverage for duration types (#10933) @vuule
  • Dask-cuDF cumulative groupby ops (#10889) @brandon-b-miller
  • Reindex Improvements (#10815) @brandon-b-miller
  • Implement value_counts for DataFrame (#10813) @martinfalisse

🛠️ Improvements

  • Pin dask & distributed for release (#11433) @galipremsagar
  • Use documented header template for doxygen (#11430) @galipremsagar
  • Relax arrow version in dev env (#11418) @galipremsagar
  • Allow CuPy 11 (#11393) @jakirkham
  • Improve multibyte_split performance (#11347) @cwharris
  • Switch death test to use explicit trap. (#11326) @vyasr
  • Add --output-on-failure to ctest args. (#11321) @vyasr
  • Consolidate remaining DataFrame/Series APIs (#11315) @vyasr
  • Add JNI support for the join_strings API (#11309) @revans2
  • Add cupy version to setup.py install_requires (#11306) @vyasr
  • removing some unused code (#11305) @hyperbolic2346
  • Add test of wildcard selection (#11300) @vyasr
  • Update parquet reader to take stream parameter (#11294) @PointKernel
  • Spark list hashing (#11292) @bdice
  • Remove legacy join APIs (#11274) @vyasr
  • Fix cudf recipes syntax (#11273) @ajschmidt8
  • Fix cudf recipe (#11267) @ajschmidt8
  • Cleanup config files (#11266) @vyasr
  • Run mypy on all packages (#11265) @vyasr
  • Update to isort 5.10.1. (#11262) @vyasr
  • Consolidate flake8 and pydocstyle configuration (#11260) @vyasr
  • Remove redundant black config specifications. (#11258) @vyasr
  • Ensure DeprecationWarnings are not introduced via pre-commit (#11255) @wence-
  • Optimization to gpu::PreprocessColumnData in parquet reader. (#11252) @nvdbaranec
  • Move rolling impl details to detail/ directory. (#11250) @mythrocks
  • Remove lists::drop_list_duplicates (#11236) @ttnghia
  • Use cudf::lists::distinct in Python binding (#11234) @ttnghia
  • Use cudf::lists::distinct in Java binding (#11233) @ttnghia
  • Use cudf::distinct in Java binding (#11232) @ttnghia
  • Pin dask-cuda in dev environment (#11229) @galipremsagar
  • Remove cruft in map_lookup (#11221) @mythrocks
  • Deprecate skiprows & num_rows in parquet reader (#11218) @galipremsagar
  • Remove Frame._index (#11210) @vyasr
  • Improve performance for cudf::contains when searching for a scalar (#11202) @ttnghia
  • Document why Development component is needing for CMake. (#11200) @vyasr
  • cleanup unused code in rolling_test.hpp (#11195) @karthikeyann
  • Standardize join internals around DataFrame (#11184) @vyasr
  • Move character case table declarations from src to detail (#11183) @davidwendt
  • Remove usage of Frame in StringMethods (#11181) @vyasr
  • Expose get_json_object_options to Python (#11180) @SrikarVanavasam
  • Fix decimal128 stats in parquet writer (#11179) @etseidl
  • Modify CheckPageRows in parquet_test to use datasources (#11177) @etseidl
  • Pin max version of cuda-python to 11.7.0 (#11174) @Ethyling
  • Refactor and optimize Frame.where (#11168) @vyasr
  • Add npos const static member to cudf::string_view (#11166) @davidwendt
  • Move _drop_rows_by_label from Frame to IndexedFrame (#11157) @vyasr
  • Clean up _copy_type_metadata (#11156) @vyasr
  • Add nvcc conda package in dev environment (#11154) @galipremsagar
  • Struct binary comparison op functionality for spark rapids (#11153) @rwlee
  • Refactor inline conditionals. (#11151) @bdice
  • Refactor Spark hashing tests (#11145) @bdice
  • Add new _from_data_like_self factory (#11140) @vyasr
  • Update get_cucollections to use rapids-cmake (#11139) @vyasr
  • Remove unnecessary extra function for libcudacxx detection (#11138) @vyasr
  • Allow initial value for cudf::reduce and cudf::segmented_reduce. (#11137) @SrikarVanavasam
  • Remove Index.replace API (#11131) @vyasr
  • Move char-type table function declarations from src to detail (#11127) @davidwendt
  • Clean up repo root (#11124) @bdice
  • Improve print formatting of strings containing newline characters. (#11108) @nvdbaranec
  • Fix cudf::string_view::find() to return pos for empty string argument (#11107) @davidwendt
  • Forward-merge branch-22.06 to branch-22.08 (#11086) @bdice
  • Take iterators by value in clamp.cu. (#11084) @bdice
  • Performance improvements for row to column conversions (#11075) @hyperbolic2346
  • Remove deprecated Index methods from Frame (#11073) @vyasr
  • Use per-page max compressed size estimate for compression (#11066) @devavret
  • column to row refactor for performance (#11063) @hyperbolic2346
  • Include skbuild directory into build.sh clean operation (#11060) @galipremsagar
  • Unpin dask & distributed for development (#11058) @galipremsagar
  • Add support for Series.between (#11051) @galipremsagar
  • Fix groupby include (#11046) @bwyogatama
  • Regex cleanup internal reclass and reclass_device classes (#11045) @davidwendt
  • Remove public API of cudf.merge_sorted. (#11032) @bdice
  • Drop python 3.7 in code-base (#11029) @galipremsagar
  • Addition & integration of the integer power operator (#11025) @AtlantaPepsi
  • Refactor lists::contains (#11019) @ttnghia
  • Change build.sh to find C++ library by default and avoid shadowing CMAKE_ARGS (#11013) @vyasr
  • Clean up parquet unit test (#11005) @PointKernel
  • Add missing #pragma once to header files (#11004) @karthikeyann
  • Cleanup iterator.cuh and add fixed point support for scalar_optional_accessor (#10999) @ttnghia
  • Refactor cudf::contains (#10997) @ttnghia
  • Remove Arrow CUDA IPC code (#10995) @shwina
  • Change file extension for groupby benchmark (#10985) @ttnghia
  • Sort recipe include checks. (#10984) @bdice
  • Update cuCollections for thrust upgrade (#10983) @PointKernel
  • Expose row-group size options in cudf ParquetWriter (#10980) @rjzamora
  • Cleanup cudf::strings::detail::regex_parser class source (#10975) @davidwendt
  • Handle missing fields as nulls in get_json_object() (#10970) @SrikarVanavasam
  • Fix license families to match all-caps expected by conda-verify. (#10931) @bdice
  • Include <optional> for GCC 11 compatibility. (#10927) @bdice
  • Enable builds with scikit-build (#10919) @vyasr
  • Improve distinct by using cuco::static_map::retrieve_all (#10916) @PointKernel
  • update cudfjni to 22.08.0-SNAPSHOT (#10910) @pxLi
  • Improve the capture of fatal cuda error (#10884) @sperlingxx
  • Cleanup regex compiler operators and operands source (#10879) @davidwendt
  • Buffer: make .ptr read-only (#10872) @madsbk
  • Configurable NaN handling in device_row_comparators (#10870) @rwlee
  • Register cudf.core.groupby.Grouper objects to dask grouper_dispatch (#10838) @brandon-b-miller
  • Upgrade to arrow-8 (#10816) @galipremsagar
  • Remove getattr method in RangeIndex class (#10538) @skirui-source
  • Adding bins to value counts (#8247) @marlenezw
Package Rankings
Top 5.32% on Pypi.org
Top 8.17% on Proxy.golang.org
Top 4.8% on Repo1.maven.org