cudf | Python Ecosystem Directory

Bot releases are hidden (Show)

cudf - v24.04.00 Latest Release

Published by raydouglass 6 months ago

🚨 Breaking Changes

Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
Change exceptions thrown by copying APIs (#15319) @vyasr
Change strings_column_view::char_size to return int64 (#15197) @davidwendt
Upgrade to arrow-14.0.2 (#15108) @galipremsagar
Add support for pandas-2.2 in cudf (#15100) @galipremsagar
Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
Raise an error on import for unsupported GPUs. (#15053) @bdice
Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
Add future_stack to DataFrame.stack (#15015) @galipremsagar
Deprecate groupby fillna (#15000) @mroeschke
Deprecate replace with categorical columns (#14988) @mroeschke
Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
Add pandas-2.x support in cudf (#14916) @galipremsagar
Use cuco::static_set in the hash-based groupby (#14813) @PointKernel

🐛 Bug Fixes

Fix an issue with creating a series from scalar when dtype='category' (#15476) @galipremsagar
Update pre-commit-hooks to v0.0.3 (#15355) @KyleFromNVIDIA
[BUG][JNI] Trigger MemoryBuffer.onClosed after memory is freed (#15351) @abellina
Fix an issue with multiple short list rowgroups using the Parquet chunked reader. (#15342) @nvdbaranec
Avoid importing dask-expr if "query-planning" config is False (#15340) @rjzamora
Fix gtests/ERROR_TEST errors when run in Debug (#15317) @davidwendt
Fix OOB read in inflate_kernel (#15309) @vuule
Work around a cuFile error when running CSV tests with memcheck (#15293) @vuule
Fix Doxygen upload directory (#15291) @KyleFromNVIDIA
Fix Doxygen check (#15289) @KyleFromNVIDIA
Reintroduce PANDAS_GE_220 import (#15287) @wence-
Fix mean computation for the geometric distribution in the data generator (#15282) @vuule
Fix Parquet decimal64 stats (#15281) @etseidl
Make linking of nvtx3-cpp BUILD_LOCAL_INTERFACE (#15271) @KyleFromNVIDIA
Workaround compute-sanitizer memcheck bug (#15259) @davidwendt
Cleanup hostdevice_vector and add more APIs (#15252) @ttnghia
Fix number of rows in randomly generated lists columns (#15248) @vuule
Fix wrong output for collect_list/collect_set of lists column (#15243) @ttnghia
Fix testchunkedPackTwoPasses to copy from the bounce buffer (#15220) @abellina
Fix accessing .columns by an external API (#15212) @galipremsagar
[JNI] Disable testChunkedPackTwoPasses for now (#15210) @abellina
Update labeler and codeowner configs for CMake files (#15208) @PointKernel
Avoid dict normalization in __dask_tokenize__ (#15187) @rjzamora
Fix memcheck error in distinct inner join (#15164) @PointKernel
Remove unneeded script parameters in test_cpp_memcheck.sh (#15158) @davidwendt
Fix ListColumn.to_pandas() to retain list type (#15155) @galipremsagar
Avoid factorization in MultiIndex.to_pandas (#15150) @mroeschke
Fix GroupBy.get_group and GroupBy.indices (#15143) @wence-
Remove const from range_window_bounds::_extent. (#15138) @mythrocks
DataFrame.columns = ... retains RangeIndex & set dtype (#15129) @mroeschke
Correctly handle output for GroupBy.apply when chunk results are reindexed series (#15109) @brandon-b-miller
Fix Series.groupby.shift with a MultiIndex (#15098) @mroeschke
Fix reductions when DataFrame has MulitIndex columns (#15097) @mroeschke
Fix deprecation warnings for deprecated hash() calls (#15095) @davidwendt
Add support for arrow large_string in cudf (#15093) @galipremsagar
Fix sort_values pytest failure with pandas-2.x regression (#15092) @galipremsagar
Resolve path parsing issues in get_json_object (#15082) @SurajAralihalli
Fix bugs in handling of delta encodings (#15075) @etseidl
Fix is_device_write_preferred in void_sink and user_sink_wrapper (#15064) @vuule
Eliminate duplicate allocation of nested string columns (#15061) @vuule
Raise an error on import for unsupported GPUs. (#15053) @bdice
Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
Fix Index.difference to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar
Add future_stack to DataFrame.stack (#15015) @galipremsagar
Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
Fix DataFrame.sort_index to respect ignore_index on all axis (#14995) @galipremsagar
Raise for pyarrow array that is tz-aware (#14980) @mroeschke
Direct SeriesGroupBy.aggregate to SeriesGroupBy.agg (#14971) @rjzamora
Respect IntervalDtype and CategoricalDtype objects passed by users (#14961) @mroeschke
unset CUDF_SPILL after a pytest (#14958) @galipremsagar
Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader (#14939) @karthikeyann
Fix chunked reads of Parquet delta encoded pages (#14921) @etseidl
Fix reading offset for data stream in ORC reader (#14911) @ttnghia
Enable sanitizer check for a test case testORCReadAndWriteForDecimal128 (#14897) @res-life
Fix dask token normalization (#14829) @rjzamora
Fix 24.04 versions (#14825) @raydouglass
Ensure slow private attrs are maybe proxies (#14380) @mroeschke

📖 Documentation

Ignore DLManagedTensor in the docs build (#15392) @davidwendt
Revert "Temporarily disable docs errors. (#15265)" (#15269) @bdice
Temporarily disable docs errors. (#15265) @bdice
Update developer_guide.md with new guidance on quoted internal includes (#15238) @harrism
Fix broken link for developer guide (#15025) @sanjana098
[DOC] Update typo in docs example of structs_column_wrapper (#14949) @karthikeyann
Update cudf.pandas FAQ. (#14940) @bdice
Optimize doc builds (#14856) @vyasr
Add developer guideline to use east const. (#14836) @bdice
Document how cuDF is pronounced (#14753) @pentschev
Notes convert to Pandas-compat (#12641) @Touutae-lab

🚀 New Features

Address inconsistency in single quote normalization in JSON reader (#15324) @shrshi
Use JNI pinned pool resource with cuIO (#15255) @abellina
Add DELTA_BYTE_ARRAY encoder for Parquet (#15239) @etseidl
Migrate filling operations to pylibcudf (#15225) @brandon-b-miller
[JNI] rmm based pinned pool (#15219) @abellina
Implement zero-copy host buffer source instead of using an arrow implementation (#15189) @vuule
Enable creation of columns from scalar (#15181) @vyasr
Use NVTX from GitHub. (#15178) @bdice
Implement segmented_row_bit_count for computing row sizes by segments of rows (#15169) @ttnghia
Implement search using pylibcudf (#15166) @vyasr
Add distinct left join (#15149) @PointKernel
Add cardinality control for groupby benchs with flat types (#15134) @PointKernel
Add ability to request Parquet encodings on a per-column basis (#15081) @etseidl
Automate include grouping order in .clang-format (#15063) @harrism
Requesting a clean build directory also clears Jitify cache (#15052) @robertmaynard
API for JSON unquoted whitespace normalization (#15033) @shrshi
Implement concatenate, lists.explode, merge, sorting, and stream compaction in pylibcudf (#15011) @vyasr
Implement replace in pylibcudf (#15005) @vyasr
Add distinct key inner join (#14990) @PointKernel
Implement rolling in pylibcudf (#14982) @vyasr
Implement joins in pylibcudf (#14972) @vyasr
Implement scans and reductions in pylibcudf (#14970) @vyasr
Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
Implement groupby in pylibcudf (#14945) @vyasr
Support casting of Map type to string in JSON reader (#14936) @karthikeyann
POC for whitespace removal in input JSON data using FST (#14931) @shrshi
Support for LZ4 compression in ORC and Parquet (#14906) @vuule
Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
Migrate unary operations to pylibcudf (#14850) @vyasr
Migrate binary operations to pylibcudf (#14821) @vyasr
Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
Support CUDA 12.2 (#14712) @jameslamb

🛠️ Improvements

Use conda env create --yes instead of --force (#15403) @bdice
Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
Change exceptions thrown by copying APIs (#15319) @vyasr
Enable branch testing for cudf.pandas (#15316) @galipremsagar
Replace black with ruff-format (#15312) @mroeschke
This fixes an NPE when trying to read empty JSON data by adding a new API for missing information (#15307) @revans2
Address poor performance of Parquet string decoding (#15304) @etseidl
Update script input name (#15301) @AyodeAwe
Make test_read_parquet_partitioned_filtered data deterministic (#15296) @mroeschke
Add timeout for cudf.pandas pandas tests (#15284) @galipremsagar
Add upper bound to prevent usage of NumPy 2 (#15283) @bdice
Fix cudf::test::to_host return of host_vector (#15263) @davidwendt
Implement grouped product scan (#15254) @wence-
Add CUDA 12.4 to supported PTX versions (#15247) @brandon-b-miller
Implement DataFrame|Series.squeeze (#15244) @mroeschke
Roll back ipow changes due to register pressure. (#15242) @pmattione-nvidia
Remove create_chars_child_column utility (#15241) @davidwendt
Update dlpack to version 0.8 (#15237) @dantegd
Improve performance in JSON reader when mixed_types_as_string option is enabled (#15236) @shrshi
Remove row conversion code from libcudf (#15234) @ttnghia
Use variable substitution for RAPIDS version in Doxyfile (#15231) @KyleFromNVIDIA
Add ListColumns.to_pandas(arrow_type=) (#15228) @mroeschke
Treat dask-cudf CI artifacts as pure wheels (#15223) @bdice
Clean up usage of CUDA_ARCH and other macros. (#15218) @bdice
DOC: use constants in performance-comparisons.ipynb (#15215) @raybellwaves
Rewrite conversion in terms of column (#15213) @vyasr
Switch pytest-xdist algo to worksteal (#15207) @galipremsagar
Deprecate strings_column_view::offsets_begin() (#15205) @davidwendt
Add get_upstream_resource method to stream_checking_resource_adaptor (#15203) @miscco
Tune up row size estimation in the data generator (#15202) @vuule
Fix offset value for generating test data in parquet_chunked_reader_test.cu (#15200) @ttnghia
Change strings_column_view::char_size to return int64 (#15197) @davidwendt
Fix includes for row_operators.cuh (#15194) @davidwendt
Generalize GHA selectors for pure Python testing (#15191) @bdice
Improvements for __cuda_array_interface__ tests (#15188) @bdice
Allow to_pandas to return pandas.ArrowDtype (#15182) @mroeschke
Ignore byte_range in read_json when the size is not smaller than the input data (#15180) @vuule
Expose new stable_sort and finish stream_compaction in pylibcudf (#15175) @wence-
[ci] update matrix filters for dask-cudf builds (#15174) @jameslamb
Change make_strings_children to return uvector (#15171) @davidwendt
Don't override to_pandas for Datelike columns (#15167) @mroeschke
Drop python-snappy from dependencies. (#15161) @bdice
Add microkernels for fixed-width and fixed-width dictionary in Parquet decode (#15159) @abellina
Make HostColumnVector.DataType accessor methods public (#15157) @jbrennan333
Java bindings for left outer distinct join (#15154) @jlowe
Forward-merge branch-24.02 to branch-24.04 (#15153) @bdice
Enable pandas pytests for cudf.pandas (#15147) @galipremsagar
Add java option to keep quotes for JSON reads (#15146) @revans2
Change cross-pandas-version testing in cudf (#15145) @galipremsagar
Use hostdevice_vector in kernel_error to avoid the pageable copy (#15140) @vuule
Clean up Columns.astype & cudf.dtype (#15125) @mroeschke
Simplify some to_pandas implementations (#15123) @mroeschke
Java: Add leak tracking for Scalar instances (#15121) @jlowe
Remove calls to strings_column_view::offsets_begin() (#15112) @davidwendt
Add support for Python 3.11, require NumPy 1.23+ (#15111) @jameslamb
Compile-time ipow computation with array lookup (#15110) @pmattione-nvidia
Upgrade to arrow-14.0.2 (#15108) @galipremsagar
Dynamically set version in RAPIDS doc builds (#15101) @jakirkham
Add support for pandas-2.2 in cudf (#15100) @galipremsagar
Update devcontainers to CUDA Toolkit 12.2 (#15099) @trxcllnt
Fix datetime binop pytest failures in pandas-2.2 (#15090) @galipremsagar
Validate types in pylibcudf Column/Table constructors (#15088) @wence-
xfail test_join_ordering_pandas_compat for pandas 2.2 (#15080) @mroeschke
Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation. (#15079) @nvdbaranec
Adjust test_binops for pandas 2.2 (#15078) @mroeschke
Remove offsets_begin() call from nvtext::generate_ngrams (#15077) @davidwendt
Use offsetalator in cudf::detail::has_nonempty_null_rows (#15076) @davidwendt
Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
Fix cudf::test::to_host to handle both offset types for strings columns (#15073) @davidwendt
Add condition for test_groupby_nulls_basic in pandas 2.2 (#15072) @mroeschke
xfail tests in test_udf_masked_ops due to pandas 2.2 bug (#15071) @mroeschke
target branch-24.04 for GitHub Actions workflows (#15069) @jameslamb
Implement stable version of cudf::sort (#15066) @wence-
Fix ORC and JSON tests failures for pandas 2.2 (#15062) @mroeschke
Adjust test_joining for pandas 2.2 (#15060) @mroeschke
Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
Fix test_resample index dtype checking for pandas 2.2 (#15058) @mroeschke
Split out strings/replace.cu and rework its gtests (#15054) @davidwendt
Avoid incompatible value type setting in test_rolling for pandas 2.2 (#15050) @mroeschke
Change chained replace inplace test to COW test for pandas 2.2 (#15049) @mroeschke
Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
Avoid chained indexing in test_indexing for pandas 2.2 (#15045) @mroeschke
Avoid pandas 2.2 DeprecationWarning in test_hdf (#15044) @mroeschke
Use appropriate make_offsets_child_column for building lists columns (#15043) @davidwendt
Factor out position-offsets logic from strings split_helper utility (#15040) @davidwendt
Forward-merge branch-24.02 to branch-24.04 (#15039) @bdice
Clean up nvtx macros (#15038) @PointKernel
Add xfailures for test_applymap for pandas 2.2 (#15034) @mroeschke
Expose libcudf filter expression in read_parquet (#15028) @wence-
Adjust tests in test_dataframe.py for pandas 2.2 (#15023) @mroeschke
Adjust test_datetime_infer_format for pandas 2.2 (#15021) @mroeschke
Performance optimizations for parquet sub-rowgroup reader. (#15020) @nvdbaranec
JNI bindings for distinct_hash_join (#15019) @jlowe
Change copy_if_safe to call thrust instead of the overload function (#15018) @davidwendt
Improve performance of copy_if_else for long strings (#15017) @davidwendt
Fix is_string_dtype test for pandas 2.2 (#15012) @mroeschke
Rework cudf::strings::detail::copy_range for offsetalator (#15010) @davidwendt
Use offsetalator in cudf::get_json_object() (#15009) @davidwendt
Align integral types in ORC to specs (#15008) @vuule
Clean up detail sequence header inclusion (#15007) @PointKernel
Add groupby.apply(include_groups=) to match pandas 2.2 deprecation (#15006) @mroeschke
Use offsetalator in cudf::interleave_columns() (#15004) @davidwendt
Use offsetalator in cudf::row_bit_count() (#15003) @davidwendt
Use offsetalator in cudf::strings::wrap() (#15002) @davidwendt
Use offsetalator in cudf::strings::reverse (#15001) @davidwendt
Deprecate groupby fillna (#15000) @mroeschke
Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (#14999) @mroeschke
Remove unneeded calls to create_chars_child_column utility (#14997) @davidwendt
Add environment-agnostic scripts for running ctests and pytests (#14992) @trxcllnt
Filter all DeprecationWarning's by ArrowTable.to_pandas() (#14989) @galipremsagar
Deprecate replace with categorical columns (#14988) @mroeschke
Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
Ensure that ctest is called with --no-tests=error. (#14983) @bdice
Deprecate non-integer periods in date_range and interval_range (#14976) @galipremsagar
Update ops-bot.yaml (#14974) @AyodeAwe
Use page statistics in Parquet reader (#14973) @etseidl
Use fused types for overloaded function signatures (#14969) @vyasr
Deprecate certain frequency strings (#14967) @galipremsagar
Update copyrights for 24.04. (#14964) @bdice
Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
Introduce GetJsonObjectOptions in getJSONObject Java API (#14956) @SurajAralihalli
JNI JSON read with DataSource and infered schema, along with basic java nested Schema JSON reads (#14954) @revans2
Make codecov only informational (always pass). (#14952) @bdice
Replace legacy cudf and dask_cudf imports as (d)gd (#14944) @mroeschke
Replace _is_datetime64tz/interval_dtype with isinstance (#14943) @mroeschke
Update tests for pandas 2. (#14941) @bdice
Use more public pandas APIs (#14929) @mroeschke
Replace local copyright check with pre-commit-hooks verify-copyright (#14917) @KyleFromNVIDIA
Add pandas-2.x support in cudf (#14916) @galipremsagar
Use offsetalator in nvtext::byte_pair_encoding (#14888) @davidwendt
De-DOS line-endings (#14880) @wence-
Add detail cuco_allocator (#14877) @PointKernel
Move all core types to using enum class in Cython (#14876) @vyasr
Read cudf.__version__ in Sphinx build (#14872) @KyleFromNVIDIA
Use int64 offset types for accessing code-points in nvtext::normalize (#14868) @davidwendt
Read version from VERSION file in CMake (#14867) @KyleFromNVIDIA
Update conda-cpp-post-build-checks to branch-24.04. (#14854) @bdice
Update cudf for compatibility with the latest cuco (#14849) @PointKernel
Remove deprecated strings functions (#14848) @davidwendt
Fix CI workflows for pandas-tests and add test summary. (#14847) @bdice
Use offsetalator in cudf::strings::copy_slice (#14844) @davidwendt
Fix V2 Parquet page alignment for use with zStandard compression (#14841) @etseidl
Fix calls to deprecated strings factory API in examples. (#14838) @bdice
Update pre-commit hooks (#14837) @bdice
Use rapids_cuda_set_runtime to determine cuda runtime usage by target (#14833) @vyasr
Remove get_mem_info functions from custom memory resources (#14832) @harrism
Fix debug build by splitting row_operator_tests_utilities.cu (#14826) @davidwendt
Remove -DNVBench_ENABLE_CUPTI=OFF. (#14820) @bdice
Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
Branch 24.04 merge branch 24.02 (#14809) @vyasr
Branch 24.04 merge branch 24.02 (#14806) @vyasr
Introduce basic "cudf" backend for Dask Expressions (#14805) @rjzamora
Remove build_struct|list_column (#14786) @mroeschke
Use offsetalator in nvtext tokenize functions (#14783) @davidwendt
Reduce execution time of Python ORC tests (#14776) @vuule
Use offsetalator in cudf::strings::split functions (#14757) @davidwendt
Use offsetalator in cudf::strings::findall (#14745) @davidwendt
Use offsetalator in cudf::strings::url_decode (#14744) @davidwendt
Use get_offset_value utility in strings shift function (#14743) @davidwendt
Use as_column instead of full (#14698) @mroeschke
List all notable breaking changes (#13535) @galipremsagar

cudf - v24.02.02

Published by raydouglass 8 months ago

🚨 Breaking Changes

Remove **kwargs from astype (#14765) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Drop Pascal GPU support. (#14630) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
Include writer code and writerVersion in ORC files (#14458) @vuule
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
Switch to scikit-build-core (#13531) @vyasr

🐛 Bug Fixes

Bump to nvcomp 3.0.6. (#15128) @bdice
[HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
Exclude tests from builds (#14981) @vyasr
Fix the bounce buffer size in ORC writer (#14947) @vuule
Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
Fix index difference to follow the pandas format (#14789) @amiralimi
Fix shared-workflows repo name (#14784) @raydouglass
Remove unparseable attributes from all nodes (#14780) @vyasr
Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
Fix calls to deprecated strings factory API (#14771) @davidwendt
Fix ptx file discovery in editable installs (#14767) @vyasr
Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
Enable intermediate proxies to be picklable (#14752) @shwina
Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
Fix CMake args (#14746) @vyasr
Fix logic bug introduced in #14730 (#14742) @wence-
[Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
Fix Groupby.get_group (#14728) @rjzamora
Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
Split cuda versions for notebook testing (#14722) @raydouglass
Fix to_numeric not preserving Series index and name (#14718) @mroeschke
Update dask-cudf wheel name (#14713) @raydouglass
Fix strings::contains matching end of string target (#14711) @davidwendt
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
Potential fix for peformance regression in #14415 (#14706) @etseidl
Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
Skip numba test that fails on ARM (#14702) @brandon-b-miller
Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
Add row conversion code from spark-rapids-jni (#14664) @ttnghia
Unconditionally export the CCCL path (#14656) @vyasr
Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
Fix invalid memory access in Parquet reader (#14637) @etseidl
Use column_empty over as_column([]) (#14632) @mroeschke
Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
Address potential race conditions in Parquet reader (#14602) @etseidl
Fix DataFrame.reindex removing column name (#14601) @mroeschke
Remove unsanitized input test data from copy gtests (#14600) @davidwendt
Fix race detected in Parquet writer (#14598) @etseidl
Correct invalid or missing return types (#14587) @robertmaynard
Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
Fixes a symbol group lookup table issue (#14561) @elstehle
Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
Improve memory footprint of isin by using contains (#14478) @wence-
Move creation of env.yaml outside the current directory (#14476) @davidwendt
Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
Correct dtype of count aggregations on empty dataframes (#14473) @wence-
Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
Fix default stream use in the CSV reader (#14443) @vuule
Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
Defer PTX file load to runtime (#13690) @brandon-b-miller

📖 Documentation

Disable parallel build (#14796) @vyasr
Add pylibcudf to the docs (#14791) @vyasr
Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
More doxygen fixes (#14639) @vyasr
Enable doxygen XML generation and fix issues (#14477) @vyasr
Some doxygen improvements (#14469) @vyasr
Remove warning in dask-cudf docs (#14454) @wence-
Update README links with redirects. (#14378) @bdice
Add pip install instructions to README (#13677) @shwina

🚀 New Features

Add ci check for external kernels (#14768) @robertmaynard
JSON single quote normalization API (#14729) @shrshi
Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
Don't constrain numba<0.58 (#14616) @brandon-b-miller
Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
JSON quote normalization (#14545) @shrshi
Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
Implement more copying APIs in pylibcudf (#14508) @vyasr
Include writer code and writerVersion in ORC files (#14458) @vuule
Parquet sub-rowgroup reading. (#14360) @nvdbaranec
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
PARQUET-2261 Size Statistics (#14000) @etseidl
Improve GroupBy JIT error handling (#13854) @brandon-b-miller
Generate unified Python/C++ docs (#13846) @vyasr
Expand JIT groupby test suite (#13813) @brandon-b-miller

🛠️ Improvements

Pin pytest<8 (#14920) @galipremsagar
Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
Remove **kwargs from astype (#14765) @mroeschke
fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
Add pynvjitlink as a dependency (#14763) @brandon-b-miller
Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
Pin pytest-cases<3.8.2 (#14756) @mroeschke
Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
Consolidate cudf object handling in as_column (#14754) @mroeschke
Reduce execution time of Parquet C++ tests (#14750) @vuule
Implement to_datetime(..., utc=True) (#14749) @mroeschke
Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
Remove unused/single use methods (#14739) @mroeschke
refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
Remove unneeded methods in Column (#14730) @mroeschke
Clean up base column methods (#14725) @mroeschke
Ensure column.fillna signatures are consistent (#14724) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
Use offsetalator in gather_chars (#14700) @davidwendt
Use make_strings_children for fill() specialization logic (#14697) @davidwendt
Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
Fix call to deprecated factory function (#14695) @davidwendt
Use as_column instead of arange for range like inputs (#14689) @mroeschke
Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
Split parquet test into multiple files (#14663) @etseidl
Custom error messages for IO with nonexistent files (#14662) @vuule
Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
Basic validation in reader benchmarks (#14647) @vuule
Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
Consolidate memoryview handling in as_column (#14643) @mroeschke
Convert FieldType to scoped enum (#14642) @vuule
Use instance over is_foo_dtype (#14641) @mroeschke
Use isinstance over is_foo_dtype internally (#14638) @mroeschke
Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
Drop nvbench patch for nvml. (#14631) @bdice
Drop Pascal GPU support. (#14630) @bdice
Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
Create strings-specific make_offsets_child_column for multiple offset types (#14612) @davidwendt
Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
Support freq in DatetimeIndex (#14593) @shwina
Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
Remove WORKSPACE env var from cudf_test temp_directory class (#14588) @davidwendt
Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
Use cuda::proclaim_return_type on device lambdas. (#14577) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Update dependencies.yaml to new pip index (#14575) @vyasr
Simplify Python CMake (#14565) @vyasr
Java expose parquet pass_read_limit (#14564) @revans2
Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
Use cudf_test temp_directory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
Fix return type of prefix increment overloads (#14544) @vuule
Make bpe_merge_pairs_impl member private (#14543) @davidwendt
Small clean up in io::statistics (#14542) @vuule
Change json gtest environment variable to compile-time definition (#14541) @davidwendt
Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
Move non-templated inline function definitions from table_view.hpp to table_view.cpp (#14535) @davidwendt
Add JNI for strings::code_points (#14533) @thirtiseven
Add a test for issue 12773 (#14529) @vyasr
Split libarrow build dependencies. (#14506) @bdice
Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
Refactor Parquet kernel_error (#14464) @etseidl
Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
Expose stream parameter in public nvtext APIs (#14456) @davidwendt
Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
Refactor cudf.Series.init (#14450) @mroeschke
Remove the use of volatile in Parquet (#14448) @vuule
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Testing stream pool implementation (#14437) @shrshi
Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
Use isinstance(..., cudf.IntervalDtype) instead of is_interval_dtype (#14424) @mroeschke
Use isinstance(..., cudf.CategoricalDtype) instead of is_categorical_dtype (#14423) @mroeschke
Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
REF: Remove instances of pd.core (#14421) @mroeschke
Expose streams in public filling APIs for label_bins (#14401) @ZelboK
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
Expose streams in Parquet reader and writer APIs (#14359) @shrshi
Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
Expose streams in ORC reader and writer APIs (#14350) @shrshi
Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
Add cuDF devcontainers (#14015) @trxcllnt
Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
Switch to scikit-build-core (#13531) @vyasr
Simplify null count checking in column equality comparator (#13312) @vyasr

cudf - v24.02.01

Published by raydouglass 8 months ago

🚨 Breaking Changes

Remove **kwargs from astype (#14765) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Drop Pascal GPU support. (#14630) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
Include writer code and writerVersion in ORC files (#14458) @vuule
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
Switch to scikit-build-core (#13531) @vyasr

🐛 Bug Fixes

[HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
Exclude tests from builds (#14981) @vyasr
Fix the bounce buffer size in ORC writer (#14947) @vuule
Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
Fix index difference to follow the pandas format (#14789) @amiralimi
Fix shared-workflows repo name (#14784) @raydouglass
Remove unparseable attributes from all nodes (#14780) @vyasr
Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
Fix calls to deprecated strings factory API (#14771) @davidwendt
Fix ptx file discovery in editable installs (#14767) @vyasr
Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
Enable intermediate proxies to be picklable (#14752) @shwina
Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
Fix CMake args (#14746) @vyasr
Fix logic bug introduced in #14730 (#14742) @wence-
[Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
Fix Groupby.get_group (#14728) @rjzamora
Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
Split cuda versions for notebook testing (#14722) @raydouglass
Fix to_numeric not preserving Series index and name (#14718) @mroeschke
Update dask-cudf wheel name (#14713) @raydouglass
Fix strings::contains matching end of string target (#14711) @davidwendt
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
Potential fix for peformance regression in #14415 (#14706) @etseidl
Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
Skip numba test that fails on ARM (#14702) @brandon-b-miller
Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
Add row conversion code from spark-rapids-jni (#14664) @ttnghia
Unconditionally export the CCCL path (#14656) @vyasr
Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
Fix invalid memory access in Parquet reader (#14637) @etseidl
Use column_empty over as_column([]) (#14632) @mroeschke
Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
Address potential race conditions in Parquet reader (#14602) @etseidl
Fix DataFrame.reindex removing column name (#14601) @mroeschke
Remove unsanitized input test data from copy gtests (#14600) @davidwendt
Fix race detected in Parquet writer (#14598) @etseidl
Correct invalid or missing return types (#14587) @robertmaynard
Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
Fixes a symbol group lookup table issue (#14561) @elstehle
Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
Improve memory footprint of isin by using contains (#14478) @wence-
Move creation of env.yaml outside the current directory (#14476) @davidwendt
Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
Correct dtype of count aggregations on empty dataframes (#14473) @wence-
Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
Fix default stream use in the CSV reader (#14443) @vuule
Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
Defer PTX file load to runtime (#13690) @brandon-b-miller

📖 Documentation

Disable parallel build (#14796) @vyasr
Add pylibcudf to the docs (#14791) @vyasr
Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
More doxygen fixes (#14639) @vyasr
Enable doxygen XML generation and fix issues (#14477) @vyasr
Some doxygen improvements (#14469) @vyasr
Remove warning in dask-cudf docs (#14454) @wence-
Update README links with redirects. (#14378) @bdice
Add pip install instructions to README (#13677) @shwina

🚀 New Features

Add ci check for external kernels (#14768) @robertmaynard
JSON single quote normalization API (#14729) @shrshi
Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
Don't constrain numba<0.58 (#14616) @brandon-b-miller
Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
JSON quote normalization (#14545) @shrshi
Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
Implement more copying APIs in pylibcudf (#14508) @vyasr
Include writer code and writerVersion in ORC files (#14458) @vuule
Parquet sub-rowgroup reading. (#14360) @nvdbaranec
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
PARQUET-2261 Size Statistics (#14000) @etseidl
Improve GroupBy JIT error handling (#13854) @brandon-b-miller
Generate unified Python/C++ docs (#13846) @vyasr
Expand JIT groupby test suite (#13813) @brandon-b-miller

🛠️ Improvements

Pin pytest<8 (#14920) @galipremsagar
Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
Remove **kwargs from astype (#14765) @mroeschke
fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
Add pynvjitlink as a dependency (#14763) @brandon-b-miller
Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
Pin pytest-cases<3.8.2 (#14756) @mroeschke
Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
Consolidate cudf object handling in as_column (#14754) @mroeschke
Reduce execution time of Parquet C++ tests (#14750) @vuule
Implement to_datetime(..., utc=True) (#14749) @mroeschke
Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
Remove unused/single use methods (#14739) @mroeschke
refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
Remove unneeded methods in Column (#14730) @mroeschke
Clean up base column methods (#14725) @mroeschke
Ensure column.fillna signatures are consistent (#14724) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
Use offsetalator in gather_chars (#14700) @davidwendt
Use make_strings_children for fill() specialization logic (#14697) @davidwendt
Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
Fix call to deprecated factory function (#14695) @davidwendt
Use as_column instead of arange for range like inputs (#14689) @mroeschke
Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
Split parquet test into multiple files (#14663) @etseidl
Custom error messages for IO with nonexistent files (#14662) @vuule
Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
Basic validation in reader benchmarks (#14647) @vuule
Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
Consolidate memoryview handling in as_column (#14643) @mroeschke
Convert FieldType to scoped enum (#14642) @vuule
Use instance over is_foo_dtype (#14641) @mroeschke
Use isinstance over is_foo_dtype internally (#14638) @mroeschke
Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
Drop nvbench patch for nvml. (#14631) @bdice
Drop Pascal GPU support. (#14630) @bdice
Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
Create strings-specific make_offsets_child_column for multiple offset types (#14612) @davidwendt
Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
Support freq in DatetimeIndex (#14593) @shwina
Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
Remove WORKSPACE env var from cudf_test temp_directory class (#14588) @davidwendt
Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
Use cuda::proclaim_return_type on device lambdas. (#14577) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Update dependencies.yaml to new pip index (#14575) @vyasr
Simplify Python CMake (#14565) @vyasr
Java expose parquet pass_read_limit (#14564) @revans2
Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
Use cudf_test temp_directory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
Fix return type of prefix increment overloads (#14544) @vuule
Make bpe_merge_pairs_impl member private (#14543) @davidwendt
Small clean up in io::statistics (#14542) @vuule
Change json gtest environment variable to compile-time definition (#14541) @davidwendt
Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
Move non-templated inline function definitions from table_view.hpp to table_view.cpp (#14535) @davidwendt
Add JNI for strings::code_points (#14533) @thirtiseven
Add a test for issue 12773 (#14529) @vyasr
Split libarrow build dependencies. (#14506) @bdice
Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
Refactor Parquet kernel_error (#14464) @etseidl
Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
Expose stream parameter in public nvtext APIs (#14456) @davidwendt
Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
Refactor cudf.Series.init (#14450) @mroeschke
Remove the use of volatile in Parquet (#14448) @vuule
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Testing stream pool implementation (#14437) @shrshi
Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
Use isinstance(..., cudf.IntervalDtype) instead of is_interval_dtype (#14424) @mroeschke
Use isinstance(..., cudf.CategoricalDtype) instead of is_categorical_dtype (#14423) @mroeschke
Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
REF: Remove instances of pd.core (#14421) @mroeschke
Expose streams in public filling APIs for label_bins (#14401) @ZelboK
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
Expose streams in Parquet reader and writer APIs (#14359) @shrshi
Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
Expose streams in ORC reader and writer APIs (#14350) @shrshi
Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
Add cuDF devcontainers (#14015) @trxcllnt
Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
Switch to scikit-build-core (#13531) @vyasr
Simplify null count checking in column equality comparator (#13312) @vyasr

cudf - v24.02.00

Published by raydouglass 8 months ago

🚨 Breaking Changes

Remove **kwargs from astype (#14765) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Drop Pascal GPU support. (#14630) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
Include writer code and writerVersion in ORC files (#14458) @vuule
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
Switch to scikit-build-core (#13531) @vyasr

🐛 Bug Fixes

Exclude tests from builds (#14981) @vyasr
Fix the bounce buffer size in ORC writer (#14947) @vuule
Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
Fix index difference to follow the pandas format (#14789) @amiralimi
Fix shared-workflows repo name (#14784) @raydouglass
Remove unparseable attributes from all nodes (#14780) @vyasr
Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
Fix calls to deprecated strings factory API (#14771) @davidwendt
Fix ptx file discovery in editable installs (#14767) @vyasr
Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
Enable intermediate proxies to be picklable (#14752) @shwina
Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
Fix CMake args (#14746) @vyasr
Fix logic bug introduced in #14730 (#14742) @wence-
[Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
Fix Groupby.get_group (#14728) @rjzamora
Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
Split cuda versions for notebook testing (#14722) @raydouglass
Fix to_numeric not preserving Series index and name (#14718) @mroeschke
Update dask-cudf wheel name (#14713) @raydouglass
Fix strings::contains matching end of string target (#14711) @davidwendt
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
Potential fix for peformance regression in #14415 (#14706) @etseidl
Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
Skip numba test that fails on ARM (#14702) @brandon-b-miller
Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
Add row conversion code from spark-rapids-jni (#14664) @ttnghia
Unconditionally export the CCCL path (#14656) @vyasr
Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
Fix invalid memory access in Parquet reader (#14637) @etseidl
Use column_empty over as_column([]) (#14632) @mroeschke
Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
Address potential race conditions in Parquet reader (#14602) @etseidl
Fix DataFrame.reindex removing column name (#14601) @mroeschke
Remove unsanitized input test data from copy gtests (#14600) @davidwendt
Fix race detected in Parquet writer (#14598) @etseidl
Correct invalid or missing return types (#14587) @robertmaynard
Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
Fixes a symbol group lookup table issue (#14561) @elstehle
Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
Improve memory footprint of isin by using contains (#14478) @wence-
Move creation of env.yaml outside the current directory (#14476) @davidwendt
Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
Correct dtype of count aggregations on empty dataframes (#14473) @wence-
Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
Fix default stream use in the CSV reader (#14443) @vuule
Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
Defer PTX file load to runtime (#13690) @brandon-b-miller

📖 Documentation

Disable parallel build (#14796) @vyasr
Add pylibcudf to the docs (#14791) @vyasr
Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
More doxygen fixes (#14639) @vyasr
Enable doxygen XML generation and fix issues (#14477) @vyasr
Some doxygen improvements (#14469) @vyasr
Remove warning in dask-cudf docs (#14454) @wence-
Update README links with redirects. (#14378) @bdice
Add pip install instructions to README (#13677) @shwina

🚀 New Features

Add ci check for external kernels (#14768) @robertmaynard
JSON single quote normalization API (#14729) @shrshi
Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
Don't constrain numba<0.58 (#14616) @brandon-b-miller
Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
JSON quote normalization (#14545) @shrshi
Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
Implement more copying APIs in pylibcudf (#14508) @vyasr
Include writer code and writerVersion in ORC files (#14458) @vuule
Parquet sub-rowgroup reading. (#14360) @nvdbaranec
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
PARQUET-2261 Size Statistics (#14000) @etseidl
Improve GroupBy JIT error handling (#13854) @brandon-b-miller
Generate unified Python/C++ docs (#13846) @vyasr
Expand JIT groupby test suite (#13813) @brandon-b-miller

🛠️ Improvements

Pin pytest<8 (#14920) @galipremsagar
Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
Remove **kwargs from astype (#14765) @mroeschke
fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
Add pynvjitlink as a dependency (#14763) @brandon-b-miller
Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
Pin pytest-cases<3.8.2 (#14756) @mroeschke
Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
Consolidate cudf object handling in as_column (#14754) @mroeschke
Reduce execution time of Parquet C++ tests (#14750) @vuule
Implement to_datetime(..., utc=True) (#14749) @mroeschke
Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
Remove unused/single use methods (#14739) @mroeschke
refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
Remove unneeded methods in Column (#14730) @mroeschke
Clean up base column methods (#14725) @mroeschke
Ensure column.fillna signatures are consistent (#14724) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
Use offsetalator in gather_chars (#14700) @davidwendt
Use make_strings_children for fill() specialization logic (#14697) @davidwendt
Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
Fix call to deprecated factory function (#14695) @davidwendt
Use as_column instead of arange for range like inputs (#14689) @mroeschke
Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
Split parquet test into multiple files (#14663) @etseidl
Custom error messages for IO with nonexistent files (#14662) @vuule
Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
Basic validation in reader benchmarks (#14647) @vuule
Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
Consolidate memoryview handling in as_column (#14643) @mroeschke
Convert FieldType to scoped enum (#14642) @vuule
Use instance over is_foo_dtype (#14641) @mroeschke
Use isinstance over is_foo_dtype internally (#14638) @mroeschke
Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
Drop nvbench patch for nvml. (#14631) @bdice
Drop Pascal GPU support. (#14630) @bdice
Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
Create strings-specific make_offsets_child_column for multiple offset types (#14612) @davidwendt
Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
Support freq in DatetimeIndex (#14593) @shwina
Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
Remove WORKSPACE env var from cudf_test temp_directory class (#14588) @davidwendt
Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
Use cuda::proclaim_return_type on device lambdas. (#14577) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Update dependencies.yaml to new pip index (#14575) @vyasr
Simplify Python CMake (#14565) @vyasr
Java expose parquet pass_read_limit (#14564) @revans2
Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
Use cudf_test temp_directory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
Fix return type of prefix increment overloads (#14544) @vuule
Make bpe_merge_pairs_impl member private (#14543) @davidwendt
Small clean up in io::statistics (#14542) @vuule
Change json gtest environment variable to compile-time definition (#14541) @davidwendt
Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
Move non-templated inline function definitions from table_view.hpp to table_view.cpp (#14535) @davidwendt
Add JNI for strings::code_points (#14533) @thirtiseven
Add a test for issue 12773 (#14529) @vyasr
Split libarrow build dependencies. (#14506) @bdice
Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
Refactor Parquet kernel_error (#14464) @etseidl
Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
Expose stream parameter in public nvtext APIs (#14456) @davidwendt
Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
Refactor cudf.Series.init (#14450) @mroeschke
Remove the use of volatile in Parquet (#14448) @vuule
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Testing stream pool implementation (#14437) @shrshi
Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
Use isinstance(..., cudf.IntervalDtype) instead of is_interval_dtype (#14424) @mroeschke
Use isinstance(..., cudf.CategoricalDtype) instead of is_categorical_dtype (#14423) @mroeschke
Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
REF: Remove instances of pd.core (#14421) @mroeschke
Expose streams in public filling APIs for label_bins (#14401) @ZelboK
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
Expose streams in Parquet reader and writer APIs (#14359) @shrshi
Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
Expose streams in ORC reader and writer APIs (#14350) @shrshi
Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
Add cuDF devcontainers (#14015) @trxcllnt
Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
Switch to scikit-build-core (#13531) @vyasr
Simplify null count checking in column equality comparator (#13312) @vyasr

cudf - v23.12.01

Published by raydouglass 11 months ago

🚨 Breaking Changes

Raise error in reindex when index is not unique (#14400) @galipremsagar
Expose stream parameter to get_json_object API (#14297) @davidwendt
Refactor cudf_kafka to use skbuild (#14292) @jdye64
Expose stream parameter in public strings convert APIs (#14255) @davidwendt
Upgrade to nvCOMP 3.0.4 (#13815) @vuule

🐛 Bug Fixes

Fix synchronization issue when writing string columns with dictionary to ORC (#14595) @vuule
Update actions/labeler to v4 (#14562) @raydouglass
Fix data corruption when skipping rows (#14557) @etseidl
Fix function name typo in cudf.pandas profiler (#14514) @galipremsagar
Fix intermediate type checking in expression parsing (#14445) @vyasr
Forward merge branch-23.10 into branch-23.12 (#14435) @raydouglass
Remove needs: wheel-build-cudf. (#14427) @bdice
Fix dask dependency in custreamz (#14420) @vyasr
Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
Support java AST String literal with desired encoding (#14402) @winningsix
Raise error in reindex when index is not unique (#14400) @galipremsagar
Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
Fix token-count logic in nvtext::tokenize_with_vocabulary (#14393) @davidwendt
Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
cudf.pandas: cuDF subpath checking in module __getattr__ (#14388) @shwina
Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
Add the new manylinux builds to the build job (#14351) @vyasr
cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
Fix overflow check in cudf::merge (#14345) @divyegala
Add cramjam (#14344) @vyasr
Enable dask_cudf/io pytests in CI (#14338) @galipremsagar
Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
Fix host buffer access from device function in the Parquet reader (#14328) @vuule
Run IO tests for Dask-cuDF (#14327) @rjzamora
Fix logical type issues in the Parquet writer (#14322) @vuule
Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
test is_valid before reading column data (#14318) @etseidl
Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
fixing thread index overflow issue (#14290) @hyperbolic2346
Fix memset error in nvtext::edit_distance_matrix (#14283) @davidwendt
Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
Handle empty string correctly in Parquet statistics (#14257) @etseidl
Fixes behaviour for incomplete lines when recover_with_nulls is enabled (#14252) @elstehle
cudf::detail::pinned_allocator doesn't throw from deallocate (#14251) @robertmaynard
Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
Fixing parquet list of struct interpretation (#13715) @hyperbolic2346

📖 Documentation

Fix io reference in docs. (#14452) @bdice
Update README (#14374) @shwina
Example code for blog on new row comparators (#13795) @divyegala

🚀 New Features

Expose streams in public unary APIs (#14342) @vyasr
Add python tests for Parquet DELTA_BINARY_PACKED encoder (#14316) @etseidl
Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
Expose streams in public null mask APIs (#14263) @vyasr
Expose streams in binaryop APIs (#14187) @vyasr
Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
Add decoder for DELTA_BYTE_ARRAY to Parquet reader (#14101) @etseidl
Add DELTA_BINARY_PACKED encoder for Parquet writer (#14100) @etseidl
Add BytePairEncoder class to cuDF (#13891) @davidwendt
Upgrade to nvCOMP 3.0.4 (#13815) @vuule
Use pynvjitlink for CUDA 12+ MVC (#13650) @brandon-b-miller

🛠️ Improvements

Build concurrency for nightly and merge triggers (#14441) @bdice
Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
Update to Arrow 14.0.1. (#14387) @bdice
Remove Cython libcpp wrappers (#14382) @vyasr
Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
Upgrade to arrow 14 (#14371) @galipremsagar
Fix a pytest typo in test_kurt_skew_error (#14368) @galipremsagar
Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
Change nullable() to has_nulls() in cudf::detail::gather (#14363) @divyegala
Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
Implement user_datasource_wrapper is_empty() and is_device_read_preferred(). (#14357) @tpn
Added streams to CSV reader and writer api (#14340) @shrshi
Upgrade wheels to use arrow 13 (#14339) @vyasr
Rework nvtext::byte_pair_encoding API (#14337) @davidwendt
Improve performance of nvtext::tokenize_with_vocabulary for long strings (#14336) @davidwendt
Upgrade arrow to 13 (#14330) @galipremsagar
Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
Drop pyorc dependency and use pandas/pyarrow instead (#14323) @galipremsagar
Avoid pyarrow.fs import for local storage (#14321) @rjzamora
Unpin dask and distributed for 23.12 development (#14320) @galipremsagar
Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
Added streams to JSON reader and writer api (#14313) @shrshi
Minor improvements in source_info (#14308) @vuule
Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
Expose stream parameter to get_json_object API (#14297) @davidwendt
Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
Expose stream parameter in public strings filter APIs (#14293) @davidwendt
Refactor cudf_kafka to use skbuild (#14292) @jdye64
Update shared-action-workflows references (#14289) @AyodeAwe
Register partd encode dispatch in dask_cudf (#14287) @rjzamora
Update versioning strategy (#14285) @vyasr
Move and rename byte-pair-encoding source files (#14284) @davidwendt
Expose stream parameter in public strings combine APIs (#14281) @davidwendt
Expose stream parameter in public strings contains APIs (#14280) @davidwendt
Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
Use branch-23.12 workflows. (#14271) @bdice
Refactor LogicalType for Parquet (#14264) @etseidl
Centralize chunked reading code in the parquet reader to reader_impl_chunking.cu (#14262) @nvdbaranec
Expose stream parameter in public strings replace APIs (#14261) @davidwendt
Expose stream parameter in public strings APIs (#14260) @davidwendt
Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
Make parquet schema index type consistent (#14256) @hyperbolic2346
Expose stream parameter in public strings convert APIs (#14255) @davidwendt
Add in java bindings for DataSource (#14254) @revans2
Reimplement cudf::merge for nested types without using comparators (#14250) @divyegala
Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
Improve contains_column by invoking contains_table (#14238) @PointKernel
Detect and report errors in Parquet header parsing (#14237) @etseidl
Normalizing offsets iterator (#14234) @davidwendt
Forward merge 23.10 into 23.12 (#14231) @galipremsagar
Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
Enable indexalator for device code (#14206) @davidwendt
Marginally reduce memory footprint of joins (#14197) @wence-
Add nvtx annotations to spilling-based data movement (#14196) @wence-
Optimize ORC writer for decimal columns (#14190) @vuule
Remove the use of volatile in ORC (#14175) @vuule
Add bytes_per_second to distinct_count of stream_compaction nvbench. (#14172) @Blonck
Add bytes_per_second to transpose benchmark (#14170) @Blonck
cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
Add bytes_per_second to shift benchmark (#13950) @Blonck
Extract debug_utilities.hpp/cu from column_utilities.hpp/cu (#13720) @ttnghia

cudf - v23.12.00

Published by raydouglass 11 months ago

🚨 Breaking Changes

Raise error in reindex when index is not unique (#14400) @galipremsagar
Expose stream parameter to get_json_object API (#14297) @davidwendt
Refactor cudf_kafka to use skbuild (#14292) @jdye64
Expose stream parameter in public strings convert APIs (#14255) @davidwendt
Upgrade to nvCOMP 3.0.4 (#13815) @vuule

🐛 Bug Fixes

Update actions/labeler to v4 (#14562) @raydouglass
Fix data corruption when skipping rows (#14557) @etseidl
Fix function name typo in cudf.pandas profiler (#14514) @galipremsagar
Fix intermediate type checking in expression parsing (#14445) @vyasr
Forward merge branch-23.10 into branch-23.12 (#14435) @raydouglass
Remove needs: wheel-build-cudf. (#14427) @bdice
Fix dask dependency in custreamz (#14420) @vyasr
Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
Support java AST String literal with desired encoding (#14402) @winningsix
Raise error in reindex when index is not unique (#14400) @galipremsagar
Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
Fix token-count logic in nvtext::tokenize_with_vocabulary (#14393) @davidwendt
Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
cudf.pandas: cuDF subpath checking in module __getattr__ (#14388) @shwina
Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
Add the new manylinux builds to the build job (#14351) @vyasr
cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
Fix overflow check in cudf::merge (#14345) @divyegala
Add cramjam (#14344) @vyasr
Enable dask_cudf/io pytests in CI (#14338) @galipremsagar
Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
Fix host buffer access from device function in the Parquet reader (#14328) @vuule
Run IO tests for Dask-cuDF (#14327) @rjzamora
Fix logical type issues in the Parquet writer (#14322) @vuule
Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
test is_valid before reading column data (#14318) @etseidl
Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
fixing thread index overflow issue (#14290) @hyperbolic2346
Fix memset error in nvtext::edit_distance_matrix (#14283) @davidwendt
Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
Handle empty string correctly in Parquet statistics (#14257) @etseidl
Fixes behaviour for incomplete lines when recover_with_nulls is enabled (#14252) @elstehle
cudf::detail::pinned_allocator doesn't throw from deallocate (#14251) @robertmaynard
Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
Fixing parquet list of struct interpretation (#13715) @hyperbolic2346

📖 Documentation

Fix io reference in docs. (#14452) @bdice
Update README (#14374) @shwina
Example code for blog on new row comparators (#13795) @divyegala

🚀 New Features

Expose streams in public unary APIs (#14342) @vyasr
Add python tests for Parquet DELTA_BINARY_PACKED encoder (#14316) @etseidl
Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
Expose streams in public null mask APIs (#14263) @vyasr
Expose streams in binaryop APIs (#14187) @vyasr
Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
Add decoder for DELTA_BYTE_ARRAY to Parquet reader (#14101) @etseidl
Add DELTA_BINARY_PACKED encoder for Parquet writer (#14100) @etseidl
Add BytePairEncoder class to cuDF (#13891) @davidwendt
Upgrade to nvCOMP 3.0.4 (#13815) @vuule
Use pynvjitlink for CUDA 12+ MVC (#13650) @brandon-b-miller

🛠️ Improvements

Build concurrency for nightly and merge triggers (#14441) @bdice
Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
Update to Arrow 14.0.1. (#14387) @bdice
Remove Cython libcpp wrappers (#14382) @vyasr
Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
Upgrade to arrow 14 (#14371) @galipremsagar
Fix a pytest typo in test_kurt_skew_error (#14368) @galipremsagar
Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
Change nullable() to has_nulls() in cudf::detail::gather (#14363) @divyegala
Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
Implement user_datasource_wrapper is_empty() and is_device_read_preferred(). (#14357) @tpn
Added streams to CSV reader and writer api (#14340) @shrshi
Upgrade wheels to use arrow 13 (#14339) @vyasr
Rework nvtext::byte_pair_encoding API (#14337) @davidwendt
Improve performance of nvtext::tokenize_with_vocabulary for long strings (#14336) @davidwendt
Upgrade arrow to 13 (#14330) @galipremsagar
Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
Drop pyorc dependency and use pandas/pyarrow instead (#14323) @galipremsagar
Avoid pyarrow.fs import for local storage (#14321) @rjzamora
Unpin dask and distributed for 23.12 development (#14320) @galipremsagar
Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
Added streams to JSON reader and writer api (#14313) @shrshi
Minor improvements in source_info (#14308) @vuule
Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
Expose stream parameter to get_json_object API (#14297) @davidwendt
Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
Expose stream parameter in public strings filter APIs (#14293) @davidwendt
Refactor cudf_kafka to use skbuild (#14292) @jdye64
Update shared-action-workflows references (#14289) @AyodeAwe
Register partd encode dispatch in dask_cudf (#14287) @rjzamora
Update versioning strategy (#14285) @vyasr
Move and rename byte-pair-encoding source files (#14284) @davidwendt
Expose stream parameter in public strings combine APIs (#14281) @davidwendt
Expose stream parameter in public strings contains APIs (#14280) @davidwendt
Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
Use branch-23.12 workflows. (#14271) @bdice
Refactor LogicalType for Parquet (#14264) @etseidl
Centralize chunked reading code in the parquet reader to reader_impl_chunking.cu (#14262) @nvdbaranec
Expose stream parameter in public strings replace APIs (#14261) @davidwendt
Expose stream parameter in public strings APIs (#14260) @davidwendt
Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
Make parquet schema index type consistent (#14256) @hyperbolic2346
Expose stream parameter in public strings convert APIs (#14255) @davidwendt
Add in java bindings for DataSource (#14254) @revans2
Reimplement cudf::merge for nested types without using comparators (#14250) @divyegala
Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
Improve contains_column by invoking contains_table (#14238) @PointKernel
Detect and report errors in Parquet header parsing (#14237) @etseidl
Normalizing offsets iterator (#14234) @davidwendt
Forward merge 23.10 into 23.12 (#14231) @galipremsagar
Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
Enable indexalator for device code (#14206) @davidwendt
Marginally reduce memory footprint of joins (#14197) @wence-
Add nvtx annotations to spilling-based data movement (#14196) @wence-
Optimize ORC writer for decimal columns (#14190) @vuule
Remove the use of volatile in ORC (#14175) @vuule
Add bytes_per_second to distinct_count of stream_compaction nvbench. (#14172) @Blonck
Add bytes_per_second to transpose benchmark (#14170) @Blonck
cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
Add bytes_per_second to shift benchmark (#13950) @Blonck
Extract debug_utilities.hpp/cu from column_utilities.hpp/cu (#13720) @ttnghia

cudf - v23.10.02

Published by raydouglass 11 months ago

🚨 Breaking Changes

Raise error in reindex when index is not unique (#14429) @galipremsagar
Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Create table_input_metadata from a table_metadata (#13920) @etseidl
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Update to Cython 3.0.0 (#13777) @vyasr
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

🐛 Bug Fixes

Raise error in reindex when index is not unique (#14429) @galipremsagar
Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
Fix inaccuracy in decimal128 rounding. (#14233) @bdice
Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
Fix pytorch related pytest (#14198) @galipremsagar
Pin to aws-sdk-cpp<1.11 (#14173) @pentschev
Fix assert failure for range window functions (#14168) @mythrocks
Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
Fix kernel launch error for cudf::io::orc::gpu::rowgroup_char_counts_kernel (#14139) @davidwendt
Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
Fix DataFrame.values with no columns but index (#14134) @mroeschke
Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
Preserve name of the column while initializing a DataFrame (#14110) @galipremsagar
Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
Drop kwargs from Series.count (#14106) @galipremsagar
Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
Only use memory resources that haven't been freed (#14103) @robertmaynard
Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
Validate ignore_index type in drop_duplicates (#14098) @mroeschke
Fix renaming Series and Index (#14080) @galipremsagar
Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
Raise NotImplementedError in to_datetime with dayfirst without infer_format (#14058) @mroeschke
Fix various issues in Index.intersection (#14054) @galipremsagar
Fix Index.difference to match with pandas (#14053) @galipremsagar
Fix empty string column construction (#14052) @galipremsagar
Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Ignore compile_commands.json (#14048) @harrism
Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
Implement sort_remaining for sort_index (#14033) @wence-
Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
Fix return type of MultiIndex.difference (#14009) @galipremsagar
Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
Fix map column can not be non-nullable for java (#14003) @res-life
Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
Use thread_index_type to avoid out of bounds accesses in conditional joins (#13971) @vyasr
Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
Use cudf::thread_index_type in get_json_object and tdigest kernels (#13962) @nvdbaranec
Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
Fixed processed bytes calculation in APPLY_BOOLEAN_MASK benchmark. (#13937) @Blonck
Fix construction of Grouping objects (#13932) @galipremsagar
Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
Fix handling of typecasting in searchsorted (#13925) @galipremsagar
Preserve index name in reindex (#13917) @galipremsagar
Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
Use cudf::thread_index_type in concatenate.cu. (#13906) @bdice
Use cudf::thread_index_type in replace.cu. (#13905) @bdice
Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
Fix nvtext::generate_character_ngrams performance regression for longer strings (#13874) @davidwendt
Fix return type of MultiIndex.levels (#13870) @galipremsagar
Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
Fix binary operations between Series and Index (#13842) @galipremsagar
Update make_lists_column_from_scalar to use make_offsets_child_column utility (#13841) @davidwendt
Fix read out of bounds in string concatenate (#13838) @pentschev
Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
Fix cuFile I/O factories (#13829) @vuule
DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
Branch 23.10 merge 23.08 (#13822) @vyasr
Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
Raise error when mixed types are being constructed (#13816) @galipremsagar
Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Fix negative unary operation for boolean type (#13780) @galipremsagar
Fix contains(in) method for Series (#13779) @galipremsagar
Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
Preserve names of column object in various APIs (#13772) @galipremsagar
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
Provide our own Cython declaration for make_unique (#13746) @wence-

📖 Documentation

Fix benchmark image. (#14376) @bdice
Fix typo in docstring: metadata. (#14025) @bdice
Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
Simplify Python doc configuration (#13826) @vyasr
Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
Fix all warnings in Python docs (#13789) @vyasr

🚀 New Features

[Java] Add JNI bindings for integers_to_hex (#14205) @razajafri
Propagate errors from Parquet reader kernels back to host (#14167) @vuule
JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations (#14154) @ttnghia
Expose streams in all public sorting APIs (#14146) @vyasr
Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
Implement GroupBy.value_counts to match pandas API (#14114) @stmio
Refactor parquet thrift reader (#14097) @etseidl
Refactor hash_reduce_by_row (#14095) @ttnghia
Support negative preceding/following for ROW window functions (#14093) @mythrocks
Support for progressive parquet chunked reading. (#14079) @nvdbaranec
Implement HISTOGRAM and MERGE_HISTOGRAM aggregations (#14045) @ttnghia
Expose streams in public search APIs (#14034) @vyasr
Expose streams in public replace APIs (#14010) @vyasr
Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
Expose streams in public filling APIs (#13990) @vyasr
Expose streams in public concatenate APIs (#13987) @vyasr
Use HostMemoryAllocator in jni::allocate_host_buffer (#13975) @gerashegalov
Enable fractional null probability for hashing benchmark (#13967) @Blonck
Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
Add nvtext::tokenize_with_vocabulary API (#13930) @davidwendt
Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
Add HostMemoryAllocator interface (#13924) @gerashegalov
Global stream pool (#13922) @etseidl
Create table_input_metadata from a table_metadata (#13920) @etseidl
Translate column size overflow exception to JNI (#13911) @mythrocks
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Exclude some tests from running with the compute sanitizer (#13872) @firestarman
Expand statistics support in ORC writer (#13848) @vuule
Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
Add cudf::strings::find function with target per row (#13808) @davidwendt
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
Support corr in GroupBy.apply through the jit engine (#13767) @shwina
Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
Support more numeric types in Groupby.apply with engine='jit' (#13729) @brandon-b-miller
[FEA] Add DELTA_BINARY_PACKED decoding support to Parquet reader (#13637) @etseidl
Read FIXED_LEN_BYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

🛠️ Improvements

Update shared-action-workflows references (backport from 23.12 to 23.10) (#14300) @AyodeAwe
Pin dask and distributed for 23.10 release (#14225) @galipremsagar
update rmm tag path (#14195) @AyodeAwe
Disable Recently Updated Check (#14193) @ajschmidt8
Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
Add Parquet reader benchmarks for row selection (#14147) @vuule
Update image names (#14145) @AyodeAwe
Support callables in DataFrame.assign (#14142) @wence-
Reduce memory usage of as_categorical_column (#14138) @wence-
Replace Python scalar conversions with libcudf (#14124) @vyasr
Update to clang 16.0.6. (#14120) @bdice
Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
Add stream parameter to external dict APIs (#14115) @SurajAralihalli
Add fallback matrix for nvcomp. (#14082) @bdice
[Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
Remove header tests (#14072) @ajschmidt8
Refactor contains_table with cuco::static_set (#14064) @PointKernel
Remove debug print in a Parquet test (#14063) @vuule
Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Expose stream parameter in public strings find APIs (#14060) @davidwendt
Update doxygen to 1.9.1 (#14059) @vyasr
Remove the mr from the base fixture (#14057) @vyasr
Expose streams in public strings case APIs (#14056) @davidwendt
Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
Use cudf::make_empty_column instead of column_view constructor (#14030) @davidwendt
Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
Explicitly depend on zlib in conda recipes (#14018) @wence-
Use grid_stride for stride computations. (#13996) @bdice
Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
Use thread_index_type in partitioning.cu (#13973) @divyegala
Use cudf::thread_index_type in merge.cu (#13972) @divyegala
Use copy-pr-bot (#13970) @ajschmidt8
Use cudf::thread_index_type in strings custom kernels (#13968) @davidwendt
Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
Added pinned pool reservation API for java (#13964) @revans2
Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
Add bytes_per_second to copy_if_else benchmark (#13960) @Blonck
Add pandas compatible output to Series.unique (#13959) @galipremsagar
Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
Make HostColumnVector.getRefCount public (#13934) @abellina
Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
Add java API to get size of host memory needed to copy column view (#13919) @revans2
Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
Enable hugepage for arrow host allocations (#13914) @madsbk
Improve performance of nvtext::edit_distance (#13912) @davidwendt
Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
Use empty() instead of size() where possible (#13908) @vuule
[JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
Allow explicit shuffle="p2p" within dask-cudf API (#13893) @rjzamora
Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
Fixes a performance regression in FST (#13850) @elstehle
Set native handles to null on close in Java wrapper classes (#13818) @jlowe
Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
Update lists::contains to experimental row comparator (#13810) @divyegala
Reduce lists::contains dispatches for scalars (#13805) @divyegala
Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
Update to Cython 3.0.0 (#13777) @vyasr
Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
Branch 23.10 merge 23.08 (#13773) @vyasr
Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
Branch 23.10 merge 23.08 (#13753) @vyasr
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Refactors JSON reader's pushdown automaton (#13716) @elstehle
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

cudf - v23.04.01

Published by raydouglass 12 months ago

🚨 Breaking Changes

Pin dask and distributed for release (#13070) @galipremsagar
Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
Update minimum pandas and numpy pinnings (#12887) @galipremsagar
Deprecate names & dtype in Index.copy (#12825) @galipremsagar
Deprecate Index.is_* methods (#12820) @galipremsagar
Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
Deprecate na_sentinel in factorize (#12817) @galipremsagar
Make string methods return a Series with a useful Index (#12814) @shwina
Produce useful guidance on overflow error in to_csv (#12705) @wence-
Move strings_udf code into cuDF (#12669) @brandon-b-miller
Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
Replace message parsing with throwing more specific exceptions (#12426) @vyasr

🐛 Bug Fixes

Pin curand version (#13127) @vyasr
Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
Fix DataFrame constructor to broadcast scalar inputs properly (#12997) @galipremsagar
Drop force_nullable_schema from chunked parquet writer (#12996) @galipremsagar
Fix gtest column utility comparator diff reporting (#12995) @davidwendt
Handle index names while performing groupby (#12992) @galipremsagar
Fix __setitem__ on string columns when the scalar value ends in a null byte (#12991) @wence-
Fix sort_values when column is all empty strings (#12988) @eriknw
Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
Pre-emptive fix for upstream dask.dataframe.read_parquet changes (#12983) @rjzamora
Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
cudftestutil supports static gtest dependencies (#12957) @robertmaynard
Include gtest in build environment. (#12956) @vyasr
Correctly handle scalar indices in Index.__getitem__ (#12955) @wence-
Avoid building cython twice (#12945) @galipremsagar
Fix set index error for Series rolling window operations (#12942) @galipremsagar
Fix calculation of null counts for Parquet statistics (#12938) @etseidl
Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
Use get_current_device_resource for intermediate allocations in COLLECT_LIST window code (#12927) @karthikeyann
Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
Fix conda recipe post-link.sh typo (#12916) @pentschev
min_rows and num_rows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
Use python -m pytest for nightly wheel tests (#12871) @bdice
Parquet writer column_size() should return a size_t (#12870) @etseidl
Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
Remove tokenizers pre-install pinning. (#12854) @vyasr
Fix parquet RangeIndex bug (#12838) @rjzamora
Remove KAFKA_HOST_TEST from compute-sanitizer check (#12831) @davidwendt
Make string methods return a Series with a useful Index (#12814) @shwina
Tell cudf_kafka to use header-only fmt (#12796) @vyasr
Add GroupBy.dtypes (#12783) @galipremsagar
Fix a leak in a test and clarify some test names (#12781) @revans2
Fix bug in all-null list due to join_list_elements special handling (#12767) @karthikeyann
Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
Fix a bug with num_keys in _scatter_by_slice (#12749) @thomcom
Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
Add always_nullable flag to Dremel encoding (#12727) @divyegala
Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
Fix faulty conditional logic in JIT GroupBy.apply (#12706) @brandon-b-miller
Produce useful guidance on overflow error in to_csv (#12705) @wence-
Handle parquet list data corner case (#12698) @nvdbaranec
Fix missing trailing comma in json writer (#12688) @karthikeyann
Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
Handle bool types in round API (#12670) @galipremsagar
Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
Fix from_arrow to load a sliced arrow table (#12665) @galipremsagar
Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
Fix find_common_dtype and values to handle complex dtypes (#12537) @galipremsagar
Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
Fix Series comparison vs scalars (#12519) @brandon-b-miller
Allow casting from UDFString back to StringView to call methods in strings_udf (#12363) @brandon-b-miller

📖 Documentation

Fix GroupBy.apply doc examples rendering (#12994) @brandon-b-miller
add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
Add README symlink for dask-cudf. (#12946) @bdice
Remove return type from @return doxygen tags (#12908) @davidwendt
Fix docs build to be pydata-sphinx-theme=0.13.0 compatible (#12874) @galipremsagar
Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
Enable doctests for GroupBy methods (#12658) @brandon-b-miller
Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt

🚀 New Features

Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
Refactor orc chunked writer (#12949) @ttnghia
Make Parquet writer nullable option application to single table writes (#12933) @vuule
Refactor io::orc::ProtobufWriter (#12877) @ttnghia
Make timezone table independent from ORC (#12805) @vuule
Cache JIT GroupBy.apply functions (#12802) @brandon-b-miller
Implement initial support for avro logical types (#6482) (#12788) @tpn
Update tests/column_utilities to use experimental::equality row comparator (#12777) @divyegala
Update distinct/unique_count to experimental::row hasher/comparator (#12776) @divyegala
Update hash_partition to use experimental::row::row_hasher (#12761) @divyegala
Update is_sorted to use experimental::row::lexicographic (#12752) @divyegala
Update default data source in cuio reader benchmarks (#12740) @PointKernel
Reenable stream identification library in CI (#12714) @vyasr
Add regex_program strings splitting java APIs and tests (#12713) @cindyyuanjiang
Add regex_program strings replacing java APIs and tests (#12701) @cindyyuanjiang
Add regex_program strings extract java APIs and tests (#12699) @cindyyuanjiang
Variable fragment sizes for Parquet writer (#12685) @etseidl
Add segmented reduction support for fixed-point types (#12680) @davidwendt
Move strings_udf code into cuDF (#12669) @brandon-b-miller
Add regex_program searching APIs and related java classes (#12666) @cindyyuanjiang
Add logging to libcudf (#12637) @vuule
Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
Convert rank to use to experimental row comparators (#12481) @divyegala
Use rapids-cmake parallel testing feature (#12451) @robertmaynard
Enable detection of undesired stream usage (#12089) @vyasr

🛠️ Improvements

Pin dask and distributed for release (#13070) @galipremsagar
Pin cupy in wheel tests to supported versions (#13041) @vyasr
Pin numba version (#13001) @vyasr
Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
Stop setting package version attribute in wheels (#12977) @vyasr
Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
Remove default detail mrs: part7 (#12970) @vyasr
Remove default detail mrs: part6 (#12969) @vyasr
Remove default detail mrs: part5 (#12968) @vyasr
Remove default detail mrs: part4 (#12967) @vyasr
Remove default detail mrs: part3 (#12966) @vyasr
Remove default detail mrs: part2 (#12965) @vyasr
Remove default detail mrs: part1 (#12964) @vyasr
Add force_nullable_schema parameter to Parquet writer. (#12952) @galipremsagar
Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
Remove remaining default stream parameters (#12943) @vyasr
Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
Implement groupby.head and groupby.tail (#12939) @wence-
Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
Pass SCCACHE_S3_USE_SSL to conda builds (#12910) @ajschmidt8
Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
Generate pyproject dependencies using dfg (#12906) @vyasr
Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
Fix moto env vars & pass AWS_SESSION_TOKEN to conda builds (#12902) @ajschmidt8
Rewrite CSV writer benchmark with nvbench (#12901) @PointKernel
Rework some code logic to reduce iterator and comparator inlining to improve compile time (#12900) @davidwendt
Deprecate line_terminator in favor of lineterminator in to_csv (#12896) @wence-
Add stream and mr parameters for structs::detail::flatten_nested_columns (#12892) @ttnghia
Deprecate libcudf regex APIs accepting pattern strings directly (#12891) @davidwendt
Remove default parameters from detail headers in include (#12888) @vyasr
Update minimum pandas and numpy pinnings (#12887) @galipremsagar
Implement groupby.sample (#12882) @wence-
Update JNI build ENV default to gcc 11 (#12881) @pxLi
Change return type of cudf::structs::detail::flatten_nested_columns to smart pointer (#12878) @ttnghia
Fix passing seed parameter to MurmurHash3_32 in cudf::hash() function (#12875) @davidwendt
Remove manual artifact upload step in CI (#12869) @ajschmidt8
Update to GCC 11 (#12868) @bdice
Fix null hive-partition behavior in dask-cudf parquet (#12866) @rjzamora
Update to protobuf>=4.21.6,<4.22. (#12864) @bdice
Update RMM allocators (#12861) @pentschev
Improve performance for replace-multi for long strings (#12858) @davidwendt
Drop Python 3.7 handling for pickle protocol 4 (#12857) @jakirkham
Migrate as much as possible to pyproject.toml (#12850) @vyasr
Enable nbqa pre-commit hooks for isort and black. (#12848) @bdice
Setting a threshold for KvikIO IO (#12841) @madsbk
Update datasets download URL (#12840) @jjacobelli
Make docs builds less verbose (#12836) @AyodeAwe
Consolidate linter configs into pyproject.toml (#12834) @vyasr
Deprecate names & dtype in Index.copy (#12825) @galipremsagar
Deprecate inplace parameters in categorical methods (#12824) @galipremsagar
Add optional text file support to ninja-log utility (#12823) @davidwendt
Deprecate Index.is_* methods (#12820) @galipremsagar
Add dfg as a pre-commit hook (#12819) @vyasr
Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
Deprecate na_sentinel in factorize (#12817) @galipremsagar
Shuffling read into a sub function in parquet read (#12809) @hyperbolic2346
Fixing parquet coalescing of reads (#12808) @hyperbolic2346
CI: Remove specification of manual stage for check_style.sh script. (#12803) @csadorf
Add compute-sanitizer github workflow action to nightly tests (#12800) @davidwendt
Enable groupby std and variance aggregation types in libcudf Debug build (#12799) @davidwendt
Expose seed argument to hash_values (#12795) @ayushdg
Fix groupby gtests coded in namespace cudf::test (#12784) @davidwendt
Improve performance for cudf::strings::count_characters for long strings (#12779) @davidwendt
Deallocate encoded data in ORC writer immediately after compression (#12770) @vuule
Stop force pulling fmt in nvbench. (#12768) @vyasr
Remove now redundant cuda initialization (#12758) @vyasr
Adds JSON reader, writer io benchmark (#12753) @karthikeyann
Use test paths relative to package directory. (#12751) @bdice
Add build metrics report as artifact to cpp-build workflow (#12750) @davidwendt
Add JNI methods for detecting and purging non-empty nulls from LIST and STRUCT (#12742) @razajafri
Stop using versioneer to manage versions (#12741) @vyasr
Reduce error handling verbosity in CI tests scripts (#12738) @AjayThorve
Reduce the number of test cases in multibyte_split benchmark (#12737) @PointKernel
Update shared workflow branches (#12733) @ajschmidt8
JNI switches to nested JSON reader (#12732) @res-life
Changing cudf::io::source_info to use cudf::host_span<std::byte> in a non-breaking form (#12730) @hyperbolic2346
Add nvbench environment class for initializing RMM in benchmarks (#12728) @davidwendt
Split C++ and Python build dependencies into separate lists. (#12724) @bdice
Add build dependencies to Java tests. (#12723) @bdice
Allow setting the seed argument for hash partition (#12715) @firestarman
Remove gpuCI scripts. (#12712) @bdice
Unpin dask and distributed for development (#12710) @galipremsagar
partition_by_hash(): use _split() (#12704) @madsbk
Remove DataFrame.quantiles from docs. (#12684) @bdice
Fast path for experimental::row::equality (#12676) @divyegala
Move date to build string in conda recipe (#12661) @ajschmidt8
Refactor reduction logic for fixed-point types (#12652) @davidwendt
Pay off some JNI RMM API tech debt (#12632) @revans2
Merge copy-on-write feature branch into branch-23.04 (#12619) @galipremsagar
Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
Pin cuda-nvrtc. (#12606) @bdice
Remove cudf::test::print calls from libcudf gtests (#12604) @davidwendt
Init JNI version 23.04.0-SNAPSHOT (#12599) @pxLi
Add performance benchmarks to user facing docs (#12595) @galipremsagar
Add docs build job (#12592) @AyodeAwe
Replace message parsing with throwing more specific exceptions (#12426) @vyasr
Support conversion to/from cudf in dask.dataframe.core.to_backend (#12380) @rjzamora

cudf - v23.10.00

Published by raydouglass about 1 year ago

🚨 Breaking Changes

Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Create table_input_metadata from a table_metadata (#13920) @etseidl
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Update to Cython 3.0.0 (#13777) @vyasr
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

🐛 Bug Fixes

Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
Fix inaccuracy in decimal128 rounding. (#14233) @bdice
Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
Fix pytorch related pytest (#14198) @galipremsagar
Pin to aws-sdk-cpp<1.11 (#14173) @pentschev
Fix assert failure for range window functions (#14168) @mythrocks
Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
Fix kernel launch error for cudf::io::orc::gpu::rowgroup_char_counts_kernel (#14139) @davidwendt
Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
Fix DataFrame.values with no columns but index (#14134) @mroeschke
Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
Preserve name of the column while initializing a DataFrame (#14110) @galipremsagar
Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
Drop kwargs from Series.count (#14106) @galipremsagar
Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
Only use memory resources that haven't been freed (#14103) @robertmaynard
Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
Validate ignore_index type in drop_duplicates (#14098) @mroeschke
Fix renaming Series and Index (#14080) @galipremsagar
Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
Raise NotImplementedError in to_datetime with dayfirst without infer_format (#14058) @mroeschke
Fix various issues in Index.intersection (#14054) @galipremsagar
Fix Index.difference to match with pandas (#14053) @galipremsagar
Fix empty string column construction (#14052) @galipremsagar
Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Ignore compile_commands.json (#14048) @harrism
Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
Implement sort_remaining for sort_index (#14033) @wence-
Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
Fix return type of MultiIndex.difference (#14009) @galipremsagar
Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
Fix map column can not be non-nullable for java (#14003) @res-life
Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
Use thread_index_type to avoid out of bounds accesses in conditional joins (#13971) @vyasr
Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
Use cudf::thread_index_type in get_json_object and tdigest kernels (#13962) @nvdbaranec
Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
Fixed processed bytes calculation in APPLY_BOOLEAN_MASK benchmark. (#13937) @Blonck
Fix construction of Grouping objects (#13932) @galipremsagar
Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
Fix handling of typecasting in searchsorted (#13925) @galipremsagar
Preserve index name in reindex (#13917) @galipremsagar
Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
Use cudf::thread_index_type in concatenate.cu. (#13906) @bdice
Use cudf::thread_index_type in replace.cu. (#13905) @bdice
Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
Fix nvtext::generate_character_ngrams performance regression for longer strings (#13874) @davidwendt
Fix return type of MultiIndex.levels (#13870) @galipremsagar
Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
Fix binary operations between Series and Index (#13842) @galipremsagar
Update make_lists_column_from_scalar to use make_offsets_child_column utility (#13841) @davidwendt
Fix read out of bounds in string concatenate (#13838) @pentschev
Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
Fix cuFile I/O factories (#13829) @vuule
DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
Branch 23.10 merge 23.08 (#13822) @vyasr
Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
Raise error when mixed types are being constructed (#13816) @galipremsagar
Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Fix negative unary operation for boolean type (#13780) @galipremsagar
Fix contains(in) method for Series (#13779) @galipremsagar
Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
Preserve names of column object in various APIs (#13772) @galipremsagar
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
Provide our own Cython declaration for make_unique (#13746) @wence-

📖 Documentation

Fix typo in docstring: metadata. (#14025) @bdice
Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
Simplify Python doc configuration (#13826) @vyasr
Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
Fix all warnings in Python docs (#13789) @vyasr

🚀 New Features

[Java] Add JNI bindings for integers_to_hex (#14205) @razajafri
Propagate errors from Parquet reader kernels back to host (#14167) @vuule
JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations (#14154) @ttnghia
Expose streams in all public sorting APIs (#14146) @vyasr
Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
Implement GroupBy.value_counts to match pandas API (#14114) @stmio
Refactor parquet thrift reader (#14097) @etseidl
Refactor hash_reduce_by_row (#14095) @ttnghia
Support negative preceding/following for ROW window functions (#14093) @mythrocks
Support for progressive parquet chunked reading. (#14079) @nvdbaranec
Implement HISTOGRAM and MERGE_HISTOGRAM aggregations (#14045) @ttnghia
Expose streams in public search APIs (#14034) @vyasr
Expose streams in public replace APIs (#14010) @vyasr
Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
Expose streams in public filling APIs (#13990) @vyasr
Expose streams in public concatenate APIs (#13987) @vyasr
Use HostMemoryAllocator in jni::allocate_host_buffer (#13975) @gerashegalov
Enable fractional null probability for hashing benchmark (#13967) @Blonck
Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
Add nvtext::tokenize_with_vocabulary API (#13930) @davidwendt
Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
Add HostMemoryAllocator interface (#13924) @gerashegalov
Global stream pool (#13922) @etseidl
Create table_input_metadata from a table_metadata (#13920) @etseidl
Translate column size overflow exception to JNI (#13911) @mythrocks
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Exclude some tests from running with the compute sanitizer (#13872) @firestarman
Expand statistics support in ORC writer (#13848) @vuule
Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
Add cudf::strings::find function with target per row (#13808) @davidwendt
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
Support corr in GroupBy.apply through the jit engine (#13767) @shwina
Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
Support more numeric types in Groupby.apply with engine='jit' (#13729) @brandon-b-miller
[FEA] Add DELTA_BINARY_PACKED decoding support to Parquet reader (#13637) @etseidl
Read FIXED_LEN_BYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

🛠️ Improvements

Pin dask and distributed for 23.10 release (#14225) @galipremsagar
update rmm tag path (#14195) @AyodeAwe
Disable Recently Updated Check (#14193) @ajschmidt8
Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
Add Parquet reader benchmarks for row selection (#14147) @vuule
Update image names (#14145) @AyodeAwe
Support callables in DataFrame.assign (#14142) @wence-
Reduce memory usage of as_categorical_column (#14138) @wence-
Replace Python scalar conversions with libcudf (#14124) @vyasr
Update to clang 16.0.6. (#14120) @bdice
Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
Add stream parameter to external dict APIs (#14115) @SurajAralihalli
Add fallback matrix for nvcomp. (#14082) @bdice
[Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
Remove header tests (#14072) @ajschmidt8
Refactor contains_table with cuco::static_set (#14064) @PointKernel
Remove debug print in a Parquet test (#14063) @vuule
Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Expose stream parameter in public strings find APIs (#14060) @davidwendt
Update doxygen to 1.9.1 (#14059) @vyasr
Remove the mr from the base fixture (#14057) @vyasr
Expose streams in public strings case APIs (#14056) @davidwendt
Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
Use cudf::make_empty_column instead of column_view constructor (#14030) @davidwendt
Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
Explicitly depend on zlib in conda recipes (#14018) @wence-
Use grid_stride for stride computations. (#13996) @bdice
Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
Use thread_index_type in partitioning.cu (#13973) @divyegala
Use cudf::thread_index_type in merge.cu (#13972) @divyegala
Use copy-pr-bot (#13970) @ajschmidt8
Use cudf::thread_index_type in strings custom kernels (#13968) @davidwendt
Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
Added pinned pool reservation API for java (#13964) @revans2
Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
Add bytes_per_second to copy_if_else benchmark (#13960) @Blonck
Add pandas compatible output to Series.unique (#13959) @galipremsagar
Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
Make HostColumnVector.getRefCount public (#13934) @abellina
Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
Add java API to get size of host memory needed to copy column view (#13919) @revans2
Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
Enable hugepage for arrow host allocations (#13914) @madsbk
Improve performance of nvtext::edit_distance (#13912) @davidwendt
Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
Use empty() instead of size() where possible (#13908) @vuule
[JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
Allow explicit shuffle="p2p" within dask-cudf API (#13893) @rjzamora
Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
Fixes a performance regression in FST (#13850) @elstehle
Set native handles to null on close in Java wrapper classes (#13818) @jlowe
Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
Update lists::contains to experimental row comparator (#13810) @divyegala
Reduce lists::contains dispatches for scalars (#13805) @divyegala
Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
Update to Cython 3.0.0 (#13777) @vyasr
Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
Branch 23.10 merge 23.08 (#13773) @vyasr
Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
Branch 23.10 merge 23.08 (#13753) @vyasr
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Refactors JSON reader's pushdown automaton (#13716) @elstehle
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

cudf - v23.08.00

Published by raydouglass about 1 year ago

🚨 Breaking Changes

Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
Expose streams in all public copying APIs (#13629) @vyasr
Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
Remove deprecated cudf.set_allocator. (#13591) @bdice
Change build.sh to use pip install instead of setup.py (#13507) @vyasr
Remove unused max_rows_tensor parameter from subword tokenizer (#13463) @davidwendt
Fix decimal scale reductions in _get_decimal_type (#13224) @charlesbluca

🐛 Bug Fixes

Add CUDA version to cudf_kafka and libcudf-example build strings. (#13769) @bdice
Fix typo in wheels-test.yaml. (#13763) @bdice
Don't test strings shorter than the requested ngram size (#13758) @vyasr
Add CUDA version to custreamz build string. (#13754) @bdice
Fix writing of ORC files with empty child string columns (#13745) @vuule
Remove the erroneous "empty level" short-circuit from ORC reader (#13722) @vuule
Fix character counting when writing sliced tables into ORC (#13721) @vuule
Parquet uses row group row count if missing from header (#13712) @hyperbolic2346
Fix reading of RLE encoded boolean data from parquet files with V2 page headers (#13707) @etseidl
Fix a corner case of list lexicographic comparator (#13701) @ttnghia
Fix combined filtering and column projection in dask_cudf.read_parquet (#13697) @rjzamora
Revert fetch-rapids changes (#13696) @vyasr
Data generator - include offsets in the size estimate of list elments (#13688) @vuule
Add cuda-nvcc-impl to cudf for numba CUDA 12 (#13673) @jakirkham
Fix combined filtering and column projection in read_parquet (#13666) @rjzamora
Use thrust::identity as hash functions for byte pair encoding (#13665) @PointKernel
Fix loc-getitem ordering when index contains duplicate labels (#13659) @wence-
[REVIEW] Introduce parity with pandas for MultiIndex.loc ordering & fix a bug in Groupby with as_index (#13657) @galipremsagar
Fix memcheck error found in nvtext tokenize functions (#13649) @davidwendt
Fix has_nonempty_nulls ignoring column offset (#13647) @ttnghia
[Java] Avoid double-free corruption in case of an Exception while creating a ColumnView (#13645) @razajafri
Fix memcheck error in ORC reader call to cudf::io::copy_uncompressed_kernel (#13643) @davidwendt
Fix CUDA 12 conda environment to remove cubinlinker and ptxcompiler. (#13636) @bdice
Fix inf/NaN comparisons for FLOAT orderby in window functions (#13635) @mythrocks
Refactor Index search to simplify code and increase correctness (#13625) @wence-
Fix compile warning for unused variable in split_re.cu (#13621) @davidwendt
Fix tz_localize for dask_cudf Series (#13610) @shwina
Fix issue with no decompressed data in ORC reader (#13609) @vuule
Fix floating point window range extents. (#13606) @mythrocks
Fix localize(None) for timezone-naive columns (#13603) @shwina
Fixed a memory leak caused by Exception thrown while constructing a ColumnView (#13597) @razajafri
Handle nullptr return value from bitmask_or in distinct_count (#13590) @wence-
Bring parity with pandas in Index.join (#13589) @galipremsagar
Fix cudf.melt when there are more than 255 columns (#13588) @hcho3
Fix memory issues in cuIO due to removal of memory padding (#13586) @ttnghia
Fix Parquet multi-file reading (#13584) @etseidl
Fix memcheck error found in LISTS_TEST (#13579) @davidwendt
Fix memcheck error found in STRINGS_TEST (#13578) @davidwendt
Fix memcheck error found in INTEROP_TEST (#13577) @davidwendt
Fix memcheck errors found in REDUCTION_TEST (#13574) @davidwendt
Preemptive fix for hive-partitioning change in dask (#13564) @rjzamora
Fix an issue with dask_cudf.read_csv when lines are needed to be skipped (#13555) @galipremsagar
Fix out-of-bounds memory write in cudf::dictionary::detail::concatenate (#13554) @davidwendt
Fix the null mask size in json reader (#13537) @karthikeyann
Fix cudf::strings::strip for all-empty input column (#13533) @davidwendt
Make sure to build without isolation or installing dependencies (#13524) @vyasr
Remove preload lib from CMake for now (#13519) @vyasr
Fix missing separator after null values in JSON writer (#13503) @karthikeyann
Ensure single_lane_block_sum_reduce is safe to call in a loop (#13488) @wence-
Update all versions in pyproject.toml files. (#13486) @bdice
Remove applying nvbench that doesn't exist in 23.08 (#13484) @robertmaynard
Fix chunked Parquet reader benchmark (#13482) @vuule
Update JNI JSON reader column compatability for Spark (#13477) @revans2
Fix unsanitized output of scan with strings (#13455) @davidwendt
Reject functions without bytecode from _can_be_jitted in GroupBy Apply (#13429) @brandon-b-miller
Fix decimal scale reductions in _get_decimal_type (#13224) @charlesbluca

📖 Documentation

Fix doxygen groups for io data sources and sinks (#13718) @davidwendt
Add pandas compatibility note to DataFrame.query docstring (#13693) @beckernick
Add pylibcudf to developer guide (#13639) @vyasr
Fix repeated words in doxygen text (#13598) @karthikeyann
Update docs for top-level API. (#13592) @bdice
Fix the the doxygen text for cudf::concatenate and other places (#13561) @davidwendt
Document stream validation approach used in testing (#13556) @vyasr
Cleanup doc repetitions in libcudf (#13470) @karthikeyann

🚀 New Features

Support min and max aggregations for list type in groupby and reduction (#13676) @ttnghia
Add nvtext::jaccard_index API for strings columns (#13669) @davidwendt
Add read_parquet_metadata libcudf API (#13663) @karthikeyann
Expose streams in all public copying APIs (#13629) @vyasr
Add XXHash_64 hash function to cudf (#13612) @davidwendt
Java support: Floating point order-by columns for RANGE window functions (#13595) @mythrocks
Use cuco::static_map to build string dictionaries in ORC writer (#13580) @vuule
Add pylibcudf subpackage with gather implementation (#13562) @vyasr
Add JNI for lists::concatenate_list_elements (#13547) @ttnghia
Enable nested types for lists::concatenate_list_elements (#13545) @ttnghia
Add unicode encoding for string columns in JSON writer (#13539) @karthikeyann
Remove numba kernels from find_index_of_val (#13517) @brandon-b-miller
Floating point order-by columns for RANGE window functions (#13512) @mythrocks
Parse column chunk metadata statistics in parquet reader (#13472) @karthikeyann
Add abs function to apply (#13408) @brandon-b-miller
[FEA] AST filtering in parquet reader (#13348) @karthikeyann
[FEA] Adds option to recover from invalid JSON lines in JSON tokenizer (#13344) @elstehle
Ensure cccl packages don't clash with upstream version (#13235) @robertmaynard
Update struct_minmax_util to experimental row comparator (#13069) @divyegala
Add stream parameter to hashing APIs (#12090) @vyasr

🛠️ Improvements

Pin dask and distributed for 23.08 release (#13802) @galipremsagar
Relax protobuf pinnings. (#13770) @bdice
Switch fully unbounded window functions to use aggregations (#13727) @mythrocks
Switch to new wheel building pipeline (#13723) @vyasr
Revert CUDA 12.0 CI workflows to branch-23.08. (#13719) @bdice
Adding identify minimum version requirement (#13713) @hyperbolic2346
Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
Optimize ORC reader performance for list data (#13708) @vyasr
fix limit overflow message in a docstring (#13703) @ahmet-uyar
Alleviates JSON parser's need for multi-file sources to end with a newline (#13702) @elstehle
Update cython-lint and replace flake8 with ruff (#13699) @vyasr
Add __dask_tokenize__ definitions to cudf classes (#13695) @rjzamora
Convert libcudf hashing benchmarks to nvbench (#13694) @davidwendt
Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
Improve performance of cudf::strings::split on whitespace (#13680) @davidwendt
Allow ORC and Parquet writers to write nullable columns without nulls as non-nullable (#13675) @vuule
Raise a NotImplementedError in to_datetime when utc is passed (#13670) @shwina
Add rmm_mode parameter to nvbench base fixture (#13668) @davidwendt
Fix multiindex loc ordering in pandas-compat mode (#13660) @wence-
Add nvtext hash_character_ngrams function (#13654) @davidwendt
Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
Acquire spill lock in to/from_arrow (#13646) @shwina
Expose stable versions of libcudf sort routines (#13634) @wence-
Separate out hash_test.cpp source for each hash API (#13633) @davidwendt
Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
Create separate libcudf hash APIs for each supported hash function (#13626) @davidwendt
Add convert_dtypes API (#13623) @shwina
Clean up cupy in dependencies.yaml. (#13617) @bdice
Use cuda-version to constrain cudatoolkit. (#13615) @bdice
Add murmurhash3_x64_128 function to libcudf (#13604) @davidwendt
Performance improvement for cudf::strings::like (#13594) @davidwendt
Remove deprecated cudf.set_allocator. (#13591) @bdice
Clean up cudf device atomic with cuda::atomic_ref (#13583) @PointKernel
Add java bindings for distinct count (#13573) @revans2
Use nvcomp conda package. (#13566) @bdice
Add exception to string_scalar if input string exceeds size_type (#13560) @davidwendt
Add dispatch for cudf.Dataframe to/from pyarrow.Table conversion (#13558) @rjzamora
Get rid of cuco::pair_type aliases (#13553) @PointKernel
Introduce parity with pandas when sort=False in Groupby (#13551) @galipremsagar
Update CMake in docker to 3.26.4 (#13550) @NvTimLiu
Clarify source of error message in stream testing. (#13541) @bdice
Deprecate strings_to_categorical in cudf.read_parquet (#13540) @galipremsagar
Update to CMake 3.26.4 (#13538) @vyasr
s3 folder naming fix (#13536) @AyodeAwe
Implement iloc-getitem using parse-don't-validate approach (#13534) @wence-
Make synchronization explicit in the names of hostdevice_* copying APIs (#13530) @ttnghia
Add benchmark (Google Benchmark) dependency to conda packages. (#13528) @bdice
Add libcufile to dependencies.yaml. (#13523) @bdice
Fix some memoization logic in groupby/sort/sort_helper.cu (#13521) @davidwendt
Use sizes_to_offsets_iterator in cudf::gather for strings (#13520) @davidwendt
use rapids-upload-docs script (#13518) @AyodeAwe
Support UTF-8 BOM in CSV reader (#13516) @davidwendt
Move stream-related test configuration to CMake (#13513) @vyasr
Implement cudf.option_context (#13511) @galipremsagar
Unpin dask and distributed for development (#13508) @galipremsagar
Change build.sh to use pip install instead of setup.py (#13507) @vyasr
Use test default stream (#13506) @vyasr
Remove documentation build scripts for Jenkins (#13495) @ajschmidt8
Use east const in include files (#13494) @karthikeyann
Use east const in src files (#13493) @karthikeyann
Use east const in tests files (#13492) @karthikeyann
Use east const in benchmarks files (#13491) @karthikeyann
Performance improvement for nvtext tokenize/token functions (#13480) @davidwendt
Add pd.Float*Dtype to Avro and ORC mappings (#13475) @mroeschke
Use pandas public APIs where available (#13467) @mroeschke
Allow pd.ArrowDtype in cudf.from_pandas (#13465) @mroeschke
Rework libcudf regex benchmarks with nvbench (#13464) @davidwendt
Remove unused max_rows_tensor parameter from subword tokenizer (#13463) @davidwendt
Separate io-text and nvtext pytests into different files (#13435) @davidwendt
Add a move_to function to cudf::string_view::const_iterator (#13428) @davidwendt
Allow newer scikit-build (#13424) @vyasr
Refactor sort_by_values to sort_values, drop indices from return values. (#13419) @bdice
Inline Cython exception handler (#13411) @vyasr
Init JNI version 23.08.0-SNAPSHOT (#13401) @pxLi
Refactor ORC reader (#13396) @ttnghia
JNI: Remove cleaned objects in memory cleaner (#13378) @res-life
Add tests of currently unsupported indexing (#13338) @wence-
Performance improvement for some libcudf regex functions for long strings (#13322) @davidwendt
Exposure Tracked Buffer (first step towards unifying copy-on-write and spilling) (#13307) @madsbk
Write string data directly to column_buffer in Parquet reader (#13302) @etseidl
Add stacktrace into cudf exception types (#13298) @ttnghia
cuDF: Build CUDA 12 packages (#12922) @bdice

cudf - v23.06.01

Published by raydouglass over 1 year ago

🚨 Breaking Changes

Fix batch processing for parquet writer (#13438) @ttnghia
Use <NA> instead of null to match pandas. (#13415) @bdice
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Remove null mask and null count from column_view constructors (#13311) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Cleanup Parquet chunked writer (#13094) @ttnghia
Cleanup ORC chunked writer (#13091) @ttnghia
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Remove deprecated regex functions from libcudf (#13067) @davidwendt
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🐛 Bug Fixes

Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
Fix writing of ORC files with empty rowgroups (#13466) @vuule
Fix cudf::repeat logic when count is zero (#13459) @davidwendt
Fix batch processing for parquet writer (#13438) @ttnghia
Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
Use <NA> instead of null to match pandas. (#13415) @bdice
Fix tokenize with non-space delimiter (#13403) @shwina
Fix groupby head/tail for empty dataframe (#13398) @shwina
Default to closed="right" in IntervalIndex constructor (#13394) @shwina
Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
Fix unused argument errors in nvcc 11.5 (#13387) @abellina
Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
Fix page size estimation in Parquet writer (#13364) @etseidl
Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
Support gcc 12 as the C++ compiler (#13316) @robertmaynard
Correctly set bitmask size in from_column_view (#13315) @wence-
Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
Fix parquet schema interpretation issue (#13277) @hyperbolic2346
Fix 64bit shift bug in avro reader (#13276) @karthikeyann
Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
Clean up buffers in case AssertionError (#13262) @razajafri
Allow empty input table in ast compute_column (#13245) @wence-
Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
Fix the row index stream order in ORC reader (#13242) @vuule
Make is_decompression_disabled and is_compression_disabled thread-safe (#13240) @vuule
Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
Fix race in ORC string dictionary creation (#13214) @revans2
Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
Fix hostdevice_vector::subspan (#13187) @ttnghia
Use custom nvbench entry point to ensure cudf::nvbench_base_fixture usage (#13183) @robertmaynard
Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
Fix a few clang-format style check errors (#13146) @davidwendt
[REVIEW] Fix Series and DataFrame constructors to validate index lengths (#13122) @galipremsagar
Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
Adds checks to make sure json reader won't overflow (#13115) @elstehle
Fix null_count of columns returned by chunked_parquet_reader (#13111) @vuule
Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
[REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Fix column selection read_parquet benchmarks (#13082) @vuule
Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
Add algorithm include in data_sink.hpp (#13068) @ahendriksen
Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
Prevent overflow with skip_rows in ORC and Parquet readers (#13063) @vuule
Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
[REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
Fix read_avro() skip_rows and num_rows. (#12912) @tpn
Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina

🚀 New Features

Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
cuDF numba cuda 12 updates (#13337) @brandon-b-miller
Add tz_convert method to convert between timestamps (#13328) @shwina
Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
Support the case=False argument to str.contains (#13290) @shwina
Add an event handler for ColumnVector.close (#13279) @abellina
JNI api for cudf::chunked_pack (#13278) @abellina
Implement a chunked_pack API (#13260) @abellina
Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
JNI changes for range-extents in window functions. (#13199) @mythrocks
Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
Add IS_NULL operator to AST (#13145) @karthikeyann
STRING order-by column for RANGE window functions (#13143) @mythrocks
Update contains_table to experimental row hasher and equality comparator (#13119) @divyegala
Automatically select GroupBy.apply algorithm based on if the UDF is jittable (#13113) @brandon-b-miller
Refactor Parquet chunked writer (#13076) @ttnghia
Add Python bindings for string literal support in AST (#13073) @karthikeyann
Add Java bindings for string literal support in AST (#13072) @karthikeyann
Add string scalar support in AST (#13061) @karthikeyann
Log cuIO warnings using the libcudf logger (#13043) @vuule
Update mixed_join to use experimental row hasher and comparator (#13028) @divyegala
Support structs of lists in row lexicographic comparator (#13005) @ttnghia
Adding hostdevice_span that is a span createable from hostdevice_vector (#12981) @hyperbolic2346
Add nvtext::minhash function (#12961) @davidwendt
Support lists of structs in row lexicographic comparator (#12953) @ttnghia
Update join to use experimental row hasher and comparator (#12787) @divyegala
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🛠️ Improvements

Bump typing_extensions minimum version to 4.0.0 (#13618) @shwina
Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
Handle some corner-cases in indexing with boolean masks (#13402) @wence-
Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
[JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
Fix JNI method with mismatched parameter list (#13384) @ttnghia
Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Move some nvtext benchmarks to nvbench (#13368) @davidwendt
run docs nightly too (#13366) @AyodeAwe
Add warning for default dtype parameter in get_dummies (#13365) @galipremsagar
Add log messages about kvikIO compatibility mode (#13363) @vuule
Switch back to using primary shared-action-workflows branch (#13362) @vyasr
Deprecate StringIndex and use Index instead (#13361) @galipremsagar
Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
Expunge most uses of TypeVar(bound="Foo") (#13346) @wence-
Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
Improve distinct_count with cuco::static_set (#13343) @PointKernel
Fix contiguous_split performance (#13342) @ttnghia
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Update mypy to 1.3 (#13340) @wence-
[Java] Purge non-empty nulls when setting validity (#13335) @razajafri
Add row-wise filtering step to read_parquet (#13334) @rjzamora
Performance improvement for nvtext::minhash (#13333) @davidwendt
Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
Move meta calculation in dask_cudf.read_parquet (#13327) @rjzamora
Changes to support Numpy >= 1.24 (#13325) @shwina
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Clean up distinct_count benchmark (#13321) @PointKernel
Fix gtest pinning to 1.13.0. (#13319) @bdice
Remove null mask and null count from column_view constructors (#13311) @vyasr
Address feedback from 13289 (#13306) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
First check for BaseDtype when infering the data type of an arbitrary object (#13295) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Support CUDA 12.0 for pip wheels (#13289) @divyegala
Refactor transform_lists_of_structs in row_operators.cu (#13288) @ttnghia
Branch 23.06 merge 23.04 (#13286) @vyasr
Update cupy dependency (#13284) @vyasr
Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
Fix unused variables and functions (#13275) @karthikeyann
Fix integer overflow in partition scatter_map construction (#13272) @wence-
Numba 0.57 compatibility fixes (#13271) @gmarkall
Performance improvement in cudf::strings::all_characters_of_type (#13259) @davidwendt
Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
Build wheels using new single image workflow (#13249) @vyasr
Enable sccache hits from local builds (#13248) @AyodeAwe
Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
Introduce pandas_compatible option in cudf (#13241) @galipremsagar
Add metadata_builder helper class (#13232) @abellina
Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
Add chunked reader benchmark (#13223) @SrikarVanavasam
Set the null count in output columns in the CSV reader (#13221) @vuule
Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
Fix string_scalar stream usage in write_json.cu (#13212) @davidwendt
Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Optimization to decoding of parquet level streams (#13203) @nvdbaranec
Clean up and simplify gpuDecideCompression (#13202) @vuule
Use std::array for a statically sized vector in create_serialized_trie (#13201) @vuule
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
Split up unique_count.cu to improve build time (#13169) @davidwendt
Use nvtx3 includes in string examples. (#13165) @bdice
Change some .cu gtest files to .cpp (#13155) @davidwendt
Remove wheel pytest verbosity (#13151) @sevagh
Fix libcudf to always pass null-count to set_null_mask (#13149) @davidwendt
Fix gtests to always pass null-count to set_null_mask calls (#13148) @davidwendt
Optimize JSON writer (#13144) @karthikeyann
Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
[REVIEW] Deprecate pad and backfill methods (#13140) @galipremsagar
Use CTAD instead of functions in ProtobufReader (#13135) @vuule
Remove more instances of UNKNOWN_NULL_COUNT (#13134) @vyasr
Update clang-format to 16.0.1. (#13133) @bdice
Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
Branch 23.06 merge 23.04 (#13131) @vyasr
Compute null-count in cudf::detail::slice (#13124) @davidwendt
Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
Set null-count in linked_column_view conversion operator (#13121) @davidwendt
Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
Remove uses-setup-env-vars (#13105) @vyasr
Explicitly compute null count in concatenate APIs (#13104) @vyasr
Replace unnecessary uses of UNKNOWN_NULL_COUNT (#13102) @vyasr
Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
Use .element() instead of .data() for window range calculations (#13095) @mythrocks
Cleanup Parquet chunked writer (#13094) @ttnghia
Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
Cleanup ORC chunked writer (#13091) @ttnghia
Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
Change cudf::test::make_null_mask to also return null-count (#13081) @davidwendt
Resolved automerger from branch-23.04 to branch-23.06 (#13080) @galipremsagar
Assert for non-empty nulls (#13071) @razajafri
Remove deprecated regex functions from libcudf (#13067) @davidwendt
Refactor cudf::detail::sorted_order (#13062) @ttnghia
Improve performance of slice_strings for long strings (#13057) @davidwendt
Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
[REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
Remove console output from some libcudf gtests (#13027) @davidwendt
Remove underscore in build string. (#13025) @bdice
Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
Fix auto merger from branch-23.04 to branch-23.06 (#13009) @galipremsagar
Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
Add nvtx annotatations to groupby methods (#12941) @wence-
Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
Optimize set-like operations (#12769) @ttnghia
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Add empty test files for test reorganization (#12288) @shwina

cudf - v23.06.00

Published by raydouglass over 1 year ago

🚨 Breaking Changes

Fix batch processing for parquet writer (#13438) @ttnghia
Use <NA> instead of null to match pandas. (#13415) @bdice
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Remove null mask and null count from column_view constructors (#13311) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Cleanup Parquet chunked writer (#13094) @ttnghia
Cleanup ORC chunked writer (#13091) @ttnghia
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Remove deprecated regex functions from libcudf (#13067) @davidwendt
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🐛 Bug Fixes

Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
Fix writing of ORC files with empty rowgroups (#13466) @vuule
Fix cudf::repeat logic when count is zero (#13459) @davidwendt
Fix batch processing for parquet writer (#13438) @ttnghia
Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
Use <NA> instead of null to match pandas. (#13415) @bdice
Fix tokenize with non-space delimiter (#13403) @shwina
Fix groupby head/tail for empty dataframe (#13398) @shwina
Default to closed="right" in IntervalIndex constructor (#13394) @shwina
Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
Fix unused argument errors in nvcc 11.5 (#13387) @abellina
Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
Fix page size estimation in Parquet writer (#13364) @etseidl
Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
Support gcc 12 as the C++ compiler (#13316) @robertmaynard
Correctly set bitmask size in from_column_view (#13315) @wence-
Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
Fix parquet schema interpretation issue (#13277) @hyperbolic2346
Fix 64bit shift bug in avro reader (#13276) @karthikeyann
Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
Clean up buffers in case AssertionError (#13262) @razajafri
Allow empty input table in ast compute_column (#13245) @wence-
Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
Fix the row index stream order in ORC reader (#13242) @vuule
Make is_decompression_disabled and is_compression_disabled thread-safe (#13240) @vuule
Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
Fix race in ORC string dictionary creation (#13214) @revans2
Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
Fix hostdevice_vector::subspan (#13187) @ttnghia
Use custom nvbench entry point to ensure cudf::nvbench_base_fixture usage (#13183) @robertmaynard
Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
Fix a few clang-format style check errors (#13146) @davidwendt
[REVIEW] Fix Series and DataFrame constructors to validate index lengths (#13122) @galipremsagar
Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
Adds checks to make sure json reader won't overflow (#13115) @elstehle
Fix null_count of columns returned by chunked_parquet_reader (#13111) @vuule
Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
[REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Fix column selection read_parquet benchmarks (#13082) @vuule
Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
Add algorithm include in data_sink.hpp (#13068) @ahendriksen
Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
Prevent overflow with skip_rows in ORC and Parquet readers (#13063) @vuule
Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
[REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
Fix read_avro() skip_rows and num_rows. (#12912) @tpn
Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina

🚀 New Features

Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
cuDF numba cuda 12 updates (#13337) @brandon-b-miller
Add tz_convert method to convert between timestamps (#13328) @shwina
Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
Support the case=False argument to str.contains (#13290) @shwina
Add an event handler for ColumnVector.close (#13279) @abellina
JNI api for cudf::chunked_pack (#13278) @abellina
Implement a chunked_pack API (#13260) @abellina
Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
JNI changes for range-extents in window functions. (#13199) @mythrocks
Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
Add IS_NULL operator to AST (#13145) @karthikeyann
STRING order-by column for RANGE window functions (#13143) @mythrocks
Update contains_table to experimental row hasher and equality comparator (#13119) @divyegala
Automatically select GroupBy.apply algorithm based on if the UDF is jittable (#13113) @brandon-b-miller
Refactor Parquet chunked writer (#13076) @ttnghia
Add Python bindings for string literal support in AST (#13073) @karthikeyann
Add Java bindings for string literal support in AST (#13072) @karthikeyann
Add string scalar support in AST (#13061) @karthikeyann
Log cuIO warnings using the libcudf logger (#13043) @vuule
Update mixed_join to use experimental row hasher and comparator (#13028) @divyegala
Support structs of lists in row lexicographic comparator (#13005) @ttnghia
Adding hostdevice_span that is a span createable from hostdevice_vector (#12981) @hyperbolic2346
Add nvtext::minhash function (#12961) @davidwendt
Support lists of structs in row lexicographic comparator (#12953) @ttnghia
Update join to use experimental row hasher and comparator (#12787) @divyegala
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🛠️ Improvements

Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
Handle some corner-cases in indexing with boolean masks (#13402) @wence-
Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
[JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
Fix JNI method with mismatched parameter list (#13384) @ttnghia
Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Move some nvtext benchmarks to nvbench (#13368) @davidwendt
run docs nightly too (#13366) @AyodeAwe
Add warning for default dtype parameter in get_dummies (#13365) @galipremsagar
Add log messages about kvikIO compatibility mode (#13363) @vuule
Switch back to using primary shared-action-workflows branch (#13362) @vyasr
Deprecate StringIndex and use Index instead (#13361) @galipremsagar
Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
Expunge most uses of TypeVar(bound="Foo") (#13346) @wence-
Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
Improve distinct_count with cuco::static_set (#13343) @PointKernel
Fix contiguous_split performance (#13342) @ttnghia
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Update mypy to 1.3 (#13340) @wence-
[Java] Purge non-empty nulls when setting validity (#13335) @razajafri
Add row-wise filtering step to read_parquet (#13334) @rjzamora
Performance improvement for nvtext::minhash (#13333) @davidwendt
Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
Move meta calculation in dask_cudf.read_parquet (#13327) @rjzamora
Changes to support Numpy >= 1.24 (#13325) @shwina
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Clean up distinct_count benchmark (#13321) @PointKernel
Fix gtest pinning to 1.13.0. (#13319) @bdice
Remove null mask and null count from column_view constructors (#13311) @vyasr
Address feedback from 13289 (#13306) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
First check for BaseDtype when infering the data type of an arbitrary object (#13295) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Support CUDA 12.0 for pip wheels (#13289) @divyegala
Refactor transform_lists_of_structs in row_operators.cu (#13288) @ttnghia
Branch 23.06 merge 23.04 (#13286) @vyasr
Update cupy dependency (#13284) @vyasr
Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
Fix unused variables and functions (#13275) @karthikeyann
Fix integer overflow in partition scatter_map construction (#13272) @wence-
Numba 0.57 compatibility fixes (#13271) @gmarkall
Performance improvement in cudf::strings::all_characters_of_type (#13259) @davidwendt
Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
Build wheels using new single image workflow (#13249) @vyasr
Enable sccache hits from local builds (#13248) @AyodeAwe
Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
Introduce pandas_compatible option in cudf (#13241) @galipremsagar
Add metadata_builder helper class (#13232) @abellina
Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
Add chunked reader benchmark (#13223) @SrikarVanavasam
Set the null count in output columns in the CSV reader (#13221) @vuule
Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
Fix string_scalar stream usage in write_json.cu (#13212) @davidwendt
Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Optimization to decoding of parquet level streams (#13203) @nvdbaranec
Clean up and simplify gpuDecideCompression (#13202) @vuule
Use std::array for a statically sized vector in create_serialized_trie (#13201) @vuule
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
Split up unique_count.cu to improve build time (#13169) @davidwendt
Use nvtx3 includes in string examples. (#13165) @bdice
Change some .cu gtest files to .cpp (#13155) @davidwendt
Remove wheel pytest verbosity (#13151) @sevagh
Fix libcudf to always pass null-count to set_null_mask (#13149) @davidwendt
Fix gtests to always pass null-count to set_null_mask calls (#13148) @davidwendt
Optimize JSON writer (#13144) @karthikeyann
Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
[REVIEW] Deprecate pad and backfill methods (#13140) @galipremsagar
Use CTAD instead of functions in ProtobufReader (#13135) @vuule
Remove more instances of UNKNOWN_NULL_COUNT (#13134) @vyasr
Update clang-format to 16.0.1. (#13133) @bdice
Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
Branch 23.06 merge 23.04 (#13131) @vyasr
Compute null-count in cudf::detail::slice (#13124) @davidwendt
Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
Set null-count in linked_column_view conversion operator (#13121) @davidwendt
Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
Remove uses-setup-env-vars (#13105) @vyasr
Explicitly compute null count in concatenate APIs (#13104) @vyasr
Replace unnecessary uses of UNKNOWN_NULL_COUNT (#13102) @vyasr
Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
Use .element() instead of .data() for window range calculations (#13095) @mythrocks
Cleanup Parquet chunked writer (#13094) @ttnghia
Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
Cleanup ORC chunked writer (#13091) @ttnghia
Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
Change cudf::test::make_null_mask to also return null-count (#13081) @davidwendt
Resolved automerger from branch-23.04 to branch-23.06 (#13080) @galipremsagar
Assert for non-empty nulls (#13071) @razajafri
Remove deprecated regex functions from libcudf (#13067) @davidwendt
Refactor cudf::detail::sorted_order (#13062) @ttnghia
Improve performance of slice_strings for long strings (#13057) @davidwendt
Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
[REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
Remove console output from some libcudf gtests (#13027) @davidwendt
Remove underscore in build string. (#13025) @bdice
Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
Fix auto merger from branch-23.04 to branch-23.06 (#13009) @galipremsagar
Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
Add nvtx annotatations to groupby methods (#12941) @wence-
Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
Optimize set-like operations (#12769) @ttnghia
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Add empty test files for test reorganization (#12288) @shwina

cudf - v23.04.00

Published by raydouglass over 1 year ago

🚨 Breaking Changes

Pin dask and distributed for release (#13070) @galipremsagar
Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
Update minimum pandas and numpy pinnings (#12887) @galipremsagar
Deprecate names & dtype in Index.copy (#12825) @galipremsagar
Deprecate Index.is_* methods (#12820) @galipremsagar
Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
Deprecate na_sentinel in factorize (#12817) @galipremsagar
Make string methods return a Series with a useful Index (#12814) @shwina
Produce useful guidance on overflow error in to_csv (#12705) @wence-
Move strings_udf code into cuDF (#12669) @brandon-b-miller
Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
Replace message parsing with throwing more specific exceptions (#12426) @vyasr

🐛 Bug Fixes

Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
Fix DataFrame constructor to broadcast scalar inputs properly (#12997) @galipremsagar
Drop force_nullable_schema from chunked parquet writer (#12996) @galipremsagar
Fix gtest column utility comparator diff reporting (#12995) @davidwendt
Handle index names while performing groupby (#12992) @galipremsagar
Fix __setitem__ on string columns when the scalar value ends in a null byte (#12991) @wence-
Fix sort_values when column is all empty strings (#12988) @eriknw
Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
Pre-emptive fix for upstream dask.dataframe.read_parquet changes (#12983) @rjzamora
Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
cudftestutil supports static gtest dependencies (#12957) @robertmaynard
Include gtest in build environment. (#12956) @vyasr
Correctly handle scalar indices in Index.__getitem__ (#12955) @wence-
Avoid building cython twice (#12945) @galipremsagar
Fix set index error for Series rolling window operations (#12942) @galipremsagar
Fix calculation of null counts for Parquet statistics (#12938) @etseidl
Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
Use get_current_device_resource for intermediate allocations in COLLECT_LIST window code (#12927) @karthikeyann
Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
Fix conda recipe post-link.sh typo (#12916) @pentschev
min_rows and num_rows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
Use python -m pytest for nightly wheel tests (#12871) @bdice
Parquet writer column_size() should return a size_t (#12870) @etseidl
Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
Remove tokenizers pre-install pinning. (#12854) @vyasr
Fix parquet RangeIndex bug (#12838) @rjzamora
Remove KAFKA_HOST_TEST from compute-sanitizer check (#12831) @davidwendt
Make string methods return a Series with a useful Index (#12814) @shwina
Tell cudf_kafka to use header-only fmt (#12796) @vyasr
Add GroupBy.dtypes (#12783) @galipremsagar
Fix a leak in a test and clarify some test names (#12781) @revans2
Fix bug in all-null list due to join_list_elements special handling (#12767) @karthikeyann
Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
Fix a bug with num_keys in _scatter_by_slice (#12749) @thomcom
Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
Add always_nullable flag to Dremel encoding (#12727) @divyegala
Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
Fix faulty conditional logic in JIT GroupBy.apply (#12706) @brandon-b-miller
Produce useful guidance on overflow error in to_csv (#12705) @wence-
Handle parquet list data corner case (#12698) @nvdbaranec
Fix missing trailing comma in json writer (#12688) @karthikeyann
Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
Handle bool types in round API (#12670) @galipremsagar
Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
Fix from_arrow to load a sliced arrow table (#12665) @galipremsagar
Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
Fix find_common_dtype and values to handle complex dtypes (#12537) @galipremsagar
Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
Fix Series comparison vs scalars (#12519) @brandon-b-miller
Allow casting from UDFString back to StringView to call methods in strings_udf (#12363) @brandon-b-miller

📖 Documentation

Fix GroupBy.apply doc examples rendering (#12994) @brandon-b-miller
add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
Add README symlink for dask-cudf. (#12946) @bdice
Remove return type from @return doxygen tags (#12908) @davidwendt
Fix docs build to be pydata-sphinx-theme=0.13.0 compatible (#12874) @galipremsagar
Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
Enable doctests for GroupBy methods (#12658) @brandon-b-miller
Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt

🚀 New Features

Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
Refactor orc chunked writer (#12949) @ttnghia
Make Parquet writer nullable option application to single table writes (#12933) @vuule
Refactor io::orc::ProtobufWriter (#12877) @ttnghia
Make timezone table independent from ORC (#12805) @vuule
Cache JIT GroupBy.apply functions (#12802) @brandon-b-miller
Implement initial support for avro logical types (#6482) (#12788) @tpn
Update tests/column_utilities to use experimental::equality row comparator (#12777) @divyegala
Update distinct/unique_count to experimental::row hasher/comparator (#12776) @divyegala
Update hash_partition to use experimental::row::row_hasher (#12761) @divyegala
Update is_sorted to use experimental::row::lexicographic (#12752) @divyegala
Update default data source in cuio reader benchmarks (#12740) @PointKernel
Reenable stream identification library in CI (#12714) @vyasr
Add regex_program strings splitting java APIs and tests (#12713) @cindyyuanjiang
Add regex_program strings replacing java APIs and tests (#12701) @cindyyuanjiang
Add regex_program strings extract java APIs and tests (#12699) @cindyyuanjiang
Variable fragment sizes for Parquet writer (#12685) @etseidl
Add segmented reduction support for fixed-point types (#12680) @davidwendt
Move strings_udf code into cuDF (#12669) @brandon-b-miller
Add regex_program searching APIs and related java classes (#12666) @cindyyuanjiang
Add logging to libcudf (#12637) @vuule
Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
Convert rank to use to experimental row comparators (#12481) @divyegala
Use rapids-cmake parallel testing feature (#12451) @robertmaynard
Enable detection of undesired stream usage (#12089) @vyasr

🛠️ Improvements

Pin dask and distributed for release (#13070) @galipremsagar
Pin cupy in wheel tests to supported versions (#13041) @vyasr
Pin numba version (#13001) @vyasr
Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
Stop setting package version attribute in wheels (#12977) @vyasr
Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
Remove default detail mrs: part7 (#12970) @vyasr
Remove default detail mrs: part6 (#12969) @vyasr
Remove default detail mrs: part5 (#12968) @vyasr
Remove default detail mrs: part4 (#12967) @vyasr
Remove default detail mrs: part3 (#12966) @vyasr
Remove default detail mrs: part2 (#12965) @vyasr
Remove default detail mrs: part1 (#12964) @vyasr
Add force_nullable_schema parameter to Parquet writer. (#12952) @galipremsagar
Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
Remove remaining default stream parameters (#12943) @vyasr
Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
Implement groupby.head and groupby.tail (#12939) @wence-
Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
Pass SCCACHE_S3_USE_SSL to conda builds (#12910) @ajschmidt8
Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
Generate pyproject dependencies using dfg (#12906) @vyasr
Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
Fix moto env vars & pass AWS_SESSION_TOKEN to conda builds (#12902) @ajschmidt8
Rewrite CSV writer benchmark with nvbench (#12901) @PointKernel
Rework some code logic to reduce iterator and comparator inlining to improve compile time (#12900) @davidwendt
Deprecate line_terminator in favor of lineterminator in to_csv (#12896) @wence-
Add stream and mr parameters for structs::detail::flatten_nested_columns (#12892) @ttnghia
Deprecate libcudf regex APIs accepting pattern strings directly (#12891) @davidwendt
Remove default parameters from detail headers in include (#12888) @vyasr
Update minimum pandas and numpy pinnings (#12887) @galipremsagar
Implement groupby.sample (#12882) @wence-
Update JNI build ENV default to gcc 11 (#12881) @pxLi
Change return type of cudf::structs::detail::flatten_nested_columns to smart pointer (#12878) @ttnghia
Fix passing seed parameter to MurmurHash3_32 in cudf::hash() function (#12875) @davidwendt
Remove manual artifact upload step in CI (#12869) @ajschmidt8
Update to GCC 11 (#12868) @bdice
Fix null hive-partition behavior in dask-cudf parquet (#12866) @rjzamora
Update to protobuf>=4.21.6,<4.22. (#12864) @bdice
Update RMM allocators (#12861) @pentschev
Improve performance for replace-multi for long strings (#12858) @davidwendt
Drop Python 3.7 handling for pickle protocol 4 (#12857) @jakirkham
Migrate as much as possible to pyproject.toml (#12850) @vyasr
Enable nbqa pre-commit hooks for isort and black. (#12848) @bdice
Setting a threshold for KvikIO IO (#12841) @madsbk
Update datasets download URL (#12840) @jjacobelli
Make docs builds less verbose (#12836) @AyodeAwe
Consolidate linter configs into pyproject.toml (#12834) @vyasr
Deprecate names & dtype in Index.copy (#12825) @galipremsagar
Deprecate inplace parameters in categorical methods (#12824) @galipremsagar
Add optional text file support to ninja-log utility (#12823) @davidwendt
Deprecate Index.is_* methods (#12820) @galipremsagar
Add dfg as a pre-commit hook (#12819) @vyasr
Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
Deprecate na_sentinel in factorize (#12817) @galipremsagar
Shuffling read into a sub function in parquet read (#12809) @hyperbolic2346
Fixing parquet coalescing of reads (#12808) @hyperbolic2346
CI: Remove specification of manual stage for check_style.sh script. (#12803) @csadorf
Add compute-sanitizer github workflow action to nightly tests (#12800) @davidwendt
Enable groupby std and variance aggregation types in libcudf Debug build (#12799) @davidwendt
Expose seed argument to hash_values (#12795) @ayushdg
Fix groupby gtests coded in namespace cudf::test (#12784) @davidwendt
Improve performance for cudf::strings::count_characters for long strings (#12779) @davidwendt
Deallocate encoded data in ORC writer immediately after compression (#12770) @vuule
Stop force pulling fmt in nvbench. (#12768) @vyasr
Remove now redundant cuda initialization (#12758) @vyasr
Adds JSON reader, writer io benchmark (#12753) @karthikeyann
Use test paths relative to package directory. (#12751) @bdice
Add build metrics report as artifact to cpp-build workflow (#12750) @davidwendt
Add JNI methods for detecting and purging non-empty nulls from LIST and STRUCT (#12742) @razajafri
Stop using versioneer to manage versions (#12741) @vyasr
Reduce error handling verbosity in CI tests scripts (#12738) @AjayThorve
Reduce the number of test cases in multibyte_split benchmark (#12737) @PointKernel
Update shared workflow branches (#12733) @ajschmidt8
JNI switches to nested JSON reader (#12732) @res-life
Changing cudf::io::source_info to use cudf::host_span<std::byte> in a non-breaking form (#12730) @hyperbolic2346
Add nvbench environment class for initializing RMM in benchmarks (#12728) @davidwendt
Split C++ and Python build dependencies into separate lists. (#12724) @bdice
Add build dependencies to Java tests. (#12723) @bdice
Allow setting the seed argument for hash partition (#12715) @firestarman
Remove gpuCI scripts. (#12712) @bdice
Unpin dask and distributed for development (#12710) @galipremsagar
partition_by_hash(): use _split() (#12704) @madsbk
Remove DataFrame.quantiles from docs. (#12684) @bdice
Fast path for experimental::row::equality (#12676) @divyegala
Move date to build string in conda recipe (#12661) @ajschmidt8
Refactor reduction logic for fixed-point types (#12652) @davidwendt
Pay off some JNI RMM API tech debt (#12632) @revans2
Merge copy-on-write feature branch into branch-23.04 (#12619) @galipremsagar
Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
Pin cuda-nvrtc. (#12606) @bdice
Remove cudf::test::print calls from libcudf gtests (#12604) @davidwendt
Init JNI version 23.04.0-SNAPSHOT (#12599) @pxLi
Add performance benchmarks to user facing docs (#12595) @galipremsagar
Add docs build job (#12592) @AyodeAwe
Replace message parsing with throwing more specific exceptions (#12426) @vyasr
Support conversion to/from cudf in dask.dataframe.core.to_backend (#12380) @rjzamora

cudf - v23.02.00

Published by raydouglass over 1 year ago

🚨 Breaking Changes

Pin dask and distributed for release (#12695) @galipremsagar
Change ways to access ptr in Buffer (#12587) @galipremsagar
Remove column names (#12578) @vuule
Default cudf::io::read_json to nested JSON parser (#12544) @vuule
Switch engine=cudf to the new JSON reader (#12509) @galipremsagar
Add trailing comma support for nested JSON reader (#12448) @karthikeyann
Upgrade to arrow-10.0.1 (#12327) @galipremsagar
Fail loudly to avoid data corruption with unsupported input in read_orc (#12325) @vuule
CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
Remove deprecated code for 23.02 (#12281) @vyasr
Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
Purge non-empty nulls for superimpose_nulls and push_down_nulls (#12239) @ttnghia
Rename cudf::structs::detail::superimpose_parent_nulls APIs (#12230) @ttnghia
Remove JIT type names, refactor id_to_type. (#12158) @bdice
Floor division uses integer division for integral arguments (#12131) @wence-

🐛 Bug Fixes

Fix a mask data corruption in UDF (#12647) @galipremsagar
pre-commit: Update isort version to 5.12.0 (#12645) @wence-
tests: Skip cuInit tests if cuda-gdb is not found or not working (#12644) @wence-
Revert regex program java APIs and tests (#12639) @cindyyuanjiang
Fix leaks in ColumnVectorTest (#12625) @jlowe
Handle when spillable buffers own each other (#12607) @madsbk
Fix incorrect null counts for sliced columns in JCudfSerialization (#12589) @jlowe
lists: Transfer dtypes correctly through list.get (#12586) @wence-
timedelta: Don't go via float intermediates for floordiv (#12585) @wence-
Fixing BUG, get_next_chunk() should use the blocking function device_read() (#12584) @madsbk
Make JNI QuoteStyle accessible outside ai.rapids.cudf (#12572) @mythrocks
partition_by_hash(): support index (#12554) @madsbk
Mixed Join benchmark bug due to wrong conditional column (#12553) @divyegala
Update List Lexicographical Comparator (#12538) @divyegala
Dynamically read PTX version (#12534) @brandon-b-miller
build.sh switch to use RAPIDS magic value (#12525) @robertmaynard
Loosen runtime arrow pinning (#12522) @vyasr
Enable metadata transfer for complex types in transpose (#12491) @galipremsagar
Fix issues with parquet chunked reader (#12488) @nvdbaranec
Fix missing metadata transfer in concat for ListColumn (#12487) @galipremsagar
Rename libcudf substring source files to slice (#12484) @davidwendt
Fix compile issue with arrow 10 (#12465) @ttnghia
Fix List offsets bug in mixed type list column in nested JSON reader (#12447) @karthikeyann
Fix xfail incompatibilities (#12423) @vyasr
Fix bug in Parquet column index encoding (#12404) @etseidl
When building Arrow shared look for a shared OpenSSL (#12396) @robertmaynard
Fix get_json_object to return empty column on empty input (#12384) @davidwendt
Pin arrow 9 in testing dependencies to prevent conda solve issues (#12377) @vyasr
Fix reductions any/all return value for empty input (#12374) @davidwendt
Fix debug compile errors in parquet.hpp (#12372) @davidwendt
Purge non-empty nulls in cudf::make_lists_column (#12370) @ttnghia
Use correct memory resource in io::make_column (#12364) @vyasr
Add code to detect possible malformed page data in parquet files. (#12360) @nvdbaranec
Fail loudly to avoid data corruption with unsupported input in read_orc (#12325) @vuule
Fix NumericPairIteratorTest for float values (#12306) @davidwendt
Fixes memory allocation in nested JSON tokenizer (#12300) @elstehle
Reconstruct dtypes correctly for list aggs of struct columns (#12290) @wence-
Fix regex \A and \Z to strictly match string begin/end (#12282) @davidwendt
Fix compile issue in json_chunked_reader.cpp (#12280) @ttnghia
Change reductions any/all to return valid values for empty input (#12279) @davidwendt
Only exclude join keys that are indices from key columns (#12271) @wence-
Fix spill to device limit (#12252) @madsbk
Correct behaviour of sort in concat for singleton concatenations (#12247) @wence-
Purge non-empty nulls for superimpose_nulls and push_down_nulls (#12239) @ttnghia
Patch CUB DeviceSegmentedSort and remove workaround (#12234) @davidwendt
Fix memory leak in udf_string::assign(&&) function (#12206) @davidwendt
Workaround thrust-copy-if limit in json get_tree_representation (#12190) @davidwendt
Fix page size calculation in Parquet writer (#12182) @etseidl
Add cudf::detail::sizes_to_offsets_iterator to allow checking overflow in offsets (#12180) @davidwendt
Workaround thrust-copy-if limit in wordpiece-tokenizer (#12168) @davidwendt
Floor division uses integer division for integral arguments (#12131) @wence-

📖 Documentation

Fix link to NVTX (#12598) @sameerz
Include missing groupby functions in documentation (#12580) @quasiben
Fix documentation author (#12527) @bdice
Update libcudf reduction docs for casting output types (#12526) @davidwendt
Add JSON reader page in user guide (#12499) @GregoryKimball
Link unsupported iteration API docstrings (#12482) @galipremsagar
strings_udf doc update (#12469) @brandon-b-miller
Update cudf_assert docs with correct NDEBUG behavior (#12464) @robertmaynard
Update pre-commit hooks guide (#12395) @bdice
Update test docs to not use detail comparison utilities (#12332) @PointKernel
Fix doxygen description for regex_program::compute_working_memory_size (#12329) @davidwendt
Add eval to docs. (#12322) @vyasr
Turn on xfail_strict=true (#12244) @wence-
Update 10 minutes to cuDF (#12114) @wence-

🚀 New Features

Use kvikIO as the default IO backend (#12574) @vuule
Use has_nonempty_nulls instead of may_contain_non_empty_nulls in superimpose_nulls and push_down_nulls (#12560) @ttnghia
Add strings methods removeprefix and removesuffix (#12557) @davidwendt
Add regex_program java APIs and unit tests (#12548) @cindyyuanjiang
Default cudf::io::read_json to nested JSON parser (#12544) @vuule
Make string quoting optional on CSV write (#12539) @mythrocks
Use new nvCOMP API to optimize the compression temp memory size (#12533) @vuule
Support "values" orient (array of arrays) in Nested JSON reader (#12498) @karthikeyann
one_hot_encode to use experimental row comparators (#12478) @divyegala
Support %W and %w format specifiers in cudf::strings::to_timestamps (#12475) @davidwendt
Add JSON Writer (#12474) @karthikeyann
Refactor thrust_copy_if into cudf::detail::copy_if_safe (#12455) @ttnghia
Add trailing comma support for nested JSON reader (#12448) @karthikeyann
Extract tokenize_json.hpp detail header from src/io/json/nested_json.hpp (#12432) @ttnghia
JNI bindings to write CSV (#12425) @mythrocks
Nested JSON depth benchmark (#12371) @karthikeyann
Implement lists::reverse (#12336) @ttnghia
Use device_read in experimental read_json (#12314) @vuule
Implement JNI for strings::reverse (#12283) @ttnghia
Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
Add cudf::strings:like function with multiple patterns (#12269) @davidwendt
Add environment variable to control host memory allocation in hostdevice_vector (#12251) @vuule
Add cudf::strings::reverse function (#12227) @davidwendt
Selectively use dictionary encoding in Parquet writer (#12211) @etseidl
Support replace in strings_udf (#12207) @brandon-b-miller
Add support to read binary encoded decimals in parquet (#12205) @PointKernel
Support regex EOL where the string ends with a new-line character (#12181) @davidwendt
Updating stream_compaction/unique to use new row comparators (#12159) @divyegala
Add device buffer datasource (#12024) @PointKernel
Implement groupby apply with JIT (#11452) @bwyogatama

🛠️ Improvements

Update shared workflow branches (#12696) @ajschmidt8
Pin dask and distributed for release (#12695) @galipremsagar
Don't upload libcudf-example to Anaconda.org (#12671) @ajschmidt8
Pin wheel dependencies to same RAPIDS release (#12659) @sevagh
Use CTK 118/cp310 branch of wheel workflows (#12602) @sevagh
Change ways to access ptr in Buffer (#12587) @galipremsagar
Version a parquet writer xfail (#12579) @galipremsagar
Remove column names (#12578) @vuule
Parquet reader optimization to address V100 regression. (#12577) @nvdbaranec
Add support for category dtypes in CSV reader (#12571) @galipremsagar
Remove spill_lock parameter from SpillableBuffer.get_ptr() (#12564) @madsbk
Optimize cudf::make_lists_column (#12547) @ttnghia
Remove cudf::strings::repeat_strings_output_sizes from Java and JNI (#12546) @ttnghia
Test that cuInit is not called when RAPIDS_NO_INITIALIZE is set (#12545) @wence-
Rework repeat_strings to use sizes-to-offsets utility (#12543) @davidwendt
Replace exclusive_scan with sizes_to_offsets in cudf::lists::sequences (#12541) @davidwendt
Rework nvtext::ngrams_tokenize to use sizes-to-offsets utility (#12540) @davidwendt
Fix binary-ops gtests coded in namespace cudf::test (#12536) @davidwendt
More @acquire_spill_lock() and as_buffer(..., exposed=False) (#12535) @madsbk
Guard CUDA runtime APIs with error checking (#12531) @PointKernel
Update TODOs from issue 10432. (#12528) @bdice
Update rapids-cmake definitions version in GitHub Actions style checks. (#12511) @bdice
Switch engine=cudf to the new JSON reader (#12509) @galipremsagar
Fix SUM/MEAN aggregation type support. (#12503) @bdice
Stop using pandas._testing (#12492) @vyasr
Fix ROLLING_TEST gtests coded in namespace cudf::test (#12490) @davidwendt
Fix erroneously skipped ORC ZSTD test (#12486) @vuule
Rework nvtext::generate_character_ngrams to use make_strings_children (#12480) @davidwendt
Raise warnings as errors in the test suite (#12468) @vyasr
Remove int32 hard-coding in python (#12467) @galipremsagar
Use cudaMemcpyDefault. (#12466) @bdice
Update workflows for nightly tests (#12462) @ajschmidt8
Build CUDA 11.8 and Python 3.10 Packages (#12457) @ajschmidt8
JNI build image default as cuda11.8 (#12441) @pxLi
Re-enable Recently Updated Check (#12435) @ajschmidt8
Rework remaining cudf::strings::from_xyz functions to use make_strings_children (#12434) @vuule
Build wheels alongside conda CI (#12427) @sevagh
Remove arguments for checking exception messages in Python (#12424) @vyasr
Clean up cuco usage (#12421) @PointKernel
Fix warnings in remaining modules (#12406) @vyasr
Update ops-bot.yaml (#12402) @ajschmidt8
Rework cudf::strings::integers_to_ipv4 to use make_strings_children utility (#12401) @davidwendt
Use numpy.empty() instead of bytearray to allocate host memory for spilling (#12399) @madsbk
Deprecate chunksize from dask_cudf.read_csv (#12394) @rjzamora
Expose the RMM pool size in JNI (#12390) @revans2
Fix COPYING_TEST: gtests coded in namespace cudf::test (#12387) @davidwendt
Rework cudf::strings::url_encode to use make_strings_children utility (#12385) @davidwendt
Use make_strings_children in parse_data nested json reader (#12382) @karthikeyann
Fix warnings in test_datetime.py (#12381) @vyasr
Mixed Join Benchmarks (#12375) @divyegala
Fix warnings in dataframe.py (#12369) @vyasr
Update conda recipes. (#12368) @bdice
Use gpu-latest-1 runner tag (#12366) @bdice
Rework cudf::strings::from_booleans to use make_strings_children (#12365) @vuule
Fix warnings in test modules up to test_dataframe.py (#12355) @vyasr
JSON column performance optimization - struct column nulls (#12354) @karthikeyann
Accelerate stable-segmented-sort with CUB segmented sort (#12347) @davidwendt
Add size check to make_offsets_child_column utility (#12345) @davidwendt
Enable max compression ratio small block optimization for ZSTD (#12338) @vuule
Fix warnings in test_monotonic.py (#12334) @vyasr
Improve JSON column creation performance (list offsets) (#12330) @karthikeyann
Upgrade to arrow-10.0.1 (#12327) @galipremsagar
Fix warnings in test_orc.py (#12326) @vyasr
Fix warnings in test_groupby.py (#12324) @vyasr
Fix test_notebooks.sh (#12323) @ajschmidt8
Fix transform gtests coded in namespace cudf::test (#12321) @davidwendt
Fix check_style.sh script (#12320) @ajschmidt8
Rework cudf::strings::from_timestamps to use make_strings_children (#12317) @davidwendt
Fix warnings in test_index.py (#12313) @vyasr
Fix warnings in test_multiindex.py (#12310) @vyasr
CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
Fix warnings in test_indexing.py (#12305) @vyasr
Fix warnings in test_joining.py (#12304) @vyasr
Unpin dask and distributed for development (#12302) @galipremsagar
Re-enable sccache for Jenkins builds (#12297) @ajschmidt8
Define needs for pr-builder workflow. (#12296) @bdice
Forward merge 22.12 into 23.02 (#12294) @vyasr
Fix warnings in test_stats.py (#12293) @vyasr
Fix table gtests coded in namespace cudf::test (#12292) @davidwendt
Change cython for regex calls to use cudf::strings::regex_program (#12289) @davidwendt
Improved error reporting when reading multiple JSON files (#12285) @vuule
Deprecate Frame.sum_of_squares (#12284) @vyasr
Remove deprecated code for 23.02 (#12281) @vyasr
Clean up handling of max_page_size_bytes in Parquet writer (#12277) @etseidl
Fix replace gtests coded in namespace cudf::test (#12270) @davidwendt
Add pandas nullable type support in Index.to_pandas (#12268) @galipremsagar
Rework nvtext::detokenize to use indexalator for row indices (#12267) @davidwendt
Fix reduction gtests coded in namespace cudf::test (#12257) @davidwendt
Remove default parameters from cudf::detail::sort function declarations (#12254) @davidwendt
Add duplicated support for Series, DataFrame and Index (#12246) @galipremsagar
Replace column/table test utilities with macros (#12242) @PointKernel
Rework cudf::strings::pad and zfill to use make_strings_children (#12238) @davidwendt
Fix sort gtests coded in namespace cudf::test (#12237) @davidwendt
Wrapping concat and file writes in @acquire_spill_lock() (#12232) @madsbk
Rename cudf::structs::detail::superimpose_parent_nulls APIs (#12230) @ttnghia
Cover parsing to decimal types in read_json tests (#12229) @vuule
Spill Statistics (#12223) @madsbk
Use CUDF_JNI_ENABLE_PROFILING to conditionally enable profiling support. (#12221) @bdice
Clean up of test_spilling.py (#12220) @madsbk
Simplify repetitive boolean logic (#12218) @vuule
Add Series.hasnans and Index.hasnans (#12214) @galipremsagar
Add cudf::strings:udf::replace function (#12210) @davidwendt
Adds in new java APIs for appending byte arrays to host columnar data (#12208) @revans2
Remove Python dependencies from Java CI. (#12193) @bdice
Fix null order in sort-based groupby and improve groupby tests (#12191) @divyegala
Move strings children functions from cudf/strings/detail/utilities.cuh to new header (#12185) @davidwendt
Clean up existing JNI scalar to column code (#12173) @revans2
Remove JIT type names, refactor id_to_type. (#12158) @bdice
Update JNI version to 23.02.0-SNAPSHOT (#12129) @pxLi
Minor refactor of cpp/src/io/parquet/page_data.cu (#12126) @etseidl
Add codespell as a linter (#12097) @benfred
Enable specifying exceptions in error macros (#12078) @vyasr
Move _label_encoding from Series to Column (#12040) @shwina
Add GitHub Actions Workflows (#12002) @ajschmidt8
Consolidate dask-cudf groupby_agg calls in one place (#10835) @charlesbluca

cudf - v22.12.01

Published by GPUtester almost 2 years ago

🚨 Breaking Changes

Add JNI for substring without 'end' parameter. (#12113) @firestarman
Refactor purge_nonempty_nulls (#12111) @ttnghia
Create an int8 column in read_csv when all elements are missing (#12110) @vuule
Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to "ALWAYS" (#12080) @vuule
Fix type promotion edge cases in numerical binops (#12074) @wence-
Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
Rollback of DeviceBufferLike (#12009) @madsbk
Remove unused managed_allocator (#12005) @vyasr
Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
Default to equal NaNs in make_merge_sets_aggregation. (#11952) @bdice
Remove validation that requires introspection (#11938) @vyasr
Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
Support nested types as groupby keys in libcudf (#11792) @PointKernel
Default to equal NaNs in make_collect_set_aggregation. (#11621) @bdice
Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source

🐛 Bug Fixes

strings_udf: use libcudf caching of character tables (#12343) @wence-
Fix include line for IO Cython modules (#12250) @vyasr
Make dask pinning looser (#12231) @vyasr
Workaround for CUB segmented-sort bug with boolean keys (#12217) @davidwendt
Fix from_dict backend dispatch to match upstream dask (#12203) @galipremsagar
Merge branch-22.10 into branch-22.12 (#12198) @davidwendt
Fix compression in ORC writer (#12194) @vuule
Don't use CMake 3.25.0 as it has a show stopping FindCUDAToolkit bug (#12188) @robertmaynard
Fix data corruption when reading ORC files with empty stripes (#12160) @vuule
Fix decimal binary operations (#12142) @galipremsagar
Ensure dlpack include is provided to cudf interop lib (#12139) @robertmaynard
Safely allocate udf_string pointers in strings_udf (#12138) @brandon-b-miller
Fix/disable jitify lto (#12122) @robertmaynard
Fix conditional_full_join benchmark (#12121) @GregoryKimball
Fix regex working-memory-size refactor error (#12119) @davidwendt
Add in negative size checks for columns (#12118) @revans2
Add JNI for substring without 'end' parameter. (#12113) @firestarman
Fix reading of CSV files with blank second row (#12098) @vuule
Fix an error in IO with GzipFile type (#12085) @galipremsagar
Workaround groupby aggregate thrust::copy_if overflow (#12079) @davidwendt
Fix alignment of compressed blocks in ORC writer (#12077) @vuule
Fix singleton-range __setitem__ edge case (#12075) @wence-
Fix type promotion edge cases in numerical binops (#12074) @wence-
Force using old fmt in nvbench. (#12067) @vyasr
Fixes List offset bug in Nested JSON reader (#12060) @karthikeyann
Allow falling back to shim_60.ptx by default in strings_udf (#12056) @brandon-b-miller
Force black exclusions for pre-commit. (#12036) @bdice
Add memory_usage & items implementation for Struct column & dtype (#12033) @galipremsagar
Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
Fixes bug in csv_reader_options construction in cython (#12021) @karthikeyann
Fix issues when both usecols and names options are used in read_csv (#12018) @vuule
Port thrust's pinned_allocator to cudf, since Thrust 1.17 removes the type (#12004) @robertmaynard
Revert "Replace most of preprocessor usage in nvcomp adapter with constexpr" (#11999) @vuule
Fix bug where df.loc resulting in single row could give wrong index (#11998) @eriknw
Switch to DISABLE_DEPRECATION_WARNINGS to match other RAPIDS projects (#11989) @robertmaynard
Fix maximum page size estimate in Parquet writer (#11962) @vuule
Fix local offset handling in bgzip reader (#11918) @upsj
Fix an issue reading struct-of-list types in Parquet. (#11910) @nvdbaranec
Fix memcheck error in TypeInference.Timestamp gtest (#11905) @davidwendt
Fix type casting in Series.setitem (#11904) @wence-
Fix memcheck error in get_dremel_data (#11903) @davidwendt
Fixes Unsupported column type error due to empty list columns in Nested JSON reader (#11897) @karthikeyann
Fix segmented-sort to ignore indices outside the offsets (#11888) @davidwendt
Fix cudf::stable_sorted_order for NaN and -NaN in FLOAT64 columns (#11874) @davidwendt
Fix writing of Parquet files with many fragments (#11869) @etseidl
Fix RangeIndex unary operators. (#11868) @vyasr
JNI Avoid NPE for reading host binary data (#11865) @revans2
Fix decimal benchmark input data generation (#11863) @karthikeyann
Fix pre-commit copyright check (#11860) @galipremsagar
Fix Parquet support for seconds and milliseconds duration types (#11854) @vuule
Ensure better compiler cache results between cudf cal-ver branches (#11835) @robertmaynard
Fix make_column_from_scalar for all-null strings column (#11807) @davidwendt
Tell jitify_preprocess where to search for libnvrtc (#11787) @robertmaynard
add V2 page header support to parquet reader (#11778) @etseidl
Parquet reader: bug fix for a num_rows/skip_rows corner case, w/optimization for nested preprocessing (#11752) @nvdbaranec
Determine if Arrow has S3 support at runtime in unit test. (#11560) @bdice

📖 Documentation

Use rapidsai CODE_OF_CONDUCT.md (#12166) @bdice
Add symlinks to notebooks. (#12128) @bdice
Add truncate API to python doc pages (#12109) @galipremsagar
Update Numba docs links. (#12107) @bdice
Remove "Multi-GPU with Dask-cuDF" notebook. (#12095) @bdice
Fix link to c++ developer guide from CONTRIBUTING.md (#12084) @brandon-b-miller
Add pivot_table and crosstab to docs. (#12014) @bdice
Fix doxygen text for cudf::dictionary::encode (#11991) @davidwendt
Replace default_stream_value with get_default_stream in docs. (#11985) @vyasr
Add dtype docs pages and docstrings for cudf specific dtypes (#11974) @galipremsagar
Update Unit Testing in libcudf guidelines to code tests outside the cudf::test namespace (#11959) @davidwendt
Rename libcudf++ to libcudf. (#11953) @bdice
Fix documentation referring to removed as_gpu_matrix method. (#11937) @bdice
Remove "experimental" warning for struct columns in ORC reader and writer (#11880) @vuule
Initial draft of policies and guidelines for libcudf usage. (#11853) @vyasr
Add clear indication of non-GPU accelerated parameters in read_json docstring (#11825) @GregoryKimball
Add developer docs for writing tests (#11199) @vyasr

🚀 New Features

Adds an EventHandler to Java MemoryBuffer to be invoked on close (#12125) @abellina
Support + in strings_udf (#12117) @brandon-b-miller
Support upper and lower in strings_udf (#12099) @brandon-b-miller
Add wheel builds (#12096) @vyasr
Allow setting malloc heap size in string udfs (#12094) @brandon-b-miller
Support strip, lstrip, and rstrip in strings_udf (#12091) @brandon-b-miller
Mark nvcomp zstd compression stable (#12059) @jbrennan333
Add debug-only onAllocated/onDeallocated to RmmEventHandler (#12054) @abellina
Enable building against the libarrow contained in pyarrow (#12034) @vyasr
Add strings like jni and native method (#12032) @cindyyuanjiang
Cleanup common parsing code in JSON, CSV reader (#12022) @karthikeyann
byte_range support for JSON Lines format (#12017) @karthikeyann
Minor cleanup of root CMakeLists.txt for better organization (#11988) @robertmaynard
Add inplace arithmetic operators to MaskedType (#11987) @brandon-b-miller
Implement JNI for chunked Parquet reader (#11961) @ttnghia
Add method argument to DataFrame.quantile (#11957) @rjzamora
Add gpu memory watermark apis to JNI (#11950) @abellina
Adds retryCount to RmmEventHandler.onAllocFailure (#11940) @abellina
Enable returning string data from UDFs used through apply (#11933) @brandon-b-miller
Switch over to rapids-cmake patches for thrust (#11921) @robertmaynard
Add strings udf C++ classes and functions for phase II (#11912) @davidwendt
Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
Enable CEC for strings_udf (#11884) @brandon-b-miller
ArrowIPCTableWriter writes en empty batch in the case of an empty table. (#11883) @firestarman
Implement chunked Parquet reader (#11867) @ttnghia
Add read_orc_metadata to libcudf (#11815) @vuule
Support nested types as groupby keys in libcudf (#11792) @PointKernel
Adding feature Truncate to DataFrame and Series (#11435) @VamsiTallam95

🛠️ Improvements

Reduce number of tests marked spilling (#12197) @madsbk
Pin dask and distributed for release (#12165) @galipremsagar
Don't rely on GNU find in headers_test.sh (#12164) @wence-
Update cp.clip call (#12148) @quasiben
Enable automatic column projection in groupby().agg (#12124) @rjzamora
Refactor purge_nonempty_nulls (#12111) @ttnghia
Create an int8 column in read_csv when all elements are missing (#12110) @vuule
Spilling to host memory (#12106) @madsbk
First pass of pd.read_orc changes in tests (#12103) @galipremsagar
Expose engine argument in dask_cudf.read_json (#12101) @rjzamora
Remove CUDA 10 compatibility code. (#12088) @bdice
Move and update dask nigthly install in CI (#12082) @galipremsagar
Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to "ALWAYS" (#12080) @vuule
Remove macros that inspect the contents of exceptions (#12076) @vyasr
Fix ingest_raw_data performance issue in Nested JSON reader due to RVO (#12070) @karthikeyann
Remove overflow error during decimal binops (#12063) @galipremsagar
Change cudf::detail::tdigest to cudf::tdigest::detail (#12050) @davidwendt
Fix quantile gtests coded in namespace cudf::test (#12049) @davidwendt
Add support for DataFrame.from_dict`to_dictandSeries.to_dict` (#12048) @galipremsagar
Refactor Parquet reader (#12046) @ttnghia
Forward merge 22.10 into 22.12 (#12045) @vyasr
Standardize newlines at ends of files. (#12042) @bdice
Trim trailing whitespace from all files. (#12041) @bdice
Use nosync policy in gather and scatter implementations. (#12038) @bdice
Remove smart quotes from all docstrings. (#12035) @bdice
Update cuda-python dependency to 11.7.1 (#12030) @galipremsagar
Add cython-lint to pre-commit checks. (#12020) @bdice
Use pragma once (#12019) @bdice
New GHA to add issues/prs to project board (#12016) @jarmak-nv
Add DataFrame.pivot_table. (#12015) @bdice
Rollback of DeviceBufferLike (#12009) @madsbk
Remove default parameters for nvtext::detail functions (#12007) @davidwendt
Remove default parameters for cudf::dictionary::detail functions (#12006) @davidwendt
Remove unused managed_allocator (#12005) @vyasr
Remove default parameters for cudf::strings::detail functions (#12003) @davidwendt
Remove unnecessary code from dask-cudf _Frame (#12001) @rjzamora
Ignore python docs build artifacts (#12000) @galipremsagar
Use rapids-cmake for google benchmark. (#11997) @vyasr
Leverage rapids_cython for more automated RPATH handling (#11996) @vyasr
Remove stale labeler (#11995) @raydouglass
Move protobuf compilation to CMake (#11986) @vyasr
Replace most of preprocessor usage in nvcomp adapter with constexpr (#11980) @vuule
Add missing noexcepts to column_in_metadata methods (#11973) @vyasr
Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
Accelerate libcudf segmented sort with CUB segmented sort (#11969) @davidwendt
Feature/remove default streams (#11967) @vyasr
Add pool memory resource to libcudf basic example (#11966) @davidwendt
Fix some libcudf calls to cudf::detail::gather (#11963) @davidwendt
Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
Add deprecation warning for set_allocator. (#11958) @vyasr
Fix lists and structs gtests coded in namespace cudf::test (#11956) @davidwendt
Add full page indexes to Parquet writer benchmarks (#11955) @etseidl
Use gather-based strings factory in cudf::strings::strip (#11954) @davidwendt
Default to equal NaNs in make_merge_sets_aggregation. (#11952) @bdice
Add strip_delimiters option to read_text (#11946) @upsj
Refactor multibyte_split output_builder (#11945) @upsj
Remove validation that requires introspection (#11938) @vyasr
Add .str.find_multiple API (#11928) @galipremsagar
Add regex_program class for use with all regex APIs (#11927) @davidwendt
Enable backend dispatching for Dask-DataFrame creation (#11920) @rjzamora
Performance improvement in JSON Tree traversal (#11919) @karthikeyann
Fix some gtests incorrectly coded in namespace cudf::test (part I) (#11917) @davidwendt
Refactor pad/zfill functions for reuse with strings udf (#11914) @davidwendt
Add nanosecond & microsecond to DatetimeProperties (#11911) @galipremsagar
Pin mimesis version in setup.py. (#11906) @bdice
Error on ListColumn or any new unsupported column in cudf.Index (#11902) @galipremsagar
Add thrust output iterator fix (1805) to thrust.patch (#11900) @davidwendt
Relax codecov threshold diff (#11899) @galipremsagar
Use public APIs in STREAM_COMPACTION_NVBENCH (#11892) @GregoryKimball
Add coverage for string UDF tests. (#11891) @vyasr
Provide data_chunk_source wrapper for datasource (#11886) @upsj
Handle multibyte_split byte_range out-of-bounds offsets on host (#11885) @upsj
Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
Change expect_strings_empty into expect_column_empty libcudf test utility (#11873) @davidwendt
Add ngroup (#11871) @shwina
Reduce memory usage in nested JSON parser - tree generation (#11864) @karthikeyann
Unpin dask and distributed for development (#11859) @galipremsagar
Remove unused includes for table/row_operators (#11857) @GregoryKimball
Use conda-forge's pyorc (#11855) @jakirkham
Add libcudf strings examples (#11849) @davidwendt
Remove cudf_io namespace alias (#11827) @vuule
Test/remove thrust vector usage (#11813) @vyasr
Add BGZIP reader to python read_text (#11802) @upsj
Merge branch-22.10 into branch-22.12 (#11801) @davidwendt
Fix compile warning from CUDF_FUNC_RANGE in a member function (#11798) @davidwendt
Update cudf JNI version to 22.12.0-SNAPSHOT (#11764) @pxLi
Update flake8 to 5.0.4 and use flake8-force to check Cython. (#11736) @bdice
Add BGZIP multibyte_split benchmark (#11723) @upsj
Bifurcate Dependency Lists (#11674) @bdice
Default to equal NaNs in make_collect_set_aggregation. (#11621) @bdice
Conform "bench_isin" to match generator column names (#11549) @GregoryKimball
Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
Add checks for HLG layers in dask-cudf groupby tests (#10853) @charlesbluca
part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
Make all nvcc warnings into errors (#8916) @trxcllnt

cudf - v22.12.00

Published by GPUtester almost 2 years ago

🚨 Breaking Changes

Add JNI for substring without 'end' parameter. (#12113) @firestarman
Refactor purge_nonempty_nulls (#12111) @ttnghia
Create an int8 column in read_csv when all elements are missing (#12110) @vuule
Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to "ALWAYS" (#12080) @vuule
Fix type promotion edge cases in numerical binops (#12074) @wence-
Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
Rollback of DeviceBufferLike (#12009) @madsbk
Remove unused managed_allocator (#12005) @vyasr
Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
Default to equal NaNs in make_merge_sets_aggregation. (#11952) @bdice
Remove validation that requires introspection (#11938) @vyasr
Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
Support nested types as groupby keys in libcudf (#11792) @PointKernel
Default to equal NaNs in make_collect_set_aggregation. (#11621) @bdice
Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source

🐛 Bug Fixes

Fix include line for IO Cython modules (#12250) @vyasr
Make dask pinning looser (#12231) @vyasr
Workaround for CUB segmented-sort bug with boolean keys (#12217) @davidwendt
Fix from_dict backend dispatch to match upstream dask (#12203) @galipremsagar
Merge branch-22.10 into branch-22.12 (#12198) @davidwendt
Fix compression in ORC writer (#12194) @vuule
Don't use CMake 3.25.0 as it has a show stopping FindCUDAToolkit bug (#12188) @robertmaynard
Fix data corruption when reading ORC files with empty stripes (#12160) @vuule
Fix decimal binary operations (#12142) @galipremsagar
Ensure dlpack include is provided to cudf interop lib (#12139) @robertmaynard
Safely allocate udf_string pointers in strings_udf (#12138) @brandon-b-miller
Fix/disable jitify lto (#12122) @robertmaynard
Fix conditional_full_join benchmark (#12121) @GregoryKimball
Fix regex working-memory-size refactor error (#12119) @davidwendt
Add in negative size checks for columns (#12118) @revans2
Add JNI for substring without 'end' parameter. (#12113) @firestarman
Fix reading of CSV files with blank second row (#12098) @vuule
Fix an error in IO with GzipFile type (#12085) @galipremsagar
Workaround groupby aggregate thrust::copy_if overflow (#12079) @davidwendt
Fix alignment of compressed blocks in ORC writer (#12077) @vuule
Fix singleton-range __setitem__ edge case (#12075) @wence-
Fix type promotion edge cases in numerical binops (#12074) @wence-
Force using old fmt in nvbench. (#12067) @vyasr
Fixes List offset bug in Nested JSON reader (#12060) @karthikeyann
Allow falling back to shim_60.ptx by default in strings_udf (#12056) @brandon-b-miller
Force black exclusions for pre-commit. (#12036) @bdice
Add memory_usage & items implementation for Struct column & dtype (#12033) @galipremsagar
Reduce/Remove reliance on **kwargs and *args in IO readers & writers (#12025) @galipremsagar
Fixes bug in csv_reader_options construction in cython (#12021) @karthikeyann
Fix issues when both usecols and names options are used in read_csv (#12018) @vuule
Port thrust's pinned_allocator to cudf, since Thrust 1.17 removes the type (#12004) @robertmaynard
Revert "Replace most of preprocessor usage in nvcomp adapter with constexpr" (#11999) @vuule
Fix bug where df.loc resulting in single row could give wrong index (#11998) @eriknw
Switch to DISABLE_DEPRECATION_WARNINGS to match other RAPIDS projects (#11989) @robertmaynard
Fix maximum page size estimate in Parquet writer (#11962) @vuule
Fix local offset handling in bgzip reader (#11918) @upsj
Fix an issue reading struct-of-list types in Parquet. (#11910) @nvdbaranec
Fix memcheck error in TypeInference.Timestamp gtest (#11905) @davidwendt
Fix type casting in Series.setitem (#11904) @wence-
Fix memcheck error in get_dremel_data (#11903) @davidwendt
Fixes Unsupported column type error due to empty list columns in Nested JSON reader (#11897) @karthikeyann
Fix segmented-sort to ignore indices outside the offsets (#11888) @davidwendt
Fix cudf::stable_sorted_order for NaN and -NaN in FLOAT64 columns (#11874) @davidwendt
Fix writing of Parquet files with many fragments (#11869) @etseidl
Fix RangeIndex unary operators. (#11868) @vyasr
JNI Avoid NPE for reading host binary data (#11865) @revans2
Fix decimal benchmark input data generation (#11863) @karthikeyann
Fix pre-commit copyright check (#11860) @galipremsagar
Fix Parquet support for seconds and milliseconds duration types (#11854) @vuule
Ensure better compiler cache results between cudf cal-ver branches (#11835) @robertmaynard
Fix make_column_from_scalar for all-null strings column (#11807) @davidwendt
Tell jitify_preprocess where to search for libnvrtc (#11787) @robertmaynard
add V2 page header support to parquet reader (#11778) @etseidl
Parquet reader: bug fix for a num_rows/skip_rows corner case, w/optimization for nested preprocessing (#11752) @nvdbaranec
Determine if Arrow has S3 support at runtime in unit test. (#11560) @bdice

📖 Documentation

Use rapidsai CODE_OF_CONDUCT.md (#12166) @bdice
Add symlinks to notebooks. (#12128) @bdice
Add truncate API to python doc pages (#12109) @galipremsagar
Update Numba docs links. (#12107) @bdice
Remove "Multi-GPU with Dask-cuDF" notebook. (#12095) @bdice
Fix link to c++ developer guide from CONTRIBUTING.md (#12084) @brandon-b-miller
Add pivot_table and crosstab to docs. (#12014) @bdice
Fix doxygen text for cudf::dictionary::encode (#11991) @davidwendt
Replace default_stream_value with get_default_stream in docs. (#11985) @vyasr
Add dtype docs pages and docstrings for cudf specific dtypes (#11974) @galipremsagar
Update Unit Testing in libcudf guidelines to code tests outside the cudf::test namespace (#11959) @davidwendt
Rename libcudf++ to libcudf. (#11953) @bdice
Fix documentation referring to removed as_gpu_matrix method. (#11937) @bdice
Remove "experimental" warning for struct columns in ORC reader and writer (#11880) @vuule
Initial draft of policies and guidelines for libcudf usage. (#11853) @vyasr
Add clear indication of non-GPU accelerated parameters in read_json docstring (#11825) @GregoryKimball
Add developer docs for writing tests (#11199) @vyasr

🚀 New Features

Adds an EventHandler to Java MemoryBuffer to be invoked on close (#12125) @abellina
Support + in strings_udf (#12117) @brandon-b-miller
Support upper and lower in strings_udf (#12099) @brandon-b-miller
Add wheel builds (#12096) @vyasr
Allow setting malloc heap size in string udfs (#12094) @brandon-b-miller
Support strip, lstrip, and rstrip in strings_udf (#12091) @brandon-b-miller
Mark nvcomp zstd compression stable (#12059) @jbrennan333
Add debug-only onAllocated/onDeallocated to RmmEventHandler (#12054) @abellina
Enable building against the libarrow contained in pyarrow (#12034) @vyasr
Add strings like jni and native method (#12032) @cindyyuanjiang
Cleanup common parsing code in JSON, CSV reader (#12022) @karthikeyann
byte_range support for JSON Lines format (#12017) @karthikeyann
Minor cleanup of root CMakeLists.txt for better organization (#11988) @robertmaynard
Add inplace arithmetic operators to MaskedType (#11987) @brandon-b-miller
Implement JNI for chunked Parquet reader (#11961) @ttnghia
Add method argument to DataFrame.quantile (#11957) @rjzamora
Add gpu memory watermark apis to JNI (#11950) @abellina
Adds retryCount to RmmEventHandler.onAllocFailure (#11940) @abellina
Enable returning string data from UDFs used through apply (#11933) @brandon-b-miller
Switch over to rapids-cmake patches for thrust (#11921) @robertmaynard
Add strings udf C++ classes and functions for phase II (#11912) @davidwendt
Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
Enable CEC for strings_udf (#11884) @brandon-b-miller
ArrowIPCTableWriter writes en empty batch in the case of an empty table. (#11883) @firestarman
Implement chunked Parquet reader (#11867) @ttnghia
Add read_orc_metadata to libcudf (#11815) @vuule
Support nested types as groupby keys in libcudf (#11792) @PointKernel
Adding feature Truncate to DataFrame and Series (#11435) @VamsiTallam95

🛠️ Improvements

Reduce number of tests marked spilling (#12197) @madsbk
Pin dask and distributed for release (#12165) @galipremsagar
Don't rely on GNU find in headers_test.sh (#12164) @wence-
Update cp.clip call (#12148) @quasiben
Enable automatic column projection in groupby().agg (#12124) @rjzamora
Refactor purge_nonempty_nulls (#12111) @ttnghia
Create an int8 column in read_csv when all elements are missing (#12110) @vuule
Spilling to host memory (#12106) @madsbk
First pass of pd.read_orc changes in tests (#12103) @galipremsagar
Expose engine argument in dask_cudf.read_json (#12101) @rjzamora
Remove CUDA 10 compatibility code. (#12088) @bdice
Move and update dask nigthly install in CI (#12082) @galipremsagar
Throw an error when libcudf is built without cuFile and LIBCUDF_CUFILE_POLICY is set to "ALWAYS" (#12080) @vuule
Remove macros that inspect the contents of exceptions (#12076) @vyasr
Fix ingest_raw_data performance issue in Nested JSON reader due to RVO (#12070) @karthikeyann
Remove overflow error during decimal binops (#12063) @galipremsagar
Change cudf::detail::tdigest to cudf::tdigest::detail (#12050) @davidwendt
Fix quantile gtests coded in namespace cudf::test (#12049) @davidwendt
Add support for DataFrame.from_dict`to_dictandSeries.to_dict` (#12048) @galipremsagar
Refactor Parquet reader (#12046) @ttnghia
Forward merge 22.10 into 22.12 (#12045) @vyasr
Standardize newlines at ends of files. (#12042) @bdice
Trim trailing whitespace from all files. (#12041) @bdice
Use nosync policy in gather and scatter implementations. (#12038) @bdice
Remove smart quotes from all docstrings. (#12035) @bdice
Update cuda-python dependency to 11.7.1 (#12030) @galipremsagar
Add cython-lint to pre-commit checks. (#12020) @bdice
Use pragma once (#12019) @bdice
New GHA to add issues/prs to project board (#12016) @jarmak-nv
Add DataFrame.pivot_table. (#12015) @bdice
Rollback of DeviceBufferLike (#12009) @madsbk
Remove default parameters for nvtext::detail functions (#12007) @davidwendt
Remove default parameters for cudf::dictionary::detail functions (#12006) @davidwendt
Remove unused managed_allocator (#12005) @vyasr
Remove default parameters for cudf::strings::detail functions (#12003) @davidwendt
Remove unnecessary code from dask-cudf _Frame (#12001) @rjzamora
Ignore python docs build artifacts (#12000) @galipremsagar
Use rapids-cmake for google benchmark. (#11997) @vyasr
Leverage rapids_cython for more automated RPATH handling (#11996) @vyasr
Remove stale labeler (#11995) @raydouglass
Move protobuf compilation to CMake (#11986) @vyasr
Replace most of preprocessor usage in nvcomp adapter with constexpr (#11980) @vuule
Add missing noexcepts to column_in_metadata methods (#11973) @vyasr
Pass column names to write_csv instead of table_metadata pointer (#11972) @vuule
Accelerate libcudf segmented sort with CUB segmented sort (#11969) @davidwendt
Feature/remove default streams (#11967) @vyasr
Add pool memory resource to libcudf basic example (#11966) @davidwendt
Fix some libcudf calls to cudf::detail::gather (#11963) @davidwendt
Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
Add deprecation warning for set_allocator. (#11958) @vyasr
Fix lists and structs gtests coded in namespace cudf::test (#11956) @davidwendt
Add full page indexes to Parquet writer benchmarks (#11955) @etseidl
Use gather-based strings factory in cudf::strings::strip (#11954) @davidwendt
Default to equal NaNs in make_merge_sets_aggregation. (#11952) @bdice
Add strip_delimiters option to read_text (#11946) @upsj
Refactor multibyte_split output_builder (#11945) @upsj
Remove validation that requires introspection (#11938) @vyasr
Add .str.find_multiple API (#11928) @galipremsagar
Add regex_program class for use with all regex APIs (#11927) @davidwendt
Enable backend dispatching for Dask-DataFrame creation (#11920) @rjzamora
Performance improvement in JSON Tree traversal (#11919) @karthikeyann
Fix some gtests incorrectly coded in namespace cudf::test (part I) (#11917) @davidwendt
Refactor pad/zfill functions for reuse with strings udf (#11914) @davidwendt
Add nanosecond & microsecond to DatetimeProperties (#11911) @galipremsagar
Pin mimesis version in setup.py. (#11906) @bdice
Error on ListColumn or any new unsupported column in cudf.Index (#11902) @galipremsagar
Add thrust output iterator fix (1805) to thrust.patch (#11900) @davidwendt
Relax codecov threshold diff (#11899) @galipremsagar
Use public APIs in STREAM_COMPACTION_NVBENCH (#11892) @GregoryKimball
Add coverage for string UDF tests. (#11891) @vyasr
Provide data_chunk_source wrapper for datasource (#11886) @upsj
Handle multibyte_split byte_range out-of-bounds offsets on host (#11885) @upsj
Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
Change expect_strings_empty into expect_column_empty libcudf test utility (#11873) @davidwendt
Add ngroup (#11871) @shwina
Reduce memory usage in nested JSON parser - tree generation (#11864) @karthikeyann
Unpin dask and distributed for development (#11859) @galipremsagar
Remove unused includes for table/row_operators (#11857) @GregoryKimball
Use conda-forge's pyorc (#11855) @jakirkham
Add libcudf strings examples (#11849) @davidwendt
Remove cudf_io namespace alias (#11827) @vuule
Test/remove thrust vector usage (#11813) @vyasr
Add BGZIP reader to python read_text (#11802) @upsj
Merge branch-22.10 into branch-22.12 (#11801) @davidwendt
Fix compile warning from CUDF_FUNC_RANGE in a member function (#11798) @davidwendt
Update cudf JNI version to 22.12.0-SNAPSHOT (#11764) @pxLi
Update flake8 to 5.0.4 and use flake8-force to check Cython. (#11736) @bdice
Add BGZIP multibyte_split benchmark (#11723) @upsj
Bifurcate Dependency Lists (#11674) @bdice
Default to equal NaNs in make_collect_set_aggregation. (#11621) @bdice
Conform "bench_isin" to match generator column names (#11549) @GregoryKimball
Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
Add checks for HLG layers in dask-cudf groupby tests (#10853) @charlesbluca
part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
Make all nvcc warnings into errors (#8916) @trxcllnt

cudf - v22.10.01

Published by GPUtester almost 2 years ago

🚨 Breaking Changes

Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
Disable nvCOMP DEFLATE integration (#11811) @vuule
Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
Update zfill to match Python output (#11634) @davidwendt
Upgrade pandas to 1.5 (#11617) @galipremsagar
Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
Adding optional parquet reader schema (#11524) @hyperbolic2346
Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
Remove support for skip_rows / num_rows options in the parquet reader. (#11503) @nvdbaranec
Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
Disable Arrow S3 support by default. (#11470) @bdice
Convert thrust::optional usages to std::optional (#11455) @robertmaynard
Remove unused is_struct trait. (#11450) @bdice
Refactor the Buffer class (#11447) @madsbk
Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
Refactor pad_side and strip_type enums into side_type enum (#11438) @davidwendt
Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
Use the new JSON parser when the experimental reader is selected (#11364) @vuule
Remove deprecated Series.applymap. (#11031) @bdice
Remove deprecated expand parameter from str.findall. (#11030) @bdice

🐛 Bug Fixes

Update cuda-python dependency to 11.7.1 (#11994) @shwina
Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
Handle ptx file paths during strings_udf import (#11862) @galipremsagar
Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
Reset strings_udf CEC and solve several related issues (#11846) @brandon-b-miller
Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
Fix is_valid checks in Scalar._binaryop (#11818) @wence-
Fix operator NotImplemented issue with numpy (#11816) @galipremsagar
Disable nvCOMP DEFLATE integration (#11811) @vuule
Build strings_udf package with other python packages in nightlies (#11808) @brandon-b-miller
Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
Build cudf locally before building strings_udf conda packages in CI (#11785) @brandon-b-miller
Fix an issue in cudf::row_bit_count involving structs and lists at multiple levels. (#11779) @nvdbaranec
Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
Fix issue with set-item incase of list and struct types (#11760) @galipremsagar
Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
Fix ORC string sum statistics (#11740) @vuule
Add strings_udf package for python 3.9 (#11730) @brandon-b-miller
Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
Don't assume stream is a compile-time constant expression (#11725) @vyasr
Fix get_thrust.cmake format at patch command (#11715) @davidwendt
Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
Fix DataFrame.from_arrow to preserve type metadata (#11698) @galipremsagar
Fix compile error due to missing header (#11697) @ttnghia
Default to Snappy compression in to_orc when using cuDF or Dask (#11690) @vuule
Fix an issue related to Multindex when group_keys=True (#11689) @galipremsagar
Transfer correct dtype to exploded column (#11687) @wence-
Ignore protobuf generated files in mypy checks (#11685) @galipremsagar
Maintain the index name after .loc (#11677) @shwina
Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
Fix multi-file remote datasource bug (#11655) @rjzamora
Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
Fix bug in device_write(): it uses an incorrect size (#11651) @madsbk
fixes overflows in benchmarks (#11649) @elstehle
Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
Update zfill to match Python output (#11634) @davidwendt
Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
Fix host scalars construction of nested types (#11612) @galipremsagar
Fix compile warning in nested_json_gpu.cu (#11607) @davidwendt
Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
Add is_timestamp test for leap second (60) (#11594) @davidwendt
Fix an issue with to_arrow when column name type is not a string (#11590) @galipremsagar
Fix exception in segmented-reduce benchmark (#11588) @davidwendt
Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
Correct distribution data type in quantiles benchmark (#11584) @vuule
Fix multibyte_split benchmark for host buffers (#11583) @upsj
xfail custreamz display test for now (#11567) @shwina
Fix JNI for TableWithMeta to use schema_info instead of column_names (#11566) @jlowe
Reduce code duplication for dask & distributed nightly/stable installs (#11565) @galipremsagar
Fix groupby failures in dask_cudf CI (#11561) @rjzamora
Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
Update parquet fuzz tests to drop support for skiprows & num_rows (#11505) @galipremsagar
Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
Fix regex quantifier check to include capture groups (#11373) @davidwendt
Fix read_text when byte_range is aligned with field (#11371) @upsj
Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
column: calculate null_count before release()ing the cudf::column (#11365) @wence-

📖 Documentation

Update guide-to-udfs notebook (#11861) @brandon-b-miller
Update docstring for cudf.read_text (#11799) @GregoryKimball
Add doc section for list & struct handling (#11770) @galipremsagar
Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
Add docs for use of string data to DataFrame.apply and Series.apply and update guide to UDFs notebook (#11733) @brandon-b-miller
Enable more Pydocstyle rules (#11582) @bdice
Remove unused cpp/img folder (#11554) @davidwendt
Publish C++ developer docs (#11475) @vyasr
Fix a misalignment in cudf.get_dummies docstring (#11443) @galipremsagar
Update contributing doc to include links to the developer guides (#11390) @davidwendt
Fix table_view_base doxygen format (#11340) @davidwendt
Create main developer guide for Python (#11235) @vyasr
Add developer documentation for benchmarking (#11122) @vyasr
cuDF error handling document (#7917) @isVoid

🚀 New Features

Add hasNull statistic reading ability to ORC (#11747) @devavret
Add istitle to string UDFs (#11738) @brandon-b-miller
JSON Column creation in GPU (#11714) @karthikeyann
Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
Add BGZIP data_chunk_reader (#11652) @upsj
Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
Generate unique keys table in java JNI contiguousSplitGroups (#11614) @res-life
Generic type casting to support the new nested JSON reader (#11613) @elstehle
JSON tree traversal (#11610) @karthikeyann
Add casting operators to masked UDFs (#11578) @brandon-b-miller
Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
Add strings 'like' function (#11558) @davidwendt
Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
Adds support for json lines format to the nested JSON reader (#11534) @elstehle
Adding optional parquet reader schema (#11524) @hyperbolic2346
Adds GPU implementation of JSON-token-stream to JSON-tree (#11518) @karthikeyann
Add gdb pretty-printers for simple types (#11499) @upsj
Add create_random_column function to the data generator (#11490) @vuule
Add fluent API builder to data_profile (#11479) @vuule
Adds Nested Json benchmark (#11466) @karthikeyann
Convert thrust::optional usages to std::optional (#11455) @robertmaynard
Python API for the future experimental JSON reader (#11426) @vuule
Return schema info from JSON reader (#11419) @vuule
Add regex ASCII flag support for matching builtin character classes (#11404) @davidwendt
Truncate parquet column indexes (#11403) @etseidl
Adds the end-to-end JSON parser implementation (#11388) @elstehle
Use the new JSON parser when the experimental reader is selected (#11364) @vuule
Add placeholder for the experimental JSON reader (#11334) @vuule
Add read-only functions on string dtypes to DataFrame.apply and Series.apply (#11319) @brandon-b-miller
Added 'crosstab' and 'pivot_table' features (#11314) @shaswat-indian
Quickly error out when trying to build with unsupported nvcc versions (#11297) @robertmaynard
Adds JSON tokenizer (#11264) @elstehle
List lexicographic comparator (#11129) @devavret
Add generic type inference for cuIO (#11121) @PointKernel
Fully support nested types in cudf::contains (#10656) @ttnghia
Support nested types in lists::contains (#10548) @ttnghia

🛠️ Improvements

Pin dask and distributed for release (#11822) @galipremsagar
Add examples for Nested JSON reader (#11814) @GregoryKimball
Support shuffle-based groupby aggregations in dask_cudf (#11800) @rjzamora
Update strings udf version updater script (#11772) @galipremsagar
Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
Pass dtype param to avoid pd.Series warnings (#11761) @galipremsagar
Enable schema_element & keep_quotes support in json reader (#11746) @galipremsagar
Add ability to construct ListColumn when size is None (#11745) @galipremsagar
Reduces memory requirements in JSON parser and adds bytes/s and peak memory usage to benchmarks (#11732) @elstehle
Add missing copyright headers. (#11712) @bdice
Fix copyright check issues in pre-commit (#11711) @bdice
Include decimal in supported types for range window order-by columns (#11710) @mythrocks
Disable very large column gtest for contiguous-split (#11706) @davidwendt
Drop split_out=None test from groupby.agg (#11704) @wence-
Use CubinLinker for CUDA Minor Version Compatibility (#11701) @gmarkall
Add regex capture-group parameter to auto convert to non-capture groups (#11695) @davidwendt
Add a __dataframe__ method to the protocol dataframe object (#11692) @rgommers
Special-case multibyte_split for single-byte delimiter (#11681) @upsj
Remove isort exclusions (#11680) @bdice
Refactor CSV reader benchmarks with nvbench (#11678) @PointKernel
Check conda recipe headers with pre-commit (#11669) @bdice
Remove redundant style check for clang-format. (#11668) @bdice
Add support for group_keys in groupby (#11659) @galipremsagar
Fix pandoc pinning. (#11658) @bdice
Revert removal of skip_rows / num_rows options from the Parquet reader. (#11657) @nvdbaranec
Update git metadata (#11647) @bdice
Call set_null_count on a returning column if null-count is known (#11646) @davidwendt
Fix some libcudf detail calls not passing the stream variable (#11642) @davidwendt
Update to mypy 0.971 (#11640) @wence-
Refactor strings strip functor to details header (#11635) @davidwendt
Fix incorrect nullCount in get_json_object (#11633) @trxcllnt
Simplify hostdevice_vector (#11631) @upsj
Refactor parquet writer benchmarks with nvbench (#11623) @PointKernel
Rework contains_scalar to check nulls at runtime (#11622) @davidwendt
Fix incorrect memory resource used in rolling temp columns (#11618) @mythrocks
Upgrade pandas to 1.5 (#11617) @galipremsagar
Move type-dispatcher calls from traits.hpp to traits.cpp (#11616) @davidwendt
Refactor parquet reader benchmarks with nvbench (#11611) @PointKernel
Forward-merge branch-22.08 to branch-22.10 (#11608) @bdice
Use stream in Java API. (#11601) @bdice
Refactors of public/detail APIs, CUDF_FUNC_RANGE, stream handling. (#11600) @bdice
Improve ORC writer benchmark with nvbench (#11598) @PointKernel
Tune multibyte_split kernel (#11587) @upsj
Move split_utils.cuh to strings/detail (#11585) @davidwendt
Fix warnings due to compiler regression with if constexpr (#11581) @ttnghia
Add full 24-bit dictionary support to Parquet writer (#11580) @etseidl
Expose "explicit-comms" option in shuffle-based dask_cudf functions (#11576) @rjzamora
Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
Refactor dask_cudf groupby to use apply_concat_apply (#11571) @rjzamora
Add ability to write list(struct) columns as map type in orc writer (#11568) @galipremsagar
Add byte_range to multibyte_split benchmark + NVBench refactor (#11562) @upsj
JNI support for writing binary columns in parquet (#11556) @revans2
Support additional dictionary bit widths in Parquet writer (#11547) @etseidl
Refactor string/numeric conversion utilities (#11545) @davidwendt
Removing unnecessary asserts in parquet tests (#11544) @hyperbolic2346
Clean up ORC reader benchmarks with NVBench (#11543) @PointKernel
Reuse MurmurHash3_32 in Parquet page data. (#11528) @bdice
Add hexadecimal value separators (#11527) @bdice
Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
Struct support for NULL_EQUALS binary operation (#11520) @rwlee
Bump hadoop-common from 3.2.3 to 3.2.4 in /java (#11516) @dependabot[bot]
Fix Feather test warning. (#11511) @bdice
copy_range ballot_syncs to have no execution dependency (#11508) @robertmaynard
Upgrade to arrow-9.x (#11507) @galipremsagar
Remove support for skip_rows / num_rows options in the parquet reader. (#11503) @nvdbaranec
Single-pass multibyte_split (#11500) @upsj
Sanitize percentile_approx() output for empty input (#11498) @SrikarVanavasam
Unpin dask and distributed for development (#11492) @galipremsagar
Move SparkMurmurHash3_32 functor. (#11489) @bdice
Refactor group_nunique.cu to use nullate::DYNAMIC for reduce-by-key functor (#11482) @davidwendt
Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
Add reduction distinct_count benchmark (#11473) @ttnghia
Add groupby nunique aggregation benchmark (#11472) @ttnghia
Disable Arrow S3 support by default. (#11470) @bdice
Add groupby max aggregation benchmark (#11464) @ttnghia
Extract Dremel encoding code from Parquet (#11461) @vyasr
Add missing Thrust #includes. (#11457) @bdice
Make CMake hooks verbose (#11456) @vyasr
Control Parquet page size through Python API (#11454) @etseidl
Add control of Parquet column index creation to python (#11453) @etseidl
Remove unused is_struct trait. (#11450) @bdice
Refactor the Buffer class (#11447) @madsbk
Refactor pad_side and strip_type enums into side_type enum (#11438) @davidwendt
Update to Thrust 1.17.0 (#11437) @bdice
Add in JNI for parsing JSON data and getting the metadata back too. (#11431) @revans2
Convert byte_array_view to use std::byte (#11424) @hyperbolic2346
Deprecate unflatten_nested_columns (#11421) @SrikarVanavasam
Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
Add Spark list hashing Java tests (#11379) @bdice
Move cmake to the build section. (#11376) @vyasr
Remove use of CUDA driver API calls from libcudf (#11370) @shwina
Add column constructor from device_uvector&& (#11356) @SrikarVanavasam
Remove unused custreamz thirdparty directory (#11343) @vyasr
Update jni version to 22.10.0-SNAPSHOT (#11338) @pxLi
Enable using upstream jitify2 (#11287) @shwina
Cache cudf.Scalar (#11246) @shwina
Remove deprecated Series.applymap. (#11031) @bdice
Remove deprecated expand parameter from str.findall. (#11030) @bdice

cudf - v22.10.00

Published by GPUtester about 2 years ago

🚨 Breaking Changes

Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
Disable nvCOMP DEFLATE integration (#11811) @vuule
Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
Update zfill to match Python output (#11634) @davidwendt
Upgrade pandas to 1.5 (#11617) @galipremsagar
Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
Adding optional parquet reader schema (#11524) @hyperbolic2346
Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
Remove support for skip_rows / num_rows options in the parquet reader. (#11503) @nvdbaranec
Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
Disable Arrow S3 support by default. (#11470) @bdice
Convert thrust::optional usages to std::optional (#11455) @robertmaynard
Remove unused is_struct trait. (#11450) @bdice
Refactor the Buffer class (#11447) @madsbk
Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
Refactor pad_side and strip_type enums into side_type enum (#11438) @davidwendt
Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
Use the new JSON parser when the experimental reader is selected (#11364) @vuule
Remove deprecated Series.applymap. (#11031) @bdice
Remove deprecated expand parameter from str.findall. (#11030) @bdice

🐛 Bug Fixes

Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
Handle ptx file paths during strings_udf import (#11862) @galipremsagar
Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
Reset strings_udf CEC and solve several related issues (#11846) @brandon-b-miller
Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
Fix is_valid checks in Scalar._binaryop (#11818) @wence-
Fix operator NotImplemented issue with numpy (#11816) @galipremsagar
Disable nvCOMP DEFLATE integration (#11811) @vuule
Build strings_udf package with other python packages in nightlies (#11808) @brandon-b-miller
Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
Build cudf locally before building strings_udf conda packages in CI (#11785) @brandon-b-miller
Fix an issue in cudf::row_bit_count involving structs and lists at multiple levels. (#11779) @nvdbaranec
Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
Fix issue with set-item incase of list and struct types (#11760) @galipremsagar
Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
Fix ORC string sum statistics (#11740) @vuule
Add strings_udf package for python 3.9 (#11730) @brandon-b-miller
Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
Don't assume stream is a compile-time constant expression (#11725) @vyasr
Fix get_thrust.cmake format at patch command (#11715) @davidwendt
Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
Fix DataFrame.from_arrow to preserve type metadata (#11698) @galipremsagar
Fix compile error due to missing header (#11697) @ttnghia
Default to Snappy compression in to_orc when using cuDF or Dask (#11690) @vuule
Fix an issue related to Multindex when group_keys=True (#11689) @galipremsagar
Transfer correct dtype to exploded column (#11687) @wence-
Ignore protobuf generated files in mypy checks (#11685) @galipremsagar
Maintain the index name after .loc (#11677) @shwina
Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
Fix multi-file remote datasource bug (#11655) @rjzamora
Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
Fix bug in device_write(): it uses an incorrect size (#11651) @madsbk
fixes overflows in benchmarks (#11649) @elstehle
Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
Update zfill to match Python output (#11634) @davidwendt
Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
Fix host scalars construction of nested types (#11612) @galipremsagar
Fix compile warning in nested_json_gpu.cu (#11607) @davidwendt
Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
Add is_timestamp test for leap second (60) (#11594) @davidwendt
Fix an issue with to_arrow when column name type is not a string (#11590) @galipremsagar
Fix exception in segmented-reduce benchmark (#11588) @davidwendt
Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
Correct distribution data type in quantiles benchmark (#11584) @vuule
Fix multibyte_split benchmark for host buffers (#11583) @upsj
xfail custreamz display test for now (#11567) @shwina
Fix JNI for TableWithMeta to use schema_info instead of column_names (#11566) @jlowe
Reduce code duplication for dask & distributed nightly/stable installs (#11565) @galipremsagar
Fix groupby failures in dask_cudf CI (#11561) @rjzamora
Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
Update parquet fuzz tests to drop support for skiprows & num_rows (#11505) @galipremsagar
Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
Fix regex quantifier check to include capture groups (#11373) @davidwendt
Fix read_text when byte_range is aligned with field (#11371) @upsj
Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
column: calculate null_count before release()ing the cudf::column (#11365) @wence-

📖 Documentation

Update guide-to-udfs notebook (#11861) @brandon-b-miller
Update docstring for cudf.read_text (#11799) @GregoryKimball
Add doc section for list & struct handling (#11770) @galipremsagar
Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
Add docs for use of string data to DataFrame.apply and Series.apply and update guide to UDFs notebook (#11733) @brandon-b-miller
Enable more Pydocstyle rules (#11582) @bdice
Remove unused cpp/img folder (#11554) @davidwendt
Publish C++ developer docs (#11475) @vyasr
Fix a misalignment in cudf.get_dummies docstring (#11443) @galipremsagar
Update contributing doc to include links to the developer guides (#11390) @davidwendt
Fix table_view_base doxygen format (#11340) @davidwendt
Create main developer guide for Python (#11235) @vyasr
Add developer documentation for benchmarking (#11122) @vyasr
cuDF error handling document (#7917) @isVoid

🚀 New Features

Add hasNull statistic reading ability to ORC (#11747) @devavret
Add istitle to string UDFs (#11738) @brandon-b-miller
JSON Column creation in GPU (#11714) @karthikeyann
Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
Add BGZIP data_chunk_reader (#11652) @upsj
Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
Generate unique keys table in java JNI contiguousSplitGroups (#11614) @res-life
Generic type casting to support the new nested JSON reader (#11613) @elstehle
JSON tree traversal (#11610) @karthikeyann
Add casting operators to masked UDFs (#11578) @brandon-b-miller
Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
Add strings 'like' function (#11558) @davidwendt
Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
Adds support for json lines format to the nested JSON reader (#11534) @elstehle
Adding optional parquet reader schema (#11524) @hyperbolic2346
Adds GPU implementation of JSON-token-stream to JSON-tree (#11518) @karthikeyann
Add gdb pretty-printers for simple types (#11499) @upsj
Add create_random_column function to the data generator (#11490) @vuule
Add fluent API builder to data_profile (#11479) @vuule
Adds Nested Json benchmark (#11466) @karthikeyann
Convert thrust::optional usages to std::optional (#11455) @robertmaynard
Python API for the future experimental JSON reader (#11426) @vuule
Return schema info from JSON reader (#11419) @vuule
Add regex ASCII flag support for matching builtin character classes (#11404) @davidwendt
Truncate parquet column indexes (#11403) @etseidl
Adds the end-to-end JSON parser implementation (#11388) @elstehle
Use the new JSON parser when the experimental reader is selected (#11364) @vuule
Add placeholder for the experimental JSON reader (#11334) @vuule
Add read-only functions on string dtypes to DataFrame.apply and Series.apply (#11319) @brandon-b-miller
Added 'crosstab' and 'pivot_table' features (#11314) @shaswat-indian
Quickly error out when trying to build with unsupported nvcc versions (#11297) @robertmaynard
Adds JSON tokenizer (#11264) @elstehle
List lexicographic comparator (#11129) @devavret
Add generic type inference for cuIO (#11121) @PointKernel
Fully support nested types in cudf::contains (#10656) @ttnghia
Support nested types in lists::contains (#10548) @ttnghia

🛠️ Improvements

Pin dask and distributed for release (#11822) @galipremsagar
Add examples for Nested JSON reader (#11814) @GregoryKimball
Support shuffle-based groupby aggregations in dask_cudf (#11800) @rjzamora
Update strings udf version updater script (#11772) @galipremsagar
Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
Pass dtype param to avoid pd.Series warnings (#11761) @galipremsagar
Enable schema_element & keep_quotes support in json reader (#11746) @galipremsagar
Add ability to construct ListColumn when size is None (#11745) @galipremsagar
Reduces memory requirements in JSON parser and adds bytes/s and peak memory usage to benchmarks (#11732) @elstehle
Add missing copyright headers. (#11712) @bdice
Fix copyright check issues in pre-commit (#11711) @bdice
Include decimal in supported types for range window order-by columns (#11710) @mythrocks
Disable very large column gtest for contiguous-split (#11706) @davidwendt
Drop split_out=None test from groupby.agg (#11704) @wence-
Use CubinLinker for CUDA Minor Version Compatibility (#11701) @gmarkall
Add regex capture-group parameter to auto convert to non-capture groups (#11695) @davidwendt
Add a __dataframe__ method to the protocol dataframe object (#11692) @rgommers
Special-case multibyte_split for single-byte delimiter (#11681) @upsj
Remove isort exclusions (#11680) @bdice
Refactor CSV reader benchmarks with nvbench (#11678) @PointKernel
Check conda recipe headers with pre-commit (#11669) @bdice
Remove redundant style check for clang-format. (#11668) @bdice
Add support for group_keys in groupby (#11659) @galipremsagar
Fix pandoc pinning. (#11658) @bdice
Revert removal of skip_rows / num_rows options from the Parquet reader. (#11657) @nvdbaranec
Update git metadata (#11647) @bdice
Call set_null_count on a returning column if null-count is known (#11646) @davidwendt
Fix some libcudf detail calls not passing the stream variable (#11642) @davidwendt
Update to mypy 0.971 (#11640) @wence-
Refactor strings strip functor to details header (#11635) @davidwendt
Fix incorrect nullCount in get_json_object (#11633) @trxcllnt
Simplify hostdevice_vector (#11631) @upsj
Refactor parquet writer benchmarks with nvbench (#11623) @PointKernel
Rework contains_scalar to check nulls at runtime (#11622) @davidwendt
Fix incorrect memory resource used in rolling temp columns (#11618) @mythrocks
Upgrade pandas to 1.5 (#11617) @galipremsagar
Move type-dispatcher calls from traits.hpp to traits.cpp (#11616) @davidwendt
Refactor parquet reader benchmarks with nvbench (#11611) @PointKernel
Forward-merge branch-22.08 to branch-22.10 (#11608) @bdice
Use stream in Java API. (#11601) @bdice
Refactors of public/detail APIs, CUDF_FUNC_RANGE, stream handling. (#11600) @bdice
Improve ORC writer benchmark with nvbench (#11598) @PointKernel
Tune multibyte_split kernel (#11587) @upsj
Move split_utils.cuh to strings/detail (#11585) @davidwendt
Fix warnings due to compiler regression with if constexpr (#11581) @ttnghia
Add full 24-bit dictionary support to Parquet writer (#11580) @etseidl
Expose "explicit-comms" option in shuffle-based dask_cudf functions (#11576) @rjzamora
Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
Refactor dask_cudf groupby to use apply_concat_apply (#11571) @rjzamora
Add ability to write list(struct) columns as map type in orc writer (#11568) @galipremsagar
Add byte_range to multibyte_split benchmark + NVBench refactor (#11562) @upsj
JNI support for writing binary columns in parquet (#11556) @revans2
Support additional dictionary bit widths in Parquet writer (#11547) @etseidl
Refactor string/numeric conversion utilities (#11545) @davidwendt
Removing unnecessary asserts in parquet tests (#11544) @hyperbolic2346
Clean up ORC reader benchmarks with NVBench (#11543) @PointKernel
Reuse MurmurHash3_32 in Parquet page data. (#11528) @bdice
Add hexadecimal value separators (#11527) @bdice
Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
Struct support for NULL_EQUALS binary operation (#11520) @rwlee
Bump hadoop-common from 3.2.3 to 3.2.4 in /java (#11516) @dependabot[bot]
Fix Feather test warning. (#11511) @bdice
copy_range ballot_syncs to have no execution dependency (#11508) @robertmaynard
Upgrade to arrow-9.x (#11507) @galipremsagar
Remove support for skip_rows / num_rows options in the parquet reader. (#11503) @nvdbaranec
Single-pass multibyte_split (#11500) @upsj
Sanitize percentile_approx() output for empty input (#11498) @SrikarVanavasam
Unpin dask and distributed for development (#11492) @galipremsagar
Move SparkMurmurHash3_32 functor. (#11489) @bdice
Refactor group_nunique.cu to use nullate::DYNAMIC for reduce-by-key functor (#11482) @davidwendt
Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
Add reduction distinct_count benchmark (#11473) @ttnghia
Add groupby nunique aggregation benchmark (#11472) @ttnghia
Disable Arrow S3 support by default. (#11470) @bdice
Add groupby max aggregation benchmark (#11464) @ttnghia
Extract Dremel encoding code from Parquet (#11461) @vyasr
Add missing Thrust #includes. (#11457) @bdice
Make CMake hooks verbose (#11456) @vyasr
Control Parquet page size through Python API (#11454) @etseidl
Add control of Parquet column index creation to python (#11453) @etseidl
Remove unused is_struct trait. (#11450) @bdice
Refactor the Buffer class (#11447) @madsbk
Refactor pad_side and strip_type enums into side_type enum (#11438) @davidwendt
Update to Thrust 1.17.0 (#11437) @bdice
Add in JNI for parsing JSON data and getting the metadata back too. (#11431) @revans2
Convert byte_array_view to use std::byte (#11424) @hyperbolic2346
Deprecate unflatten_nested_columns (#11421) @SrikarVanavasam
Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
Add Spark list hashing Java tests (#11379) @bdice
Move cmake to the build section. (#11376) @vyasr
Remove use of CUDA driver API calls from libcudf (#11370) @shwina
Add column constructor from device_uvector&& (#11356) @SrikarVanavasam
Remove unused custreamz thirdparty directory (#11343) @vyasr
Update jni version to 22.10.0-SNAPSHOT (#11338) @pxLi
Enable using upstream jitify2 (#11287) @shwina
Cache cudf.Scalar (#11246) @shwina
Remove deprecated Series.applymap. (#11031) @bdice
Remove deprecated expand parameter from str.findall. (#11030) @bdice

cudf - v22.08.01

Published by GPUtester about 2 years ago

🚨 Breaking Changes

Pin numpy to <1.23 (#11824) @galipremsagar
Remove legacy join APIs (#11274) @vyasr
Remove lists::drop_list_duplicates (#11236) @ttnghia
Remove Index.replace API (#11131) @vyasr
Remove deprecated Index methods from Frame (#11073) @vyasr
Remove public API of cudf.merge_sorted. (#11032) @bdice
Drop python 3.7 in code-base (#11029) @galipremsagar
Return empty dataframe when reading a Parquet file using empty columns option (#11018) @vuule
Remove Arrow CUDA IPC code (#10995) @shwina
Buffer: make .ptr read-only (#10872) @madsbk

🐛 Bug Fixes

Fix out-of-bound access in cudf::detail::label_segments (#11497) @ttnghia
Fix distributed error related to loop_in_thread (#11428) @galipremsagar
Fix atomic operations on NaN values (#11420) @ttnghia
Relax arrow pinning to just 8.x and remove cuda build dependency from cudf recipe (#11412) @kkraus14
Revert "Allow CuPy 11" (#11409) @jakirkham
Fix moto timeouts (#11369) @galipremsagar
Set +/-infinity as the identity values for floating-point numbers in device operators min and max (#11357) @ttnghia
Fix memory_usage() for ListSeries (#11355) @thomcom
Fix constructing Column from column_view with expired mask (#11354) @shwina
Handle parquet corner case: Columns with more rows than are in the row group. (#11353) @nvdbaranec
Fix DatetimeIndex & TimedeltaIndex constructors (#11342) @galipremsagar
Fix unsigned-compare compile warning in IntPow binops (#11339) @davidwendt
Fix performance issue and add a new code path to cudf::detail::contains (#11330) @ttnghia
Pin pytorch to temporarily unblock from libcupti errors (#11289) @galipremsagar
Workaround for nvcomp zstd overwriting blocks for orc due to underestimate of sizes (#11288) @jbrennan333
Fix inconsistency when hashing two tables in cudf::detail::contains (#11284) @ttnghia
Fix issue related to numpy array and category dtype (#11282) @galipremsagar
Add NotImplementedError when on is specified in DataFrame.join. (#11275) @vyasr
Fix invalid allocate_like() and empty_like() tests. (#11268) @nvdbaranec
Returns DataFrame When Concating Along Axis 1 (#11263) @isVoid
Fix compile error due to missing header (#11257) @ttnghia
Fix a memory aliasing/crash issue in scatter for lists. (#11254) @nvdbaranec
Fix tests/rolling/empty_input_test (#11238) @ttnghia
Fix const qualifier when using host_span<bitmask_type const*> (#11220) @ttnghia
Avoid using nvcompBatchedDeflateDecompressGetTempSizeEx in cuIO (#11213) @vuule
Generate benchmark data with correct run length regardless of cardinality (#11205) @vuule
Fix cumulative count index behavior (#11188) @brandon-b-miller
Fix assertion in dask_cudf test_struct_explode (#11170) @rjzamora
Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager (#11161) @res-life
Fix compatibility issues with pandas 1.4.3 (#11152) @vyasr
Ensure cuco export set is installed in cmake build (#11147) @jlowe
Avoid redundant deepcopy in cudf.from_pandas (#11142) @galipremsagar
Fix compile error due to missing header (#11126) @ttnghia
Fix __cuda_array_interface__ failures (#11113) @galipremsagar
Support octal and hex within regex character class pattern (#11112) @davidwendt
Fix split_re matching logic for word boundaries (#11106) @davidwendt
Handle multiple files metadata in read_parquet (#11105) @galipremsagar
Fix index alignment for Series objects with repeated index (#11103) @shwina
FindcuFile now searches in the current CUDA Toolkit location (#11101) @robertmaynard
Fix regex word boundary logic to include underline (#11099) @davidwendt
Exclude CudaFatalTest when selecting all Java tests (#11083) @jlowe
Fix duplicate cudatoolkit pinning issue (#11070) @galipremsagar
Maintain the input index in the result of a groupby-transform (#11068) @shwina
Fix bug with row count comparison for expect_columns_equivalent(). (#11059) @nvdbaranec
Fix BPE uninitialized size value for null and empty input strings (#11054) @davidwendt
Include missing header for usage of get_current_device_resource() (#11047) @AtlantaPepsi
Fix warn_unused_result error in parquet test (#11026) @karthikeyann
Return empty dataframe when reading a Parquet file using empty columns option (#11018) @vuule
Fix small error in page row count limiting (#10991) @etseidl
Fix a row index entry error in ORC writer issue (#10989) @vuule
Fix grouped covariance to require both values to be convertible to double. (#10891) @bdice

📖 Documentation

Defer loading of custom.js (#11465) @galipremsagar
Fix issues with day & night modes in python docs (#11400) @galipremsagar
Update missing data handling APIs in docs (#11345) @galipremsagar
Add lists filtering APIs to doxygen group. (#11336) @bdice
Remove unused import in README sample (#11318) @vyasr
Note null behavior in where docs (#11276) @brandon-b-miller
Update docstring for spans in get_row_data_range (#11271) @vyasr
Update nvCOMP integration table (#11231) @vuule
Add dev docs for documentation writing (#11217) @vyasr
Documentation fix for concatenate (#11187) @dagardner-nv
Fix unresolved links in markdown (#11173) @karthikeyann
Fix cudf version in README.md install commands (#11164) @jvanstraten
Switch language from None to "en" in docs build (#11133) @galipremsagar
Remove docs mentioning scalar_view since no such class exists. (#11132) @bdice
Add docstring entry for DataFrame.value_counts (#11039) @galipremsagar
Add docs to rolling var, std, count. (#11035) @bdice
Fix docs for Numba UDFs. (#11020) @bdice
Replace column comparison utilities functions with macros (#11007) @karthikeyann
Fix Doxygen warnings in multiple headers files (#11003) @karthikeyann
Fix doxygen warnings in utilities/ headers (#10974) @karthikeyann
Fix Doxygen warnings in table header files (#10964) @karthikeyann
Fix Doxygen warnings in column header files (#10963) @karthikeyann
Fix Doxygen warnings in strings / header files (#10937) @karthikeyann
Generate Doxygen Tag File for Libcudf (#10932) @isVoid
Fix doxygen warnings in structs, lists headers (#10923) @karthikeyann
Fix doxygen warnings in fixed_point.hpp (#10922) @karthikeyann
Fix doxygen warnings in ast/, rolling, tdigest/, wrappers/, dictionary/ headers (#10921) @karthikeyann
fix doxygen warnings in cudf/io/types.hpp, other header files (#10913) @karthikeyann
fix doxygen warnings in cudf/io/ avro, csv, json, orc, parquet header files (#10912) @karthikeyann
Fix doxygen warnings in cudf/*.hpp (#10896) @karthikeyann
Add missing documentation in aggregation.hpp (#10887) @karthikeyann
Revise PR template. (#10774) @bdice

🚀 New Features

Change cmake to allow controlling Arrow version via cmake variable (#11429) @kkraus14
Adding support for list<int8> columns to be written as byte arrays in parquet (#11328) @hyperbolic2346
Adding byte array view structure (#11322) @hyperbolic2346
Adding byte_array statistics (#11303) @hyperbolic2346
Add column indexes to Parquet writer (#11302) @etseidl
Provide an Option for Default Integer and Floating Bitwidth (#11272) @isVoid
FST benchmark (#11243) @karthikeyann
Adds the Finite-State Transducer algorithm (#11242) @elstehle
Refactor collect_set to use cudf::distinct and cudf::lists::distinct (#11228) @ttnghia
Treat zstd as stable in nvcomp releases 2.3.2 and later (#11226) @jbrennan333
Add 24 bit dictionary support to Parquet writer (#11216) @devavret
Enable positive group indices for extractAllRecord on JNI (#11215) @anthony-chang
JNI bindings for NTH_ELEMENT window aggregation (#11201) @mythrocks
Add JNI bindings for extractAllRecord (#11196) @anthony-chang
Add cudf.options (#11193) @isVoid
Add thrift support for parquet column and offset indexes (#11178) @etseidl
Adding binary read/write as options for parquet (#11160) @hyperbolic2346
Support nth_element for window functions (#11158) @mythrocks
Implement lists::distinct and cudf::detail::stable_distinct (#11149) @ttnghia
Implement Groupby pct_change (#11144) @skirui-source
Add JNI for set operations (#11143) @ttnghia
Remove deprecated PER_THREAD_DEFAULT_STREAM (#11134) @jbrennan333
Added a Java method to check the existence of a list of keys in a map (#11128) @razajafri
Feature/python benchmarking (#11125) @vyasr
Support nan_equality in cudf::distinct (#11118) @ttnghia
Added JNI for getMapValueForKeys (#11104) @razajafri
Refactor semi_anti_join (#11100) @ttnghia
Replace remaining instances of rmm::cuda_stream_default with cudf::default_stream_value (#11082) @jbrennan333
Adds the Logical Stack algorithm (#11078) @elstehle
Add doxygen-check pre-commit hook (#11076) @karthikeyann
Use new nvCOMP API to optimize the decompression temp memory size (#11064) @vuule
Add Doxygen CI check (#11057) @karthikeyann
Support duplicate_keep_option in cudf::distinct (#11052) @ttnghia
Support set operations (#11043) @ttnghia
Support for ZLIB compression in ORC writer (#11036) @vuule
Adding feature swaplevels (#11027) @VamsiTallam95
Use nvCOMP for ZLIB decompression in ORC reader (#11024) @vuule
Function for bfill, ffill #9591 (#11022) @Sreekiran096
Generate group offsets from element labels (#11017) @ttnghia
Feature axes (#10979) @VamsiTallam95
Generate group labels from offsets (#10945) @ttnghia
Add missing cuIO benchmark coverage for duration types (#10933) @vuule
Dask-cuDF cumulative groupby ops (#10889) @brandon-b-miller
Reindex Improvements (#10815) @brandon-b-miller
Implement value_counts for DataFrame (#10813) @martinfalisse

🛠️ Improvements

Pin numpy to <1.23 (#11824) @galipremsagar
Make Index Join Tests on Default Precisions Deterministic (#11451) @isVoid
Pin dask & distributed for release (#11433) @galipremsagar
Use documented header template for doxygen (#11430) @galipremsagar
Relax arrow version in dev env (#11418) @galipremsagar
Added Java bindings for Parquet options for binary read (#11410) @razajafri
Allow CuPy 11 (#11393) @jakirkham
Improve multibyte_split performance (#11347) @cwharris
Switch death test to use explicit trap. (#11326) @vyasr
Add --output-on-failure to ctest args. (#11321) @vyasr
Consolidate remaining DataFrame/Series APIs (#11315) @vyasr
Add JNI support for the join_strings API (#11309) @revans2
Add cupy version to setup.py install_requires (#11306) @vyasr
removing some unused code (#11305) @hyperbolic2346
Add test of wildcard selection (#11300) @vyasr
Update parquet reader to take stream parameter (#11294) @PointKernel
Spark list hashing (#11292) @bdice
Remove legacy join APIs (#11274) @vyasr
Fix cudf recipes syntax (#11273) @ajschmidt8
Fix cudf recipe (#11267) @ajschmidt8
Cleanup config files (#11266) @vyasr
Run mypy on all packages (#11265) @vyasr
Update to isort 5.10.1. (#11262) @vyasr
Consolidate flake8 and pydocstyle configuration (#11260) @vyasr
Remove redundant black config specifications. (#11258) @vyasr
Ensure DeprecationWarnings are not introduced via pre-commit (#11255) @wence-
Optimization to gpu::PreprocessColumnData in parquet reader. (#11252) @nvdbaranec
Move rolling impl details to detail/ directory. (#11250) @mythrocks
Remove lists::drop_list_duplicates (#11236) @ttnghia
Use cudf::lists::distinct in Python binding (#11234) @ttnghia
Use cudf::lists::distinct in Java binding (#11233) @ttnghia
Use cudf::distinct in Java binding (#11232) @ttnghia
Pin dask-cuda in dev environment (#11229) @galipremsagar
Remove cruft in map_lookup (#11221) @mythrocks
Deprecate skiprows & num_rows in parquet reader (#11218) @galipremsagar
Remove Frame._index (#11210) @vyasr
Improve performance for cudf::contains when searching for a scalar (#11202) @ttnghia
Document why Development component is needing for CMake. (#11200) @vyasr
cleanup unused code in rolling_test.hpp (#11195) @karthikeyann
Standardize join internals around DataFrame (#11184) @vyasr
Move character case table declarations from src to detail (#11183) @davidwendt
Remove usage of Frame in StringMethods (#11181) @vyasr
Expose get_json_object_options to Python (#11180) @SrikarVanavasam
Fix decimal128 stats in parquet writer (#11179) @etseidl
Modify CheckPageRows in parquet_test to use datasources (#11177) @etseidl
Pin max version of cuda-python to 11.7.0 (#11174) @Ethyling
Refactor and optimize Frame.where (#11168) @vyasr
Add npos const static member to cudf::string_view (#11166) @davidwendt
Move _drop_rows_by_label from Frame to IndexedFrame (#11157) @vyasr
Clean up _copy_type_metadata (#11156) @vyasr
Add nvcc conda package in dev environment (#11154) @galipremsagar
Struct binary comparison op functionality for spark rapids (#11153) @rwlee
Refactor inline conditionals. (#11151) @bdice
Refactor Spark hashing tests (#11145) @bdice
Add new _from_data_like_self factory (#11140) @vyasr
Update get_cucollections to use rapids-cmake (#11139) @vyasr
Remove unnecessary extra function for libcudacxx detection (#11138) @vyasr
Allow initial value for cudf::reduce and cudf::segmented_reduce. (#11137) @SrikarVanavasam
Remove Index.replace API (#11131) @vyasr
Move char-type table function declarations from src to detail (#11127) @davidwendt
Clean up repo root (#11124) @bdice
Improve print formatting of strings containing newline characters. (#11108) @nvdbaranec
Fix cudf::string_view::find() to return pos for empty string argument (#11107) @davidwendt
Forward-merge branch-22.06 to branch-22.08 (#11086) @bdice
Take iterators by value in clamp.cu. (#11084) @bdice
Performance improvements for row to column conversions (#11075) @hyperbolic2346
Remove deprecated Index methods from Frame (#11073) @vyasr
Use per-page max compressed size estimate for compression (#11066) @devavret
column to row refactor for performance (#11063) @hyperbolic2346
Include skbuild directory into build.sh clean operation (#11060) @galipremsagar
Unpin dask & distributed for development (#11058) @galipremsagar
Add support for Series.between (#11051) @galipremsagar
Fix groupby include (#11046) @bwyogatama
Regex cleanup internal reclass and reclass_device classes (#11045) @davidwendt
Remove public API of cudf.merge_sorted. (#11032) @bdice
Drop python 3.7 in code-base (#11029) @galipremsagar
Addition & integration of the integer power operator (#11025) @AtlantaPepsi
Refactor lists::contains (#11019) @ttnghia
Change build.sh to find C++ library by default and avoid shadowing CMAKE_ARGS (#11013) @vyasr
Clean up parquet unit test (#11005) @PointKernel
Add missing #pragma once to header files (#11004) @karthikeyann
Cleanup iterator.cuh and add fixed point support for scalar_optional_accessor (#10999) @ttnghia
Refactor cudf::contains (#10997) @ttnghia
Remove Arrow CUDA IPC code (#10995) @shwina
Change file extension for groupby benchmark (#10985) @ttnghia
Sort recipe include checks. (#10984) @bdice
Update cuCollections for thrust upgrade (#10983) @PointKernel
Expose row-group size options in cudf ParquetWriter (#10980) @rjzamora
Cleanup cudf::strings::detail::regex_parser class source (#10975) @davidwendt
Handle missing fields as nulls in get_json_object() (#10970) @SrikarVanavasam
Fix license families to match all-caps expected by conda-verify. (#10931) @bdice
Include <optional> for GCC 11 compatibility. (#10927) @bdice
Enable builds with scikit-build (#10919) @vyasr
Improve distinct by using cuco::static_map::retrieve_all (#10916) @PointKernel
update cudfjni to 22.08.0-SNAPSHOT (#10910) @pxLi
Improve the capture of fatal cuda error (#10884) @sperlingxx
Cleanup regex compiler operators and operands source (#10879) @davidwendt
Buffer: make .ptr read-only (#10872) @madsbk
Configurable NaN handling in device_row_comparators (#10870) @rwlee
Register cudf.core.groupby.Grouper objects to dask grouper_dispatch (#10838) @brandon-b-miller
Upgrade to arrow-8 (#10816) @galipremsagar
Remove getattr method in RangeIndex class (#10538) @skirui-source
Adding bins to value counts (#8247) @marlenezw

cudf - v22.08.00

Published by GPUtester about 2 years ago

🚨 Breaking Changes

Remove legacy join APIs (#11274) @vyasr
Remove lists::drop_list_duplicates (#11236) @ttnghia
Remove Index.replace API (#11131) @vyasr
Remove deprecated Index methods from Frame (#11073) @vyasr
Remove public API of cudf.merge_sorted. (#11032) @bdice
Drop python 3.7 in code-base (#11029) @galipremsagar
Return empty dataframe when reading a Parquet file using empty columns option (#11018) @vuule
Remove Arrow CUDA IPC code (#10995) @shwina
Buffer: make .ptr read-only (#10872) @madsbk

🐛 Bug Fixes

Fix distributed error related to loop_in_thread (#11428) @galipremsagar
Relax arrow pinning to just 8.x and remove cuda build dependency from cudf recipe (#11412) @kkraus14
Revert "Allow CuPy 11" (#11409) @jakirkham
Fix moto timeouts (#11369) @galipremsagar
Set +/-infinity as the identity values for floating-point numbers in device operators min and max (#11357) @ttnghia
Fix memory_usage() for ListSeries (#11355) @thomcom
Fix constructing Column from column_view with expired mask (#11354) @shwina
Handle parquet corner case: Columns with more rows than are in the row group. (#11353) @nvdbaranec
Fix DatetimeIndex & TimedeltaIndex constructors (#11342) @galipremsagar
Fix unsigned-compare compile warning in IntPow binops (#11339) @davidwendt
Fix performance issue and add a new code path to cudf::detail::contains (#11330) @ttnghia
Pin pytorch to temporarily unblock from libcupti errors (#11289) @galipremsagar
Workaround for nvcomp zstd overwriting blocks for orc due to underestimate of sizes (#11288) @jbrennan333
Fix inconsistency when hashing two tables in cudf::detail::contains (#11284) @ttnghia
Fix issue related to numpy array and category dtype (#11282) @galipremsagar
Add NotImplementedError when on is specified in DataFrame.join. (#11275) @vyasr
Fix invalid allocate_like() and empty_like() tests. (#11268) @nvdbaranec
Returns DataFrame When Concating Along Axis 1 (#11263) @isVoid
Fix compile error due to missing header (#11257) @ttnghia
Fix a memory aliasing/crash issue in scatter for lists. (#11254) @nvdbaranec
Fix tests/rolling/empty_input_test (#11238) @ttnghia
Fix const qualifier when using host_span<bitmask_type const*> (#11220) @ttnghia
Avoid using nvcompBatchedDeflateDecompressGetTempSizeEx in cuIO (#11213) @vuule
Generate benchmark data with correct run length regardless of cardinality (#11205) @vuule
Fix cumulative count index behavior (#11188) @brandon-b-miller
Fix assertion in dask_cudf test_struct_explode (#11170) @rjzamora
Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager (#11161) @res-life
Fix compatibility issues with pandas 1.4.3 (#11152) @vyasr
Ensure cuco export set is installed in cmake build (#11147) @jlowe
Avoid redundant deepcopy in cudf.from_pandas (#11142) @galipremsagar
Fix compile error due to missing header (#11126) @ttnghia
Fix __cuda_array_interface__ failures (#11113) @galipremsagar
Support octal and hex within regex character class pattern (#11112) @davidwendt
Fix split_re matching logic for word boundaries (#11106) @davidwendt
Handle multiple files metadata in read_parquet (#11105) @galipremsagar
Fix index alignment for Series objects with repeated index (#11103) @shwina
FindcuFile now searches in the current CUDA Toolkit location (#11101) @robertmaynard
Fix regex word boundary logic to include underline (#11099) @davidwendt
Exclude CudaFatalTest when selecting all Java tests (#11083) @jlowe
Fix duplicate cudatoolkit pinning issue (#11070) @galipremsagar
Maintain the input index in the result of a groupby-transform (#11068) @shwina
Fix bug with row count comparison for expect_columns_equivalent(). (#11059) @nvdbaranec
Fix BPE uninitialized size value for null and empty input strings (#11054) @davidwendt
Include missing header for usage of get_current_device_resource() (#11047) @AtlantaPepsi
Fix warn_unused_result error in parquet test (#11026) @karthikeyann
Return empty dataframe when reading a Parquet file using empty columns option (#11018) @vuule
Fix small error in page row count limiting (#10991) @etseidl
Fix a row index entry error in ORC writer issue (#10989) @vuule
Fix grouped covariance to require both values to be convertible to double. (#10891) @bdice

📖 Documentation

Fix issues with day & night modes in python docs (#11400) @galipremsagar
Update missing data handling APIs in docs (#11345) @galipremsagar
Add lists filtering APIs to doxygen group. (#11336) @bdice
Remove unused import in README sample (#11318) @vyasr
Note null behavior in where docs (#11276) @brandon-b-miller
Update docstring for spans in get_row_data_range (#11271) @vyasr
Update nvCOMP integration table (#11231) @vuule
Add dev docs for documentation writing (#11217) @vyasr
Documentation fix for concatenate (#11187) @dagardner-nv
Fix unresolved links in markdown (#11173) @karthikeyann
Fix cudf version in README.md install commands (#11164) @jvanstraten
Switch language from None to "en" in docs build (#11133) @galipremsagar
Remove docs mentioning scalar_view since no such class exists. (#11132) @bdice
Add docstring entry for DataFrame.value_counts (#11039) @galipremsagar
Add docs to rolling var, std, count. (#11035) @bdice
Fix docs for Numba UDFs. (#11020) @bdice
Replace column comparison utilities functions with macros (#11007) @karthikeyann
Fix Doxygen warnings in multiple headers files (#11003) @karthikeyann
Fix doxygen warnings in utilities/ headers (#10974) @karthikeyann
Fix Doxygen warnings in table header files (#10964) @karthikeyann
Fix Doxygen warnings in column header files (#10963) @karthikeyann
Fix Doxygen warnings in strings / header files (#10937) @karthikeyann
Generate Doxygen Tag File for Libcudf (#10932) @isVoid
Fix doxygen warnings in structs, lists headers (#10923) @karthikeyann
Fix doxygen warnings in fixed_point.hpp (#10922) @karthikeyann
Fix doxygen warnings in ast/, rolling, tdigest/, wrappers/, dictionary/ headers (#10921) @karthikeyann
fix doxygen warnings in cudf/io/types.hpp, other header files (#10913) @karthikeyann
fix doxygen warnings in cudf/io/ avro, csv, json, orc, parquet header files (#10912) @karthikeyann
Fix doxygen warnings in cudf/*.hpp (#10896) @karthikeyann
Add missing documentation in aggregation.hpp (#10887) @karthikeyann
Revise PR template. (#10774) @bdice

🚀 New Features

Change cmake to allow controlling Arrow version via cmake variable (#11429) @kkraus14
Adding support for list<int8> columns to be written as byte arrays in parquet (#11328) @hyperbolic2346
Adding byte array view structure (#11322) @hyperbolic2346
Adding byte_array statistics (#11303) @hyperbolic2346
Add column indexes to Parquet writer (#11302) @etseidl
Provide an Option for Default Integer and Floating Bitwidth (#11272) @isVoid
FST benchmark (#11243) @karthikeyann
Adds the Finite-State Transducer algorithm (#11242) @elstehle
Refactor collect_set to use cudf::distinct and cudf::lists::distinct (#11228) @ttnghia
Treat zstd as stable in nvcomp releases 2.3.2 and later (#11226) @jbrennan333
Add 24 bit dictionary support to Parquet writer (#11216) @devavret
Enable positive group indices for extractAllRecord on JNI (#11215) @anthony-chang
JNI bindings for NTH_ELEMENT window aggregation (#11201) @mythrocks
Add JNI bindings for extractAllRecord (#11196) @anthony-chang
Add cudf.options (#11193) @isVoid
Add thrift support for parquet column and offset indexes (#11178) @etseidl
Adding binary read/write as options for parquet (#11160) @hyperbolic2346
Support nth_element for window functions (#11158) @mythrocks
Implement lists::distinct and cudf::detail::stable_distinct (#11149) @ttnghia
Implement Groupby pct_change (#11144) @skirui-source
Add JNI for set operations (#11143) @ttnghia
Remove deprecated PER_THREAD_DEFAULT_STREAM (#11134) @jbrennan333
Added a Java method to check the existence of a list of keys in a map (#11128) @razajafri
Feature/python benchmarking (#11125) @vyasr
Support nan_equality in cudf::distinct (#11118) @ttnghia
Added JNI for getMapValueForKeys (#11104) @razajafri
Refactor semi_anti_join (#11100) @ttnghia
Replace remaining instances of rmm::cuda_stream_default with cudf::default_stream_value (#11082) @jbrennan333
Adds the Logical Stack algorithm (#11078) @elstehle
Add doxygen-check pre-commit hook (#11076) @karthikeyann
Use new nvCOMP API to optimize the decompression temp memory size (#11064) @vuule
Add Doxygen CI check (#11057) @karthikeyann
Support duplicate_keep_option in cudf::distinct (#11052) @ttnghia
Support set operations (#11043) @ttnghia
Support for ZLIB compression in ORC writer (#11036) @vuule
Adding feature swaplevels (#11027) @VamsiTallam95
Use nvCOMP for ZLIB decompression in ORC reader (#11024) @vuule
Function for bfill, ffill #9591 (#11022) @Sreekiran096
Generate group offsets from element labels (#11017) @ttnghia
Feature axes (#10979) @VamsiTallam95
Generate group labels from offsets (#10945) @ttnghia
Add missing cuIO benchmark coverage for duration types (#10933) @vuule
Dask-cuDF cumulative groupby ops (#10889) @brandon-b-miller
Reindex Improvements (#10815) @brandon-b-miller
Implement value_counts for DataFrame (#10813) @martinfalisse

🛠️ Improvements

Pin dask & distributed for release (#11433) @galipremsagar
Use documented header template for doxygen (#11430) @galipremsagar
Relax arrow version in dev env (#11418) @galipremsagar
Allow CuPy 11 (#11393) @jakirkham
Improve multibyte_split performance (#11347) @cwharris
Switch death test to use explicit trap. (#11326) @vyasr
Add --output-on-failure to ctest args. (#11321) @vyasr
Consolidate remaining DataFrame/Series APIs (#11315) @vyasr
Add JNI support for the join_strings API (#11309) @revans2
Add cupy version to setup.py install_requires (#11306) @vyasr
removing some unused code (#11305) @hyperbolic2346
Add test of wildcard selection (#11300) @vyasr
Update parquet reader to take stream parameter (#11294) @PointKernel
Spark list hashing (#11292) @bdice
Remove legacy join APIs (#11274) @vyasr
Fix cudf recipes syntax (#11273) @ajschmidt8
Fix cudf recipe (#11267) @ajschmidt8
Cleanup config files (#11266) @vyasr
Run mypy on all packages (#11265) @vyasr
Update to isort 5.10.1. (#11262) @vyasr
Consolidate flake8 and pydocstyle configuration (#11260) @vyasr
Remove redundant black config specifications. (#11258) @vyasr
Ensure DeprecationWarnings are not introduced via pre-commit (#11255) @wence-
Optimization to gpu::PreprocessColumnData in parquet reader. (#11252) @nvdbaranec
Move rolling impl details to detail/ directory. (#11250) @mythrocks
Remove lists::drop_list_duplicates (#11236) @ttnghia
Use cudf::lists::distinct in Python binding (#11234) @ttnghia
Use cudf::lists::distinct in Java binding (#11233) @ttnghia
Use cudf::distinct in Java binding (#11232) @ttnghia
Pin dask-cuda in dev environment (#11229) @galipremsagar
Remove cruft in map_lookup (#11221) @mythrocks
Deprecate skiprows & num_rows in parquet reader (#11218) @galipremsagar
Remove Frame._index (#11210) @vyasr
Improve performance for cudf::contains when searching for a scalar (#11202) @ttnghia
Document why Development component is needing for CMake. (#11200) @vyasr
cleanup unused code in rolling_test.hpp (#11195) @karthikeyann
Standardize join internals around DataFrame (#11184) @vyasr
Move character case table declarations from src to detail (#11183) @davidwendt
Remove usage of Frame in StringMethods (#11181) @vyasr
Expose get_json_object_options to Python (#11180) @SrikarVanavasam
Fix decimal128 stats in parquet writer (#11179) @etseidl
Modify CheckPageRows in parquet_test to use datasources (#11177) @etseidl
Pin max version of cuda-python to 11.7.0 (#11174) @Ethyling
Refactor and optimize Frame.where (#11168) @vyasr
Add npos const static member to cudf::string_view (#11166) @davidwendt
Move _drop_rows_by_label from Frame to IndexedFrame (#11157) @vyasr
Clean up _copy_type_metadata (#11156) @vyasr
Add nvcc conda package in dev environment (#11154) @galipremsagar
Struct binary comparison op functionality for spark rapids (#11153) @rwlee
Refactor inline conditionals. (#11151) @bdice
Refactor Spark hashing tests (#11145) @bdice
Add new _from_data_like_self factory (#11140) @vyasr
Update get_cucollections to use rapids-cmake (#11139) @vyasr
Remove unnecessary extra function for libcudacxx detection (#11138) @vyasr
Allow initial value for cudf::reduce and cudf::segmented_reduce. (#11137) @SrikarVanavasam
Remove Index.replace API (#11131) @vyasr
Move char-type table function declarations from src to detail (#11127) @davidwendt
Clean up repo root (#11124) @bdice
Improve print formatting of strings containing newline characters. (#11108) @nvdbaranec
Fix cudf::string_view::find() to return pos for empty string argument (#11107) @davidwendt
Forward-merge branch-22.06 to branch-22.08 (#11086) @bdice
Take iterators by value in clamp.cu. (#11084) @bdice
Performance improvements for row to column conversions (#11075) @hyperbolic2346
Remove deprecated Index methods from Frame (#11073) @vyasr
Use per-page max compressed size estimate for compression (#11066) @devavret
column to row refactor for performance (#11063) @hyperbolic2346
Include skbuild directory into build.sh clean operation (#11060) @galipremsagar
Unpin dask & distributed for development (#11058) @galipremsagar
Add support for Series.between (#11051) @galipremsagar
Fix groupby include (#11046) @bwyogatama
Regex cleanup internal reclass and reclass_device classes (#11045) @davidwendt
Remove public API of cudf.merge_sorted. (#11032) @bdice
Drop python 3.7 in code-base (#11029) @galipremsagar
Addition & integration of the integer power operator (#11025) @AtlantaPepsi
Refactor lists::contains (#11019) @ttnghia
Change build.sh to find C++ library by default and avoid shadowing CMAKE_ARGS (#11013) @vyasr
Clean up parquet unit test (#11005) @PointKernel
Add missing #pragma once to header files (#11004) @karthikeyann
Cleanup iterator.cuh and add fixed point support for scalar_optional_accessor (#10999) @ttnghia
Refactor cudf::contains (#10997) @ttnghia
Remove Arrow CUDA IPC code (#10995) @shwina
Change file extension for groupby benchmark (#10985) @ttnghia
Sort recipe include checks. (#10984) @bdice
Update cuCollections for thrust upgrade (#10983) @PointKernel
Expose row-group size options in cudf ParquetWriter (#10980) @rjzamora
Cleanup cudf::strings::detail::regex_parser class source (#10975) @davidwendt
Handle missing fields as nulls in get_json_object() (#10970) @SrikarVanavasam
Fix license families to match all-caps expected by conda-verify. (#10931) @bdice
Include <optional> for GCC 11 compatibility. (#10927) @bdice
Enable builds with scikit-build (#10919) @vyasr
Improve distinct by using cuco::static_map::retrieve_all (#10916) @PointKernel
update cudfjni to 22.08.0-SNAPSHOT (#10910) @pxLi
Improve the capture of fatal cuda error (#10884) @sperlingxx
Cleanup regex compiler operators and operands source (#10879) @davidwendt
Buffer: make .ptr read-only (#10872) @madsbk
Configurable NaN handling in device_row_comparators (#10870) @rwlee
Register cudf.core.groupby.Grouper objects to dask grouper_dispatch (#10838) @brandon-b-miller
Upgrade to arrow-8 (#10816) @galipremsagar
Remove getattr method in RangeIndex class (#10538) @skirui-source
Adding bins to value counts (#8247) @marlenezw

Package Rankings

Top 5.32% on Pypi.org

Top 8.17% on Proxy.golang.org

Top 4.8% on Repo1.maven.org

Related Projects

sit4onnx

Tools for simple inference testing using TensorRT, CUDA and OpenVINO CPU/GPU and CPU providers. S...

12 May 2022 18

localGPT

Chat with your documents on your local device using GPT models. No data leaves your device and 10...

24 May 2023 19,925

cumm

CUda Matrix Multiply library.

08 Oct 2021 67

vqa-outliers

Code and Experiments for ACL-IJCNLP 2021 Paper "Mind Your Outliers! Investigating the Negative Im...

25 May 2021 55

sqaod

Solvers/annealers for simulated quantum annealing on CPU and CUDA(NVIDIA GPU).

24 Oct 2017 81

CuVec

Unifying Python/C++/CUDA memory: Python buffered array ↔️ `std::vector` ↔️ CUDA managed memory

16 Jan 2021 80

blazingsql

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

24 Sep 2018 1,896

spconv

Spatial Sparse Convolution Library

19 Jan 2019 1,847

cupy

NumPy & SciPy for GPU

01 Nov 2016 7,739

annotated-s4

Implementation of https://srush.github.io/annotated-s4

08 Dec 2021 450

panda3d

Powerful, mature open-source cross-platform game engine for Python and C++, developed by Disney a...

30 Sep 2013 4,258

librapid

A highly optimised C++ library for mathematical applications and neural networks.

25 May 2021 163

CV-CUDA

CV-CUDA™ is an open-source, GPU accelerated library for cloud-scale image processing and computer...

23 Aug 2022 2,338

DeepRec

DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is h...

24 Dec 2021 1,029