sparklyr - sparklyr 1.8.6 Latest Release

Published by edgararuiz 6 months ago

Sparklyr 1.8.6

Addresses issues with R 4.4.0. The root cause was that version checking functions
changed how the work.
- package_version() no longer accepts numeric_version() output. Wrapped
  the package_version() function to coerce the argument if it's a
  numeric_version class
- Comparison operators (<, >=, etc.) for packageVersion() do no longer accept numeric values.
  The changes were to pass the version as a character
Adding support for Databricks "autoloader" (format: cloudFiles) for streaming ingestion of files(stream_read_cloudfiles)(@zacdav-db #3432):
- stream_write_table()
- stream_read_table()
Made changes to stream_write_generic (@zacdav-db #3432):
- toTable method doesn't allow calling start, added to_table param that adjusts logic
- path option not propagated when to_table is TRUE
Upgrades to Roxygen version 7.3.1

sparklyr - sparklyr 1.8.5

Published by edgararuiz 7 months ago

Fixes

Fixes quoting issue with dbplyr 2.5.0 (#3429)
Fixes Windows OS identification (#3426)

Package improvements

Removes dependency on tibble, all calls are now redirected to dplyr (#3399)
Removes dependency on rapddirs (#3401):
- Backwards compatibility with sparklyr 0.5 is no longer needed
- Replicates selection of cache directory
Converts spark_apply() to a method (#3418)

Spark improvements

Spark 2.3 is no longer considered maintained as of September 2019
- Removes Java folder for versions 2.3 and below
- Merges Scala file sets into Spark version 2.4
- Re-compiles JARs for version 2.4 and above
Updates Delta-to-Spark version matching when using delta as one of the
packages when connecting (#3414)

sparklyr - sparklyr 1.8.4

Published by edgararuiz 12 months ago

Sparklyr 1.8.4

Compatability with new `dbplyr` version

Fixes db_connection_describe() S3 consistency error (@t-kalinowski)
Addresses new error from dbplyr that fails when you try to access
components from a remote tbl using $
Bumps the version of dbplyr to switch between the two methods to create
temporary tables
Addresses new translate_sql() hard requirement to pass a con object. Done
by passing the current connection or simulate_hive()

Fixes

Small fix to spark_connect_method() arguments. Removes 'hadoop_version'
Improvements to handling pysparklyr load (@t-kalinowski)
Fixes 'subscript out of bounds' issue found by pysparklyr (@t-kalinowski)
Updates available Spark download links

Improvements

Removes dependency on the following packages:
- digest
- base64enc
- ellipsis
Converts ml_fit() into a S3 method for pysparklyr compatibility

Test improvements

Improvements and fixes to tests (@t-kalinowski)
Fixes test jobs that include should have included Arrow but did not
Updates to the Spark versions to be tested
Re-adds tests for development dbplyr

sparklyr - sparklyr 1.8.3

Published by edgararuiz about 1 year ago

Sparklyr 1.8.3

Improvements

Spark error message relays are now cached instead of the entire content
displayed as an R error. This used to overwhelm the interactive session's
console or Notebook, because of the amount of lines returned by the
Spark message. Now, by default, it will return the top of the Spark
error message, which is typically the most relevant part. The full error can
still be accessed using a new function called spark_last_error()
Reduces redundancy on several tests
Handles SQL quoting when the table reference contains multiple levels. The
common time someone would encounter an issue is when a table name is passed
using in_catalog(), or in_schema().

Java

Adds Scala scripts to handle changes in the upcoming version of Spark (3.5)
Adds new JAR file to handle Spark 3.0 to 3.4
Adds new JAR file to handle Spark 3.5 and above

Fixes

It prevents an error when na.rm = TRUE is explicitly set within pmax() and
pmin(). It will now also purposely fail if na.rm is set to FALSE. The
default of these functions in base R is for na.rm to be FALSE, but ever
since these functions were released, there has been no warning or error. For now,
we will keep that behavior until a better approach can be figured out. (#3353)
spark_install() will now properly match when a partial version is passed
to the function. The issue was that passing '2.3' would match to '3.2.3', instead
of '2.3.x' (#3370)

Package integration

Adds functionality to allow other packages to provide sparklyr additional
back-ends. This effort is mainly focused on adding the ability to integrate
with Spark Connect and Databricks Connect through a new package.
New exported functions to integrate with the RStudio IDE. They all have the
same spark_ide_ prefix
Modifies several read functions to become exported methods, such as
sdf_read_column().
Adds spark_integ_test_skip() function. This is to allow other packages
to use sparklyr's test suite. It enables a way to the external package to
indicate if a given test should run or be skipped.
If installed, sparklyr will load the pysparklyr package

sparklyr - sparklyr 1.8.2

Published by edgararuiz over 1 year ago

New Features

Adds Azure Synapse Analytics connectivity (@Bob-Chou , #3336)
Adds support for "parameterized" queries now available in Spark 3.4 (@gregleleu #3335)
Adds new DBI methods: dbValid and dbDisconnect (@alibell, #3296)
Adds overwrite parameter to dbWriteTable() (@alibell, #3296)
Adds database parameter to dbListTables() (@alibell, #3296)
Adds ability to turn off predicate support (where(), across()) using
options("sparklyr.support.predicates" = FALSE). Defaults to TRUE. This should
accelerate dplyr commands because it won't need to process column types
for every single piped command

Fixes

Fixes Spark download locations (#3331)
Fix various rlang deprecation warnings (@mgirlich, #3333).

Misc

Switches upper version of Spark to 3.4, and updates JARS (#3334)

sparklyr - sparklyr 1.8.1

Published by edgararuiz over 1 year ago

Bug Fixes

Fixes consistency issues with dplyr's sample_n(), slice(), op_vars(), and sample_frac()

Internal functionality

Adds R-devel to GHA testing

sparklyr - sparklyr 1.8.0

Published by edgararuiz over 1 year ago

Bug Fixes

Addresses Warning from CRAN checks
Addresses option(stringsAsFactors) usage
Fixes root cause of issue processing pivot wider and distinct (#3317 & #3320)
Updates local Spark download sources

sparklyr - sparklyr 1.7.8

Published by edgararuiz about 2 years ago

New features

Adds new metric extraction functions: ml_metrics_binary(),
ml_metrics_regression() and ml_metrics_multiclass(). They work closer to
how yardstick metric extraction functions work. They expect a table with
the predictions and actual values, and returns a concise tibble with the
metrics. (#3281)
Adds new spark_insert_table() function. This allows one to insert data into
an existing table definition without redefining the table, even when overwriting
the existing data. (#3272 @jimhester)

Bug Fixes

Restores "validator" functions to regression models. Removing them in a previous
version broke ml_cross_validator() for regression models. (#3273)

Spark

Adds support to Spark 3.3 local installation. This includes the ability to
enable and setup log4j version 2. (#3269)
Updates the JSON file that sparklyr uses to find and download Spark for
local use. It is worth mentioning that starting with Spark 3.3, the Hadoop
version number is no longer using a minor version for its download link. So,
instead of requesting 3.2, the version to request is 3.

Internal functionality

Removes workaround for older versions of arrow. Bumps arrow version
dependency, from 0.14.0 to 0.17.0 (#3283 @nealrichardson)
Removes code related to backwards compatibility with dbplyr. sparklyr
requires dbplyr version 2.2.1 or above, so the code is no longer needed.
(#3277)
Begins centralizing ML parameter validation into a single function that will
run the proper cast function for each Spark parameter. It also starts using
S3 methods, instead of searching for a concatenated function name, to find the
proper parameter validator. Regression models are the first ones to use this
new method. (#3279)
sparklyr compilation routines have been improved and simplified.
spark_compile() now provides more informative output when used. It also adds
tests to compilation to make sure. It also adds a step to install Scala in the
corresponding GHAs. This is so that the new JAR build tests are able to run.
(#3275)
Stops using package environment variables directly. Any package level variable
will be handled by a genv prefixed function to set and retrieve values. This
avoids the risk of having the exact same variable initialized on more than on
R script. (#3274)
Adds more tests to improve coverage.

Misc

Addresses new CRAN HTML check NOTEs. It also adds a new GHA action to run the
same checks to make sure we avoid new issues with this in the future.

sparklyr - sparklyr 1.7.6

Published by edgararuiz over 2 years ago

Ensures compatibility with Spark version 3.2 (#3261)
Compatibility with new dbplyr version (@mgirlich)
Removes stringr dependency
Fixes augment() when the model was fitted via parsnip (#3233)

sparklyr - sparklyr 1.7.5

Published by edgararuiz over 2 years ago

Misc

Addresses both CRAN Check Results warnings:
- Un-exported object rlang::is_env()
- pivot_wider() S3 consistency issue

sparklyr - sparklyr 1.7.4

Published by edgararuiz over 2 years ago

Misc

Edgar Ruiz (https://github.com/edgararuiz) will be the new maintainer of
{sparklyr} moving forward.

sparklyr - sparklyr 1.7.3

Published by yitao-li almost 3 years ago

Data

Implemented support for the .groups parameter for dplyr::summarize()
operations on Spark dataframes
Fixed the incorrect handling of the remove = TRUE option for
separate.tbl_spark()
Optimized away an extra count query when collecting Spark dataframes from
Spark to R.

Misc

By default, use links from the https://dlcdn.apache.org site for downloading
Apache Spark when possible.
Attempt to continue spark_install() process even if the Spark version
specified is not present in inst/extdata/versions*.json files (in which
case sparklyr will guess the URL of the tar ball based on the existing
and well-known naming convention used by https://archive.apache.org, i.e.,
https://archive.apache.org/dist/spark/spark-${spark version}/spark-${spark version}-bin-hadoop${hadoop version}.tgz)
Revised inst/extdata/versions*.json files to reflect recent releases of
Apache Spark.
Implemented sparklyr_get_backend_port() for querying the port number used
by the sparklyr backend.

sparklyr - v1.7.2

Published by yitao-li about 3 years ago

sparklyr -

Published by yitao-li over 3 years ago

sparklyr -

Published by yitao-li over 3 years ago

sparklyr -

Published by yitao-li over 3 years ago

sparklyr - CRAN v1.6.2

Published by yitao-li over 3 years ago

sparklyr - CRAN v1.5.0

Published by yitao-li almost 4 years ago

sparklyr - CRAN v1.4.0

Published by yitao-li about 4 years ago

sparklyr - CRAN v.1.3.0

Published by yitao-li over 4 years ago

sparklyr

Sparklyr 1.8.6

Fixes

Package improvements

Spark improvements

Sparklyr 1.8.4

Compatability with new `dbplyr` version

Fixes

Improvements

Test improvements

Sparklyr 1.8.3

Improvements

Java

Fixes

Package integration

New Features

Fixes

Misc

Bug Fixes

Internal functionality

Bug Fixes

New features

Bug Fixes

Spark

Internal functionality

Misc

Misc

Misc

Data

Misc

Related Projects

big-data

dbx

implyr

spark

dplyr

sparklyr

Sparklyr 1.8.6

Fixes

Package improvements

Spark improvements

Sparklyr 1.8.4

Compatability with new dbplyr version

Fixes

Improvements

Test improvements

Sparklyr 1.8.3

Improvements

Java

Fixes

Package integration

New Features

Fixes

Misc

Bug Fixes

Internal functionality

Bug Fixes

New features

Bug Fixes

Spark

Internal functionality

Misc

Misc

Misc

Data

Misc

Related Projects

big-data

dbx

implyr

spark

dplyr

Compatability with new `dbplyr` version