cdap

An open source framework for building data analytic applications.

OTHER License

Stars
735
Committers
237

Bot releases are hidden (Show)

cdap - CDAP 6.10.1 Latest Release

Published by samdgupi 5 months ago

Changes

CDAP-21006: You can add a provider using OAuthHandler while reusing credentials stored in Google Cloud Secret Manager
CDAP-20934: Added support for option string field (keep-strings) in parse-xml-to-json Wrangler directive.
PLUGIN-900: The BigQuery sink plugin doesn’t provide the Dedupe By option while in insert mode.
PLUGIN-1563: The BigQuery plugin supports the JSON type.
PLUGIN-1715: Improved retries in BigQuery plugin.
PLUGIN-1748: Improved error messages in the Spanner source.
PLUGIN-1769: Improved retries in PubSub plugin.

Fixes

PLUGIN-1736: Fixed an issue in Wrangler causing the send-to-error-and-continue directive to not initialize dq_failure when the condition is false.
CDAP-20951: Fixed an issue that occurs if running a replication pipeline when task workers are enabled.
PLUGIN-788, PLUGIN-781, PLUGIN-1318, PLUGIN-782: Improved error reporting in the BigQuery Sink. Fixed an issue in BigQuery Argument Setter where validation error wasn’t displayed correctly.
PLUGIN-1617: Fixed an issue with the Python plugin, where running in native mode doesn’t work as intended.
PLUGIN-1728: Fixed an issue causing certain connection parameters to not propagate in a MySQL connection.
PLUGIN-1735: Fixed an issue causing the Cloud Storage Copy action to timeout while working with large files.
PLUGIN-1738: Fixed an issue causing Copy and Move plugins to not create buckets at the destination path as expected, resulting in a runtime error.
PLUGIN-1742: Fixed an issue causing empty source input to fail in multiple plugins.
PLUGIN-1778: Fixed an issue with remote execution of Wrangler directives causing type information to not be emitted.
PLUGIN-1771: Streaming pipelines in CDAP support the Excel source. Batch pipelines with an Excel source can consume high memory and fail in large pipelines.
CDAP-21024: Fixed an issue causing a No record field provided error.
CDAP-20890: Fixed an issue with using the Conditional plugin as a source for Wrangler, causing CDAP not to fetch the necessary schema.
CDAP-20999: Fixed an issue with instance upgrades causing existing schedule names to be improperly encoded in the URL, resulting in pre-upgrade failure.
CDAP-20988: Fixed an issue with schedules causing the maximum concurrent run property to not work as intended.
CDAP-20932: Fixed an issue causing committed ID to incorrectly propagate when pushing pipeline configurations to Git.

Breaking

CDAP version 6.10.1 has a known issue in the Cloud Storage plugin causing pipelines to intermittently fail if the plugin contains a * regex pattern and uses Dataproc 2.0. To mitigate this issue you can :

  • Change the Dataproc image to version 2.1 or
  • Use an older plugin version. or
  • Increase memory for the executor
cdap - CDAP 6.10.0

Published by samdgupi 9 months ago

Improvements

CDAP-15361: Wrangler is schema aware.
CDAP-20799: CDAP supports multi pipeline pull and push as part of source control management with GitHub.
CDAP-20831: If a task is stuck, task workers are forcefully restarted.
CDAP-20868: Added capability to run concurrent tasks in task workers.
PLUGIN-1694: Added validation for incorrect credentials in the Amazon S3
source.

Changes

CDAP-20904 and CDAP-20581: In Source Control Management, GitHub PAT was removed from CDAP web interface for repository configurations.
CDAP-20846: Improved latency when BigQuery pushdown is enabled by fetching artifacts from a local cache.
PLUGIN-1718: The BigQuery sink supports flexible table names and column names.
PLUGIN-1692: BigQuery sinks support ingesting data to JSON data type fields.
PLUGIN-1705: In BigQuery sink jobs, you can add labels in the form of key-value pairs.
PLUGIN-1729: In BigQuery execute jobs, you can add labels in the form of key-value pairs.
PLUGIN-1293: The Cloud Storage Java Client is upgraded to version 2.3 and later.

Fixes

CDAP-20521: Fixed an issue causing columns that have all null values to be dropped in Wrangler.
CDAP-20587: Fixed an issue causing slowness in API while fetching runs of all applications in a namespace.
CDAP-20815: Fixed an issue causing pipeline upgrades to not have the intended description.
CDAP-20839: Made the following fixes to Wrangler grammar:

  • The NUMERIC token type supports negative numbers.
  • The PROPERTIES token type supports one or more properties.

PLUGIN-1681: Fixed an issue in the Postgres DB plugin causing macros to be unsupported for database configuration.

Deprecated

Spark compute engine running on Scala 2.11 is not supported.

cdap - CDAP 6.9.2

Published by dli357 about 1 year ago

Improvements

CDAP-19428: CDAP supports setting custom scopes when creating a Dataproc cluster.

CDAP-20698: You can set common metadata labels for Dataproc clusters and jobs using the Common Labels property in the Ephemeral Dataproc compute profile.

You can set labels for the Dataproc jobs using the Common Labels property in the Existing Dataproc compute profile.

You can set a pipeline runtime argument with the key system.profile.properties.labels and a value representing the labels in the following format: key1|value1;key2|value2. This setting overrides the common labels set in the compute profile for pipeline runs.

CDAP-20712: CDAP supports using Dataproc temp buckets in compute profiles.

Fixes

PLUGIN-1660: Added retry for Pub/Sub snapshot creation and deletion in real-time pipeline with a Pub/Sub source when a retryable internal error is thrown.

CDAP-20674: Fixed a bug causing the Dynamic Spark plugins to fail when running on Dataproc 1.5.

CDAP-20680: Fixed a discrepancy in warning and error counts reported between the pipeline summary tab and system logs.

CDAP-20759: Fixed a problem when, in rare cases, a cluster couldn't be found with Cluster Reuse.

CDAP-20778: Fixed a bug causing the JavaScript transform to fail on Dataproc 2.1.

cdap - CDAP 6.9.1

Published by CuriousVini over 1 year ago

Improvements

CDAP-20436: Added the ability to aggregate pipeline metrics in the RuntimeClientService by setting app.program.runtime.monitor.metrics.aggregation.enabled to true in cdap-site.xml. This slightly increases the resource usage of the RuntimeClientService but decreases the load on the CDAP metrics service. The scalability of the metrics service increases with the number of spark executors per pipeline.

CDAP-20228: CDAP supports source control management with GitHub. Cloud Data Fusion supports using Source Control Management to manage pipeline versions through GitHub repositories. Source Control Management is available in Preview.

CDAP-20543: CDAP version 6.9.1 supports the Dataproc image 2.1 compute engine, which runs in Java11. If you change the Dataproc image to 2.1, the JDBC drivers that the database plugins use in those instances must be compatible with Java11.

CDAP-20455: Streaming pipelines that use Spark checkpointing can use macros if the cdap.streaming.allow.source.macros runtime argument is set to true. Note that macro evaluation will only be performed for the first run in this case, then stored in the checkpoint. It won't be reevaluated in later runs.

CDAP-20466: Added Lifecycle microservices endpoint to delete a streaming application state for Kafka Consumer Streaming and Google Cloud Pub/Sub Streaming sources.

CDAP-20488: Improved performance of replication pipelines by caching schema objects for data events.

CDAP-20500: Added a launch mode setting to the Dataproc provisioners. When set to Client mode, the program launcher runs in the Dataproc job itself, instead of as a separate YARN application. This reduces the start-up time and cluster resources required, but may cause failures if the launcher needs more memory, such as if there is an action plugin that loads data into memory.

CDAP-20504: Removed duplicate backend calls when a program reads from the secure store.

CDAP-20567: Added support to upgrade Pipeline Post-run Action (Pipeline Alerts) plugins during the pipeline upgrade process.

PLUGIN-1537: CDAP supports the following improvements and changes for real time pipelines with a single Pub/Sub streaming source and no Windower plugins:
The Pub/Sub streaming source has built-in support—data is processed at least once. Enabling Spark checkpointing isn’t required
Pub/Sub streaming source creates a Pub/Sub snapshot at the beginning of each batch and removes it at the end of each batch.
The Pub/Sub Snapshot creation has a cost associated with it. For more information, see PubSub pricing.
The snapshot creations can be monitored using Cloud Audit logs.

Fixed

CDAP-18394: Fixed an issue which checks GET permission on a namespace that doesn't exist yet during the namespace creation flow.

CDAP-20216: Fixed an issue where Dataproc continued running a job when it couldn't communicate with the CDAP instance, if the replication job or pipeline was deleted in CDAP.

CDAP-20568: Fixed an issue that caused pipelines with triggers with runtime arguments to fail after the instance was upgraded to CDAP 6.8+ and 6.9.0.

CDAP-20597: Fixed an issue where arguments set by actions don't overwrite runtime arguments. Users must add the following runtime argument: system.skip.normal.macro.evaluation=true.

PLUGIN-1594: Fixed an issue where initial offset was not considered in the Kafka batch source.

CDAP-20655: Fixed an issue that caused the Pipeline Studio page to show an incorrect count of triggers.

CDAP-20660: Fixed an issue that caused the Trigger's Payload Config to be missing in the UI for an upgraded instance.

Deprecated

CDAP-20667: All datasets except FileSet and ExternalDataset are deprecated and will be removed in a future release. All the deprecated datasets use the Table dataset in some form, which only works for programs running with the native provisioner on very old Hadoop releases.

cdap - CDAP 6.8.3

Published by rmstar over 1 year ago

Feature

CDAP-20381: Added the ability to configure java options for a pipeline run by setting the system.program.jvm.opts runtime argument.

Improvement

CDAP-20567: CDAP supports upgrades in the Pipeline Post-run Action (Pipeline Alerts) plugins during the pipeline upgrade process.

Fixes

CDAP-20549: Fixed an issue where executor resource settings are not honored when app.pipeline.overwriteConfig is set.

CDAP-20568: Fixed an issue that caused pipelines with triggers with runtime arguments to fail after the instance was upgraded to CDAP 6.8+ and 6.9.0.

CDAP-20597: Fixed an issue where arguments set by actions didn't overwrite runtime arguments. To fix the issue, add the following runtime argument: system.skip.normal.macro.evaluatio=true.

CDAP-20643: Fixed security vulnerabilities by ensuring that software updates are applied regularly to the CDAP operator images.

CDAP-20655: Fixed an issue that caused the Pipeline Studio page to show an incorrect count of triggers.

CDAP-20660: Fixed an issue that caused the Trigger's Payload Config to be missing in the UI for an upgraded instance.

PLUGIN-1582: Fixed an issue in the BigQuery Sink where the absence of an ordering key was giving an exception.

PLUGIN-1594: Fixed an issue where initial offset was not considered in the Kafka Batch Source.

cdap - CDAP 6.8.2

Published by rmstar over 1 year ago

Bug Fixes

CDAP-20431: Fixed an issue that sometimes caused pipelines to fail when running pipelines on Dataproc with the following error: Unsupported program type: Spark. The first time a pipeline that only contained actions ran on a newly created or upgraded instance, it succeeded. However, the next pipeline runs, which included sources or sinks, might have failed with this error.

cdap - CDAP 6.9.0

Published by CuriousVini over 1 year ago

Features

CDAP-20454: Added support for specifying filters in SQL in Wrangler and pushdown of SQL filters in Wrangler to BigQuery. In the Wrangler transformation, added support for specifying preconditions in SQL, and added support for transformation pushdown for SQL preconditions.

CDAP-20288: Added support for Dataproc driver node groups. To use Dataproc driver node groups, when you create the Dataproc cluster, configure the following properties:
yarn:yarn.nodemanager.resource.memory.enforced=false
yarn:yarn.nodemanager.admin-env.SPARK_HOME=$SPARK_HOME

Note: The single quotation marks are important in the property when using gcloud CLI to create the cluster ('yarn:yarn.nodemanager.admin-env.SPARK_HOME=$SPARK_HOME') so that the shell doesn't try to resolve the $ locally before submitting.

CDAP-19628: Added support for Window Aggregation operations in Transformation Pushdown to reduce the pipeline execution time by performing SQL operations in BigQuery instead of Spark.

CDAP-19425: Added support for editing deployed pipelines.

CDAP-20228: Added support for pipeline version control with GitHub.

Improvements

CDAP-20381: Added the ability to configure Java options for a pipeline run by setting the system.program.jvm.opts runtime argument.

CDAP-20140: Replication pipelines generate logs for stats of events processed by source and target plugins at a fixed interval.
Changes
CDAP-20430: Fixed the pipeline stage validation API to return unevaluated macro values to prevent secure macros from being returned.

CDAP-20373: When you duplicate a pipeline, CDAP appends _copy to the pipeline name when it opens in the Pipeline Studio. In previous releases, CDAP appended _<v1, v2, v3> to the name.

Bug Fixes

CDAP-20458: Fixed an issue where the flow control running count metric (system.flowcontrol.running.count) might be stale if no new pipelines or replication jobs were started.

CDAP-20431: Fixed an issue that sometimes caused pipelines to fail when running pipelines on Dataproc with the following error: Unsupported program type: Spark. The first time a pipeline that only contained actions ran on a newly created or upgraded instance, it succeeded. However, the next pipeline runs, which included sources or sinks, might have failed with this error.

CDAP-20301: Fixed an issue where a replication job got stuck in an infinite retry when it failed to process a DDL operation.

CDAP-20276: For replication jobs, fixed an issue where retries for transient errors from BigQuery might have resulted in data inconsistency.

CDAP-19389](https://cdap.atlassian.net/browse/CDAP-19389): For SQL Server replication sources, fixed an issue on the Review assessment page, where SQL Server DATETIME and DATETIME2 columns were shown as mapped to TIMESTAMP columns in BigQuery. This was a UI bug. The replication job mapped the data types to the BigQuery DATETIME type.

PLUGIN-1516: Updated the Window Aggregation Analytics plugin to support Spark3 and remove the dependency in Scala 2.11

PLUGIN-1514: For the Database sink, fixed an issue where the pipeline didn’t fail if there was an error writing data to the database. Now, if there is an error writing data to the database, the pipeline fails and no data is written to the database.

PLUGIN-1513: For BigQuery Pushdown, fixed an issue when BigQuery Pushdown was enabled for an existing dataset, the Location where the BigQuery Sink executed jobs was the location specified in the Pushdown configuration, not the BigQuery Dataset location. The configured Location should have only been used when creating resources. Now, if the dataset already exists, the Location for the existing dataset is used.

PLUGIN-1512: Fixed an issue where pipelines failed when the output schema was overridden in certain source plugins. This was because the output schema didn’t match the order of the fields from the query. This happened when the pipeline included any of the following batch sources:

  • Database
  • Oracle
  • MySQL
  • SQL Server
  • PostgreSQL
  • DB2
  • MariaDB
  • Netezza
  • CloudSQL PostgreSQL
  • CloudSQL MySQL
  • Teradata

Pipelines no longer fail when you override the output schema in these source plugins. CDAP uses the name of the field to match the schema of the field in the result set and the field in the output schema.

PLUGIN-1503: Fixed an issue where pipelines that had a Database batch source and an Oracle sink that used a connection object (using SYSDBA) to connect to an Oracle database failed to establish a connection to the Oracle database. This was due to a package conflict between the Database batch source and the Oracle sink plugins.

PLUGIN-1494: For Oracle batch sources, fixed an issue that caused the pipeline to fail when there was a TIMESTAMP WITH LOCAL TIME ZONE column set to NULLABLE and the source had values that were NULL.

PLUGIN-1481: In the Oracle batch source, the Oracle NUMBER data type defined without precision and scale by default was mapped to CDAP string data type. If these fields were used by an Oracle Sink to insert into a NUMBER data type field in the Oracle table, the pipeline failed due to incompatibility between string and NUMBER type. Now, the Oracle Sink inserts these string types into NUMBER fields in the Oracle table.

cdap - CDAP 6.7.3

Published by sumitjnn over 1 year ago

Bug Fixes

CDAP-19599: Fixed an issue in the BigQuery Replication Target plugin that caused replication jobs to fail when the BigQuery target table already existed. The new version of the plugin will automatically be used in new replication jobs.

CDAP-19622: Fixed upgrade for MySQL and SQL Server replication jobs. You can now upgrade MySQL and SQL Server replication jobs from CDAP 6.7.1 and 6.7.2 to CDAP 6.7.3.

CDAP-20013: Fixed upgrade for Oracle by Datastream replication jobs. You can now upgrade Oracle by Datastream replication jobs from CDAP 6.6.0 and 6.7.x to CDAP 6.7.3 or higher.

CDAP-20235: For Database plugins, fixed a security issue where the database username and password were exposed in App Fabric logs.

CDAP-20271: Fixed an issue that caused pipelines to fail when they used a connection that included a secure macro and the secure macro had JSON as the value (for example, the Service Account property).

CDAP-20392: Fixed an issue that occurred in certain upgrade scenarios, where pipelines that didn’t have the Use Connection property set, but the plugin the connection properties (such as Project ID and Service account information) were not displayed in the plugin UI.

CDAP-20394: Fixed an issue where the replication source plugin's event reader was not stopped by the Delta worker in case of errors, leading to leakage of the plugin's resources.

CDAP-20146: Fixed an issue in security-enabled instances that caused pipeline launches to fail and return a token expired error when evaluating secure macros in provisioner properties.

PLUGIN-1433: In the Oracle Batch Source, when the source data included fields with the Numeric data type (undefined precision and scale), CDAP set the precision to 38 and the scale to 0. If any values in the field had scale other than 0, CDAP truncated these values, which could have resulted in data loss. If the scale for a field was overridden in the plugin output schema, the pipeline failed.

Now, if an Oracle source has Numeric data type fields with undefined precision and scale, you must manually set the scale for these fields in the plugin output schema. When you run the pipeline, the pipeline will not fail and the new scale will be used for the field instead. However, there might be truncation if there are any Numbers present in the fields with the scale greater than the scale defined in the plugin. CDAP writes warning messages in the pipeline log indicating the presence of Numbers with undefined precision and scale in the pipeline. For more information about setting precision and scale in a plugin, see Changing the precision and scale for decimal fields in the output schema.

PLUGIN-1374: Improved performance for batch pipelines with MySQL sinks.

cdap - CDAP 6.8.1

Published by vsethi09 over 1 year ago

Features

CDAP-19729: Added support to upgrade realtime pipelines created in CDAP 6.8.0 with a Kafka Consumer Streaming source to CDAP 6.8.1. After the CDAP platform is upgraded to 6.8.1, you can use the Lifecycle microservices to upgrade these pipelines.

Changes

CDAP-20110: When running CDAP on Kubernetes, Spark program types are now run as Kubernetes jobs instead of deployments.

CDAP-20201: CDAP now sets Spark Kubernetes connect/read timeouts based on the CDAP Kubernetes timeout settings. Previously, CDAP did not set Spark Kubernetes connection/read timeouts. Spark used its default timeout setting.

Bug Fixes

CDAP-20394: Fixed an issue where the replication source plugin's event reader was not stopped by the Delta worker in case of errors, leading to leakage of the plugin's resources.

CDAP-20392: Fixed an issue that occurred in certain upgrade scenarios, where pipelines that didn’t have the Use Connection property set, but the plugin the connection properties (such as Project ID and Service account information) were not displayed in the plugin UI.

CDAP-20271: Fixed an issue that caused pipelines to fail when they used a connection that included a secure macro that had JSON as the value (for example, the Service Account property).

CDAP-20257: For Oracle Datastream replication sources, fixed an issue where the Review Assessment page would freeze for a long time when the selected or manually entered table did not exist in the source database.

CDAP-20199: For Oracle Datastream replication sources, fixed an issue where the Select tables and transformations page timed out, failed to load the list of tables, and displayed the error deadline exceeded when the source database contained a large number of tables.

CDAP-20146: Fixed an error in security-enabled instances that caused pipeline launch to fail and return token expired to the user when evaluating secure macros in provisioner properties.

CDAP-20121: For MySQL Replication sources, fixed an issue that caused replication jobs to fail during initial snapshotting when the job included a runtime argument with the Debezium property binary-handling-mode.

CDAP-20028: For Replication jobs, increased retry duration for API calls to update state/offsets in Replication jobs.

CDAP-20013: Fixed upgrade for Oracle by Datastream replication jobs. You can now upgrade Oracle by Datastream replication jobs from CDAP 6.6.0 and 6.7.x to CDAP 6.8.1.

CDAP-19622: Fixed upgrade for MySQL and SQL Server replication jobs. You can now upgrade MySQL and SQL Server replication jobs from CDAP 6.7.x to CDAP 6.8.1.

cdap - CDAP 6.8.0

Published by vsethi09 almost 2 years ago

New Features

The Dataplex Batch Source and Dataplex Sink plugins are generally available (GA).

CDAP-19592: For Oracle (by Datastream) replication sources, added a purge policy for a GCS (Google Cloud Storage) bucket created by the plugin that Datastream will write its output to.

CDAP-19584: Added support for monitoring CDAP pipelines using an external tool.

CDAP-18450: Added support for AND triggers. Now, you can create OR and AND triggers. Previously, all triggers were OR triggers.

PLUGIN-871: Added support for BigQuery batch source pushdown.

Enhancements

CDAP-19678: Added ability to specify k8s affinity for CDAP services in CDAP custom resource.

CDAP-19605: Added ability to see the logs coming from Twill application master now in pipeline logs.

CDAP-19591: In the Datastream replication source, added the property GCS Bucket Location, which Datastream will write its output to.

CDAP-19590: In the Datastream replication source, added the list of Datastream regions to the Region property. You no longer need to manually enter the Datastream region.

CDAP-19589: For replication jobs with an Oracle (by Datastream) source, ensured data consistency when multiple CDC events are generated at the same timestamp, by ordering events reliably.

CDAP-19568: Significantly improved time it takes to start a pipeline (after provisioning).

CDAP-19555, CDAP-19554: Made the following improvements and changes for streaming pipelines with a single Kafka Consumer Streaming source and no Windower plugins:

Kafka Consumer Streaming source has native support so the data is guaranteed to be processed at least once.

CDAP-19501: For Replication jobs, improved performance for Review Assessment.

CDAP-19475: Modified /app endpoints (GET and POST) in AppLifecycleHttpHandler to include the following information in the response:

  "change": {
      "author": "joe",
      "creationTimeMillis": 1668540944833,
      "latest": true
}

The new information is included in the response for the following endpoints:

CDAP-19365: Changed the Datastream replication source to identify each row by the Primary key of the table. Previously, the plugin identified each row by the ROWID.

CDAP-19328: Splitter Transformation based plugins now have access to prepareRun, onRunFinish methods.

CDAP-18430: The Lineage page has a new look-and-feel.

Bug Fixes

CDAP-20002: Removed the CDAP Tour from the Welcome page.

CDAP-19939: Fixed an issue in the BigQuery target replication plugin that caused replication jobs to fail when replicating datetime columns from sources that are more precise than microsecond, for example datetime2 data type in SQL Server.

CDAP-19970: Google Cloud Data Loss Prevention plugins (version 1.4.0) are available in the CDAP Hub version 6.8.0 with the following changes:

  • For the Google Cloud Data Loss Prevention (DLP) PII Filter Transformation, fixed an issue where pipelines failed because the DLP client was not initialized.
  • For all of the Google Cloud Data Loss Prevention (DLP) transformations, added relevant exception details when validation of DLP Inspection template fails, rather than throwing a generic IllegalArgumentException.

CDAP-19630: For custom Dataproc compute profiles, fixed an issue where the wrong GCS bucket was used to stage data. Now, CDAP uses the GCS bucket specified in the custom compute profile.

CDAP-19599: Fixed an issue in the BigQuery Replication Target plugin that caused replication jobs to fail when the BigQuery target table already existed. The new version of the plugin will automatically be used in new replication jobs. Due to CDAP-19622, if you want to use the new plugin version in existing jobs, recreate each replication job.

CDAP-19486: In the Wrangler transformation, fixed an issue where the pipeline didn’t fail when the Error Handling property was set to Fail Pipeline. This happened when an error was returned, but no exception was thrown and there were 0 records in output. For example, this happened when one of the directive (such as parse-as-simple-date) failed because the input data was not in the correct format. This fix is under a feature flag and not available by default. If this feature flag is enabled, existing pipelines might fail if there are data issues since the default error handling property is set to Fail Pipeline.

CDAP-19481: Fixed an issue that caused Replication Assessment to hang when the Oracle (by Datastream) GCS Bucket property was empty or had an invalid bucket name. Now, CDAP returns a 400 error code during assessment when the property is empty or has an invalid bucket name.

CDAP-19455: Added user error tags to Dataproc errors returned during cluster creation and job submission. Added ability to set troubleshooting docs url in CDAP site for Dataproc API errors.

CDAP-19442: Fixed an issue that caused Replication jobs to fail when the source column name didn’t comply with BigQuery naming conventions. Now, if a source column name doesn’t comply with BigQuery naming conventions, CDAP replaces invalid characters with an underscore, prepends an underscore if the first character is a number, and truncates the name if it exceeds the maximum length.

CDAP-19266: In the File batch source, fixed an issue where Get Schema appeared only when Format was set to delimited. Now, Get Schema appears for all formats.

CDAP-18846: Fixed issue with the output schema when connecting a Splitter transformation with a Joiner transformation.

CDAP-18302: Fixed an issue where Compute Profile creation failed without showing an error message in the CDAP UI. Now, CDAP shows an error message when a Compute Profile is missing required properties.

CDAP-17619: Fixed an issue that caused imports in the CDAP UI to fail for pipelines exported through the Pipeline Microservices.

CDAP-13130: Fixed an issue where you couldn’t keep an earlier version of a plugin when you exported a pipeline and then imported it into the same version of CDAP, even though the earlier version of the plugin is deployed in CDAP. Now, if you export a pipeline with an earlier version of a plugin, when you import the pipeline, you can choose to keep the earlier version or upgrade it to the current version. For example, if you export a pipeline with a BigQuery source (version 0.21.0) and then import it into the same CDAP instance, you can choose to keep version 0.20.0 or upgrade to version 0.21.0.

PLUGIN-1433: In the Oracle Batch Source, when the source data included fields with the Numeric data type (undefined precision and scale), CDAP set the precision to 38 and the scale to 0. If any values in the field had scale other than 0, CDAP truncated these values, which could have resulted in data loss. If the scale for a field was overridden in the plugin output schema, the pipeline failed.

Now, if an Oracle source has Numeric data type fields with undefined precision and scale, you must manually set the scale for these fields in the plugin output schema. When you run the pipeline, the pipeline will not fail and the new scale will be used for the field instead. However, there might be truncation if there are any Numbers present in the fields with the scale greater than the scale defined in the plugin. CDAP writes warning messages in the pipeline log indicating the presence of Numbers with undefined precision and scale in the pipeline. For more information about setting precision and scale in a plugin, see Changing the precision and scale for decimal fields in the output schema.

PLUGIN-1325: In Wrangler, fixed an issue that caused the Wrangler UI to hang when a BigQuery table name contained characters besides alphanumeric characters and underscores (such as a dash). Now, Wrangler successfully imports BigQuery tables that comply with BigQuery naming conventions.

PLUGIN-826: In the HTTP batch source plugin, fixed an issue where validation failed when the URL property contained a macro and Pagination Type was set to Increment an index.

PLUGIN-1378: In the Dataplex Sink plugin, added a new property, Update Dataplex Metadata, which adds support for updating metadata in Dataplex for newly generated data.

PLUGIN-1374: Improved performance for batch pipelines with MySQL sinks.

PLUGIN-1333: Improved Kafka Producer Sink performance.

PLUGIN-664: In the Google Cloud Storage Delete Action plugin, added support for bulk deletion of files and folders. You can now use the (*) wildcard character to represent any character.

PLUGIN-641: In Wrangler, added the Average arithmetic function, which calculates the average of the selected columns.

In Wrangler, Numeric functions support 3 or more columns.

Security Fixes

The following vulnerabilities were found in open source libraries:

  • Arbitrary Code Execution
  • Deserialization of Untrusted Data
  • SQL Injection
  • Information Exposure
  • Hash Collision
  • Remote Code Execution (RCE)

To address these vulnerabilities, the following libraries have security fixes:

  • commons-collections:commons-collections (Deserialization of Untrusted Data). Upgraded to apply security fixes.
  • commons-fileupload:commons-fileupload (Arbitrary Code Execution). Upgraded to apply security fixes.
  • ch.qos.logback:logback-core (Arbitrary Code Execution). Upgraded to apply security fixes.
  • org.apache.hive:hive-jdbc (SQL Injection). Excluded org.apache.hive:hive-jdbc dependency
  • org.bouncycastle:bcprov-jdk16 (Hash Collision)
  • com.fasterxml.jackson.core:jackson-databind (Deserialization of Untrusted Data). Upgraded to apply security fixes.

Deprecations

CDAP-19559: For streaming pipelines, the Pipeline configuration properties Checkpointing and Checkpoint directory are deprecated. Setting these properties will no longer have any effect.

CDAP will decide automatically if checkpointing or CDAP internal state tracking is enabled. To disable at least once processing in streaming pipelines, you can set the runtime argument cdap.streaming.atleastonce.enabled. Both Spark checkpointing and state tracking will be disabled if this is set to false.​​

cdap - CDAP 6.7.2

Published by vsethi09 about 2 years ago

Enhancements
CDAP-19601: For new Dataproc compute profiles, changed the default value of Master Machine Type and Worker Machine Type from n2 to e2.

Bug Fixes
CDAP-19532: Fixed an issue in the Database Batch Source plugin that caused pipelines to fail during runtime when there was a column with precision of 0 in the source returned by JDBC. Now, if a column has a precision of 0, the pipeline no longer fails. This affected CDAP 6.7.1 only. Note: In the Database Batch Source, if a column has precision 0, you must change the data type to Double in the Output Schema to ensure the pipeline runs successfully.

PLUGIN-1373: In the BigQuery Sink plugin (version 0.20.3), fixed an issue that sometimes caused a NullPointerException error when trying to update table metrics.

PLUGIN-1367: In the BigQuery Sink plugin (version 0.20.3), fixed an issue that caused a NullPointerException error when the output schema was not defined.

PLUGIN-1361: In the Send Email batch pipeline alert, fixed an issue where emails failed to send when the Protocol was set to TLS.

cdap - CDAP 6.7.0

Published by sechegaray about 2 years ago

New Features
General
Added support for mounting arbitrary volumes to CDAP system services in the CDAP operator.

Performance and Scalability
CDAP-19016: Increase pipeline run scalability.

CDAP-18837: Use system pods to enable horizontal scaling of pipeline launching. For more information, see System Workers.

Plugins
Google Dataplex Batch Source and Google Dataplex Sink system plugins are available in Preview.

Transformation Pushdown
Transformation Pushdown for joins is generally available (GA).

In Transformation Pushdown, Group By aggregation and Deduplicate aggregation are available in Preview.

CDAP-18437: Transformation Pushdown supports the BigQuery Storage Read API to improve performance when extracting data from BigQuery.

PLUGIN-1001: Added support for connections to Transformation Pushdown.

Wrangler
Added support to parse files before loading data into the Wrangler workspace. This means the recipe does not include parse directives. Now, when you create a pipeline from Wrangler, the source has the correct Format property.

Added support to allow users to ​​import the schema for formats such as JSON and some AVRO files where schema inference is not possible before loading data into the Wrangler workspace.

Enhancements
PLUGIN-1245: In the Joiner transformation, renamed the Distribution Skewed Input Stage property to Skewed Input Stage. Changed UI label only.

PLUGIN-1118: In Google Cloud File Reader batch source and Amazon S3 batch source plugins, added the Enable Quoted Values property, which lets you treat content between quotes as a value.

PLUGIN-1107: In the Google Cloud Data Loss Prevention (DLP) Decrypt Transformation and Google Cloud Data Loss Prevention (DLP) Redact Transformation, added the Resource Location property, which lets you specify the resource location for the DLP Service. For more information, see Specifying processing locations | Data Loss Prevention Documentation | Google Cloud.

PLUGIN-1004, CDAP-18386: Improved connection management to allow users to edit connections. Removed option to view connections.

PLUGIN-984: Added support for connections to the following plugins:

CloudSQL MySQL batch source

CloudSQL MySQL sink

CloudSQL PostgreSQL batch source

CloudSQL PostgreSQL sink

PLUGIN-968: Added support for connections in the following sinks:

BigQuery Table

Database

BigQuery Multi Table

PostgreSQL

GCS Multi Files

GCS

SQL Server

Kafka Producer

MySQL

Oracle

S3

Spanner

PLUGIN-965: In the GCS Done File Marker post-action plugin, added the Location property, which lets you have buckets and customer-managed encryption keys in locations that are not US locations.

PLUGIN-926, PLUGIN-939: In the BigQuery Execution Action plugin and the BigQuery Argument Setter action plugin, added support for the Dataset Project ID property, which is the Project ID of the dataset that stores the query results. It's required if the dataset is in a different project than the BigQuery job.

PLUGIN-731: In BigQuery sinks, added support for BigNumeric data type.

PLUGIN-670: In the BigQuery Table Batch Source, added the ability to query any temporary table in any project when you set the Enable querying views property to Yes. Previously, you could only query views.

PLUGIN-650: In Google Data Loss Prevention plugins, added support for templates from other projects.

CDAP-18982: Added a new pipeline state for when you manually stop a pipeline run: Stopping.

CDAP-18778: In the BigQuery Execute action plugin, added the ability to look up the drive scope for the service account to read from external tables created from the drive.

CDAP-18713: Added support for setting up workload identity in separate k8s namespaces.

CDAP-18655: Improved generic Database source plugin to correctly read decimal data.

CDAP-18556: Improved Google Cloud Platform plugins to validate the Encryption Key Name property.

CDAP-18456: In the replication configurations, added the ability to enable soft deletes from a BigQuery target.

CDAP-18405: Improved connection management to allow users to browse partial hierarchies like BigQuery datasets and Dataplex zones.

CDAP-18318: Permission checks are now required for updating/viewing system service information.

CDAP-17955: Replication assessment warnings no longer block draft deployment.

CDAP-16035: In Wrangler, added support for nested arrays, such as the BigQuery STRUCT data type.

In the Amazon S3 connection and Amazon S3 batch source plugins, added Session Token property.

In the Google Cloud Storage File Reader batch source plugin, added the Allow Empty Input property.

In the Joiner transformation, added the Input with Larger Data Skew property.

In the in Google Cloud Storage File Reader batch source plugin, Amazon S3 batch source plugin, and File batch source plugin, changed Skip Header property name to Use First Row as Header

Behavior Changes
CDAP-18990: In the Pipeline Studio, if you click Stop on a running pipeline, if the pipeline does not stop after 6 hours, the pipeline is forcefully terminated.

CDAP-18918: in the Deduplicate Analytics plugin, Limited the Filter Operation property to one record. If this property is not set, one random record will be chosen from the group of ‘duplicate’ records.

PLUGIN-795: The BigQuery sink supports Nullable Arrays. A NULL array gets converted to empty arrays at insertion time.

Wrangler no longer infers all values in CSV files as Strings. Instead, it maps the columns to a corresponding data type.

Bug Fixes
PLUGIN-1210: Fixed an issue in the Group By transformation where Longest String and Shortest String aggregators returned an empty string "", even when all records contained null values in the specified field. The Group By transformation now returns null for empty input.

PLUGIN-1183: Fixed an issue in the Group By transformation that caused the Concat and Concat Distinct aggregate function to produce incorrect results in some cases.

PLUGIN-1177: Fixed an issue in the Group By transformation that caused the Variance, Variance If and Standard Deviation aggregate function to produce incorrect results in some cases.

PLUGIN-1126: In the Oracle and MySQL Batch Source plugins, fixed an issue to treat all timestamps, specifically the ones older than the Gregorian cutover date (October 15, 1582), from the database in Gregorian calendar format.

PLUGIN-1074: Improved the generic Database source plugin to correctly read data when the data type is NUMBER, scale is set, and the data contains integer values.

PLUGIN-1024: Fixed an issue in the Router transformation that resulted in an error when the Default handling property was set to Skip.

PLUGIN-1022: Fixed an issue that caused pipelines with a Conditional plugin and running on MapReduce to fail.

PLUGIN-972: Fixed an issue in sources (such as File and Cloud Storage) that resulted in an error if you clicked Get Schema when the source file contained delimiters used in regular expressions, such as "|" or ".". You no longer need to escape delimiters for sources.

PLUGIN-733: Fixed an issue where Google Cloud Datastore sources read a maximum of 300 records. Datastore sources now read all records.

PLUGIN-704: Fixed an issue in BigQuery sinks where the output table was not partitioned correctly under the following circumstances:

The output table doesn’t exist.

Partitioning type is set to Time.

Operation is set to Upsert.

PLUGIN-694: Fixed an issue that caused pipelines with BigQuery sinks that have input schemas with nested array fields to fail.

PLUGIN-682: Fixed an access-level issue that caused pipelines with Elasticsearch and MongoDB plugins to fail.

CDAP-18994: Fixed issues that caused failures when reading maps and named enums from Avro files.

CDAP-18992: Fixed an issue in the Replication BigQuery target plugin where French characters in the source table were getting transferred incorrectly by getting replaced with '?'

CDAP-18974: Fixed an issue in the MySQL replication source where timestamp mapped to string. Now, timestamp correctly maps to timestamp.

CDAP-18878: Fixed an issue where Amazon S3 source and sink failed on Spark3 when using s3n as the Path property.

CDAP-18806: Fixed an issue in the GCS connection so it can read files with spaces in the name.

CDAP-18900: Fixed an issue where FileSecureStoreService did not properly store keys in case-sensitive namespaces.

CDAP-18860: Removed whitespace trimming from the runtime arguments UI. Whitespace can now properly be set as an argument value.

CDAP-18786: Fixed an issue in plugin templates where the Lock change option did not work.

CDAP-18692: Fixed an issue that caused Null Pointer Exceptions when dealing with Array of Records in BigQuery.

CDAP-18396: Fixed an issue where connection names allowed special characters. Now, connection names can only include letters, numbers, hyphens, and underscores.

CDAP-18009: Fixed an issue where the CDAP Pipeline Studio UI automatically checks the Null box if a schema record has the array data type.

CDAP-17955: Replication assessment warnings no longer block draft job deployment.

Deprecations
MapReduce Compute Engine
CDAP-18913: The MapReduce compute engine is deprecated and will be removed in a future release. Recommended: Use Spark as the compute engine for data pipelines.

Spark Compute Engine running on Scala 2.11
CDAP-19063: Spark running on Scala 2.11 is no longer supported. CDAP supports Spark 2.4+ running on Scala 2.12 only.

CDAP-19016: Spark-specific metrics are not served anymore with CDAP metrics API.

Wrangler
CDAP-18897: Deprecated the Set first row as header option for the parse-as-csv Wrangler directive. Parsing should be configured at the connection or source layer, not at the transformation layer. For more information, see Parsing Files in Wrangler.

Plugins
The following plugins are deprecated and will be removed in a future release:

Avro Snapshot Dataset batch source

Parquet Snapshot Dataset batch source

CDAP Table Dataset batch source

Avro Time Partitioned Dataset batch source

Parquet Time Partitioned Dataset batch source

Key Value Dataset batch source

Key Value Dataset sink

Avro Snapshot Dataset sink

Parquet Snapshot Dataset sink

Snapshot text sink

CDAP Table Dataset sink

Avro Time Partitioned Dataset sink

ORC Time Partitioned Dataset sink

Parquet Time Partitioned Dataset sink

MD5/SHA Field Dataset transformation

Value Mapper transformation

Amazon S3 Batch Source and Sink
CDAP-18878: s3n is deprecated as a scheme in the Path property in Amazon S3 Batch Source and Amazon S3 Sink plugins. If the Path property includes s3n, it is converted to s3a during runtime.

Known Issues
PostgreSQL batch source and sink plugins
PLUGIN-1126: Any timestamps older than the Gregorian cut over date (October 15, 1582) will not be represented correctly in the pipeline.

SQL Server Replication Source
CDAP-19354: The default setting for the snapshot transaction isolation level (snapshot.isolation.mode) is repeatable_read, which locks the source table until the initial snapshot completes. If the initial snapshot takes a long time, this can block other queries.

In case transaction isolation level doesn't work or is not enabled on the SQL Server instance, follow these steps:

Configure SQL Server with one of the following transaction isolation levels:

In most cases, set snapshot.isolation.mode to snapshot.

If schema modification will not happen during the initial snapshot, set snapshot.isolation.mode to read_committed.

For more information, see Enable the snapshot transaction isolation level in SQL Server 2005 Analysis Services.

  1. After SQL Server is configured, pass a Debezium argument to the Replication job. To pass a

Debezium argument to a Replication job in CDAP, specify a runtime argument prefixed with source.connector, for example, set the Key to source.connector.snapshot.isolation.mode and the Value to snapshot.

For more information about setting a Debezium property, see Pass a Debezium argument to a Replication job.

Dataproc
CDAP version 6.7.0 does not support Dataproc version 1.3. For more information, see the compatible versions of Dataproc.

cdap - CDAP 6.7.1

Published by sechegaray about 2 years ago

Enhancements
CDAP-19050: Enhanced the Dataproc provisioner to avoid making unneeded Compute Engine calls depending on the configuration settings.

CDAP-18336: For new Dataproc compute profiles, changed the default value of Master Machine Type from n1 to n2.

Bug Fixes
CDAP-19381: Fixed an issue in CDAP that created duplicate entries in file cache map, which resulted in multiple attempts to delete the same cache file.

CDAP-19379:

Fixed an issue where the Log service left empty folders, which made the mounting of Persistent Disk slow. This caused the Log service to fail to start in a timely manner.

Fixed an issue that caused pipelines to take a long time to launch or get stuck. This was linked to I/O throttling that occurred on the underlying Persistent Disk.

CDAP-19366: Fixed an issue that caused pipelines to fail when two or more pipelines were scheduled to start simultaneously on a static Dataproc cluster. This was due to a file upload race condition.

CDAP-19353: Fixed an issue in flow control that caused Appfabric to return 5xx error code in rare scenarios instead of 429 (Too Many Requests Error) if the number of concurrently launching or running pipelines were above certain thresholds.

CDAP-19276: Fixed an issue that resulted in an error when a compute profile was exported from the default namespace after switching from a custom namespace.

CDAP-19216: Fixed an issue when you started a pipeline multiple times and then stopped the pipeline before it completed, which resulted in the following UI error: Program is not running.

CDAP-19211: Removed verbose logs from the BigQuery client libraries in pipeline logs.

PLUGIN-1256: Fixed an issue that caused the BigQuery Execute action plugin configured with an Encryption Key Name (CMEK) to fail when the SQL query contained DDL Statements.

PLUGIN-954: In the BigQuery Execute action plugin, added a property Store Results in a BigQuery Table in the UI, which hides the destination table related properties by default.

cdap - CDAP 6.6.0

Published by seanzhougoogle over 2 years ago

New Features
CDAP-18653: Added one-click autoscaling for Dataproc compute profiles.

Enhancements
PLUGIN-994: Added support for Fetch Size to the following plugins:

CloudSQL MySQL batch source

CloudSQL PostgreSQL batch source

IBM DB2 batch source

MariaDB batch source

MySQL batch source

Netezza batch source

Oracle batch source

PostgreSQL batch source

SQL Server batch source

Teradata batch source

CDAP-18738: Dataproc Cluster Reuse. Runtime property system.profile.properties.clusterReuseEnabled is no longer required to enable cluster reuse. Default Max Idle Time is set to 30 minutes to prevent accidental cluster leak.

CDAP-18725: Added more details for pipeline success and failure metrics.

CDAP-18712: Added ability to limit published lineage messages to a configurable size to avoid out of memory errors due to large lineages.

CDAP-18651: Preview runners no longer perform any kind of access enforcement.

CDAP-18647: Added new limit of 5000 records for Previewing data in the Pipeline Studio.

CDAP-18621: Added new default value of 30 minutes for the Dataproc profile Max Idle Time property. Previously, Max Idle Time had no default value.

CDAP-18836: Added temporary namespace UPDATE enforcement for pipeline connections.

CDAP-18798: Added system.program.starting.delay.seconds metric to measure time taken by program to transition from provisioning to running state.

CDAP-18714: Added metrics for API call latency.

CDAP-18725: Added new tags (Provisioner, Cluster Status, Existing Status) to existing program failure/success metric.

CDAP-17772: Added authn/z between internal system services via token verification.

Instance Stability and Memory Usage
CDAP-18696: Added new Applications parameter (app.max.concurrent.launching) to cdap-default.xml control back pressure on pipeline starting requests. Requests exceeding the limit will fail with 429 (Too Many Requests) status.

CDAP-18712: Added new Metadata parameter (metadata.messaging.publish.size.limit) to cdap-default.xml to limit the size of published lineage messages to avoid out of memory errors due to large lineages.

CDAP-18672: Added new Dataset parameter (data.storage.sql.scan.size.rows) to cdap-default.xml to set the number of rows fetched for database reads from PostgreSQL.

CDAP-18559, CDAP-17986: Added retries to Dataproc API calls to ensure transient errors don’t affect cluster provisioning.

CDAP-18594, CDAP-18810: Fixed a problem when pipeline could not be deleted due to program state not updated after retries.

CDAP-18857: Added new Applications parameter (app.artifact.parallelism.max) to cdap-default.xml that limits artifact repository initialization parallelism to prevent Out of Memory errors on App Fabric startup.

CDAP-18848: Reduced Metrics parameter (metrics.processor.queue.size) parameter default from 20000 to 1000 to prevent Out of Memory during metric processing.

CDAP-18791, CDAP-18627, CDAP-18553: Improved LevelDB performance and memory usage.

CDAP-18748, CDAP-18737, CDAP-18685, CDAP-18680: Improved running pipelines handling during App Fabric restarts.

CDAP-18656: Prevented App Fabric Out Of Memory error when it’s asked to retrieve a long list of pipelines within a namespace.

CDAP-18603: Added pagination to application list API.

CDAP-18586: Prevented App Fabric Out Of Memory when system argument list is too long.

Bug Fixes
PLUGIN-1035: Fixed an issue that caused pipelines to fail when a Database batch source included a decimal column with precision greater than 19.

PLUGIN-1022: Fixed an issue that caused pipelines with a Conditional plugin and running on MapReduce to fail.

PLUGIN-1015: Fixed an issue that caused pipelines with a Conditional plugin and running on Spark to fail.

PLUGIN-974: Fixed an issue that caused validation to fail for GCS Multi File sinks.

Behavior Changes
CDAP-18586: getApplicationSpecification() method in interface io.cdap.cdap.api.schedule.ProgramStatusTriggerInfo has been removed in CDAP 6.6.0, which can cause the CDAP build break if you are using this method.

cdap - CDAP 6.5.1

Published by greeshmaswaminathan almost 3 years ago

Enhancements

PLUGIN-883, PLUGIN-897: Added Encryption Key Name property to the following plugins so users can encrypt any new resources created by these plugins with Customer Managed Encryption Keys (CMEK):

  • Big Query Execute action

  • GCS Copy action

  • GCS Create action

  • GCS Move action

  • GCS Done File Marker Pipeline Alert

  • BigQuery Batch source

  • BigQuery Multi Table sink

  • BigQuery Table Sink

  • Google Cloud Storage sink

  • Google Cloud Storage Multi File sink

  • Google Cloud PubSub sink

  • Google Cloud Spanner sink

  • Transformation Pushdown to BigQuery

PLUGIN-898: Added Location property to GCS Copy and GCS Move action plugins to auto-create destination buckets if they do not exist before running the pipeline. Previously, the bucket had to exist before running the pipeline.

CDAP-18566: The File connection now browses the file system. For example, on a Hadoop cluster, the File connection now browses the HDFS file system. For CDAP Sandbox, the File connection still browses the local file system.

CDAP-18532: Added the following optional cdap-site.xml configs:

If a config router.block.request.enabled is true in conf, the request router should respond with a specific response (provided through config) to every user request, hence blocking all the user requests.

If a status code is provided using config router.block.request.status.code, the server should respond with this status code, the default value should be 503.

If a response message is provided using config router.block.request.message, the server should respond with this response body; otherwise the response body should be empty.

CDAP-18384: Added metrics for authorization in CDAP.

Bug Fixes

CDAP-18571: Fixed an issue where messages couldn’t be retrieved for Kafka topics. This broke in 6.5.0 and is now fixed in 6.5.1.

CDAP-18538, CDAP-184254: Fixed an issue where you couldn’t create a profile for an existing Dataproc cluster.

CDAP-18529: Fixed an issue that caused pipelines to fail when Transformation Pushdown was enabled and used macros as properties.

CDAP-18446: Fixed an issue that caused long running programs, like Replication, to fail within the default Hadoop delegation token timeout. Now, these tokens get renewed so that the job keeps running.

CDAP-18439: Fixed an issue in Replication that caused the Configure button to result in an error when you clicked it.

CDAP-18428: Fixed an issue that caused pipelines to fail with an Access Denied error when the pipeline had BigQuery plugins or Transformation Pushdown configuration that included a Dataset Project ID that was in a different project than the specified Project ID:

  • BigQuery sources

  • BigQuery sinks

  • BigQuery Multi Table sinks

  • Transformation Pushdown

The Access Denied error was due to missing permissions on the service account.

To ensure pipelines with BigQuery or BigQuery Multi Table sinks and pipelines with Transformation Pushdown enabled run successfully, assign the following roles to the Project ID service account:

  • BigQuery Job User role to run jobs

  • GCE Storage Bucket Admin role to create a temporary bucket

If the dataset is not in the same project that the BigQuery job will run in, the Dataset Project ID service account must be granted the following role to write data to a BigQuery dataset or table:

  • BigQuery Data Editor role

CDAP-18423: Fixed an issue in the GCS connection that prevented browsing and parsing files stored in folders under buckets.

CDAP-18335: Fixed an issue where the UI was unusable until an error displayed in the UI was closed by clicking the x icon.

CDAP-18318: Fixed an issue where users did not need permission to restart system services, reset system service log levels, get system service statuses, etc. Now, if authorization is enabled on the cluster, users will need to have the corresponding permissions for these system services in order to access them.

CDAP-18249: Fixed an issue where the Upload window didn’t close after uploading a user-defined directive due to missing properties in the user-defined directive json.

PLUGIN-899: Fixed an issue that caused custom formats to be unusable in the GCS source and sink.

cdap - CDAP 6.5.0

Published by greeshmaswaminathan about 3 years ago

New Features

Connections

CDAP-17870: Added global connections for sources in Wrangler and data pipelines. For more information, see Managing Connections. Also added new endpoints for connections to the Pipeline Microservices.

CDAP-17924: Redesigned the Namespace Admin page.

Dataproc

CDAP-17999: Added support for labels in the Dataproc provisioner.

CDAP-17862: Added Shielded VMs as configuration settings for the Dataproc provisioner. For more information, see Google Dataproc.

CDAP-18004: Added support for running worker pods using different Kubernetes service accounts.

Namespaces

CDAP-17731: Added support to show current namespace name in the footer.

CDAP-17877, CDAP-17876: Added Connections and Drivers to Namespace Admin page for centralized management of all connections and Drivers. For more information, see JDBC Drivers and Managing Connections.

Spark 3

CDAP-17693: Added Spark 3 support for Standalone CDAP, CDAP Sandbox, and Previewing data.

CDAP-17930: Added Dataproc version to 2.0 as the default for new and upgraded pipelines. For more information, see “Upgrade Notes for Spark 3” below.

Transformation Pushdown

CDAP-17863: Added support for Transformation pushdown into BigQuery for Joiner transformations. For more information, see Using Transformation pushdown.

Improvements

CDAP-17730: Added authorization checks for preferences, logging, compute profiles, and metadata endpoints.

CDAP-17915: Added support to search for tables based on schema name when you select tables for a Replication job.

CDAP-17946: Improved error messages on the Pipeline List page.

CDAP-17973: Improved Wrangler error messages

CDAP-18024: Added support for running CDAP as a non-root user.

CDAP-18039: Added additional trace logging in the authorization flow for debugging.

CDAP-18146: Pods created by CDAP now inherit their ImagePullPolicy from the pod which created them.

CDAP-18194: Added support for BIGNUMERIC data type for BigQuery target in replication.

PLUGIN-764: Added support for Datetime data type for SQL Server batch source plugins.

PLUGIN-645: Added support for Datetime data type for Replication jobs.

Behavior Changes

CDAP-18114: MySQL, Oracle, PostgreSQL, and SQL Server batch sources, sinks, actions, and pipeline alerts are now installed by default as system plugins. Previously, these plugins were available in the Hub as user plugins.

CDAP-17898: When you use a connection in Wrangler and create a data pipeline, CDAP now creates a pipeline with the source plugin and then Wrangler transformation. In previous releases, CDAP created the pipeline with just the Wrangler transformation. You had to manually add the source plugin to the pipeline and configure it.

Bug Fixes

CDAP-17895: Fixed an issue in Replication that caused jobs to fail if more than 1000 tables are selected for replication.

CDAP-17919: Fixed an issue that caused replication jobs to hang when there were too many Delete or DDL events.

CDAP-17939: Improved the Messaging Service cleanup strategy so that it uses far fewer resources and cannot go out of memory.

CDAP-17942: Fixed an issue that caused plugin validation to fail when a macro is used within a macro function. For example: ${logicalStartTime(${date_format})}

CDAP-17943: Fixed an issue that caused pipelines with aggregations and Decimal fields to fail with an exception.

CDAP-17959: Fixed an issue that caused Wrangler to ignore all the other columns other than the given column when parsing Excel files.

CDAP-17965: For CDAP instances running on Kubernetes, fixed an issue that prevented new previews from being scheduled after the preview manager had been stopped 10 times.

CDAP-17995: Fixed Wrangler to fail pipelines upon error. In Wrangler 6.2 and above, there was a backwards incompatible change where pipelines did not fail if there was an error and instead were marked as completed.

CDAP-18002: Fixed an issue in Replication that caused jobs to fail when restarted during snapshotting.

CDAP-18012, CDAP-18003, CDAP-17853: Improved resilience of TMS.

CDAP-18060: Fixed an issue in CDAP Sandbox that caused Get Schema to fail when the source includes the Format field.

CDAP-18131: Fixed an issue where replication to BigQuery was failing because the source table had column names which are reserved keywords in the BigQuery.

PLUGIN-178: Fixed an issue while writing non-null values to a nullable field in BigQuery.

PLUGIN-635: Fixed an issue in the BigQuery plugins to correctly delete temporary GCS storage buckets.

PLUGIN-654: Fixed an issue that caused pipelines to fail when Pub/Sub source Subscription field was a macro.

PLUGIN-655: Fixed an issue in the BigQuery sink that caused failures when the input schema was not provided.

PLUGIN-669: Fixed Join Condition Type to be displayed in the Joiner for pipelines upgraded from versions before 6.4.0

PLUGIN-678: Fixed an issue in the BigQuery sink that caused pipelines to fail or give incorrect results.

PLUGIN-697: Fixed an issue that caused File Source Plugin validation to fail when there was a macro in the Format field.

Upgrade Notes for Spark 3

In CDAP 6.5.0, Spark 3 is the new default engine that will be used for Preview and running pipelines on Dataproc. Also Spark 1 support was removed from CDAP.

After an instance is upgraded to version 6.5.0, any new or upgraded pipeline that uses a Dataproc profile without an explicit image version will use the latest Dataproc image 2.0 that has Spark 3.1 bundled.

Any pipeline that was not upgraded will still use the original 1.3 Dataproc image that has Spark 2.3 bundled.

What does it mean for pipeline developers / operations?

Spark 3.1 provides a lot of improvements in different areas. See the release notes for Spark 3.0 and Spark 3.1. The main changes that affect backwards compatibility are:

Python 2 support is removed, any PySpark code must be Python 3 compatible.

Spark 3.1 uses Scala 2.12 that is binary incompatible with Scala 2.11. Most of the code is source compatible, so recompile Scala code with Scala 2.12 if you have any issues.

What does it mean for plugin developers?

If you use any Scala code, make sure it’s binary compatible with the corresponding Scala version: 2.12 for Spark 3 and 2.11 for Spark 2 execution environment.

This can be easily achieved by referencing the proper spark2_2.11 or spark3_2.12 version of the CDAP artifact, e.g. see [CDAP-17693] Introduce spark 3 for tests, drop spark 1, create spark … by tivv · Pull Request #1364 · cdapio/hydrator-plugins . Note that you must explicitly choose the version because artifacts without the version that were previously using Spark 1 are no longer available.

If you have any dependencies on Scala-specific artifacts (e.g. Kafka), change those as well.

The new Hadoop version used in dependencies is 2.6.0 instead of 2.3.0.

What to do in case of any problems?

Spark 2 is still fully supported by CDAP. If you use Dataproc, enter image version “1.3” in your provisioning profile and it will use exactly the same image CDAP 6.4 uses.

It’s highly recommended that you solve any problems found and migrate to the Spark 3 execution environment as it brings a number of enhancements including huge performance improvements.

Known Issues

Database connections

Although you can create connections for Database, MySQL, Oracle, PostgreSQL, and SQL Server sources, the plugin properties do not include Use Connection. This means that you cannot reference a connection in a database source plugin. However, from the Properties page in a database source plugin, you can select a connection to have CDAP populate the plugin properties with the connection properties.

To use the properties set in these connections in the corresponding batch source plugin, follow these steps:

In Pipeline Studio, add the source plugin to the canvas.

Click Properties.

Click Browse Database.
The Browse Database page appears with the available connections listed in the left panel.

Click the connection you want to use.

Locate the table you want to add to the source plugin and click it.
The source properties now include all of the properties from the connection.

cdap - CDAP 6.4.1

Published by rmstar over 3 years ago

New Features

Replication

PLUGIN-645: BigQuery targets now support the Datetime data type.

Bug Fixes

CDAP-17943: Fixed an issue that caused pipelines with aggregations and Decimal fields to fail with an exception.

CDAP-17939: Improved the Messaging Service cleanup strategy so that it uses far fewer resources and cannot go out of memory.

CDAP-18012, CDAP-18003, CDAP-17853: Improved resilience of TMS.

PLUGIN-669: Fixed Join Condition Type to be displayed in the Joiner for pipelines upgraded from versions prior to 6.4.0

CDAP-17995: Fixed Wrangler to fail pipelines upon error. In Wrangler 6.2 and above, there was a backwards incompatible change where pipelines did not fail if there was an error and instead were marked as completed.

CDAP-17965: For CDAP instances running on Kubernetes, fixed an issue that prevented Previews from starting once a user has stopped a Preview run 10 times.

PLUGIN-178: Fixed an issue while writing non-null values to a nullable field in BigQuery.
PLUGIN-635: Fixed an issue in the BigQuery plugins to correctly delete temporary GCS buckets.

PLUGIN-655: Fixed an issue in the BigQuery sink that caused failures when the input schema was not provided.

PLUGIN-678: Fixed an issue in the BigQuery sink that caused pipelines to fail or give incorrect results.

PLUGIN-654: Fixed an issue that caused pipelines to fail when Pub/Sub source Subscription field was a macro.

cdap - CDAP 6.4.0

Published by rmstar over 3 years ago

New Features

Datetime Data Type

PLUGIN-615, PLUGIN-614: Added Datetime data type support to the following plugins:

  • BigQuery batch source
  • BigQuery sink
  • BigQuery Multi Table sink
  • Bigtable batch source
  • Bigtable sink
  • Datastore batch source
  • Datastore sink
  • GCS File batch source
  • GCS File sink
  • GCS Multi File sink
  • Spanner batch source
  • Spanner sink
  • File source
  • File sink
  • Wrangler
  • Amazon S3 batch source
  • Amazon S3 sink
  • Database source

Also, BigQuery datetime type can be directly mapped to CDAP datetime data type.

CDAP-17684, CDAP-17636: Added support for DateTime data type in Wrangler. You can now select Parse > Datetime to transform columns of strings to datetime values and Format > Datetime to change the date and time pattern of a column of datetime values.

Added new Wrangler directives that you can use in Power Mode to transform columns of strings to datetime values: Parse as Datetime, Current Datetime, Datetime to Timestamp, Format Datetime, Timestamp to Datetime

CDAP-17620: Added support Datetime logical data type in CDAP schema

Dataproc

CDAP-17622: Added machine type, cluster properties, and idle TTL as configuration settings for the dataproc provisioner. For more information, see Google Dataproc.

Security

CDAP-17709: Added support for PROXY authentication mode to nodejs proxy. CDAP UI now supports both MANAGED and PROXY modes of authentication. For more information, see Configuring Proxy Authentication Mode.

Pipeline Studio

CDAP-17549: Added support for data pipeline comments. For more information, see Adding comments to a data pipeline.

Plugin OAUTH Support

CDAP-17611: Updated Salesforce plugins to incorporate with the new OAuth macro function

CDAP-17610: Implemented a new macro function for OAuth token exchange

CDAP-17609: Implemented new HTTP endpoints for OAuth management

Replication

CDAP-17674: Added support to allow users to specify a runtime argument, retain.staging.table, to retain BigQuery staging table to help debug issues

CDAP-17595: Added upgrade support for replication jobs

CDAP-17471: Added the ability to duplicate, export, and import replication jobs

CDAP-17337: Added property to configure dataset name in the BigQuery replication target. By default, the dataset name is the same as the Replication source database name. For more information, see Google BigQuery Target.

CDAP-16755: Added ability to add the runtime argument "event.queue.capacity" to specify the capacity of the event queue in bytes for Replication jobs. If the target plugin consumes the event slower than the source plugin emits the event, the event may stay in the queue and occupy the memory. With this capability the user can control how much memory, at most, can be used for the event queue.

Kubernetes

CDAP-17618: Replaced Zookeeper for K8S CDAP setup with K8S secrets. For more information, see Prepare the secret token for authentication service.

CDAP-17466: Added Authentication functionality for CDAP on Kubernetes setup. For more information, see Installation on Kubernetes.

Joiner Analytics Plugin

CDAP-17607: Added advanced join conditions to the joiner plugin. This allows users to specify an arbitrary SQL condition to join on. These types of joins are typically much more costly to perform than basic join on equality. For more information, see Join Condition Type.

New System Plugins for Data Pipelines

PLUGIN-558: Added new post-action plugin, GCS Done File Marker. This post-action plugin marks the end of a pipeline run by creating and storing an empty DONE (or SUCCESS) file in the given GCS bucket upon a pipeline completion, success, or failure so that you can use it to orchestrate downstream/dependent processes.

Improvements

PLUGIN-601: Added a metric for bytes read from database source

PLUGIN-571: Added support to filter tables in the Multiple Database Tables Batch Source

PLUGIN-570: Improved error handling for Multiple Database Batch Sources and BigQuery multi-table sink that enables the pipelines to continue if one or more tables fail

CDAP-17724: Renamed replication pipelines to jobs

CDAP-17721: Added support for Kerberos login in K8s environment

CDAP-17675: Renamed Delete button to Remove in Replication Assessment report

CDAP-17670: Improved plugin initialization performance optimization

CDAP-17650: Added tag with parent artifact detail to Dataproc cluster created by CDAP

CDAP-17645: Set a timeout on the ssh connection so that the pipeline runs fails when the cluster becomes unreachable

CDAP-17642: Added namespace count to Dataplane metrics

CDAP-17621: Added the Customer Manager Encryption Key (CMEK) configuration property for replication BigQuery target. For more information, see Google BigQuery Replication Target.

CDAP-17613: Improved Replication Assessment page to highlight SQL Server tables with Schema issues in red

CDAP-17603: Added ability to jump to any step when modifying the Replication draft

CDAP-17601: Improved performance by loading data directly into the target table during replication snapshot process

CDAP-17597: Added poll metrics in Overview and Monitoring in Replication detail view

CDAP-17583: Improved Performance for Replication

CDAP-17582: Added ability to pass additional properties for Debezium and jdbc drivers for replication sources

CDAP-17482: Added ability to start Replication app from a last known checkpoint.

CDAP-17474: Added support for configuring elasticsearch TLS connection to trust all certs. For more information, see Elasticsearch.

CDAP-17414: Improved Replication Table selection user experience

CDAP-17289: Improved reliability of Pub/Sub Source plugin

CDAP-17248: Added File Encoding property to Amazon S3, File and GCS File Reader batch source plugins

CDAP-17114: Removed the record view in pipeline preview for the Joiner node because it was misleading

CDAP-16548: Renamed the Staging Bucket Location property to Location in the BigQuery Target properties page. For more information, see Google BigQuery Target.

CDAP-16623: Removed multiple way to collapse/expand the Connection menu

CDAP-16008: Added support for Kerberos Hadoop cluster in the Remote Hadoop Provisioner

CDAP-15552: Fixed Wrangler to highlight new column generated by a directive

Behavior Changes

CDAP-16180: Resolved macro to preferences during pipeline validation

In previous releases, when you validated a plugin, macros were not being resolved with preferences.

In CDAP 6.4.0, when you validate a plugin, macros now get resolved with preferences.

PLUGIN-470: Removed Multi sink runtime argument requirements, allowing users to add simple transformations in multi-source/multi-sink pipelines.

In previous releases, multi-sink plugins require the pipeline to set a runtime argument for each table, with the schema for each table.

In CDAP 6.4.0, CDAP determines the schema dynamically at runtime instead of requiring arguments to be set.

Bug Fixes

PLUGIN-610: Fixed Bigtable Batch Source plugin

PLUGIN-606: FTP batch source now works with empty File System Properties. See “Deprecations” below.

PLUGIN-545: Added support for strings in Min/Max aggregate functions (used in both Group By and Pivot plugins)

PLUGIN-539: Fixed Salesforce plugin to correctly parse the schema as Avro schema to make sure all the field names are accepted by Avro

PLUGIN-517: Fixed data pipeline with BigQuery sink that failed with INVALID_ARGUMENT exception if the range specified was a macro

PLUGIN-222: Fixed Kinesis Spark Streaming source

CDAP-17746: Fixed an issue in field validation logic in pipelines with BigQuery sink that caused a NullPointerException

CDAP-17744: Fixed Schema editor to show UI validations

CDAP-17737: Fixed Conditions plugins to work with Spark 3

CDAP-17732: Fixed the Wrangler Generate UUID directive to correctly generate a universally unique identifier (UUID) of the record

CDAP-17718: Fixed advanced joins to recognize auto broadcast setting

CDAP-17717: Fixed upgraded CDAP instances to include arrow to the Error Collector

CDAP-17713: Fixed Pipeline Studio UI to send null instead of string for blank plugin properties

CDAP-17703: Fixed Pipeline Studio to use current namespace when it fetches data pipeline drafts

CDAP-17691: Fixed SecureStore API to support SYSTEM namespace

CDAP-17683: Fixed million indicator on Replication Monitoring page

CDAP-17680: Fixed Replication statistics to display on the dashboard for SQL Server

CDAP-17678: Fixed an issue where clicking the Delete button on Replication Assessment page resulted in an error for the replication job

CDAP-17653: Removed the usage of authorization token while generating session token in nodejs proxy.

CDAP-17641: Schema name is now shown when selecting tables to replicate

CDAP-17635: Fixed Replication to correctly insert rows that were previous deleted by a replication job

CDAP-17630: Data pipelines running in Spark 3 enabled Dataproc cluster no longer fail with class not found exception

CDAP-17617: Fixed Replication Overview page to display the label of the table status when you hover over the table status

CDAP-17598: Added ability to hover over metrics in the Pipeline Summary page

CDAP-17591: Fixed Wrangler completion percentage

CDAP-17584: Fixed Replication with a SQL Server source to generate rows correctly in BigQuery target table if snapshot failed and restarted

CDAP-17570: Fixed an issue where SQL Server replication job stopped processing data when the connection was reset by the SQL Server

CDAP-17568: Fixed the Replication wizard to close without error when you click the X icon to exit

CDAP-17495: Fixed an error in Replication wizard Step 3 "Select tables, columns and events to replicate" where selecting no columns for a table caused the wizard to fetch all columns in a table

CDAP-17491: Using a macro for a password in a replication job no longer results in an error

CDAP-17483: Fixed logical type display for data pipeline preview runs

CDAP-17476: Fixed Dashboard API to return programs running but started before the startTime

CDAP-17450: Fixed Replication job (when deployed) to show advanced configurations in UI

CDAP-17347: Fixed data pipeline with Python Evaluator transformation to run without stack trace errors

CDAP-17331: Suppressed verbose info logs from Debezium in Replication jobs

CDAP-17189: Added loading indicator while fetching logs in Log Viewer

CDAP-17028: Fixed Pipeline preview so logical start time function doesn’t display as a macro

CDAP-16804: Fixed fields with a list drop down menu in the Replication wizard to default to “Select one”

CDAP-16726: Added message in Replication Assessment when there are tables that CDAP cannot access

CDAP-16609: Used error message when an invalid expression is added in Wrangler

CDAP-16316: Fixed RENAME directive in Wrangler so it’s case sensitive

CDAP-16233: Fixed Pipeline Operations UI to stop showing the loading icon forever when it gets error from backend

CDAP-15979: Fixed Wrangler to no longer generate invalid reference names

CDAP-15509: Fixed Wrangler to display logical types instead of java types

CDAP-15465: Fixed pipelines from Wrangler to no longer generate incorrect for xml files

CDAP-13907: Added connection in Wrangler hard codes the name of the jdbc driver

CDAP-13281: Batch data pipelines with Spark 2.2 engine and HDFS sinks no longer fail with delegation token issue error

Known Issues

CDAP-17720: When you run a Replication job, if a source table has a column name that does not conform to BigQuery naming conventions, the job fails with an error similar to the following:

com.google.cloud.bigquery.BigQueryException: Invalid field name "SYS_NC00012$". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.

Note: In BigQuery, column names must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.

Workaround: Remove columns from the Replication job that do not conform to the BigQuery naming conventions.

Deprecations

FTP Batch Source (System Plugin for Data Pipelines)

The FTP Batch Source plugin installed with CDAP is deprecated and will be removed in a CDAP 7.0.0. This deprecation includes all versions of the FTP Batch Source prior to version 3.0.0. The supported version of the FTP Batch Source is version 3.0.0 and is available for download in the Hub.

FTP Batch Source version 3.0.0 is completely backward compatible, except that it uses a different artifact. This was done to ensure that updates to the plugin can be delivered out of band of CDAP releases, through the Hub.

It’s recommended that you use version 3.0.0 in your data pipelines.

cdap - CDAP 6.3.0

Published by yaojiefeng over 3 years ago

Summary

This release introduces a number of new features, improvements, and bug fixes to CDAP. The main highlight of the release is:

Replication
Added metrics for amount of data processed and error count from the replicator app.
Improved the replication UI page for better user experience.

New Features

CDAP-16835 - Added support for upgrading applications via REST API. Example usage is to upgrade all pipelines in a namespace to use the latest available artifacts.

CDAP-16836 - Added new options in CDAP CLI to take URI instead of host and port combination.

CDAP-16980 - New Log Viewer feature which enables users to see the most recent logs.

CDAP-17355 - Added Draft count metric and created Drafts API to manage drafts in the backend.

CDAP-17418 - This feature supports replicating those databases that have a "schema" concept. While "schema" is just a collection of DB objects.

CDAP-17460 - Redesign Replication Detail page.

CDAP-17461 - Redesign Dashboard page into Operations page.

Improvements

CDAP-16812 - Updated labels and descriptions for Service Account properties in the Dataproc provisioner.

CDAP-16815 - Added a metric records.updated in BigQuery sink. This will give a total of all the inserts, updates, and upserts into the sink.

CDAP-16918 - Introduced a new REST API for getting all application details across all namespaces.

CDAP-16929 - Added the ability to select a Custom Dataproc Image. The complete URI for the custom image should be specified.

CDAP-17015 - Updated Preview to show the number of Preview runs pending before current run (if there are any runs pending). The number of pending runs is shown under the timer in the UI.

CDAP-17065 - Disabled Spark YARN app retries since Spark already performs retries at a task level.

CDAP-17077 - Changed the auto-caching strategy in Spark pipelines to default to using disk only caching instead of memory due to common out of memory failures. Also changed the caching strategy to only cache at places that would prevent sources from being recomputed instead of the more aggressive caching previously done.

CDAP-17078 - Added a setting to consolidate multiple pipeline branches into single operations in Spark pipelines. This can improve performance in pipelines by avoiding recomputation. This can be turned on by setting a preference or runtime argument for 'spark.cdap.pipeline.consolidate.stages' to 'true'.

CDAP-17095 - Added Distribution to AutoJoiner API to increase performance for skewed joins.

CDAP-17123 - Made "records.updated" metric available for GCS Batch Sink plugin.

CDAP-17130 - Added Joiner Distribution support to MapReduce and streaming pipelines.

CDAP-17179 - Added new properties Filesystem properties and Output File Prefix for GCS Sink.

CDAP-17182 - Enable traffic compression in Runtime service.

CDAP-17198 - Added Runtime service to the system service statues.

CDAP-17202 - Improved commit performance for sinks.

CDAP-17249 - Added documentation about Regex Path Filter property to File and GCS sources.

CDAP-17389 - Added options for master and worker disk type and fixed the Dataproc provisioner to use the configured disk settings for secondary workers on autoscale clusters.

CDAP-17425 - Exposed the number of preview records requested to source plugins.

CDAP-17428 - Changed pipeline stage consolidation to be enabled by default. This improves the performance of certain types of pipelines.

CDAP-17439 - Added support for Hadoop 3 and Spark 3 for program execution.

CDAP-17462 - Delta source developers don't need to populate previous rows in the update event if the delta source supports row_id which is a unique identifier that can identify a row.

CDAP-17484 - Replication Assessment page now displays an error when a user selects two source tables with the same name to replicate, which is not supported

PLUGIN-282 - Added new Data Cacher plugin to allow users to manually cache data at certain points in a pipeline.

PLUGIN-303 - Added Distribution settings to Joiner plugin for increased performance in skewed joins.

Bug Fixes

CDAP-16797- CDAP UI now validates Pipeline Alerts before adding to the Pipeline Studio.

CDAP-16816 - Fixed schedule properties to overwrite preferences set on the application instead of the other way around. This most visibly fixed a bug where the compute profile set on a pipeline schedule or trigger would get overwritten by the profile for the pipeline.

CDAP-16824 - Fixed UI to show plugin properties for plugins that don't have a plugin widget.

CDAP-16845 - Fixed a bug that started running Preview for pipelines with post-run actions even if the user chooses the option to not run Preview.

CDAP-16870 - Fixed PySpark support to work with Spark 2.1.3+.

CDAP-16879 - For BigQuery sinks, if both Truncate Table and Update Table Schema are set to True, when you run the pipeline, only Truncate Table will be applied. Update Table Schema will be ignored.

CDAP-16880 - Removed schema validation from BQ sink when 'Truncate Table' option is set to True.

CDAP-16891 - Unsupported pipelines in drafts would be upgraded when users open them.

CDAP-16898 - Fixed a bug that did not fetch Preview data when the plugin label had spaces in it.

CDAP-16950 - Includes all ERROR level logs logged under the application logging context.

CDAP-16959 - Fixed an issue in Preview with runtime arguments re-rendering and losing focus when containing macros.

CDAP-16972 - Fixed an issue where Preview config would open when trying to stop a Preview.

CDAP-16975 - If there are multiple versions of a plugin, the latest version is now the default and is the version that gets added to pipelines. If the user has already chosen a specific version (older version), it defaults to that instead of the latest.

CDAP-16976 - UI resets the default version of plugins for specific users during upgrade. When users upgrade from 6.1.2 to 6.1.3 or later, UI will reset the default version of the plugin the user has already chosen. Post upgrade, if the user uses the same plugin, UI will choose the latest version of the same plugin.

CDAP-16993 - Fixed a bug in Preview for fields that have non-string types such as bytes.

CDAP-17000 - Changed default value of spark.network.timeout to 10 minutes to make pipeline execution more stable for shuffle heavy pipelines.

CDAP-17029 - Fixed an issue that caused an extra empty row to appear when sampling GCS text files in Wrangler.

CDAP-17043 - Fixed a bug for showing dropdown menu for Wrangler tabs to be correct. Existing dropdown overlapped with other UI elements hindering the usage of UI.

CDAP-17044 - Columns names are validated for BigQuery sink.

CDAP-17045 - Fixed the bug to allow large pipelines with - in the name to properly overflow in the UI.

CDAP-17057 - Fixed a bug that did not allow a user to make further changes to preferences when saving preferences returned an error.

CDAP-17059 - Added a check to fail pipeline deployment if there is an action in the middle of the pipeline.

CDAP-17074 - Improved state transitions for starting pipelines in app fabric to increase stability if app fabric unexpectedly restarts.

CDAP-17097 - Fixed a bug that caused Splitter transforms to be unable to fetch their output ports and schemas.

CDAP-17117 - Fixed styling bug so header of Preview tab does not scroll with table.

CDAP-17121 - Fixed a bug where Preview run fails on null values due to Json Encorder NullPointerException.

CDAP-17133 - Fixed tab styles for users on Mac with system preferences set to show scrollbars always in Chrome.

CDAP-17135 - Fixed a race condition in stopping Spark program in Standalone CDAP that can cause stop to hang.

CDAP-17137 - Fixed a bug that showed preview pipeline stopping in UI even when call to stop pipeline returns error.

CDAP-17138 - Fixed a bug that caused an empty error banner to appear when the user stops Preview.

CDAP-17139 - Fixed styling of Preview tab so that side by side tables and record tables are aligned.

CDAP-17140 - Fixed a bug so error banner for deploy failure shows failure details from backend status message, if they exist.

CDAP-17141 - Fixed bug that allowed a user to make unsaved config changes by disabling Pipeline Config button in Preview mode when run is in progress.

CDAP-17145 - Modified preview timer logic to use submitTime instead of pipeline run startTime, to take into account time spent in INIT and WAITING states.

CDAP-17161 - Reduce memory footprint for program execution monitoring.

CDAP-17166 - Fixed a bug that caused the setting for the number of executors in streaming pipelines to be ignored.

CDAP-17171 - Fixed horizontal tab styling to handle mac system setting "scrolling always on" in chrome.

CDAP-17172 - Fixed a bug that showed banner about stopping pipeline when a pipeline was deployed after running Preview.

CDAP-17174 - Fixed a bug that doesn't allow the user to stop Preview if pipeline run has already completed.

CDAP-17213 - Pick up Spark configuration correctly from the remote Hadoop cluster for program execution.

CDAP-17217 - Fixed overflow styling for long text in preview tables.

CDAP-17224 - Fixed an issue where the Dashboard page will show the graph being full when there is no run during the time period selected.

CDAP-17225 - Fixed a bug that caused pipeline deployment to fail if the pipeline contained comments.

CDAP-17233 - Improved Wrangler error messages for incorrect syntax and errors in Wrangler command line.

CDAP-17237 - Fixed a bug where the cluster's default Hadoop settings were not being used in pipelines.

CDAP-17239 - Fixed a bug in StandaloneMain which prematurely deletes the Authorizer classpath directories.

CDAP-17243 - Hide Analytics and Rules Engine by default from UI.

CDAP-17246 - Fixed pipeline exported in CDAP 6.1.x to be imported without changing plugin names in the pipeline. This prevents pipelines failing during preview or deployment when imported from 6.1.x version of CDAP to 6.2.x+ version.

CDAP-17268 - Fixed a bug in schema editor to handle default type for keys (string) in map type if the existing schema doesn't have any type for keys.

CDAP-17323 - Fixed a bug in the Existing Dataproc provisioner that it checks for network unnecessarily.

CDAP-17379 - MySQL Sources for Replication now require MySQL JDBC driver 8 and above.

CDAP-17386 - Fixed a bug in Replication where MySQL source failed with NullPointerException when the data is null and a logical type.

CDAP-17408 - Fixed a bug that caused the number of partitions set by aggregator plugins to be ignored.

CDAP-17473 - Fixed a bug preventing macro values for Project ID and Path in GCS source plugin.

CDAP-17557 - Fixed the issue that when SQL Server Replicator restarts, it will generate a duplicate event.

PLUGIN-202 - Improved validations on GCS plugins to check for permissions on buckets, and improved error messages for users unable to access a GCS bucket.

PLUGIN-206 - BigQuery service API fixed a region error message discrepancy on their end.

PLUGIN-245 - Fixed BigQuery sink with macro table key validation.

PLUGIN-367 - Fixed a bug where blog file input formats are being split up in Hadoop jobs.

PLUGIN-370 - Improved some Cassandra validations.

PLUGIN-372 - Fixed user experience issue where Bigtable sink and source plugins may fail deployment if they are unable to connect to the Bigtable service.

PLUGIN-386 - Added support for BigQuery Views and Materialized Views to Wrangler.

PLUGIN-388 - Fixed output schema validation for GCS sinks with Format set to parquet.

cdap - CDAP 6.2.3

Published by CuriousVini almost 4 years ago

Summary

This release contains critical bugfixes to the Dataproc provisioner in CDAP.

Bug Fixes

  • Fixed a bug in the Existing Dataproc provisioner that it checks for network unnecessarily (CDAP-17323)
  • Allow Dataproc provisioner to accept the default value of property gcp-dataproc.serviceAccount from cdap-site.xml. This property is to configure what service account a Dataproc cluster should use when running the pipeline. (CDAP-17326)