smart-data-lake - Release 2.6.0 Latest Release

Published by zzeekk 7 months ago

Version updates and dependencies

update to spark 3.4 (#659)
update delta, iceberg and snowpark for Spark 3.4
removed critical/high/medium vulnerabilities and added missing dependencies
update plugin versions
upgrade github actions to node16 and jdk17 (#721, #759)

Features

Improve metrics per DataObject (#529)

SubFeed is enriched with metrics
spark metrics are collected synchronously
read specific metrics for jdbc, delta, iceberg and snowflake
standardize naming for inserted/updated/deleted rows
add metrics to SubFeedsExpressionData so it can be used in executionCondition

new DataObjectSchemaExporter

export schema and statistics (table, column) for UI

improve ConfigJsonExporter

export scalaDoc of transformers to display in UI (#614)

improve SparkDfsTransformer (used by CustomDataFrameAction)

implement dynamic custom transform function (#711). This allows to define custom transform methods and dynamic parameter mapping on evaluation, e.g. transform(session: SparkSession, dfDataObject1: DataFrame, dfDataObject2: DataFrame, isExec: Boolean, nbOfDts: Option[Int]): DataFrame

new state migrator (#764)

Allow reading old state file formats seamlessly
Command line tool to migrate all state files to current format.

Scala 2.13

Support and create additional artifacts for Scala 2.13 (#707)

Improvements

add appVersion to state file
state file format changed: result is now a list of SubFeeds (instead of SubFeedResult), and mainMetrics is now a part of SubFeed entry named metrics.
ActionPipelineContext.currentAction added
improve documentation for WebserviceFileDataObject and FileTransferAction
allow implementing custom HousekeepingModes
implement hadoopFileStateStoreIndexAppend (#724)
add option to replace nonstandardSQL with _ in col names (#728)
ConvertNullValuesTransformer: a generic transformer for null handling (#755)
EncryptColumnsTransformer: change algo to GCM/NoPadding (padding not necessary, overshooting) (#778)
KafkaTopicDataObject: look for groupIdPrefix in dataObject and connection options
ConfigJsonExporter: export column description from markdown files
ConfigJsonExporter: use hadoop to read description files
SnowflakeTableDataObject: forceGenericObservation = true
SnowflakeTableDataObject: add configuration for Spark options (#762)
AccessTableDataObject: get rid of ucanaccess JDBC Driver and use jackcess Library directly
JdbcTableDataObject: implement directTableOverwrite (#747)
DeltaLakeTableDataObject: allow dynamic partitioning (#799)
HadoopFileActionDAGRunStateStore: use json lines as index file format, instead of a special separator line
HadoopFileActionDAGRunStateStore index format: replace actionsState summary with detailed list of actions and dataObjects
DfTransformerWrapperDfsTransformer: apply to all subfeeds by default
implement generic DebugTransformer
implement SparkFlattenDfTransformer (#789)
reset skipped Input-SubFeeds when it is decided to execute an Action (#770)
improve detecting Databricks environment
Include results of failed tasks in state file (#615)
CI Tests: avoid running tests twice (regular + scoverage plugin)

Bugfixes

fix timing issue in ActionDAGKafkaTest
fix setting Spark case sensitive resolution only locally
fix key not found error with PartitionValues.sort
fix sporadic stackoverflow in schema export
ExcelFileDataObject: update documentation with all schema types (#727)
IcebergTableDataObject: fix semantics of GenericSchema.equalsSchema method
JdbcTableDataObject: use same method for getting existing table schema
JdbcTableConnection: rollback transaction on exception in execWithJdbcStatement (#748)
DeltaLakeTableDataObject: fix using catalog when checking if table exists (#802)
ConfigJsonExporter: fix handling multi-line column descriptions
exclude ProxyAction from schema export as it should not be instantiated by configuration
dont set DeltaLake extension and catalog properties if on Databrick UC
make consistent use of Environment.fileSystemFactory

smart-data-lake - Release 2.5.2

Published by zzeekk 12 months ago

Hotfix Release with the following changes:

Improvements

Allowed custom HousekeepingModes (#739)

smart-data-lake - Release 2.5.1

Published by pgruetter about 1 year ago

Features

#510 Documentation: search bar
#656 Modernize Schema Viewer
#700 Upload final state to API

Improvements

#671 State file: add startTstmp and duration for Init and Prepare phases
#672 State file: add inputIds and outputIds
#680 Include SDLB version information in state file
improvements in LabSparkDataObject
improvements in PartitionDiffMode
improvements in RawFileDataObject
various internal improvements (naming, wording, log messages)

Bugfixes

#644 SchemaViewer is freezing when loading definitions
#687 Skipped not handled correctly if multiple Actions writing to the same DataObject
#689 DeduplicateAction: problems with schema evolution if mergeModeEnable=true and updateCapturedColumnOnlyWhenChanged=true
#708 Recovery not triggered

Dependencies

dbutils-api: Update from 0.0.5 to 0.0.6

smart-data-lake - Release 2.5.0

Published by pgruetter over 1 year ago

Major features

Upgrade to Spark 3.3
SDL Agents
Support for Apache Iceberg
Integration with Unity Catalog

Features

#541
#549
#571
#582
#619
#621
#625
#635
#652
SmartDataLakeBuilderLab to use DataObjects more interactively in Notebooks
many-to-many transformations in Python

Improvements

Switch to log4j2 yaml format
New variable failSimulationOnMissingInputSubFeeds to configure if runs should fail when input subfeeds are missing
Expectation improvements (SQLQueryExpectation)
Improvements on JDBC transaction handling
Improvements on Schema Viewer
Proxy Support for SftpFileRefConnections
FileTransferAction: Support for multiple file transfers in parallel
Global Config: allowAsRecursiveInput - allow exceptions on specific DataObjects
Improved Xsd and JsonSchema support
Improved Metric writing to Azure LogAnalytics
Improved support on Amazon Glue

Bugfixes

#599
#627
#633
#653
Various smaller bugfixes and error handling improved

Dependencies

Spark: Update from Spark 3.2 to 3.3
Delta Lake: Update from 2.0 to 2.2

smart-data-lake - Release 2.4.2

Published by ddeuber over 1 year ago

Bugfixes and improvements:

Fix writing to Oracle databases when temporary tables are involved (#633)
When saveMode=Overwrite for JdbcTableDataObject, allow writing to the database table even if the column order in the dataframe is different (#633)
Add parameters to JdbcTableConnection in order to configure the commit behaviour in JDBC connections (#633)

Note: this release is created as Hotfix Release on top of version 2.4.1, as develop-spark3 branch is already on 2.5.0-SNAPSHOT.

smart-data-lake - Release 2.4.1

Published by zzeekk over 1 year ago

Bugfixes and improvements:

Increase spark-extensions version to 3.2.5 (#627): Remove restrictive avro schema equality test
Do not write schema file in simulations (#627)
Do not throw exception when there is no path for sample file in CustomFileAction (#627)

Note: this release is created as Hotfix Release on top of version 2.4.0, as develop-spark3 branch is already on 2.5.0-SNAPSHOT.

smart-data-lake - Release 2.4.0

Published by pgruetter almost 2 years ago

Bugfixes and improvements

#518 Schema Viewer shows wrong information
#580 Can't use same ExcelFileDataObject for write and read
#600 Schema viewer does not indicate whether a field is required
#601 Loading Schema from file should be done lazy
Leading underscores are preserved when normalizing column names
ExecutionMode and executionCondition are only applied in exec phase

Features

#591 Column encryption
#610 Support DataObjectStateIncrementalMode for KafkaTopicDataObject

Dependencies

Bump commons-net from 3.1 to 3.9.0

smart-data-lake - Release 2.3.2

Published by pgruetter almost 2 years ago

What changes are included in the pull request?

Bugfixes

#593
NotSerializableException with RelaxedCsvFileDataObject

Improvements

#577

Dependency Updates

commons-text

smart-data-lake - Release 2.3.1

Published by pgruetter almost 2 years ago

This is mainly a bugfix release, see:
#583
#584
#578
#579

One new Feature:
#575

smart-data-lake - Release 2.3.0

Published by zzeekk about 2 years ago

Version upgrades

Spark 3.2.1 -> 3.2.2
Delta-lake Delta-Lake 1.1.0 -> 2.0.0

New Features

GenericDataFrame implementation to create transformations that run with Spark and Snowpark/Snowflake (#376)
Constraints and Expectations (#43, #377, #388), see also http://smartdatalake.ch/docs/reference/dataQuality#constraints
Historize with incremental cdc mode (#407), see also http://smartdatalake.ch/blog/sdl-hist
Spark file dataobject incremental mode (#517)
Spark Dataset transformations using ScalaClassSparkDsTransformer (#489)
DataObject schemas from caseClass, jsonSchema, xsdFile and avroSchemaFile (#512)
Methods to provide schema in init-phase (#522)
Support for json-schema with confluent schema registry (#538)
JDBC overall transaction (#254)
FinalStateWriter to store state once a job is finished

Minor Bugfixes and improvements

Improve parsing xsd schema
Improve Housekeeping
Implement ColNamesLowercaseTransformer and remove converting columns to lowercase
HiveConnection pathPrefix optional, FileIncrementalMoveMode absolute archivePath
Cleanup partition directories after failure in SparkFileDataObject
Fix schema versioning
Fix Airbyte supportsIncremental optional
Fix naming of input views when chaining SQL transformations
Fix transformer dataframe output mapping and input partitionvalues
Fix calling move/compactPartition only if list is not empty

Full Changelog: https://github.com/smart-data-lake/smart-data-lake/compare/2.2.1...2.3.0

smart-data-lake - Release 2.2.1

Published by zzeekk over 2 years ago

Version upgrades

update Spark version 3.2.0 -> 3.2.1

New Features

StatusInfo REST-Server (#450)
Websocket for live status (#450)
DagExporter command line tool to export basic dag selected by a feed-selector

Minor Bugfixes and improvements

add maven profile to create fat-jar for Spark 3.1 (#465)
fix spark 3.1 json4s compatibility
fix reading state file from previous versions
update spark-extensions: fix execution on Databricks
fix and refine validatePartitionValuesExisting
move sparkSession from object Environment to GlobalConfig to support running multiple SDLB jobs on the same JVM (e.g. Databricks cluster)
fix Airbyte parser issue (#483)
update spark-excel and poi dependency because of vulnerability (#485)

smart-data-lake - Release 2.2.0

Published by pgruetter over 2 years ago

Version upgrades

Update to Spark 3.2 (#406)
Update delta lake to version 1.1 (#406)

dont use DeltaLake Table API because of strange errors
Delta Lake Version 1.1 needs Spark 3.2
Update scala-maven-plugin to support scala 2.12.14+

New Features

Implement CustomSnowparkAction (rudimentary Snowpark support, #376)
Implement script support and CustomScriptAction (#422)
Implement AirbyteDataObject (#365)
Implement basic ScalaNotebookDfTransformer (#401)
Implement SDL json schema creator (#440)
Add Atlas metadata exporter implementation

Minor Bugfixes and improvements

Extend StateListener.notifyState with parameter indicating change Action
Adapted StateChangeLogger to log only for the action for the notification was emitted
Refactor Actions SubFeed handling
Refactor integrating SparkSession into ActionPipelineContext and usage of implicit parameters
Add SASL Authentication for Kafka
Avoid loosing full error response text from webservice calls
Improve build stability by using linesIterator, otherwise on some environments the java:String.lines has precedence over scala:StringLike:lines, which causes compile problems.
Use json4s instead of hocon/configs to write json-state-files
Allow using custom class loader in order to find classes defined or loaded from notebooks (polynote) when parsing configuration
Extend ScalaJWebserviceClient so it can be re-used in getting-started
Force SaveMode.Overwrite for DeduplicateAction and HistorizeAction if mergeModeEnable=false
Make runtime info public (#454)

smart-data-lake - Release 1.3.1

Published by zzeekk almost 3 years ago

Improved Delta Lake support

improve comparing schema ignoring nullability
added support for evolving schema when working with DeltaLakeTableDataObject with SDLSaveMode.Append
handle missing delta table, _delta_log and missing hadoop path

Data Objects extensions

implement DataObjects with state (#365)
implement reading partitioned xml-data
implement Jdbc table creation and schema evolution

Streaming improvements

don't increment runId when all actions are skipped in streaming mode
fix ActionDAGRunState.isSkipped for mixed scenarios (async and sync actions)
make execActionDAG tail recursive to avoid stack overflow for long running streaming jobs

New SDLSaveMode.merge to do upsert statemetns

implement save mode merge for JdbcTableDataObject and DeltaLakeTableDataObject
implement merge mode for CopyAction
implement merge mode for DeduplicateAction (#235)
implement merge mode for HistorizeAction (#235)

New sdl-azure module

add Azure libraries, AzureADClientGrantAuthMode
introduced state change logger, which submits save events to azure log monitoring
support for azure key vault secret provider

Small bugfixes & improvements

support more type conversions in schema evolution
if possible use schemaMin to create empty DataFrame if table for recursive input doesn't exist yet.
Prevent file names starting with . in WebserviceFileDataObject (crc files still have original name though)
Remove special chars from fileRefs generated by WebserviceFileDataObject (#395)
throw exception if config entry for connections, dataObjects or actions is not of type object (#396)
fix evaluating to_date and other ReplaceableExpressions with ExpressionEvaluator
cleanup kafka dependency from deltalake pom.xml
remove wrong error message about missing executionId in SparkStageMetricsListener
fix reading data frame from skipped SubFeed if filters are ignored
fix parsing event info if appName contains special characters
add a transformer to repartition dataframe
made SmartDataLakeLogger public
Simplify final exception for better usability of log: truncate stacktrace starting from "monix.*" entries, limit logical plan in AnalysisException to 5 lines
Simplify logging of TaskFailedException

Cleanup

Cleanup deprecated PartitionDiffMode.stopIfNoData

smart-data-lake - Release 2.1.1

Published by zzeekk almost 3 years ago

Improved Delta Lake support

update DeltaLakeTableObject to use table API (#375)
improve comparing schema ignoring nullability
added support for evolving schema when working with DeltaLakeTableDataObject with SDLSaveMode.Append
handle missing delta table, _delta_log and missing hadoop path

Data Objects extensions

implement DataObjects with state (#365)
implement reading partitioned xml-data
implement Jdbc table creation and schema evolution
new RelaxedCsvFileDataObject and ZipCsvCodec for compression

Streaming improvements

don't increment runId when all actions are skipped in streaming mode
fix ActionDAGRunState.isSkipped for mixed scenarios (async and sync actions)
make execActionDAG tail recursive to avoid stack overflow for long running streaming jobs

New SDLSaveMode.merge to do upsert statemetns

implement save mode merge for JdbcTableDataObject and DeltaLakeTableDataObject
implement merge mode for CopyAction
implement merge mode for DeduplicateAction (#235)
implement merge mode for HistorizeAction (#235)

New sdl-azure module

add Azure libraries, AzureADClientGrantAuthMode
introduced state change logger, which submits save events to azure log monitoring
support for azure key vault secret provider

Small bugfixes & improvements

support more type conversions in schema evolution
if possible use schemaMin to create empty DataFrame if table for recursive input doesn't exist yet.
Prevent file names starting with . in WebserviceFileDataObject (crc files still have original name though)
Remove special chars from fileRefs generated by WebserviceFileDataObject (#395)
throw exception if config entry for connections, dataObjects or actions is not of type object (#396)
fix evaluating to_date and other ReplaceableExpressions with ExpressionEvaluator
cleanup kafka dependency from deltalake pom.xml
remove wrong error message about missing executionId in SparkStageMetricsListener
fix reading data frame from skipped SubFeed if filters are ignored
fix parsing event info if appName contains special characters
add a transformer to repartition dataframe
made SmartDataLakeLogger public
Simplify final exception for better usability of log: truncate stacktrace starting from "monix.*" entries, limit logical plan in AnalysisException to 5 lines
Simplify logging of TaskFailedException

Cleanup

Cleanup deprecated PartitionDiffMode.stopIfNoData

smart-data-lake - 1.3.0

Published by pgruetter about 3 years ago

New features

New Transformer API #344
Define retention period for date partitioned data object #211
Extending Syntax to define actions to execute #366
Databricks 8.X compatibility (only for version 2.1.0 really as Databricks Runtime 8 uses Spark 3.x) #355

Bugfixes

Recovery not working for skipped and failed predecessors #356
JmsDataObject already processing data in init phase #357
spark-tags declared in sdl-deltalake to correctly resolve dependencies
version bump of libraries (dependabot)

smart-data-lake - 2.1.0

Published by pgruetter about 3 years ago

New features

New Transformer API #344
Define retention period for date partitioned data object #211
Extending Syntax to define actions to execute #366
Databricks 8.X compatibility #355

Bugfixes

Recovery not working for skipped and failed predecessors #356
JmsDataObject already processing data in init phase #357
spark-tags declared in sdl-deltalake to correctly resolve dependencies
version bump of libraries (dependabot)

smart-data-lake - 1.2.5

Published by pgruetter over 3 years ago

New features

enable fail-on-superfluous-config-keys (#27)
implement support to register python UDFs for Spark SQL transformations
? - add org.apache.spark:hadoop-cloud s3a optimizations (#208)

Minor improvements

optimize reading multiple partitions in SparkFileDataObject
validate partition columns existing on write and read
implement including current date partition in KafkaTopicDataObject.listPartitions
use options to customize kafka consumer creation

Bugfixes

fix recovery re-executing skipped actions (#349)
fix SparkUDFCreator not serializable
avoid PythonAccumulatorV2 "java.net.ConnectException: Connection refused: connect"
fix allow overwrite hive table with different schema
fix the Schema/Database naming mess partly in Snowflake module
performance fix for ConfluentAvroDataToCatalyst conversion
fix listing partitions on relative path
JmsDataObject should not process data in init phase (#357)

smart-data-lake - 2.0.4

Published by pgruetter over 3 years ago

New features

enable fail-on-superfluous-config-keys (#27)
implement support to register python UDFs for Spark SQL transformations
add org.apache.spark:hadoop-cloud s3a optimizations (#208)

Minor improvements

optimize reading multiple partitions in SparkFileDataObject
validate partition columns existing on write and read
update to spark 3.1.1
update delta lake version
implement including current date partition in KafkaTopicDataObject.listPartitions
use options to customize kafka consumer creation

Bugfixes

fix recovery re-executing skipped actions (#349)
fix SparkUDFCreator not serializable
avoid PythonAccumulatorV2 "java.net.ConnectException: Connection refused: connect"
fix allow overwrite hive table with different schema
fix the Schema/Database naming mess partly in Snowflake module
performance fix for ConfluentAvroDataToCatalyst conversion
fix listing partitions on relative path

smart-data-lake - 1.2.4

Published by pgruetter over 3 years ago

New Features

Add Snowflake Module
add ProcessAllMode
implement SDLSaveMode.OverwriteOptimized (#313)
implement SDLSaveMode.OverwritePreserveDirectories (#292)
implement SDLPlugin (#293)
implement FileIncrementalMoveMode (#314)
implement Action.executionCondition (#314)
refactor WebserviceFileDataObject AuthModes
dynamically add run_id partition value if needed (#289)
enhance RawFileDataObject to read/write custom Spark data formats

Minor improvements

ignore hidden config files and directories (#286)
implement SparkCustomAction.inputIdsToIgnoreFilter (#129)
improve configuration checks (#213)
refactor evaluating execution mode only in init-phase, and reusing results in exec-phase (#213)
implement SparkIncrementalMode.stopIfNoData (#203)
Added Put method to webservice
move deltalake connectivity to separate module sdl-deltalake (#288)
dynamically add run_id partition value if needed (#289)
implemented SDLSaveMode.OverwritePreserveDirectories (#292)
change partition column _run_id to run_id
Rename ActionObjectId as ActionId (#283)
allow PartitionDiffMode to add values for additional partition columns through transformPartitionValues (#303)
refine exclusion of log4j files when searching configuration files (#286)
extend renaming of files (#316)
add sorting direction to sortWithinPartition
ignore numInitialHdfsPartitions if Spark 3.0 AQE is enabled
log schema in schema evolution only if changed or severity=debug
implement more flexible main input selection (#314)
refactor SecretsUtil to allow for custom SecretProvider configured by GlobalConfig
validate schema on write for SparkFileDataObject, JdbcTableDataObject & HiveTableDataObject (#325)
refactor case sensitivity for JDBC db/table names (#221)
implement jdbc connection pool (#329)
rename thread for logging (#200)
improve output of TaskFailedException (#200)

Bugfixes

remove permission check as it doesnt work on all hadoop filesystems
allow appName in state filename to contain underscore
avoid repo.spring.io/plugins-release
update spark-extensions version (fix kafka schema evoluation on read)
fix validate metricsFailCondition
cleanup usage of breakLineage when changing partitionValues or filter
dont skip actions in dry-run (#226)

smart-data-lake - 2.0.3

Published by pgruetter over 3 years ago

New Features

Add Snowflake Module
add ProcessAllMode
implement SDLSaveMode.OverwriteOptimized (#313)
implement SDLSaveMode.OverwritePreserveDirectories (#292)
implement SDLPlugin (#293)
implement FileIncrementalMoveMode (#314)
implement Action.executionCondition (#314)
refactor WebserviceFileDataObject AuthModes
dynamically add run_id partition value if needed (#289)
enhance RawFileDataObject to read/write custom Spark data formats

Minor improvements

ignore hidden config files and directories (#286)
implement SparkCustomAction.inputIdsToIgnoreFilter (#129)
improve configuration checks (#213)
refactor evaluating execution mode only in init-phase, and reusing results in exec-phase (#213)
implement SparkIncrementalMode.stopIfNoData (#203)
Added Put method to webservice
move deltalake connectivity to separate module sdl-deltalake (#288)
dynamically add run_id partition value if needed (#289)
implemented SDLSaveMode.OverwritePreserveDirectories (#292)
change partition column _run_id to run_id
Rename ActionObjectId as ActionId (#283)
allow PartitionDiffMode to add values for additional partition columns through transformPartitionValues (#303)
refine exclusion of log4j files when searching configuration files (#286)
extend renaming of files (#316)
add sorting direction to sortWithinPartition
ignore numInitialHdfsPartitions if Spark 3.0 AQE is enabled
log schema in schema evolution only if changed or severity=debug
implement more flexible main input selection (#314)
refactor SecretsUtil to allow for custom SecretProvider configured by GlobalConfig
validate schema on write for SparkFileDataObject, JdbcTableDataObject & HiveTableDataObject (#325)
refactor case sensitivity for JDBC db/table names (#221)
implement jdbc connection pool (#329)
rename thread for logging (#200)
improve output of TaskFailedException (#200)

Bugfixes

remove permission check as it doesnt work on all hadoop filesystems
allow appName in state filename to contain underscore
avoid repo.spring.io/plugins-release
update spark-extensions version (fix kafka schema evoluation on read)
fix validate metricsFailCondition
cleanup usage of breakLineage when changing partitionValues or filter
dont skip actions in dry-run (#226)

smart-data-lake

Version updates and dependencies

Features

Improve metrics per DataObject (#529)

new DataObjectSchemaExporter

improve ConfigJsonExporter

improve SparkDfsTransformer (used by CustomDataFrameAction)

new state migrator (#764)

Scala 2.13

Improvements

Bugfixes

Improvements

Features

Improvements

Bugfixes

Dependencies

Major features

Features

Improvements

Bugfixes

Dependencies

Bugfixes and improvements:

Bugfixes and improvements:

Bugfixes and improvements

Features

Dependencies

What changes are included in the pull request?

Bugfixes

Improvements

Dependency Updates

Version upgrades

New Features

Minor Bugfixes and improvements

Version upgrades

New Features

Minor Bugfixes and improvements

Version upgrades

New Features

Minor Bugfixes and improvements

Related Projects

Data-Engineering-HowTo