smart-data-lake

Smart Automation Tool for building modern Data Lakes and Data Pipelines

GPL-3.0 License

Stars
92
Committers
35

Bot releases are hidden (Show)

smart-data-lake - Release 2.6.0 Latest Release

Published by zzeekk 7 months ago

Version updates and dependencies

  • update to spark 3.4 (#659)
  • update delta, iceberg and snowpark for Spark 3.4
  • removed critical/high/medium vulnerabilities and added missing dependencies
  • update plugin versions
  • upgrade github actions to node16 and jdk17 (#721, #759)

Features

Improve metrics per DataObject (#529)

  • SubFeed is enriched with metrics
  • spark metrics are collected synchronously
  • read specific metrics for jdbc, delta, iceberg and snowflake
  • standardize naming for inserted/updated/deleted rows
  • add metrics to SubFeedsExpressionData so it can be used in executionCondition

new DataObjectSchemaExporter

  • export schema and statistics (table, column) for UI

improve ConfigJsonExporter

  • export scalaDoc of transformers to display in UI (#614)

improve SparkDfsTransformer (used by CustomDataFrameAction)

  • implement dynamic custom transform function (#711). This allows to define custom transform methods and dynamic parameter mapping on evaluation, e.g. transform(session: SparkSession, dfDataObject1: DataFrame, dfDataObject2: DataFrame, isExec: Boolean, nbOfDts: Option[Int]): DataFrame

new state migrator (#764)

  • Allow reading old state file formats seamlessly
  • Command line tool to migrate all state files to current format.

Scala 2.13

  • Support and create additional artifacts for Scala 2.13 (#707)

Improvements

  • add appVersion to state file
  • state file format changed: result is now a list of SubFeeds (instead of SubFeedResult), and mainMetrics is now a part of SubFeed entry named metrics.
  • ActionPipelineContext.currentAction added
  • improve documentation for WebserviceFileDataObject and FileTransferAction
  • allow implementing custom HousekeepingModes
  • implement hadoopFileStateStoreIndexAppend (#724)
  • add option to replace nonstandardSQL with _ in col names (#728)
  • ConvertNullValuesTransformer: a generic transformer for null handling (#755)
  • EncryptColumnsTransformer: change algo to GCM/NoPadding (padding not necessary, overshooting) (#778)
  • KafkaTopicDataObject: look for groupIdPrefix in dataObject and connection options
  • ConfigJsonExporter: export column description from markdown files
  • ConfigJsonExporter: use hadoop to read description files
  • SnowflakeTableDataObject: forceGenericObservation = true
  • SnowflakeTableDataObject: add configuration for Spark options (#762)
  • AccessTableDataObject: get rid of ucanaccess JDBC Driver and use jackcess Library directly
  • JdbcTableDataObject: implement directTableOverwrite (#747)
  • DeltaLakeTableDataObject: allow dynamic partitioning (#799)
  • HadoopFileActionDAGRunStateStore: use json lines as index file format, instead of a special separator line
  • HadoopFileActionDAGRunStateStore index format: replace actionsState summary with detailed list of actions and dataObjects
  • DfTransformerWrapperDfsTransformer: apply to all subfeeds by default
  • implement generic DebugTransformer
  • implement SparkFlattenDfTransformer (#789)
  • reset skipped Input-SubFeeds when it is decided to execute an Action (#770)
  • improve detecting Databricks environment
  • Include results of failed tasks in state file (#615)
  • CI Tests: avoid running tests twice (regular + scoverage plugin)

Bugfixes

  • fix timing issue in ActionDAGKafkaTest
  • fix setting Spark case sensitive resolution only locally
  • fix key not found error with PartitionValues.sort
  • fix sporadic stackoverflow in schema export
  • ExcelFileDataObject: update documentation with all schema types (#727)
  • IcebergTableDataObject: fix semantics of GenericSchema.equalsSchema method
  • JdbcTableDataObject: use same method for getting existing table schema
  • JdbcTableConnection: rollback transaction on exception in execWithJdbcStatement (#748)
  • DeltaLakeTableDataObject: fix using catalog when checking if table exists (#802)
  • ConfigJsonExporter: fix handling multi-line column descriptions
  • exclude ProxyAction from schema export as it should not be instantiated by configuration
  • dont set DeltaLake extension and catalog properties if on Databrick UC
  • make consistent use of Environment.fileSystemFactory
smart-data-lake - Release 2.5.2

Published by zzeekk 12 months ago

Hotfix Release with the following changes:

Improvements

  • Allowed custom HousekeepingModes (#739)
smart-data-lake - Release 2.5.1

Published by pgruetter about 1 year ago

Features

  • #510 Documentation: search bar
  • #656 Modernize Schema Viewer
  • #700 Upload final state to API

Improvements

  • #671 State file: add startTstmp and duration for Init and Prepare phases
  • #672 State file: add inputIds and outputIds
  • #680 Include SDLB version information in state file
  • improvements in LabSparkDataObject
  • improvements in PartitionDiffMode
  • improvements in RawFileDataObject
  • various internal improvements (naming, wording, log messages)

Bugfixes

  • #644 SchemaViewer is freezing when loading definitions
  • #687 Skipped not handled correctly if multiple Actions writing to the same DataObject
  • #689 DeduplicateAction: problems with schema evolution if mergeModeEnable=true and updateCapturedColumnOnlyWhenChanged=true
  • #708 Recovery not triggered

Dependencies

dbutils-api: Update from 0.0.5 to 0.0.6

smart-data-lake - Release 2.5.0

Published by pgruetter over 1 year ago

Major features

  • Upgrade to Spark 3.3
  • SDL Agents
  • Support for Apache Iceberg
  • Integration with Unity Catalog

Features

  • #541
  • #549
  • #571
  • #582
  • #619
  • #621
  • #625
  • #635
  • #652
  • SmartDataLakeBuilderLab to use DataObjects more interactively in Notebooks
  • many-to-many transformations in Python

Improvements

  • Switch to log4j2 yaml format
  • New variable failSimulationOnMissingInputSubFeeds to configure if runs should fail when input subfeeds are missing
  • Expectation improvements (SQLQueryExpectation)
  • Improvements on JDBC transaction handling
  • Improvements on Schema Viewer
  • Proxy Support for SftpFileRefConnections
  • FileTransferAction: Support for multiple file transfers in parallel
  • Global Config: allowAsRecursiveInput - allow exceptions on specific DataObjects
  • Improved Xsd and JsonSchema support
  • Improved Metric writing to Azure LogAnalytics
  • Improved support on Amazon Glue

Bugfixes

  • #599
  • #627
  • #633
  • #653
  • Various smaller bugfixes and error handling improved

Dependencies

Spark: Update from Spark 3.2 to 3.3
Delta Lake: Update from 2.0 to 2.2

smart-data-lake - Release 2.4.2

Published by ddeuber over 1 year ago

Bugfixes and improvements:

  • Fix writing to Oracle databases when temporary tables are involved (#633)
  • When saveMode=Overwrite for JdbcTableDataObject, allow writing to the database table even if the column order in the dataframe is different (#633)
  • Add parameters to JdbcTableConnection in order to configure the commit behaviour in JDBC connections (#633)

Note: this release is created as Hotfix Release on top of version 2.4.1, as develop-spark3 branch is already on 2.5.0-SNAPSHOT.

smart-data-lake - Release 2.4.1

Published by zzeekk over 1 year ago

Bugfixes and improvements:

  • Increase spark-extensions version to 3.2.5 (#627): Remove restrictive avro schema equality test
  • Do not write schema file in simulations (#627)
  • Do not throw exception when there is no path for sample file in CustomFileAction (#627)

Note: this release is created as Hotfix Release on top of version 2.4.0, as develop-spark3 branch is already on 2.5.0-SNAPSHOT.

smart-data-lake - Release 2.4.0

Published by pgruetter almost 2 years ago

Bugfixes and improvements

#518 Schema Viewer shows wrong information
#580 Can't use same ExcelFileDataObject for write and read
#600 Schema viewer does not indicate whether a field is required
#601 Loading Schema from file should be done lazy
Leading underscores are preserved when normalizing column names
ExecutionMode and executionCondition are only applied in exec phase

Features

#591 Column encryption
#610 Support DataObjectStateIncrementalMode for KafkaTopicDataObject

Dependencies

Bump commons-net from 3.1 to 3.9.0

smart-data-lake - Release 2.3.2

Published by pgruetter almost 2 years ago

What changes are included in the pull request?

Bugfixes

  • #593
  • NotSerializableException with RelaxedCsvFileDataObject

Improvements

  • #577

Dependency Updates

  • commons-text
smart-data-lake - Release 2.3.1

Published by pgruetter almost 2 years ago

This is mainly a bugfix release, see:
#583
#584
#578
#579

One new Feature:
#575

smart-data-lake - Release 2.3.0

Published by zzeekk about 2 years ago

Version upgrades

  • Spark 3.2.1 -> 3.2.2
  • Delta-lake Delta-Lake 1.1.0 -> 2.0.0

New Features

  • GenericDataFrame implementation to create transformations that run with Spark and Snowpark/Snowflake (#376)
  • Constraints and Expectations (#43, #377, #388), see also http://smartdatalake.ch/docs/reference/dataQuality#constraints
  • Historize with incremental cdc mode (#407), see also http://smartdatalake.ch/blog/sdl-hist
  • Spark file dataobject incremental mode (#517)
  • Spark Dataset transformations using ScalaClassSparkDsTransformer (#489)
  • DataObject schemas from caseClass, jsonSchema, xsdFile and avroSchemaFile (#512)
  • Methods to provide schema in init-phase (#522)
  • Support for json-schema with confluent schema registry (#538)
  • JDBC overall transaction (#254)
  • FinalStateWriter to store state once a job is finished

Minor Bugfixes and improvements

  • Improve parsing xsd schema
  • Improve Housekeeping
  • Implement ColNamesLowercaseTransformer and remove converting columns to lowercase
  • HiveConnection pathPrefix optional, FileIncrementalMoveMode absolute archivePath
  • Cleanup partition directories after failure in SparkFileDataObject
  • Fix schema versioning
  • Fix Airbyte supportsIncremental optional
  • Fix naming of input views when chaining SQL transformations
  • Fix transformer dataframe output mapping and input partitionvalues
  • Fix calling move/compactPartition only if list is not empty

Full Changelog: https://github.com/smart-data-lake/smart-data-lake/compare/2.2.1...2.3.0

smart-data-lake - Release 2.2.1

Published by zzeekk over 2 years ago

Version upgrades

  • update Spark version 3.2.0 -> 3.2.1

New Features

  • StatusInfo REST-Server (#450)
  • Websocket for live status (#450)
  • DagExporter command line tool to export basic dag selected by a feed-selector

Minor Bugfixes and improvements

  • add maven profile to create fat-jar for Spark 3.1 (#465)
  • fix spark 3.1 json4s compatibility
  • fix reading state file from previous versions
  • update spark-extensions: fix execution on Databricks
  • fix and refine validatePartitionValuesExisting
  • move sparkSession from object Environment to GlobalConfig to support running multiple SDLB jobs on the same JVM (e.g. Databricks cluster)
  • fix Airbyte parser issue (#483)
  • update spark-excel and poi dependency because of vulnerability (#485)
smart-data-lake - Release 2.2.0

Published by pgruetter over 2 years ago

Version upgrades

Update to Spark 3.2 (#406)
Update delta lake to version 1.1 (#406)

  • dont use DeltaLake Table API because of strange errors
  • Delta Lake Version 1.1 needs Spark 3.2
    Update scala-maven-plugin to support scala 2.12.14+

New Features

Implement CustomSnowparkAction (rudimentary Snowpark support, #376)
Implement script support and CustomScriptAction (#422)
Implement AirbyteDataObject (#365)
Implement basic ScalaNotebookDfTransformer (#401)
Implement SDL json schema creator (#440)
Add Atlas metadata exporter implementation

Minor Bugfixes and improvements

Extend StateListener.notifyState with parameter indicating change Action
Adapted StateChangeLogger to log only for the action for the notification was emitted
Refactor Actions SubFeed handling
Refactor integrating SparkSession into ActionPipelineContext and usage of implicit parameters
Add SASL Authentication for Kafka
Avoid loosing full error response text from webservice calls
Improve build stability by using linesIterator, otherwise on some environments the java:String.lines has precedence over scala:StringLike:lines, which causes compile problems.
Use json4s instead of hocon/configs to write json-state-files
Allow using custom class loader in order to find classes defined or loaded from notebooks (polynote) when parsing configuration
Extend ScalaJWebserviceClient so it can be re-used in getting-started
Force SaveMode.Overwrite for DeduplicateAction and HistorizeAction if mergeModeEnable=false
Make runtime info public (#454)

smart-data-lake - Release 1.3.1

Published by zzeekk almost 3 years ago

Improved Delta Lake support

  • improve comparing schema ignoring nullability
  • added support for evolving schema when working with DeltaLakeTableDataObject with SDLSaveMode.Append
  • handle missing delta table, _delta_log and missing hadoop path

Data Objects extensions

  • implement DataObjects with state (#365)
  • implement reading partitioned xml-data
  • implement Jdbc table creation and schema evolution

Streaming improvements

  • don't increment runId when all actions are skipped in streaming mode
  • fix ActionDAGRunState.isSkipped for mixed scenarios (async and sync actions)
  • make execActionDAG tail recursive to avoid stack overflow for long running streaming jobs

New SDLSaveMode.merge to do upsert statemetns

  • implement save mode merge for JdbcTableDataObject and DeltaLakeTableDataObject
  • implement merge mode for CopyAction
  • implement merge mode for DeduplicateAction (#235)
  • implement merge mode for HistorizeAction (#235)

New sdl-azure module

  • add Azure libraries, AzureADClientGrantAuthMode
  • introduced state change logger, which submits save events to azure log monitoring
  • support for azure key vault secret provider

Small bugfixes & improvements

  • support more type conversions in schema evolution
  • if possible use schemaMin to create empty DataFrame if table for recursive input doesn't exist yet.
  • Prevent file names starting with . in WebserviceFileDataObject (crc files still have original name though)
  • Remove special chars from fileRefs generated by WebserviceFileDataObject (#395)
  • throw exception if config entry for connections, dataObjects or actions is not of type object (#396)
  • fix evaluating to_date and other ReplaceableExpressions with ExpressionEvaluator
  • cleanup kafka dependency from deltalake pom.xml
  • remove wrong error message about missing executionId in SparkStageMetricsListener
  • fix reading data frame from skipped SubFeed if filters are ignored
  • fix parsing event info if appName contains special characters
  • add a transformer to repartition dataframe
  • made SmartDataLakeLogger public
  • Simplify final exception for better usability of log: truncate stacktrace starting from "monix.*" entries, limit logical plan in AnalysisException to 5 lines
  • Simplify logging of TaskFailedException

Cleanup

  • Cleanup deprecated PartitionDiffMode.stopIfNoData
smart-data-lake - Release 2.1.1

Published by zzeekk almost 3 years ago

Improved Delta Lake support

  • update DeltaLakeTableObject to use table API (#375)
  • improve comparing schema ignoring nullability
  • added support for evolving schema when working with DeltaLakeTableDataObject with SDLSaveMode.Append
  • handle missing delta table, _delta_log and missing hadoop path

Data Objects extensions

  • implement DataObjects with state (#365)
  • implement reading partitioned xml-data
  • implement Jdbc table creation and schema evolution
  • new RelaxedCsvFileDataObject and ZipCsvCodec for compression

Streaming improvements

  • don't increment runId when all actions are skipped in streaming mode
  • fix ActionDAGRunState.isSkipped for mixed scenarios (async and sync actions)
  • make execActionDAG tail recursive to avoid stack overflow for long running streaming jobs

New SDLSaveMode.merge to do upsert statemetns

  • implement save mode merge for JdbcTableDataObject and DeltaLakeTableDataObject
  • implement merge mode for CopyAction
  • implement merge mode for DeduplicateAction (#235)
  • implement merge mode for HistorizeAction (#235)

New sdl-azure module

  • add Azure libraries, AzureADClientGrantAuthMode
  • introduced state change logger, which submits save events to azure log monitoring
  • support for azure key vault secret provider

Small bugfixes & improvements

  • support more type conversions in schema evolution
  • if possible use schemaMin to create empty DataFrame if table for recursive input doesn't exist yet.
  • Prevent file names starting with . in WebserviceFileDataObject (crc files still have original name though)
  • Remove special chars from fileRefs generated by WebserviceFileDataObject (#395)
  • throw exception if config entry for connections, dataObjects or actions is not of type object (#396)
  • fix evaluating to_date and other ReplaceableExpressions with ExpressionEvaluator
  • cleanup kafka dependency from deltalake pom.xml
  • remove wrong error message about missing executionId in SparkStageMetricsListener
  • fix reading data frame from skipped SubFeed if filters are ignored
  • fix parsing event info if appName contains special characters
  • add a transformer to repartition dataframe
  • made SmartDataLakeLogger public
  • Simplify final exception for better usability of log: truncate stacktrace starting from "monix.*" entries, limit logical plan in AnalysisException to 5 lines
  • Simplify logging of TaskFailedException

Cleanup

  • Cleanup deprecated PartitionDiffMode.stopIfNoData
smart-data-lake - 1.3.0

Published by pgruetter about 3 years ago

New features

  • New Transformer API #344
  • Define retention period for date partitioned data object #211
  • Extending Syntax to define actions to execute #366
  • Databricks 8.X compatibility (only for version 2.1.0 really as Databricks Runtime 8 uses Spark 3.x) #355

Bugfixes

  • Recovery not working for skipped and failed predecessors #356
  • JmsDataObject already processing data in init phase #357
  • spark-tags declared in sdl-deltalake to correctly resolve dependencies
  • version bump of libraries (dependabot)
smart-data-lake - 2.1.0

Published by pgruetter about 3 years ago

New features

  • New Transformer API #344
  • Define retention period for date partitioned data object #211
  • Extending Syntax to define actions to execute #366
  • Databricks 8.X compatibility #355

Bugfixes

  • Recovery not working for skipped and failed predecessors #356
  • JmsDataObject already processing data in init phase #357
  • spark-tags declared in sdl-deltalake to correctly resolve dependencies
  • version bump of libraries (dependabot)
smart-data-lake - 1.2.5

Published by pgruetter over 3 years ago

New features

  • enable fail-on-superfluous-config-keys (#27)
  • implement support to register python UDFs for Spark SQL transformations
    ? - add org.apache.spark:hadoop-cloud s3a optimizations (#208)

Minor improvements

  • optimize reading multiple partitions in SparkFileDataObject
  • validate partition columns existing on write and read
  • implement including current date partition in KafkaTopicDataObject.listPartitions
  • use options to customize kafka consumer creation

Bugfixes

  • fix recovery re-executing skipped actions (#349)
  • fix SparkUDFCreator not serializable
  • avoid PythonAccumulatorV2 "java.net.ConnectException: Connection refused: connect"
  • fix allow overwrite hive table with different schema
  • fix the Schema/Database naming mess partly in Snowflake module
  • performance fix for ConfluentAvroDataToCatalyst conversion
  • fix listing partitions on relative path
  • JmsDataObject should not process data in init phase (#357)
smart-data-lake - 2.0.4

Published by pgruetter over 3 years ago

New features

  • enable fail-on-superfluous-config-keys (#27)
  • implement support to register python UDFs for Spark SQL transformations
  • add org.apache.spark:hadoop-cloud s3a optimizations (#208)

Minor improvements

  • optimize reading multiple partitions in SparkFileDataObject
  • validate partition columns existing on write and read
  • update to spark 3.1.1
  • update delta lake version
  • implement including current date partition in KafkaTopicDataObject.listPartitions
  • use options to customize kafka consumer creation

Bugfixes

  • fix recovery re-executing skipped actions (#349)
  • fix SparkUDFCreator not serializable
  • avoid PythonAccumulatorV2 "java.net.ConnectException: Connection refused: connect"
  • fix allow overwrite hive table with different schema
  • fix the Schema/Database naming mess partly in Snowflake module
  • performance fix for ConfluentAvroDataToCatalyst conversion
  • fix listing partitions on relative path
smart-data-lake - 1.2.4

Published by pgruetter over 3 years ago

New Features

  • Add Snowflake Module
  • add ProcessAllMode
  • implement SDLSaveMode.OverwriteOptimized (#313)
  • implement SDLSaveMode.OverwritePreserveDirectories (#292)
  • implement SDLPlugin (#293)
  • implement FileIncrementalMoveMode (#314)
  • implement Action.executionCondition (#314)
  • refactor WebserviceFileDataObject AuthModes
  • dynamically add run_id partition value if needed (#289)
  • enhance RawFileDataObject to read/write custom Spark data formats

Minor improvements

  • ignore hidden config files and directories (#286)
  • implement SparkCustomAction.inputIdsToIgnoreFilter (#129)
  • improve configuration checks (#213)
  • refactor evaluating execution mode only in init-phase, and reusing results in exec-phase (#213)
  • implement SparkIncrementalMode.stopIfNoData (#203)
  • Added Put method to webservice
  • move deltalake connectivity to separate module sdl-deltalake (#288)
  • dynamically add run_id partition value if needed (#289)
  • implemented SDLSaveMode.OverwritePreserveDirectories (#292)
  • change partition column _run_id to run_id
  • Rename ActionObjectId as ActionId (#283)
  • allow PartitionDiffMode to add values for additional partition columns through transformPartitionValues (#303)
  • refine exclusion of log4j files when searching configuration files (#286)
  • extend renaming of files (#316)
  • add sorting direction to sortWithinPartition
  • ignore numInitialHdfsPartitions if Spark 3.0 AQE is enabled
  • log schema in schema evolution only if changed or severity=debug
  • implement more flexible main input selection (#314)
  • refactor SecretsUtil to allow for custom SecretProvider configured by GlobalConfig
  • validate schema on write for SparkFileDataObject, JdbcTableDataObject & HiveTableDataObject (#325)
  • refactor case sensitivity for JDBC db/table names (#221)
  • implement jdbc connection pool (#329)
  • rename thread for logging (#200)
  • improve output of TaskFailedException (#200)

Bugfixes

  • remove permission check as it doesnt work on all hadoop filesystems
  • allow appName in state filename to contain underscore
  • avoid repo.spring.io/plugins-release
  • update spark-extensions version (fix kafka schema evoluation on read)
  • fix validate metricsFailCondition
  • cleanup usage of breakLineage when changing partitionValues or filter
  • dont skip actions in dry-run (#226)
smart-data-lake - 2.0.3

Published by pgruetter over 3 years ago

New Features

  • Add Snowflake Module
  • add ProcessAllMode
  • implement SDLSaveMode.OverwriteOptimized (#313)
  • implement SDLSaveMode.OverwritePreserveDirectories (#292)
  • implement SDLPlugin (#293)
  • implement FileIncrementalMoveMode (#314)
  • implement Action.executionCondition (#314)
  • refactor WebserviceFileDataObject AuthModes
  • dynamically add run_id partition value if needed (#289)
  • enhance RawFileDataObject to read/write custom Spark data formats

Minor improvements

  • ignore hidden config files and directories (#286)
  • implement SparkCustomAction.inputIdsToIgnoreFilter (#129)
  • improve configuration checks (#213)
  • refactor evaluating execution mode only in init-phase, and reusing results in exec-phase (#213)
  • implement SparkIncrementalMode.stopIfNoData (#203)
  • Added Put method to webservice
  • move deltalake connectivity to separate module sdl-deltalake (#288)
  • dynamically add run_id partition value if needed (#289)
  • implemented SDLSaveMode.OverwritePreserveDirectories (#292)
  • change partition column _run_id to run_id
  • Rename ActionObjectId as ActionId (#283)
  • allow PartitionDiffMode to add values for additional partition columns through transformPartitionValues (#303)
  • refine exclusion of log4j files when searching configuration files (#286)
  • extend renaming of files (#316)
  • add sorting direction to sortWithinPartition
  • ignore numInitialHdfsPartitions if Spark 3.0 AQE is enabled
  • log schema in schema evolution only if changed or severity=debug
  • implement more flexible main input selection (#314)
  • refactor SecretsUtil to allow for custom SecretProvider configured by GlobalConfig
  • validate schema on write for SparkFileDataObject, JdbcTableDataObject & HiveTableDataObject (#325)
  • refactor case sensitivity for JDBC db/table names (#221)
  • implement jdbc connection pool (#329)
  • rename thread for logging (#200)
  • improve output of TaskFailedException (#200)

Bugfixes

  • remove permission check as it doesnt work on all hadoop filesystems
  • allow appName in state filename to contain underscore
  • avoid repo.spring.io/plugins-release
  • update spark-extensions version (fix kafka schema evoluation on read)
  • fix validate metricsFailCondition
  • cleanup usage of breakLineage when changing partitionValues or filter
  • dont skip actions in dry-run (#226)