beam

Apache Beam is a unified programming model for Batch and Streaming data processing.

APACHE-2.0 License

Downloads
82
Stars
7.6K
Committers
1.3K

Bot releases are hidden (Show)

beam - Beam 2.56.0 release Latest Release

Published by damccorm 6 months ago

We are happy to present the new 2.56.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.

For more information on changes in 2.56.0, check out the detailed release notes.

Highlights

  • Added FlinkRunner for Flink 1.17, removed support for Flink 1.12 and 1.13. Previous version of Pipeline running on Flink 1.16 and below can be upgraded to 1.17, if the Pipeline is first updated to Beam 2.56.0 with the same Flink version. After Pipeline runs with Beam 2.56.0, it should be possible to upgrade to FlinkRunner with Flink 1.17. (#29939)
  • New Managed I/O Java API (#30830).
  • New Ordered Processing PTransform added for processing order-sensitive stateful data (#30735).

I/Os

  • Upgraded Avro version to 1.11.3, kafka-avro-serializer and kafka-schema-registry-client versions to 7.6.0 (Java) (#30638).
    The newer Avro package is known to have breaking changes. If you are affected, you can keep pinned to older Avro versions which are also tested with Beam.
  • Iceberg read/write support is available through the new Managed I/O Java API (#30830).

New Features / Improvements

  • Profiling of Cythonized code has been disabled by default. This might improve performance for some Python pipelines (#30938).
  • Bigtable enrichment handler now accepts a custom function to build a composite row key. (Python) (#30974).

Breaking Changes

  • Default consumer polling timeout for KafkaIO.Read was increased from 1 second to 2 seconds. Use KafkaIO.read().withConsumerPollingTimeout(Duration duration) to configure this timeout value when necessary (#30870).
  • Python Dataflow users no longer need to manually specify --streaming for pipelines using unbounded sources such as ReadFromPubSub.

Bugfixes

  • Fixed locking issue when shutting down inactive bundle processors. Symptoms of this issue include slowness or stuckness in long-running jobs (Python) (#30679).
  • Fixed logging issue that caused silecing the pip output when installing of dependencies provided in --requirements_file (Python).

List of Contributors

According to git shortlog, the following people contributed to the 2.56.0 release. Thank you to all contributors!

Abacn

Ahmed Abualsaud

Andrei Gurau

Andrey Devyatkin

Aravind Pedapudi

Arun Pandian

Arvind Ram

Bartosz Zablocki

Brachi Packter

Byron Ellis

Chamikara Jayalath

Clement DAL PALU

Damon

Danny McCormick

Daria Bezkorovaina

Dip Patel

Evan Burrell

Hai Joey Tran

Jack McCluskey

Jan Lukavský

JayajP

Jeff Kinard

Julien Tournay

Kenneth Knowles

Luís Bianchin

Maciej Szwaja

Melody Shen

Oleh Borysevych

Pablo Estrada

Rebecca Szper

Ritesh Ghorse

Robert Bradshaw

Sam Whittle

Sergei Lilichenko

Shahar Epstein

Shunping Huang

Svetak Sundhar

Timothy Itodo

Veronica Wasson

Vitaly Terentyev

Vlado Djerek

Yi Hu

akashorabek

bzablocki

clmccart

damccorm

dependabot[bot]

dmitryor

github-actions[bot]

liferoad

martin trieu

tvalentyn

xianhualiu

beam - Beam 2.55.1 release

Published by damccorm 6 months ago

Bugfixes

  • Fixed issue that broke WriteToJson in languages other than Java (X-lang) (#30776).
beam - Beam 2.55.0 release

Published by Abacn 7 months ago

We are happy to present the new 2.55.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.

For more information on changes in 2.55.0, check out the detailed release notes.

Highlights

  • The Python SDK will now include automatically generated wrappers for external Java transforms! (#29834)

I/Os

  • Added support for handling bad records to BigQueryIO (#30081).
    • Full Support for Storage Read and Write APIs
    • Partial Support for File Loads (Failures writing to files supported, failures loading files to BQ unsupported)
    • No Support for Extract or Streaming Inserts
  • Added support for handling bad records to PubSubIO (#30372).
    • Support is not available for handling schema mismatches, and enabling error handling for writing to Pub/Sub topics with schemas is not recommended
  • --enableBundling pipeline option for BigQueryIO DIRECT_READ is replaced by --enableStorageReadApiV2. Both were considered experimental and subject to change (Java) (#26354).

New Features / Improvements

  • Allow writing clustered and not time-partitioned BigQuery tables (Java) (#30094).
  • Redis cache support added to RequestResponseIO and Enrichment transform (Python) (#30307)
  • Merged sdks/java/fn-execution and runners/core-construction-java into the main SDK. These artifacts were never meant for users, but noting
    that they no longer exist. These are steps to bring portability into the core SDK alongside all other core functionality.
  • Added Vertex AI Feature Store handler for Enrichment transform (Python) (#30388)

Breaking Changes

  • Arrow version was bumped to 15.0.0 from 5.0.0 (#30181).
  • Go SDK users who build custom worker containers may run into issues with the move to distroless containers as a base (see Security Fixes).
  • Python SDK has changed the default value for the --max_cache_memory_usage_mb pipeline option from 100 to 0. This option was first introduced in the 2.52.0 SDK version. This change restores the behavior of the 2.51.0 SDK, which does not use the state cache. If your pipeline uses iterable side inputs views, consider increasing the cache size by setting the option manually. (#30360).

Deprecations

  • N/A

Bug fixes

  • Fixed SpannerIO.readChangeStream to support propagating credentials from pipeline options
    to the getDialect calls for authenticating with Spanner (Java) (#30361).
  • Reduced the number of HTTP requests in GCSIO function calls (Python) (#30205)

Security Fixes

  • Go SDK base container image moved to distroless/base-nossl-debian12, reducing vulnerable container surface to kernel and glibc (#30011).

Known Issues

  • In Python pipelines, when shutting down inactive bundle processors, shutdown logic can overaggressively hold the lock, blocking acceptance of new work. Symptoms of this issue include slowness or stuckness in long-running jobs. Fixed in 2.56.0 (#30679).

List of Contributors

According to git shortlog, the following people contributed to the {$RELEASE_VERSION} release. Thank you to all contributors!

Ahmed Abualsaud

Anand Inguva

Andrew Crites

Andrey Devyatkin

Arun Pandian

Arvind Ram

Chamikara Jayalath

Chris Gray

Claire McGinty

Damon Douglas

Dan Ellis

Danny McCormick

Daria Bezkorovaina

Dima I

Edward Cui

Ferran Fernández Garrido

GStravinsky

Jan Lukavský

Jason Mitchell

JayajP

Jeff Kinard

Jeffrey Kinard

Kenneth Knowles

Mattie Fu

Michel Davit

Oleh Borysevych

Ritesh Ghorse

Ritesh Tarway

Robert Bradshaw

Robert Burke

Sam Whittle

Scott Strong

Shunping Huang

Steven van Rossum

Svetak Sundhar

Talat UYARER

Ukjae Jeong (Jay)

Vitaly Terentyev

Vlado Djerek

Yi Hu

akashorabek

case-k

clmccart

dengwe1

dhruvdua

hardshah

johnjcasey

liferoad

martin trieu

tvalentyn

beam - Beam 2.54.0 release

Published by lostluck 8 months ago

We are happy to present the new 2.54.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.

For more information on changes in 2.54.0, check out the detailed release notes.

Highlights

  • Enrichment Transform along with GCP BigTable handler added to Python SDK (#30001).
  • Beam Java Batch pipelines run on Google Cloud Dataflow will default to the Portable (Runner V2)[https://cloud.google.com/dataflow/docs/runner-v2] starting with this version. (All other languages are already on Runner V2.)
    • This change is still rolling out to the Dataflow service, see (Runner V2 documentation)[https://cloud.google.com/dataflow/docs/runner-v2] for how to enable or disable it intentionally.

I/Os

  • Added support for writing to BigQuery dynamic destinations with Python's Storage Write API (#30045)
  • Adding support for Tuples DataType in ClickHouse (Java) (#29715).
  • Added support for handling bad records to FileIO, TextIO, AvroIO (#29670).
  • Added support for handling bad records to BigtableIO (#29885).

New Features / Improvements

Breaking Changes

  • N/A

Deprecations

  • N/A

Bugfixes

  • Fixed a memory leak affecting some Go SDK since 2.46.0. (#28142)

Security Fixes

  • N/A

Known Issues

  • N/A

List of Contributors

According to git shortlog, the following people contributed to the 2.54.0 release. Thank you to all contributors!

Ahmed Abualsaud

Alexey Romanenko

Anand Inguva

Andrew Crites

Arun Pandian

Bruno Volpato

caneff

Chamikara Jayalath

Changyu Li

Cheskel Twersky

Claire McGinty

clmccart

Damon

Danny McCormick

dependabot[bot]

Edward Cheng

Ferran Fernández Garrido

Hai Joey Tran

hugo-syn

Issac

Jack McCluskey

Jan Lukavský

JayajP

Jeffrey Kinard

Jerry Wang

Jing

Joey Tran

johnjcasey

Kenneth Knowles

Knut Olav Løite

liferoad

Marc

Mark Zitnik

martin trieu

Mattie Fu

Naireen Hussain

Neeraj Bansal

Niel Markwick

Oleh Borysevych

pablo rodriguez defino

Rebecca Szper

Ritesh Ghorse

Robert Bradshaw

Robert Burke

Sam Whittle

Shunping Huang

Svetak Sundhar

S. Veyrié

Talat UYARER

tvalentyn

Vlado Djerek

Yi Hu

Zechen Jian

beam - Beam 2.53.0 release

Published by jrmccluskey 10 months ago

We are happy to present the new 2.53.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.

For more information on changes in 2.53.0, check out the detailed release notes.

Highlights

  • Python streaming users that use 2.47.0 and newer versions of Beam should update to version 2.53.0, which fixes a known issue: (#27330).

I/Os

  • TextIO now supports skipping multiple header lines (Java) (#17990).
  • Python GCSIO is now implemented with GCP GCS Client instead of apitools (#25676)
  • Adding support for LowCardinality DataType in ClickHouse (Java) (#29533).
  • Added support for handling bad records to KafkaIO (Java) (#29546)
  • Add support for generating text embeddings in MLTransform for Vertex AI and Hugging Face Hub models.(#29564)
  • NATS IO connector added (Go) (#29000).

New Features / Improvements

  • The Python SDK now type checks collections.abc.Collections types properly. Some type hints that were erroneously allowed by the SDK may now fail. (#29272)
  • Running multi-language pipelines locally no longer requires Docker.
    Instead, the same (generally auto-started) subprocess used to perform the
    expansion can also be used as the cross-language worker.
  • Framework for adding Error Handlers to composite transforms added in Java (#29164).
  • Python 3.11 images now include google-cloud-profiler (#29561).

Breaking Changes

Deprecations

  • Euphoria DSL is deprecated and will be removed in a future release (not before 2.56.0) (#29451)

Bugfixes

  • (Python) Fixed sporadic crashes in streaming pipelines that affected some users of 2.47.0 and newer SDKs (#27330).
  • (Python) Fixed a bug that caused MLTransform to drop identical elements in the output PCollection (#29600).

List of Contributors

According to git shortlog, the following people contributed to the 2.53.0 release. Thank you to all contributors!

Ahmed Abualsaud

Ahmet Altay

Alexey Romanenko

Anand Inguva

Arun Pandian

Balázs Németh

Bruno Volpato

Byron Ellis

Calvin Swenson Jr

Chamikara Jayalath

Clay Johnson

Damon

Danny McCormick

Ferran Fernández Garrido

Georgii Zemlianyi

Israel Herraiz

Jack McCluskey

Jacob Tomlinson

Jan Lukavský

JayajP

Jeffrey Kinard

Johanna Öjeling

Julian Braha

Julien Tournay

Kenneth Knowles

Lawrence Qiu

Mark Zitnik

Mattie Fu

Michel Davit

Mike Williamson

Naireen

Naireen Hussain

Niel Markwick

Pablo Estrada

Radosław Stankiewicz

Rebecca Szper

Reuven Lax

Ritesh Ghorse

Robert Bradshaw

Robert Burke

Sam Rohde

Sam Whittle

Shunping Huang

Svetak Sundhar

Talat UYARER

Tom Stepp

Tony Tang

Vlado Djerek

Yi Hu

Zechen Jiang

clmccart

damccorm

darshan-sj

gabry.wu

johnjcasey

liferoad

lrakla

martin trieu

tvalentyn

beam - Beam 2.52.0 release

Published by damccorm 11 months ago

We are happy to present the new 2.52.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.

For more information on changes in 2.52.0, check out the detailed release notes.

Highlights

  • Previously deprecated Avro-dependent code (Beam Release 2.46.0) has been finally removed from Java SDK "core" package.
    Please, use beam-sdks-java-extensions-avro instead. This will allow to easily update Avro version in user code without
    potential breaking changes in Beam "core" since the Beam Avro extension already supports the latest Avro versions and
    should handle this. (#25252).
  • Publishing Java 21 SDK container images now supported as part of Apache Beam release process. (#28120)
    • Direct Runner and Dataflow Runner support running pipelines on Java21 (experimental until tests fully setup). For other runners (Flink, Spark, Samza, etc) support status depend on runner projects.

New Features / Improvements

  • Add UseDataStreamForBatch pipeline option to the Flink runner. When it is set to true, Flink runner will run batch
    jobs using the DataStream API. By default the option is set to false, so the batch jobs are still executed
    using the DataSet API.
  • upload_graph as one of the Experiments options for DataflowRunner is no longer required when the graph is larger than 10MB for Java SDK (PR#28621.
  • state amd side input cache has been enabled to a default of 100 MB. Use --max_cache_memory_usage_mb=X to provide cache size for the user state API and side inputs. (Python) (#28770).
  • Beam YAML stable release. Beam pipelines can now be written using YAML and leverage the Beam YAML framework which includes a preliminary set of IO's and turnkey transforms. More information can be found in the YAML root folder and in the README.

Breaking Changes

  • org.apache.beam.sdk.io.CountingSource.CounterMark uses custom CounterMarkCoder as a default coder since all Avro-dependent
    classes finally moved to extensions/avro. In case if it's still required to use AvroCoder for CounterMark, then,
    as a workaround, a copy of "old" CountingSource class should be placed into a project code and used directly
    (#25252).
  • Renamed host to firestoreHost in FirestoreOptions to avoid potential conflict of command line arguments (Java) (#29201).

Bugfixes

  • Fixed "Desired bundle size 0 bytes must be greater than 0" in Java SDK's BigtableIO.BigtableSource when you have more cores than bytes to read (Java) #28793.
  • watch_file_pattern arg of the RunInference arg had no effect prior to 2.52.0. To use the behavior of arg watch_file_pattern prior to 2.52.0, follow the documentation at https://beam.apache.org/documentation/ml/side-input-updates/ and use WatchFilePattern PTransform as a SideInput. (#28948)
  • MLTransform doesn't output artifacts such as min, max and quantiles. Instead, MLTransform will add a feature to output these artifacts as human readable format - #29017. For now, to use the artifacts such as min and max that were produced by the eariler MLTransform, use read_artifact_location of MLTransform, which reads artifacts that were produced earlier in a different MLTransform (#29016)
  • Fixed a memory leak, which affected some long-running Python pipelines: #28246.

Security Fixes

List of Contributors

According to git shortlog, the following people contributed to the 2.52.0 release. Thank you to all contributors!

Ahmed Abualsaud
Ahmet Altay
Aleksandr Dudko
Alexey Romanenko
Anand Inguva
Andrei Gurau
Andrey Devyatkin
BjornPrime
Bruno Volpato
Bulat
Chamikara Jayalath
Damon
Danny McCormick
Devansh Modi
Dominik Dębowczyk
Ferran Fernández Garrido
Hai Joey Tran
Israel Herraiz
Jack McCluskey
Jan Lukavský
JayajP
Jeff Kinard
Jeffrey Kinard
Jiangjie Qin
Jing
Joar Wandborg
Johanna Öjeling
Julien Tournay
Kanishk Karanawat
Kenneth Knowles
Kerry Donny-Clark
Luís Bianchin
Minbo Bae
Pranav Bhandari
Rebecca Szper
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
RyuSA
Shunping Huang
Steven van Rossum
Svetak Sundhar
Tony Tang
Vitaly Terentyev
Vivek Sumanth
Vlado Djerek
Yi Hu
aku019
brucearctor
caneff
damccorm
ddebowczyk92
dependabot[bot]
dpcollins-google
edman124
gabry.wu
illoise
johnjcasey
jonathan-lemos
kennknowles
liferoad
magicgoody
martin trieu
nancyxu123
pablo rodriguez defino
tvalentyn

beam - Beam 2.51.0 release

Published by kennknowles about 1 year ago

We are happy to present the new 2.51.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.

For more information on changes in 2.51.0, check out the detailed release notes.

New Features / Improvements

Breaking Changes

  • Removed fastjson library dependency for Beam SQL. Table property is changed to be based on jackson ObjectNode (Java) (#24154).
  • Removed TensorFlow from Beam Python container images PR. If you have been negatively affected by this change, please comment on #20605.
  • Removed the parameter t reflect.Type from parquetio.Write. The element type is derived from the input PCollection (Go) (#28490)
  • Refactor BeamSqlSeekableTable.setUp adding a parameter joinSubsetType. #28283

Bugfixes

  • Fixed exception chaining issue in GCS connector (Python) (#26769).
  • Fixed streaming inserts exception handling, GoogleAPICallErrors are now retried according to retry strategy and routed to failed rows where appropriate rather than causing a pipeline error (Python) (#21080).
  • Fixed a bug in Python SDK's cross-language Bigtable sink that mishandled records that don't have an explicit timestamp set: #28632.

Security Fixes

Known Issues

  • Python pipelines using BigQuery Storage Read API must pin fastavro dependency to 1.8.3
    or earlier: #28811

List of Contributors

According to git shortlog, the following people contributed to the 2.50.0 release. Thank you to all contributors!

Adam Whitmore

Ahmed Abualsaud

Ahmet Altay

Aleksandr Dudko

Alexey Romanenko

Anand Inguva

Andrey Devyatkin

Arvind Ram

Arwin Tio

BjornPrime

Bruno Volpato

Bulat

Celeste Zeng

Chamikara Jayalath

Clay Johnson

Damon

Danny McCormick

David Cavazos

Dip Patel

Hai Joey Tran

Hao Xu

Haruka Abe

Jack Dingilian

Jack McCluskey

Jeff Kinard

Jeffrey Kinard

Joey Tran

Johanna Öjeling

Julien Tournay

Kenneth Knowles

Kerry Donny-Clark

Mattie Fu

Melissa Pashniak

Michel Davit

Moritz Mack

Pranav Bhandari

Rebecca Szper

Reeba Qureshi

Reuven Lax

Ritesh Ghorse

Robert Bradshaw

Robert Burke

Ruwann

Ryan Tam

Sam Rohde

Sereana Seim

Svetak Sundhar

Tim Grein

Udi Meiri

Valentyn Tymofieiev

Vitaly Terentyev

Vlado Djerek

Xinyu Liu

Yi Hu

Zbynek Konecny

Zechen Jiang

bzablocki

caneff

dependabot[bot]

gDuperran

gabry.wu

johnjcasey

kberezin-nshl

kennknowles

liferoad

lostluck

magicgoody

martin trieu

mosche

olalamichelle

tvalentyn

xqhu

Łukasz Spyra

beam - Beam 2.50.0 release

Published by lostluck about 1 year ago

We are happy to present the new 2.50.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.

For more information on changes in 2.50.0, check out the detailed release notes.

Highlights

  • Spark 3.2.2 is used as default version for Spark runner (#23804).
  • The Go SDK has a new default local runner, called Prism (#24789).
  • All Beam released container images are now multi-arch images that support both x86 and ARM CPU architectures.

I/Os

  • Java KafkaIO now supports picking up topics via topicPattern (#26948)
  • Support for read from Cosmos DB Core SQL API (#23604)
  • Upgraded to HBase 2.5.5 for HBaseIO. (Java) (#27711)
  • Added support for GoogleAdsIO source (Java) (#27681).

New Features / Improvements

  • The Go SDK now requires Go 1.20 to build. (#27558)
  • The Go SDK has a new default local runner, Prism. (#24789).
  • Hugging Face Model Handler for RunInference added to Python SDK. (#26632)
  • Hugging Face Pipelines support for RunInference added to Python SDK. (#27399)
  • Vertex AI Model Handler for RunInference now supports private endpoints (#27696)
  • MLTransform transform added with support for common ML pre/postprocessing operations (#26795)
  • Upgraded the Kryo extension for the Java SDK to Kryo 5.5.0. This brings in bug fixes, performance improvements, and serialization of Java 14 records. (#27635)
  • All Beam released container images are now multi-arch images that support both x86 and ARM CPU architectures. (#27674). The multi-arch container images include:
    • All versions of Go, Python, Java and Typescript SDK containers.
    • All versions of Flink job server containers.
    • Java and Python expansion service containers.
    • Transform service controller container.
    • Spark3 job server container.
  • Added support for batched writes to AWS SQS for improved throughput (Java, AWS 2).(#21429)

Breaking Changes

  • Python SDK: Legacy runner support removed from Dataflow, all pipelines must use runner v2.
  • Python SDK: Dataflow Runner will no longer stage Beam SDK from PyPI in the --staging_location at pipeline submission. Custom container images that are not based on Beam's default image must include Apache Beam installation.(#26996)

Deprecations

  • The Go Direct Runner is now Deprecated. It remains available to reduce migration churn.
    • Tests can be set back to the direct runner by overriding TestMain: func TestMain(m *testing.M) { ptest.MainWithDefault(m, "direct") }
    • It's recommended to fix issues seen in tests using Prism, as they can also happen on any portable runner.
    • Use the generic register package for your pipeline DoFns to ensure pipelines function on portable runners, like prism.
    • Do not rely on closures or using package globals for DoFn configuration. They don't function on portable runners.

Bugfixes

  • Fixed DirectRunner bug in Python SDK where GroupByKey gets empty PCollection and fails when pipeline option direct_num_workers!=1.(#27373)
  • Fixed BigQuery I/O bug when estimating size on queries that utilize row-level security (#27474)

List of Contributors

According to git shortlog, the following people contributed to the 2.50.0 release. Thank you to all contributors!

Abacn

acejune

AdalbertMemSQL

ahmedabu98

Ahmed Abualsaud

al97

Aleksandr Dudko

Alexey Romanenko

Anand Inguva

Andrey Devyatkin

Anton Shalkovich

ArjunGHUB

Bjorn Pedersen

BjornPrime

Brett Morgan

Bruno Volpato

Buqian Zheng

Burke Davison

Byron Ellis

bzablocki

case-k

Celeste Zeng

Chamikara Jayalath

Clay Johnson

Connor Brett

Damon

Damon Douglas

Dan Hansen

Danny McCormick

Darkhan Nausharipov

Dip Patel

Dmytro Sadovnychyi

Florent Biville

Gabriel Lacroix

Hai Joey Tran

Hong Liang Teoh

Jack McCluskey

James Fricker

Jeff Kinard

Jeff Zhang

Jing

johnjcasey

jon esperanza

Josef Šimánek

Kenneth Knowles

Laksh

Liam Miller-Cushon

liferoad

magicgoody

Mahmud Ridwan

Manav Garg

Marco Vela

martin trieu

Mattie Fu

Michel Davit

Moritz Mack

mosche

Peter Sobot

Pranav Bhandari

Reeba Qureshi

Reuven Lax

Ritesh Ghorse

Robert Bradshaw

Robert Burke

RyuSA

Saba Sathya

Sam Whittle

Steven Niemitz

Steven van Rossum

Svetak Sundhar

Tony Tang

Valentyn Tymofieiev

Vitaly Terentyev

Vlado Djerek

Yichi Zhang

Yi Hu

Zechen Jiang

beam - Beam 2.49.0 release

Published by Abacn over 1 year ago

We are happy to present the new 2.49.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.

For more information on changes in 2.49.0, check out the detailed release notes.

I/Os

  • Support for Bigtable Change Streams added in Java BigtableIO.ReadChangeStream (#27183).
  • Added Bigtable Read and Write cross-language transforms to Python SDK ((#26593), (#27146)).

New Features / Improvements

  • Allow prebuilding large images when using --prebuild_sdk_container_engine=cloud_build, like images depending on tensorflow or torch (#27023).
  • Disabled pip cache when installing packages on the workers. This reduces the size of prebuilt Python container images (#27035).
  • Select dedicated avro datum reader and writer (Java) (#18874).
  • Timer API for the Go SDK (Go) (#22737).

Deprecations

  • Remove Python 3.7 support. (#26447)

Bugfixes

  • Fixed KinesisIO NullPointerException when a progress check is made before the reader is started (IO) (#23868)

Known Issues

List of Contributors

According to git shortlog, the following people contributed to the 2.49.0 release. Thank you to all contributors!

Abzal Tuganbay

AdalbertMemSQL

Ahmed Abualsaud

Ahmet Altay

Alan Zhang

Alexey Romanenko

Anand Inguva

Andrei Gurau

Arwin Tio

Bartosz Zablocki

Bruno Volpato

Burke Davison

Byron Ellis

Chamikara Jayalath

Charles Rothrock

Chris Gavin

Claire McGinty

Clay Johnson

Damon

Daniel Dopierała

Danny McCormick

Darkhan Nausharipov

David Cavazos

Dip Patel

Dmitry Repin

Gavin McDonald

Jack Dingilian

Jack McCluskey

James Fricker

Jan Lukavský

Jasper Van den Bossche

John Casey

John Gill

Joseph Crowley

Kanishk Karanawat

Katie Liu

Kenneth Knowles

Kyle Galloway

Liam Miller-Cushon

MakarkinSAkvelon

Masato Nakamura

Mattie Fu

Michel Davit

Naireen Hussain

Nathaniel Young

Nelson Osacky

Nick Li

Oleh Borysevych

Pablo Estrada

Reeba Qureshi

Reuven Lax

Ritesh Ghorse

Robert Bradshaw

Robert Burke

Rouslan

Saadat Su

Sam Rohde

Sam Whittle

Sanil Jain

Shunping Huang

Smeet nagda

Svetak Sundhar

Timur Sultanov

Udi Meiri

Valentyn Tymofieiev

Vlado Djerek

WuA

XQ Hu

Xianhua Liu

Xinyu Liu

Yi Hu

Zachary Houfek

alexeyinkin

bigduu

bullet03

bzablocki

jonathan-lemos

jubebo

magicgoody

ruslan-ikhsan

sultanalieva-s

vitaly.terentyev

beam - Beam 2.48.0 release

Published by riteshghorse over 1 year ago

We are happy to present the new 2.48.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.

For more information on changes in 2.48.0, check out the detailed release notes.

Note: The release tag for Go SDK for this release is sdks/v2.48.2 instead of sdks/v2.48.0 because of incorrect commit attached to the release tag sdks/v2.48.0.

Highlights

  • "Experimental" annotation cleanup: the annotation and concept have been removed from Beam to avoid
    the misperception of code as "not ready". Any proposed breaking changes will be subject to
    case-by-case pro/con decision making (and generally avoided) rather than using the "Experimental"
    to allow them.

I/Os

  • Added rename for GCS and copy for local filesystem (Go) (#25779).
  • Added support for enhanced fan-out in KinesisIO.Read (Java) (#19967).
    • This change is not compatible with Flink savepoints created by Beam 2.46.0 applications which had KinesisIO sources.
  • Added textio.ReadWithFilename transform (Go) (#25812).
  • Added fileio.MatchContinuously transform (Go) (#26186).

New Features / Improvements

  • Allow passing service name for google-cloud-profiler (Python) (#26280).
  • Dead letter queue support added to RunInference in Python (#24209).
  • Support added for defining pre/postprocessing operations on the RunInference transform (#26308)
  • Adds a Docker Compose based transform service that can be used to discover and use portable Beam transforms (#26023).

Breaking Changes

  • Passing a tag into MultiProcessShared is now required in the Python SDK (#26168).
  • CloudDebuggerOptions is removed (deprecated in Beam v2.47.0) for Dataflow runner as the Google Cloud Debugger service is shutting down. (Java) (#25959).
  • AWS 2 client providers (deprecated in Beam v2.38.0) are finally removed (#26681).
  • AWS 2 SnsIO.writeAsync (deprecated in Beam v2.37.0 due to risk of data loss) was finally removed (#26710).
  • AWS 2 coders (deprecated in Beam v2.43.0 when adding Schema support for AWS Sdk Pojos) are finally removed (#23315).

Bugfixes

  • Fixed Java bootloader failing with Too Long Args due to long classpaths, with a pathing jar. (Java) (#25582).

List of Contributors

According to git shortlog, the following people contributed to the 2.48.0 release. Thank you to all contributors!

Abzal Tuganbay

Ahmed Abualsaud

Alexey Romanenko

Anand Inguva

Andrei Gurau

Andrey Devyatkin

Balázs Németh

Bazyli Polednia

Bruno Volpato

Chamikara Jayalath

Clay Johnson

Damon

Daniel Arn

Danny McCormick

Darkhan Nausharipov

Dip Patel

Dmitry Repin

George Novitskiy

Israel Herraiz

Jack Dingilian

Jack McCluskey

Jan Lukavský

Jasper Van den Bossche

Jeff Zhang

Jeremy Edwards

Johanna Öjeling

John Casey

Katie Liu

Kenneth Knowles

Kerry Donny-Clark

Kuba Rauch

Liam Miller-Cushon

MakarkinSAkvelon

Mattie Fu

Michel Davit

Moritz Mack

Nick Li

Oleh Borysevych

Pablo Estrada

Pranav Bhandari

Pranjal Joshi

Rebecca Szper

Reuven Lax

Ritesh Ghorse

Robert Bradshaw

Robert Burke

Rouslan

RuiLong J

RyujiTamaki

Sam Whittle

Sanil Jain

Svetak Sundhar

Timur Sultanov

Tony Tang

Udi Meiri

Valentyn Tymofieiev

Vishal Bhise

Vitaly Terentyev

Xinyu Liu

Yi Hu

bullet03

darshan-sj

kellen

liferoad

mokamoka03210120

psolomin

beam - Beam 2.47.0 release

Published by jrmccluskey over 1 year ago

We are happy to present the new 2.47.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.

For more information on changes in 2.47.0, check out the detailed release notes.

Highlights

  • Apache Beam adds Python 3.11 support (#23848).

I/Os

  • BigQuery Storage Write API is now available in Python SDK via cross-language (#21961).
  • Added HbaseIO support for writing RowMutations (ordered by rowkey) to Hbase (Java) (#25830).
  • Added fileio transforms MatchFiles, MatchAll and ReadMatches (Go) (#25779).
  • Add integration test for JmsIO + fix issue with multiple connections (Java) (#25887).

New Features / Improvements

  • The Flink runner now supports Flink 1.16.x (#25046).
  • Schema'd PTransforms can now be directly applied to Beam dataframes just like PCollections.
    (Note that when doing multiple operations, it may be more efficient to explicitly chain the operations
    like df | (Transform1 | Transform2 | ...) to avoid excessive conversions.)
  • The Go SDK adds new transforms periodic.Impulse and periodic.Sequence that extends support
    for slowly updating side input patterns. (#23106)
  • Python SDK now supports protobuf <4.23.0 (#24599)
  • Several Google client libraries in Python SDK dependency chain were updated to latest available major versions. (#24599)

Breaking Changes

  • If a main session fails to load, the pipeline will now fail at worker startup. (#25401).
  • Python pipeline options will now ignore unparsed command line flags prefixed with a single dash. (#25943).
  • The SmallestPerKey combiner now requires keyword-only arguments for specifying optional parameters, such as key and reverse. (#25888).

Deprecations

  • Cloud Debugger support and its pipeline options are deprecated and will be removed in the next Beam version,
    in response to the Google Cloud Debugger service turning down.
    (Java) (#25959).

Bugfixes

  • BigQuery sink in STORAGE_WRITE_API mode in batch pipelines might result in data consistency issues during the handling of other unrelated transient errors for Beam SDKs 2.35.0 - 2.46.0 (inclusive). For more details see: https://github.com/apache/beam/issues/26521

List of Contributors

According to git shortlog, the following people contributed to the 2.47.0 release. Thank you to all contributors!

Ahmed Abualsaud

Ahmet Altay

Alexey Romanenko

Amir Fayazi

Amrane Ait Zeouay

Anand Inguva

Andrew Pilloud

Andrey Kot

Bjorn Pedersen

Bruno Volpato

Buqian Zheng

Chamikara Jayalath

ChangyuLi28

Damon

Danny McCormick

Dmitry Repin

George Ma

Jack Dingilian

Jack McCluskey

Jasper Van den Bossche

Jeremy Edwards

Jiangjie (Becket) Qin

Johanna Öjeling

Juta Staes

Kenneth Knowles

Kyle Weaver

Mattie Fu

Moritz Mack

Nick Li

Oleh Borysevych

Pablo Estrada

Rebecca Szper

Reuven Lax

Reza Rokni

Ritesh Ghorse

Robert Bradshaw

Robert Burke

Saadat Su

Saifuddin53

Sam Rohde

Shubham Krishna

Svetak Sundhar

Theodore Ni

Thomas Gaddy

Timur Sultanov

Udi Meiri

Valentyn Tymofieiev

Xinyu Liu

Yanan Hao

Yi Hu

Yuvi Panda

andres-vv

bochap

dannikay

darshan-sj

dependabot[bot]

harrisonlimh

hnnsgstfssn

jrmccluskey

liferoad

tvalentyn

xianhualiu

zhangskz

beam - Beam 2.46.0 release

Published by damccorm over 1 year ago

We are happy to present the new 2.46.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.

For more information on changes in 2.46.0, check out the detailed release notes.

Highlights

  • Java SDK containers migrated to Eclipse Temurin
    as a base. This change migrates away from the deprecated OpenJDK
    container. Eclipse Temurin is currently based upon Ubuntu 22.04 while the OpenJDK
    container was based upon Debian 11.
  • RunInference PTransform will accept model paths as SideInputs in Python SDK. (#24042)
  • RunInference supports ONNX runtime in Python SDK (#22972)
  • Tensorflow Model Handler for RunInference in Python SDK (#25366)
  • Java SDK modules migrated to use :sdks:java:extensions:avro (#24748)

I/Os

  • Added in JmsIO a retry policy for failed publications (Java) (#24971).
  • Support for LZMA compression/decompression of text files added to the Python SDK (#25316)
  • Added ReadFrom/WriteTo Csv/Json as top-level transforms to the Python SDK.

New Features / Improvements

  • Add UDF metrics support for Samza portable mode.
  • Option for SparkRunner to avoid the need of SDF output to fit in memory (#23852).
    This helps e.g. with ParquetIO reads. Turn the feature on by adding experiment use_bounded_concurrent_output_for_sdf.
  • Add WatchFilePattern transform, which can be used as a side input to the RunInference PTransfrom to watch for model updates using a file pattern. (#24042)
  • Add support for loading TorchScript models with PytorchModelHandler. The TorchScript model path can be
    passed to PytorchModelHandler using torch_script_model_path=<path_to_model>. (#25321)
  • The Go SDK now requires Go 1.19 to build. (#25545)
  • The Go SDK now has an initial native Go implementation of a portable Beam Runner called Prism. (#24789)

Breaking Changes

  • The deprecated SparkRunner for Spark 2 (see 2.41.0) was removed (#25263).
  • Python's BatchElements performs more aggressive batching in some cases,
    capping at 10 second rather than 1 second batches by default and excluding
    fixed cost in this computation to better handle cases where the fixed cost
    is larger than a single second. To get the old behavior, one can pass
    target_batch_duration_secs_including_fixed_cost=1 to BatchElements.

Deprecations

  • Avro related classes are deprecated in module beam-sdks-java-core and will be eventually removed. Please, migrate to a new module beam-sdks-java-extensions-avro instead by importing the classes from org.apache.beam.sdk.extensions.avro package.
    For the sake of migration simplicity, the relative package path and the whole class hierarchy of Avro related classes in new module is preserved the same as it was before.
    For example, import org.apache.beam.sdk.extensions.avro.coders.AvroCoder class instead oforg.apache.beam.sdk.coders.AvroCoder. (#24749).

List of Contributors

According to git shortlog, the following people contributed to the 2.46.0 release. Thank you to all contributors!

Ahmet Altay

Alan Zhang

Alexey Romanenko

Amrane Ait Zeouay

Anand Inguva

Andrew Pilloud

Brian Hulette

Bruno Volpato

Byron Ellis

Chamikara Jayalath

Damon

Danny McCormick

Darkhan Nausharipov

David Katz

Dmitry Repin

Doug Judd

Egbert van der Wal

Elizaveta Lomteva

Evan Galpin

Herman Mak

Jack McCluskey

Jan Lukavský

Johanna Öjeling

John Casey

Jozef Vilcek

Junhao Liu

Juta Staes

Katie Liu

Kiley Sok

Liam Miller-Cushon

Luke Cwik

Moritz Mack

Ning Kang

Oleh Borysevych

Pablo E

Pablo Estrada

Reuven Lax

Ritesh Ghorse

Robert Bradshaw

Robert Burke

Ruslan Altynnikov

Ryan Zhang

Sam Rohde

Sam Whittle

Sam sam

Sergei Lilichenko

Shivam

Shubham Krishna

Theodore Ni

Timur Sultanov

Tony Tang

Vachan

Veronica Wasson

Vincent Devillers

Vitaly Terentyev

William Ross Morrow

Xinyu Liu

Yi Hu

ZhengLin Li

Ziqi Ma

ahmedabu98

alexeyinkin

aliftadvantage

bullet03

dannikay

darshan-sj

dependabot[bot]

johnjcasey

kamrankoupayi

kileys

liferoad

nancyxu123

nickuncaged1201

pablo rodriguez defino

tvalentyn

xqhu

beam - Beam 2.45.0 release

Published by johnjcasey over 1 year ago

We are happy to present the new 2.45.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.

For more information on changes in 2.45.0, check out the detailed release notes.

I/Os

  • MongoDB IO connector added (Go) (#24575).

New Features / Improvements

  • RunInference Wrapper with Sklearn Model Handler support added in Go SDK (#24497).
  • Adding override of allowed TLS algorithms (Java), now maintaining the disabled/legacy algorithms
    present in 2.43.0 (up to 1.8.0_342, 11.0.16, 17.0.2 for respective Java versions). This is accompanied
    by an explicit re-enabling of TLSv1 and TLSv1.1 for Java 8 and Java 11.
  • Add UDF metrics support for Samza portable mode.

Breaking Changes

  • Portable Java pipelines, Go pipelines, Python streaming pipelines, and portable Python batch
    pipelines on Dataflow are required to use Runner V2. The disable_runner_v2,
    disable_runner_v2_until_2023, disable_prime_runner_v2 experiments will raise an error during
    pipeline construction. You can no longer specify the Dataflow worker jar override. Note that
    non-portable Java jobs and non-portable Python batch jobs are not impacted. (#24515).

Bugfixes

  • Avoids Cassandra syntax error when user-defined query has no where clause in it (Java) (#24829).
  • Fixed JDBC connection failures (Java) during handshake due to deprecated TLSv1(.1) protocol for the JDK. (#24623)
  • Fixed Python BigQuery Batch Load write may truncate valid data when deposition sets to WRITE_TRUNCATE and incoming data is large (Python) (#24623).
  • Fixed Kafka watermark issue with sparse data on many partitions (#24205)

List of Contributors

According to git shortlog, the following people contributed to the 2.45.0 release. Thank you to all contributors!

AdalbertMemSQL

Ahmed Abualsaud

Ahmet Altay

Alexey Romanenko

Anand Inguva

Andrea Nardelli

Andrei Gurau

Andrew Pilloud

Benjamin Gonzalez

BjornPrime

Brian Hulette

Bulat

Byron Ellis

Chamikara Jayalath

Charles Rothrock

Damon

Daniela Martín

Danny McCormick

Darkhan Nausharipov

Dejan Spasic

Diego Gomez

Dmitry Repin

Doug Judd

Elias Segundo Antonio

Evan Galpin

Evgeny Antyshev

Fernando Morales

Jack McCluskey

Johanna Öjeling

John Casey

Junhao Liu

Kanishk Karanawat

Kenneth Knowles

Kiley Sok

Liam Miller-Cushon

Lucas Marques

Luke Cwik

MakarkinSAkvelon

Marco Robles

Mark Zitnik

Melanie

Moritz Mack

Ning Kang

Oleh Borysevych

Pablo Estrada

Philippe Moussalli

Piyush Sagar

Rebecca Szper

Reuven Lax

Rick Viscomi

Ritesh Ghorse

Robert Bradshaw

Robert Burke

Sam Whittle

Sergei Lilichenko

Seung Jin An

Shane Hansen

Sho Nakatani

Shunya Ueta

Siddharth Agrawal

Timur Sultanov

Veronica Wasson

Vitaly Terentyev

Xinbin Huang

Xinyu Liu

Xinyue Zhang

Yi Hu

ZhengLin Li

alexeyinkin

andoni-guzman

andthezhang

bullet03

camphillips22

gabihodoroaga

harrisonlimh

pablo rodriguez defino

ruslan-ikhsan

tvalentyn

yyy1000

zhengbuqian

beam -

Published by kennknowles over 1 year ago

We are happy to present the new 2.44.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.

For more information on changes in 2.44.0, check out the detailed release notes.

I/Os

  • Support for Bigtable sink (Write and WriteBatch) added (Go) (#23324).
  • S3 implementation of the Beam filesystem (Go) (#23991).
  • Support for SingleStoreDB source and sink added (Java) (#22617).
  • Added support for DefaultAzureCredential authentication in Azure Filesystem (Python) (#24210).
  • Added new CdapIO for CDAP Batch and Streaming Source/Sinks (Java) (#24961).
  • Added new SparkReceiverIO for Spark Receivers 2.4.* (Java) (#24960).

New Features / Improvements

  • Beam now provides a portable "runner" that can render pipeline graphs with
    graphviz. See python -m apache_beam.runners.render --help for more details.
  • Local packages can now be used as dependencies in the requirements.txt file, rather
    than requiring them to be passed separately via the --extra_package option
    (Python) (#23684).
  • Pipeline Resource Hints now supported via --resource_hints flag (Go) (#23990).
  • Make Python SDK containers reusable on portable runners by installing dependencies to temporary venvs (BEAM-12792).
  • RunInference model handlers now support the specification of a custom inference function in Python (#22572)
  • Support for map_windows urn added to Go SDK (#24307).

Breaking Changes

  • ParquetIO.withSplit was removed since splittable reading has been the default behavior since 2.35.0. The effect of
    this change is to drop support for non-splittable reading (Java)(#23832).
  • beam-sdks-java-extensions-google-cloud-platform-core is no longer a
    dependency of the Java SDK Harness. Some users of a portable runner (such as Dataflow Runner v2)
    may have an undeclared dependency on this package (for example using GCS with
    TextIO) and will now need to declare the dependency.
  • beam-sdks-java-core is no longer a dependency of the Java SDK Harness. Users of a portable
    runner (such as Dataflow Runner v2) will need to provide this package and its dependencies.
  • Slices now use the Beam Iterable Coder. This enables cross language use, but breaks pipeline updates
    if a Slice type is used as a PCollection element or State API element. (Go)#24339

Bugfixes

  • Fixed JmsIO acknowledgment issue (Java) (#20814)
  • Fixed Beam SQL CalciteUtils (Java) and Cross-language JdbcIO (Python) did not support JDBC CHAR/VARCHAR, BINARY/VARBINARY logical types (#23747, #23526).
  • Ensure iterated and emitted types are used with the generic register package are registered with the type and schema registries.(Go) (#23889)

List of Contributors

According to git shortlog, the following people contributed to the 2.44.0 release. Thank you to all contributors!

Ahmed Abualsaud
Ahmet Altay
Alex Merose
Alexey Inkin
Alexey Romanenko
Anand Inguva
Andrei Gurau
Andrej Galad
Andrew Pilloud
Ayush Sharma
Benjamin Gonzalez
Bjorn Pedersen
Brian Hulette
Bruno Volpato
Bulat Safiullin
Chamikara Jayalath
Chris Gavin
Damon Douglas
Danielle Syse
Danny McCormick
Darkhan Nausharipov
David Cavazos
Dmitry Repin
Doug Judd
Elias Segundo Antonio
Evan Galpin
Evgeny Antyshev
Heejong Lee
Henrik Heggelund-Berg
Israel Herraiz
Jack McCluskey
Jan Lukavsk\u00fd
Janek Bevendorff
Johanna \u00d6jeling
John J. Casey
Jozef Vilcek
Kanishk Karanawat
Kenneth Knowles
Kiley Sok
Laksh
Liam Miller-Cushon
Luke Cwik
MakarkinSAkvelon
Minbo Bae
Moritz Mack
Nancy Xu
Ning Kang
Nivaldo Tokuda
Oleh Borysevych
Pablo Estrada
Philippe Moussalli
Pranav Bhandari
Rebecca Szper
Reuven Lax
Rick Smit
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Ryan Thompson
Sam Whittle
Sanil Jain
Scott Strong
Shubham Krishna
Steven van Rossum
Svetak Sundhar
Thiago Nunes
Tianyang Hu
Trevor Gevers
Valentyn Tymofieiev
Vitaly Terentyev
Vladislav Chunikhin
Xinyu Liu
Yi Hu
Yichi Zhang

AdalbertMemSQL
agvdndor
andremissaglia
arne-alex
bullet03
camphillips22
capthiron
creste
fab-jul
illoise
kn1kn1
nancyxu123
peridotml
shinannegans
smeet07

beam - Beam 2.43.0 release

Published by chamikaramj almost 2 years ago

We are happy to present the new 2.43.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.

For more information on changes in 2.43.0, check out the detailed release notes.

Highlights

  • Python 3.10 support in Apache Beam (#21458).
  • An initial implementation of a runner that allows us to run Beam pipelines on Dask. Try it out and give us feedback! (Python) (#18962).

I/Os

  • Decreased TextSource CPU utilization by 2.3x (Java) (#23193).
  • Fixed bug when using SpannerIO with RuntimeValueProvider options (Java) (#22146).
  • Fixed issue for unicode rendering on WriteToBigQuery (#10785)
  • Remove obsolete variants of BigQuery Read and Write, always using Beam-native variant
    (#23564 and #23559).
  • Bumped google-cloud-spanner dependency version to 3.x for Python SDK (#21198).

New Features / Improvements

  • Dataframe wrapper added in Go SDK via Cross-Language (with automatic expansion service). (Go) (#23384).
  • Name all Java threads to aid in debugging (#23049).
  • An initial implementation of a runner that allows us to run Beam pipelines on Dask. (Python) (#18962).
  • Allow configuring GCP OAuth scopes via pipeline options. This unblocks usages of Beam IOs that require additional scopes.
    For example, this feature makes it possible to access Google Drive backed tables in BigQuery (#23290).
  • An example for using Python RunInference from Java (#23290).

Breaking Changes

  • CoGroupByKey transform in Python SDK has changed the output typehint. The typehint component representing grouped values changed from List to Iterable,
    which more accurately reflects the nature of the arbitrarily large output collection. #21556 Beam users may see an error on transforms downstream from CoGroupByKey. Users must change methods expecting a List to expect an Iterable going forward. See document for information and fixes.
  • The PortableRunner for Spark assumes Spark 3 as default Spark major version unless configured otherwise using --spark_version.
    Spark 2 support is deprecated and will be removed soon (#23728).

Bugfixes

  • Fixed Python cross-language JDBC IO Connector cannot read or write rows containing Numeric/Decimal type values (#19817).

List of Contributors

According to git shortlog, the following people contributed to the 2.43.0 release. Thank you to all contributors!

Ahmed Abualsaud
AlexZMLyu
Alexey Romanenko
Anand Inguva
Andrew Pilloud
Andy Ye
Arnout Engelen
Benjamin Gonzalez
Bharath Kumarasubramanian
BjornPrime
Brian Hulette
Bruno Volpato
Chamikara Jayalath
Colin Versteeg
Damon
Daniel Smilkov
Daniela Martín
Danny McCormick
Darkhan Nausharipov
David Huntsperger
Denis Pyshev
Dmitry Repin
Evan Galpin
Evgeny Antyshev
Fernando Morales
Geddy05
Harshit Mehrotra
Iñigo San Jose Visiers
Ismaël Mejía
Israel Herraiz
Jan Lukavský
Juta Staes
Kanishk Karanawat
Kenneth Knowles
KevinGG
Kiley Sok
Liam Miller-Cushon
Luke Cwik
Mc
Melissa Pashniak
Moritz Mack
Ning Kang
Pablo Estrada
Philippe Moussalli
Pranav Bhandari
Rebecca Szper
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Ryan Thompson
Ryohei Nagao
Sam Rohde
Sam Whittle
Sanil Jain
Seunghwan Hong
Shane Hansen
Shubham Krishna
Shunsuke Otani
Steve Niemitz
Steven van Rossum
Svetak Sundhar
Thiago Nunes
Toran Sahu
Veronica Wasson
Vitaly Terentyev
Vladislav Chunikhin
Xinyu Liu
Yi Hu
Yixiao Shen
alexeyinkin
arne-alex
azhurkevich
bulat safiullin
bullet03
coldWater
dpcollins-google
egalpin
johnjcasey
liferoad
rvballada
shaojwu
tvalentyn

What's Changed

New Contributors

Full Changelog: https://github.com/apache/beam/compare/v2.42.0...v2.43.0

beam - Beam 2.42.0 release

Published by lostluck about 2 years ago

We are happy to present the new 2.42.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.

For more information on changes in 2.42.0, check out the detailed release notes.

Highlights

  • Added support for stateful DoFns to the Go SDK.

New Features / Improvements

  • Added support for Zstd compression to the Python SDK.
  • Added support for Google Cloud Profiler to the Go SDK.
  • Added support for stateful DoFns to the Go SDK.

Breaking Changes

  • The Go SDK's Row Coder now uses a different single-precision float encoding for float32 types to match Java's behavior (#22629).

Bugfixes

  • Fixed Python cross-language JDBC IO Connector cannot read or write rows containing Timestamp type values 19817.

Known Issues

  • Go SDK doesn't yet support Slowly Changing Side Input pattern (#23106)
  • See a full list of open issues that affect this version.

What's Changed

New Contributors

Full Changelog: https://github.com/apache/beam/compare/v2.41.0...v2.42.0

beam - Beam 2.41.0 release

Published by kileys about 2 years ago

We are happy to present the new 2.41.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.

For more information on changes in 2.41.0, check out the detailed release notes.

I/Os

  • Projection Pushdown optimizer is now on by default for streaming, matching the behavior of batch pipelines since 2.38.0. If you encounter a bug with the optimizer, please file an issue and disable the optimizer using pipeline option --experiments=disable_projection_pushdown.

New Features / Improvements

  • Previously available in Java sdk, Python sdk now also supports logging level overrides per module. (#18222).

Breaking Changes

  • Projection Pushdown optimizer may break Dataflow upgrade compatibility for optimized pipelines when it removes unused fields. If you need to upgrade and encounter a compatibility issue, disable the optimizer using pipeline option --experiments=disable_projection_pushdown.

Deprecations

  • Support for Spark 2.4.x is deprecated and will be dropped with the release of Beam 2.44.0 or soon after (Spark runner) (#22094).
  • The modules amazon-web-services and
    kinesis for AWS Java SDK v1 are deprecated
    in favor of amazon-web-services2
    and will be eventually removed after a few Beam releases (Java) (#21249).

Bugfixes

  • Fixed a condition where retrying queries would yield an incorrect cursor in the Java SDK Firestore Connector (#22089).
  • Fixed plumbing allowed lateness in Go SDK. It was ignoring the user set value earlier and always used to set to 0. (#22474).

Known Issues

List of Contributors

According to git shortlog, the following people contributed to the 2.41.0 release. Thank you to all contributors!

Ahmed Abualsaud
Ahmet Altay
akashorabek
Alexey Inkin
Alexey Romanenko
Anand Inguva
andoni-guzman
Andrew Pilloud
Andrey
Andy Ye
Balázs Németh
Benjamin Gonzalez
BjornPrime
Brian Hulette
bulat safiullin
bullet03
Byron Ellis
Chamikara Jayalath
Damon Douglas
Daniel Oliveira
Daniel Thevessen
Danny McCormick
David Huntsperger
Dheeraj Gharde
Etienne Chauchot
Evan Galpin
Fernando Morales
Heejong Lee
Jack McCluskey
johnjcasey
Kenneth Knowles
Ke Wu
Kiley Sok
Liam Miller-Cushon
Lucas Nogueira
Luke Cwik
MakarkinSAkvelon
Manu Zhang
Minbo Bae
Moritz Mack
Naireen Hussain
Ning Kang
Oleh Borysevych
Pablo Estrada
pablo rodriguez defino
Pranav Bhandari
Rebecca Szper
Red Daly
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Ryan Thompson
Sam Whittle
Steven Niemitz
Valentyn Tymofieiev
Vincent Marquez
Vitaly Terentyev
Vlad
Vladislav Chunikhin
Yichi Zhang
Yi Hu
yirutang
Yixiao Shen
Yu Feng

beam - v2.40.0

Published by pabloem over 2 years ago

We are happy to present the new 2.40.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this
release.

For more information on changes in 2.40.0 check out the detailed release notes.

Highlights

  • Added RunInference API, a framework agnostic transform for inference. With this release, PyTorch and Scikit-learn are supported by the transform.
    See also example at apache_beam/examples/inference/pytorch_image_classification.py

I/Os

  • Upgraded to Hive 3.1.3 for HCatalogIO. Users can still provide their own version of Hive. (Java) (Issue-19554).

New Features / Improvements

  • Go SDK users can now use generic registration functions to optimize their DoFn execution. (BEAM-14347)
  • Go SDK users may now write self-checkpointing Splittable DoFns to read from streaming sources. (BEAM-11104)
  • Go SDK textio Reads have been moved to Splittable DoFns exclusively. (BEAM-14489)
  • Pipeline drain support added for Go SDK has now been tested. (BEAM-11106)
  • Go SDK users can now see heap usage, sideinput cache stats, and active process bundle stats in Worker Status. (BEAM-13829)
  • The serialization (pickling) library for Python is dill==0.3.1.1 (BEAM-11167)

Breaking Changes

  • The Go Sdk now requires a minimum version of 1.18 in order to support generics (BEAM-14347).
  • synthetic.SourceConfig field types have changed to int64 from int for better compatibility with Flink's use of Logical types in Schemas (Go) (BEAM-14173)
  • Default coder updated to compress sources used with BoundedSourceAsSDFWrapperFn and UnboundedSourceAsSDFWrapper.

Bugfixes

  • Fixed X (Java/Python) (BEAM-X).
  • Fixed Java expansion service to allow specific files to stage (BEAM-14160).
  • Fixed Elasticsearch connection when using both ssl and username/password (Java) (BEAM-14000)

Detailed list of PRs

New Contributors

Full Changelog: https://github.com/apache/beam/compare/v2.39.0...v2.40.0-RC1

What's Changed

New Contributors

Full Changelog: https://github.com/apache/beam/compare/v2.39.0...v2.40.0-RC2

beam - Beam 2.39.0 release

Published by y1chi over 2 years ago

We are happy to present the new 2.39.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this
release.

For more information on changes in 2.39.0 check out the detailed release notes.

I/Os

  • JmsIO gains the ability to map any kind of input to any subclass of javax.jms.Message (Java) (BEAM-16308).
  • JmsIO introduces the ability to write to dynamic topics (Java) (BEAM-16308).
    • A topicNameMapper must be set to extract the topic name from the input value.
    • A valueMapper must be set to convert the input value to JMS message.
  • Reduce number of threads spawned by BigqueryIO StreamingInserts (
    BEAM-14283).
  • Implemented Apache PulsarIO (BEAM-8218).

New Features / Improvements

  • Support for flink scala 2.12, because most of the libraries support version 2.12 onwards. (beam-14386)
  • 'Manage Clusters' JupyterLab extension added for users to configure usage of Dataproc clusters managed by Interactive Beam (Python) (BEAM-14130).
  • Pipeline drain support added for Go SDK (BEAM-11106). Note: this feature is not yet fully validated and should be treated as experimental in this release.
  • DataFrame.unstack(), DataFrame.pivot() and Series.unstack()
    implemented for DataFrame API (BEAM-13948, BEAM-13966).
  • Support for impersonation credentials added to dataflow runner in the Java and Python SDK (BEAM-14014).
  • Implemented Jupyterlab extension for managing Dataproc clusters (BEAM-14130).
  • ExternalPythonTransform API added for easily invoking Python transforms from
    Java (BEAM-14143).
  • Added Add support for Elasticsearch 8.x (BEAM-14003).
  • Shard aware Kinesis record aggregation (AWS Sdk v2), (BEAM-14104).
  • Upgrade to ZetaSQL 2022.04.1 (BEAM-14348).
  • Fixed ReadFromBigQuery cannot be used with the interactive runner (BEAM-14112).

Breaking Changes

  • Unused functions ShallowCloneParDoPayload(), ShallowCloneSideInput(), and ShallowCloneFunctionSpec() have been removed from the Go SDK's pipelinex package (BEAM-13739).
  • JmsIO requires an explicit valueMapper to be set (BEAM-16308). You can use the TextMessageMapper to convert String inputs to JMS TestMessages:
  JmsIO.<String>write()
        .withConnectionFactory(jmsConnectionFactory)
        .withValueMapper(new TextMessageMapper());
  • Coders in Python are expected to inherit from Coder. (BEAM-14351).
  • New abstract method metadata() added to io.filesystem.FileSystem in the
    Python SDK. (BEAM-14314)

Deprecations

Bugfixes

  • Fixed Java Spanner IO NPE when ProjectID not specified in template executions (Java) (BEAM-14405).
  • Fixed potential NPE in BigQueryServicesImpl.getErrorInfo (Java) (BEAM-14133).

Known Issues

List of Contributors

According to git shortlog, the following people contributed to the 2.39.0 release. Thank you to all contributors!

Ahmed Abualsaud,
Ahmet Altay,
Aizhamal Nurmamat kyzy,
Alexander Zhuravlev,
Alexey Romanenko,
Anand Inguva,
Andrei Gurau,
Andrew Pilloud,
Andy Ye,
Arun Pandian,
Arwin Tio,
Aydar Farrakhov,
Aydar Zainutdinov,
AydarZaynutdinov,
Balázs Németh,
Benjamin Gonzalez,
Brian Hulette,
Buqian Zheng,
Chamikara Jayalath,
Chun Yang,
Daniel Oliveira,
Daniela Martín,
Danny McCormick,
David Huntsperger,
Deepak Nagaraj,
Denise Case,
Esun Kim,
Etienne Chauchot,
Evan Galpin,
Hector Miuler Malpica Gallegos,
Heejong Lee,
Hengfeng Li,
Ilango Rajagopal,
Ilion Beyst,
Israel Herraiz,
Jack McCluskey,
Kamil Bregula,
Kamil Breguła,
Ke Wu,
Kenneth Knowles,
KevinGG,
Kiley,
Kiley Sok,
Kyle Weaver,
Liam Miller-Cushon,
Luke Cwik,
Marco Robles,
Matt Casters,
Michael Li,
MiguelAnzoWizeline,
Milan Patel,
Minbo Bae,
Moritz Mack,
Nick Caballero,
Niel Markwick,
Ning Kang,
Oskar Firlej,
Pablo Estrada,
Pavel Avilov,
Reuven Lax,
Reza Rokni,
Ritesh Ghorse,
Robert Bradshaw,
Robert Burke,
Ryan Thompson,
Sam Whittle,
Steven Niemitz,
Thiago Nunes,
Tomo Suzuki,
Valentyn Tymofieiev,
Victor,
Yi Hu,
Yichi Zhang,
Yiru Tang,
ahmedabu98,
andoni-guzman,
brachipa,
bulat safiullin,
bullet03,
dannymartinm,
daria.malkova,
dpcollins-google,
egalpin,
emily,
fbeevikm,
johnjcasey,
kileys,
[email protected],
nguyennk92,
pablo rodriguez defino,
rszper,
rvballada,
sachinag,
tvalentyn,
vachan-shetty,
yirutang

beam - Beam 2.38.0 release

Published by youngoli over 2 years ago

We are happy to present the new 2.38.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.

For more information on changes in 2.38.0 check out the detailed release notes.

I/Os

  • Introduce projection pushdown optimizer to the Java SDK (BEAM-12976). The optimizer currently only works on the BigQuery Storage API, but more I/Os will be added in future releases. If you encounter a bug with the optimizer, please file a JIRA and disable the optimizer using pipeline option --experiments=disable_projection_pushdown.
  • A new IO for Neo4j graph databases was added. (BEAM-1857) It has the ability to update nodes and relationships using UNWIND statements and to read data using cypher statements with parameters.
  • amazon-web-services2 has reached feature parity and is finally recommended over the earlier amazon-web-services and kinesis modules (Java). These will be deprecated in one of the next releases (BEAM-13174).

New Features / Improvements

  • Pipeline dependencies supplied through --requirements_file will now be staged to the runner using binary distributions (wheels) of the PyPI packages for linux_x86_64 platform (BEAM-4032). To restore the behavior to use source distributions, set pipeline option --requirements_cache_only_sources. To skip staging the packages at submission time, set pipeline option --requirements_cache=skip (Python).
  • The Flink runner now supports Flink 1.14.x (BEAM-13106).
  • Interactive Beam now supports remotely executing Flink pipelines on Dataproc (Python) (BEAM-14071).

Breaking Changes

  • (Python) Previously DoFn.infer_output_types was expected to return Iterable[element_type] where element_type is the PCollection elemnt type. It is now expected to return element_type. Take care if you have overriden infer_output_type in a DoFn (this is not common). See BEAM-13860.
  • (amazon-web-services2) The types of awsRegion / endpoint in AwsOptions changed from String to Region / URI (BEAM-13563).

Deprecations

  • Beam 2.38.0 will be the last minor release to support Flink 1.11.
  • (amazon-web-services2) Client providers (withXYZClientProvider()) as well as IO specific RetryConfigurations are deprecated, instead use withClientConfiguration() or AwsOptions to configure AWS IOs / clients.
    Custom implementations of client providers shall be replaced with a respective ClientBuilderFactory and configured through AwsOptions (BEAM-13563).

Bugfixes

  • Fix S3 copy for large objects (Java) (BEAM-14011)
  • Fix quadratic behavior of pipeline canonicalization (Go) (BEAM-14128)
    • This caused unnecessarily long pre-processing times before job submission for large complex pipelines.
  • Fix pyarrow version parsing (Python)(BEAM-14235)

Known Issues

List of Contributors

According to git shortlog, the following people contributed to the 2.38.0 release. Thank you to all contributors!

abhijeet-lele
Ahmet Altay
akustov
Alexander
Alexander Zhuravlev
Alexey Romanenko
AlikRodriguez
Anand Inguva
andoni-guzman
andreukus
Andy Ye
Ankur Goenka
ansh0l
Artur Khanin
Aydar Farrakhov
Aydar Zainutdinov
Benjamin Gonzalez
Brian Hulette
brucearctor
bulat safiullin
bullet03
Carl Mastrangelo
Chamikara Jayalath
Chun Yang
Daniela Martín
Daniel Oliveira
Danny McCormick
daria.malkova
David Cavazos
David Huntsperger
dmitryor
Dmytro Sadovnychyi
dpcollins-google
egalpin
Elias Segundo Antonio
emily
Etienne Chauchot
Hengfeng Li
Ismaël Mejía
Israel Herraiz
Jack McCluskey
Jakub Kukul
Janek Bevendorff
Jeff Klukas
Johan Sternby
Kamil Breguła
Kenneth Knowles
Ke Wu
Kiley
Kyle Weaver
laraschmidt
Lara Schmidt
LE QUELLEC Olivier
Luka Kalinovcic
Luke Cwik
Marcin Kuthan
masahitojp
Masato Nakamura
Matt Casters
Melissa Pashniak
Michael Li
Miguel Hernandez
Moritz Mack
mosche
nancyxu123
Nathan J Mehl
Niel Markwick
Ning Kang
Pablo Estrada
paul-tlh
Pavel Avilov
Rahul Iyer
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Ryan Skraba
Ryan Thompson
Sam Whittle
Seth Vargo
sp029619
Steven Niemitz
Thiago Nunes
Udi Meiri
Valentyn Tymofieiev
Victor
vitaly.terentyev
Yichi Zhang
Yi Hu
yirutang
Zachary Houfek
Zoe