bytewax

Python Stream Processing

APACHE-2.0 License

Downloads
15.3K
Stars
1.3K
Committers
23

Bot releases are visible (Hide)

bytewax - v0.20.1 Latest Release

Published by davidselassie 5 months ago

Overview

  • Fixes a bug when using
    {py:obj}~bytewax.operators.windowing.EventClock where in-order but
    "slow" data results in watermark assertion errors.

What's Changed

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.20.0...v0.21.1

bytewax - v0.20.0

Published by whoahbot 5 months ago

Overview

  • Adds a dataflow structure visualizer. Run python -m bytewax.visualize.

  • Breaking change The internal format of recovery databases has been
    changed from using JsonPickle to Python's built-in {py:obj}pickle.
    Recovery stores that used the old format will not be usable after
    upgrading.

  • Breaking change The unary operator and UnaryLogic have been
    renamed to stateful and StatefulLogic respectively.

  • Adds a stateful_batch operator to allow for lower-level batch
    control while managing state.

  • StatefulLogic.on_notify, StatefulLogic.on_eof, and
    StatefulLogic.notify_at are now optional overrides. The defaults
    retain the state and emit nothing.

  • Breaking change Windowing operators have been moved from
    bytewax.operators.window into bytewax.operators.windowing.

  • Breaking change ClockConfigs have had Config dropped from
    their name and are just Clocks. E.g. If you previously from bytewax.operators.window import SystemClockConfig now from bytewax.operators.windowing import SystemClock.

  • Breaking change WindowConfigs have been renamed to Windowers.
    E.g. If you previously from bytewax.operators.window import SessionWindow now from bytewax.operators.windowing import SessionWindower.

  • Breaking change All windowing operators now return a set of
    streams {py:obj}~bytewax.operators.windowing.WindowOut.
    {py:obj}~bytewax.operators.windowing.WindowMetadata now is
    branched into its own stream and is no longer part of the single
    downstream. All window operator emitted items are labeled with the
    unique window ID they came from to facilitate joining the data
    later.

  • Breaking change {py:obj}~bytewax.operators.windowing.fold_window
    now requires a merge argument. This handles whenever the session
    windower determines that two windows must be merged because a new
    item bridged a gap.

  • Breaking change The join_named and join_window_named operators
    have been removed because they did not support returning proper type
    information. Use {py:obj}~bytewax.operators.join or
    {py:obj}~bytewax.operators.windowing.join_window instead, which
    have been enhanced to properly type their downstream values.

  • Breaking change {py:obj}~bytewax.operators.join and
    {py:obj}~bytewax.operators.windowing.join_window have had their
    product argument replaced with mode. You now can specify more
    nuanced kinds of join modes.

  • Python interfaces are now provided for custom clocks and windowers.
    Subclass {py:obj}~bytewax.operators.windowing.Clock (and a
    corresponding {py:obj}~bytewax.operators.windowing.ClockLogic) or
    {py:obj}~bytewax.operators.windowing.Windower (and a corresponding
    {py:obj}~bytewax.operators.windowing.WindowerLogic) to define your
    own senses of time and window definitions.

  • Adds a {py:obj}~bytewax.operators.windowing.window operator to
    allow you to write more flexible custom windowing operators.

  • Session windows now work correctly with out-of-order data and joins.

  • All windowing operators now process items in timestamp order. The
    most visible change that this results in is that the
    {py:obj}~bytewax.operators.windowing.collect_window operator now
    emits collections with values in timestamp order.

  • Adds a {py:obj}~bytewax.operators.filter_map_value operator.

  • Adds a {py:obj}~bytewax.operators.enrich_cached operator for
    easier joining with an external data source.

  • Adds a {py:obj}~bytewax.operators.key_rm convenience operator to
    remove keys from a {py:obj}~bytewax.operators.KeyedStream.

What's Changed

New Contributors

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.19.0...v0.20.0

bytewax - v0.19.1

Published by whoahbot 7 months ago

Overview

  • Fixes a bug where using a system clock on certain architectures causes items to be dropped from windows.

What's Changed

New Contributors

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.19.0...v0.19.1

bytewax - v0.19.0

Published by whoahbot 7 months ago

Overview

  • Multiple operators have been reworked to avoid taking and releasing
    Python's global interpreter lock while iterating over multiple items.
    Windowing operators, stateful operators and operators like branch
    will see significant performance improvements.

    Thanks to @damiondoesthings for helping us track this down!

  • Breaking change FixedPartitionedSource.build_part,
    DynamicSource.build, FixedPartitionedSink.build_part and DynamicSink.build
    now take an additional step_id argument. This argument can be used when
    labeling custom Python metrics.

  • Custom Python metrics can now be collected using the prometheus-client
    library.

  • Breaking change The schema registry interface has been removed.
    You can still use schema registries, but you need to instantiate
    the (de)serializers on your own. This allows for more flexibility.
    See the confluent_serde and redpanda_serde examples for how
    to use the new interface.

  • Fixes bug where items would be incorrectly marked as late in sliding
    and tumbling windows in cases where the timestamps are very far from
    the align_to parameter of the windower.

  • Adds stateful_flat_map operator.

  • Breaking change Removes builder argument from stateful_map.
    Instead, the initial state value is always None and you can call
    your previous builder by hand in the mapper.

  • Breaking change Improves performance by removing the now: datetime argument from FixedPartitionedSource.build_part,
    DynamicSource.build, and UnaryLogic.on_item. If you need the
    current time, use:

from datetime import datetime, timezone

now = datetime.now(timezone.utc)
  • Breaking change Improves performance by removing the sched: datetime argument from StatefulSourcePartition.next_batch,
    StatelessSourcePartition.next_batch, UnaryLogic.on_notify. You
    should already have the scheduled next awake time in whatever
    instance variable you returned in
    {Stateful,Stateless}SourcePartition.next_awake or
    UnaryLogic.notify_at.

What's Changed

New Contributors

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.18.2...v0.19.0

bytewax - v0.18.2

Published by whoahbot 8 months ago

Overview

  • Fixes a bug that prevented the deletion of old state in recovery stores.
  • Better error messages on invalid epoch and backup interval parameters.
  • Fixes bug where dataflow will hang if a source's next_awake is set far in the future.

What's Changed

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.18.1...v0.18.2

What's Changed

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.18.1...v0.18.2

bytewax - v0.18.1

Published by whoahbot 9 months ago

Overview

What's Changed

New Contributors

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.18.0...v0.18.1

bytewax - v0.18.0

Published by whoahbot 10 months ago

Overview

  • Support for schema registries, through bytewax.connectors.kafka.registry.RedpandaSchemaRegistry and bytewax.connectors.kafka.registry.ConfluentSchemaRegistry.

  • Custom Kafka operators in bytewax.connectors.kafka.operators:
    input, output, deserialize_key, deserialize_value, deserialize,
    serialize_key, serialize_value and serialize.

  • Breaking change KafkaSource now emits a special KafkaSourceMessage to allow access to all data on consumed messages. KafkaSink now consumes KafkaSinkMessage to allow setting additional fields on produced messages.

  • Non-linear dataflows are now possible. Each operator method returns
    a handle to the Streams it produces; add further steps via calling
    operator functions on those returned handles, not the root
    Dataflow. See the migration guide for more info.

  • Auto-complete and type hinting on operators, inputs, outputs,
    streams, and logic functions now works.

  • A ton of new operators: collect_final, count_final,
    count_window, flatten, inspect_debug, join, join_named,
    max_final, max_window, merge, min_final, min_window,
    key_on, key_assert, key_split, merge, unary. Documentation
    for all operators are in bytewax.operators now.

  • New operators can be added in Python, made by grouping existing
    operators. See bytewax.dataflow module docstring for more info.

  • Breaking change Operators are now stand-alone functions; import bytewax.operators as op and use e.g. op.map("step_id", upstream, lambda x: x + 1).

  • Breaking change All operators must take a step_id argument now.

  • Breaking change fold and reduce operators have been renamed to
    fold_final and reduce_final. They now only emit on EOF and are
    only for use in batch contexts.

  • Breaking change batch operator renamed to collect, so as to
    not be confused with runtime batching. Behavior is unchanged.

  • Breaking change output operator does not forward downstream its
    items. Add operators on the upstream handle instead.

  • next_batch on input partitions can now return any Iterable, not
    just a List.

  • inspect operator now has a default inspector that prints out items
    with the step ID.

  • collect_window operator now can collect into sets and dicts.

  • Adds a get_fs_id argument to {Dir,File}Source to allow handling
    non-identical files per worker.

  • Adds a TestingSource.EOF and TestingSource.ABORT sentinel values
    you can use to test recovery.

  • Breaking change Adds a datetime argument to
    FixedPartitionSource.build_part, DynamicSource.build_part,
    StatefulSourcePartition.next_batch, and
    StatelessSourcePartition.next_batch. You can now use this to
    update your next_awake time easily.

  • Breaking change Window operators now emit WindowMetadata objects
    downstream. These objects can be used to introspect the open_time
    and close_time of windows. This changes the output type of windowing
    operators from: (key, values) to (key, (metadata, values)).

  • Breaking change IO classes and connectors have been renamed to
    better reflect their semantics and match up with documentation.

  • Moves the ability to start multiple Python processes with the
    -p or --processes to the bytewax.testing module.

  • Breaking change SimplePollingSource moved from
    bytewax.connectors.periodic to bytewax.inputs since it is an
    input helper.

  • SimplePollingSource's align_to argument now works.

What's Changed

New Contributors

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.17.1...v0.18.0

bytewax - v0.17.2

Published by whoahbot about 1 year ago

Overview

  • Fixes error message creation, and updates error messages when creating recovery partitions.

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.17.1...v0.17.2

bytewax - v0.17.1

Published by whoahbot about 1 year ago

Overview

  • Adds the batch operator to Dataflows. Calling Dataflow.batch
    will batch incoming items until either a batch size has been reached
    or a timeout has passed.

  • Adds the SimplePollingInput source. Subclass this input source to
    periodically source new input for a dataflow.

  • Re-adds GLIBC 2.27 builds to support older linux distributions.

What's Changed

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.17.0...v0.17.1

bytewax - v0.17.0

Published by whoahbot about 1 year ago

v0.17.0

Changed

  • Breaking change Recovery system re-worked. Kafka-based recovery
    removed. SQLite recovery file format changed; existing recovery DB
    files can not be used. See the module docstring for
    bytewax.recovery for how to use the new recovery system.

  • Dataflow execution supports rescaling over resumes. You can now
    change the number of workers and still get proper execution and
    recovery.

  • epoch-interval has been renamed to snapshot-interval

  • The list-parts method of PartitionedInput has been changed to
    return a List[str] and should only reflect the available
    inputs that a given worker has access to. You no longer need
    to return the complete set of partitions for all workers.

  • The next method of StatefulSource and StatelessSource has
    been changed to next_batch and should return a List of elements,
    or the empty list if there are no elements to return.

Added

  • Added new cli parameter backup-interval, to configure the length of
    time to wait before "garbage collecting" older recovery snapshots.

  • Added next_awake to input classes, which can be used to schedule
    when the next call to next_batch should occur. Use next_awake
    instead of time.sleep.

  • Added bytewax.inputs.batcher_async to bridge async Python libraries
    in Bytewax input sources.

  • Added support for linux/aarch64 and linux/armv7 platforms.

Removed

  • KafkaRecoveryConfig has been removed as a recovery store.

What's Changed

New Contributors

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.16.2...v0.17.0

bytewax - v0.16.2

Published by whoahbot over 1 year ago

Overview

  • Add support for Windows builds - thanks @zzl221000!
  • Adds a CSVInput subclass of FileInput

What's Changed

New Contributors

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.16.1...v0.16.2

bytewax - v0.16.1

Published by whoahbot over 1 year ago

Overview

  • Add a cooldown for activating workers to reduce CPU consumption.
  • Add support for Python 3.11.

What's Changed

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.16.0...v0.16.1

bytewax - v0.16.0

Published by whoahbot over 1 year ago

Overview

  • Breaking change Reworked the execution model. run_main and cluster_main
    have been moved to bytewax.testing as they are only supposed to be used
    when testing or prototyping.
    Production dataflows should be ran by calling the bytewax.run
    module with python -m bytewax.run <dataflow-path>:<dataflow-name>.
    See python -m bytewax.run -h for all the possible options.
    The functionality offered by spawn_cluster are now only offered by the
    bytewax.run script, so spawn_cluster was removed.

  • Breaking change {Sliding,Tumbling}Window.start_at has been
    renamed to align_to and both now require that argument. It's not
    possible to recover windowing operators without it.

  • Fixes bugs with windows not closing properly.

  • Fixes an issue with SQLite-based recovery. Previously you'd always
    get an "interleaved executions" panic whenever you resumed a cluster
    after the first time.

  • Add SessionWindow for windowing operators.

  • Add SlidingWindow for windowing operators.

  • Breaking change Rename TumblingWindowConfig to TumblingWindow

  • Add filter_map operator.

  • Breaking change New partition-based input and output API. This
    removes ManualInputConfig and ManualOutputConfig. See
    bytewax.inputs and bytewax.outputs for more info.

  • Breaking change Dataflow.capture operator is renamed to
    Dataflow.output.

  • Breaking change KafkaInputConfig and KafkaOutputConfig have
    been moved to bytewax.connectors.kafka.KafkaInput and
    bytewax.connectors.kafka.KafkaOutput.

  • Deprecation warning The KafkaRecovery store is being deprecated
    in favor of SqliteRecoveryConfig, and will be removed in a future
    release.

What's Changed

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.15.0...v0.16.0

bytewax - v0.15.1

Published by whoahbot over 1 year ago

Overview

  • Fixes an issue with SQLite-based recovery. Previously you'd always
    get an "interleaved executions" panic whenever you resumed a cluster
    after the first time.

What's Changed

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.14.0...v0.15.1

bytewax - v0.15.0

Published by davidselassie over 1 year ago

Overview

  • Fixes issue with multi-worker recovery. If the cluster crashed
    before all workers had completed their first epoch, the cluster
    would resume from the incorrect position. This requires a change to
    the recovery store. You cannot resume from recovery data written
    with an older version.

What's Changed

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.14.0...v0.15.0

bytewax - v0.14.0

Published by davidselassie almost 2 years ago

Overview

  • Dataflow continuation now works. If you run a dataflow over a finite
    input, all state will be persisted via recovery so if you re-run the
    same dataflow pointing at the same input, but with more data
    appended at the end, it will correctly continue processing from the
    previous end-of-stream.

  • Fixes issue with multi-worker recovery. Previously resume data was
    being routed to the wrong worker so state would be missing.

  • The above two changes require that the recovery format has been
    changed for all recovery stores. You cannot resume from recovery
    data written with an older version.

  • Adds an introspection web server to dataflow workers.

  • Adds collect_window operator.

What's Changed

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.13.1...v0.14.0

bytewax - v0.13.1

Published by miccioest almost 2 years ago

Overview

  • Added Google Colab support.

What's Changed

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.13.0...v0.13.1

bytewax - v0.13.0

Published by whoahbot almost 2 years ago

Overview

  • Added tracing instrumentation and configurations for tracing backends.

What's Changed

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.12.0...v0.13.0

bytewax - v0.12.0

Published by whoahbot almost 2 years ago

Overview

  • Fixes bug where window is never closed if recovery occurs after last
    item but before window close.

  • Recovery logging is reduced.

  • Breaking change Recovery format has been changed for all recovery stores.
    You cannot resume from recovery data written with an older version.

  • Adds a DynamoDB and BigQuery output connector.

What's Changed

New Contributors

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.7.1...v0.12.0

bytewax - v0.11.2

Published by whoahbot about 2 years ago

  • Performance improvements. ✨

  • Support SASL and SSL for bytewax.inputs.KafkaInputConfig.

What's Changed

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.11.1...0.11.2

Package Rankings
Top 6.75% on Proxy.golang.org
Top 3.43% on Pypi.org
Badges
Extracted from project README
Actions Status PyPI Bytewax User Guide
Related Projects