Build and manage real-life ML, AI, and data science projects with ease!
APACHE-2.0 License
Published by savingoyal over 1 year ago
With this release, Metaflow users can get notified on Slack when their workflows succeed or fail on Argo Workflows. Using this feature is quite straightforward
https://hooks.slack.com/services/T0XXXXXXXXX/B0XXXXXXXXX/qZXXXXXX
python flow.py argo-workflows create --notify-on-error --notify-on-success --notify-slack-webhook-url <slack-webhook-url>
METAFLOW_ARGO_WORKFLOWS_CREATE_NOTIFY_SLACK_WEBHOOK_URL=<slack-webhook-url>
in your environment instead of specifying --notify-slack-webhook-url on the CLI everytime.I deployed my workflow following the instructions above, but I haven’t received any notifications yet?
This issue may very well happen if you are running Kubernetes v1.24 or newer.
Since v1.24, Kubernetes stopped automatically creating a secret for every serviceAccount. Argo Workflows relies on the existence of these secrets to run lifecycle hooks responsible for the emission of these notifications.
Follow these steps for explicitly creating a secret for the service account that responsible for executing Argo Workflows steps:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
name: default-sa-token #change according to the name of the sa
annotations:
kubernetes.io/service-account.name: default #replace with your sa
type: kubernetes.io/service-account-token
EOF
$ kubectl edit sa default -n mynamespace
...
apiVersion: v1
kind: ServiceAccount
metadata:
creationTimestamp: "2023-05-05T20:58:58Z"
name: default
namespace: jobs-default
resourceVersion: "6739507"
uid: 4a708eff-d6ba-4dd8-80ee-8fb3c4c1e1c7
secrets:
- name: default-sa-token # should match the secret above
In case you need any assistance or have feedback for us, ping us at chat.metaflow.org or open a GitHub issue.
metaflow configure kubernetes
by @saikonen in https://github.com/Netflix/metaflow/pull/1405
Full Changelog: https://github.com/Netflix/metaflow/compare/2.8.6...2.9.0
Published by savingoyal over 1 year ago
With this release, Metaflow users can architect sequences of workflows that conduct data across teams, all the way from ETL and data warehouse to final ML outputs. Detailed documentation and a blog post to follow very shortly! Keep watching this space.
In case you need any assistance or have feedback for us, ping us at chat.metaflow.org or open a GitHub issue.
metaflow configure kubernetes
by @saikonen in https://github.com/Netflix/metaflow/pull/1405
Full Changelog: https://github.com/Netflix/metaflow/compare/2.8.6...2.9.0
Published by savingoyal over 1 year ago
With this release, Metaflow users can attach existing persistent volume claims to Metaflow tasks running on a Kubernetes cluster.
To use this functionality, simply list your persistent volume claim and mount point using the persistent_volume_claims arg in @kubernetes decorator - @kubernetes(persistent_volume_claims={"pvc-claim-name": "mount-point", "another-pvc-claim-name": "another-mount-point"})
.
Here is an example:
from metaflow import FlowSpec, step, kubernetes, current
import os
class MountPVCFlow(FlowSpec):
@kubernetes(persistent_volume_claims={"test-pvc-feature-claim": "/mnt/testvol"})
@step
def start(self):
print('testing PVC')
mount = "/mnt/testvol"
file = f"zeros_run_{current.run_id}"
with open(os.path.join(mount, file), "w+") as f:
f.write("\0" * 50)
f.flush()
print(f"mount folder contents: {os.listdir(mount)}")
self.next(self.end)
@step
def end(self):
print("finished")
if __name__=="__main__":
MountPVCFlow()
In case you need any assistance or have feedback for us, ping us at chat.metaflow.org or open a GitHub issue.
Full Changelog: https://github.com/Netflix/metaflow/compare/2.8.5...2.8.6
Published by romain-intel over 1 year ago
Improvements
The previous release resulted in disabling a sequence of user operations that worked previously:
This release restores the previous behavior.
In case you need any assistance or have feedback for us, ping us at chat.metaflow.org or open a GitHub issue.
Full Changelog: https://github.com/Netflix/metaflow/compare/2.8.4...2.8.5
Published by savingoyal over 1 year ago
Features
Improvements
It is typical for the user code in a Metaflow step to download assets from an object store, e.g. S3. Examples include serialized models and raw input data, such unstructured media or structured Parquet files. The amount of data loaded in a task is typically 10-100GB, allowing even terabytes to be handled in a foreach.
To reduce IO bottlenecks in such tasks, we provide an optimized client for S3, metaflow.S3 that makes it possible to download data using all available network bandwidth. Notably, in a modern instance the available network bandwidth can be higher than the local disk bandwidth. Consider: SATA 3.0 provides 6Gbit/s whereas a large instance can have 20Gbit/s network throughput. Even Gen3 NVMe provides just 16Git/s. To benefit from the full network bandwidth, local disk IO must be bypassed. The metaflow.S3 client accomplishes this by relying on the page cache: Nominally files are downloaded in a temporary directory on disk but practically all data stays in the page cache. This is assuming that the downloaded data can fit in memory, which can be ensured by having a high enough @resources(memory=) setting.
The above setup, which can provide excellent IO performance in general, has a small gotcha: The instance needs to have enough local disk space to back all the data, although no data actually hits the disk. Increasingly, instances may have more memory than local disk space available, so this superfluous requirement becomes a problem. This puts users in a strange situation: The instance has enough RAM to hold all the data in memory, and there are ways to download it quickly from S3, but the lack of local disk space (that is not even needed), makes it impossible to access the data.
Kubernetes supports mounting a tmpfs filesystem on the fly. Using this feature, the user can create a memory-backed file system which can be used as a temporary space for downloaded data. This removes the need to have to deal with any local disks. One can simply use a minimal root filesystem, which greatly simplifies the infrastructure setup.
With this release, we introduce a new config option - METAFLOW_TEMPDIR, which, if defined, is used as the default metaflow.S3(tmproot). If METAFLOW_TEMPDIR is not defined, tmproot=’.’ as before. In addition, a few new attributes are introduced for @kubernetes decorator -
Attribute (default) | Default behavior | Override semantics |
---|---|---|
use_tmpfs=False | tmpfs disabled | use_tmpfs=True enables tmpfs |
tmpfs_tempdir=True | sets METAFLOW_TEMPDIR=tmpfs_path | tmpfs_tempdir=False doesn't set METAFLOW_TEMPDIR |
tmpfs_size=None | sets tmpfs size to 50% of @resources(memory) | tmpfs size in megabytes |
tmpfs_path=None | use /metaflow_temp as tmpfs_path | custom mount point |
@kubernetes(memory=100000, use_tmpfs=True)
In this case, at most 50GB is available for tmpfs and it is used by S3 by default. Note that tmpfs only consumes the amount of memory corresponding to the data stored, so there is no downside in setting a large size by default.
@kubernetes(memory=100000, tmpfs_size=100000)
Let tmpfs use all available memory. Note that use_tmpfs=True doesn’t have to be specified redundantly.
@kubernetes(memory=100000, tmpfs_size=10000, tmpfs_path=’/data’, tmpfs_tempdir=False)
Full control over settings - metaflow.S3 doesn’t use the tmpfs volume in this case.
Besides metaflow.S3, the user may want to use the tmpfs volume for their own use cases. In particular, many modern ML libraries require a local cache. To support these use cases, tmpfs_path is exposed through the current object, as current.tempdir.
This allows the user to leverage the volume straightforwardly:
AutoModelForSeq2SeqLM.from_pretrained(
model_path,
cache_dir=current.tempdir,
device_map='auto',
load_in_8bit=True,
)
With this release, you can access current.run
and current.task
within a running flow, allowing for use cases like
from metaflow import current
# add tags from inside a run
current.run.add_tag('foobar')
The previous release broke backward compatibility in cases where the metaflow client object is deserialized from an older version of Metaflow. This release preserves the functionality and provides explicit compatibility guarantees going forward.
In case you need any assistance or have feedback for us, ping us at chat.metaflow.org or open a GitHub issue.
METAFLOW_S3_ENDPOINT_URL
as a part of airflow by @valayDave in https://github.com/Netflix/metaflow/pull/1368
run
and task
object. by @romain-intel in https://github.com/Netflix/metaflow/pull/1384
Full Changelog: https://github.com/Netflix/metaflow/compare/2.8.3...2.8.4
Published by savingoyal over 1 year ago
Features
Improvements
It is typical for the user code in a Metaflow step to download assets from an object store, e.g. S3. Examples include serialized models and raw input data, such unstructured media or structured Parquet files. The amount of data loaded in a task is typically 10-100GB, allowing even terabytes to be handled in a foreach.
To reduce IO bottlenecks in such tasks, we provide an optimized client for S3, metaflow.S3 that makes it possible to download data using all available network bandwidth. Notably, in a modern instance the available network bandwidth can be higher than the local disk bandwidth. Consider: SATA 3.0 provides 6Gbit/s whereas a large instance can have 20Gbit/s network throughput. Even Gen3 NVMe provides just 16Git/s. To benefit from the full network bandwidth, local disk IO must be bypassed. The metaflow.S3 client accomplishes this by relying on the page cache: Nominally files are downloaded in a temporary directory on disk but practically all data stays in the page cache. This is assuming that the downloaded data can fit in memory, which can be ensured by having a high enough @resources(memory=) setting.
The above setup, which can provide excellent IO performance in general, has a small gotcha: The instance needs to have enough local disk space to back all the data, although no data actually hits the disk. Increasingly, instances may have more memory than local disk space available, so this superfluous requirement becomes a problem. The issue is further amplified by the fact that as of today, it is impossible to add ephemeral volumes on the fly on AWS Batch. This puts users in a strange situation: The instance has enough RAM to hold all the data in memory, and there are ways to download it quickly from S3, but the lack of local disk space (that is not even needed), makes it impossible to access the data.
AWS Batch supports mounting a tmpfs filesystem on the fly. Using this feature, the user can create a memory-backed file system which can be used as a temporary space for downloaded data. This removes the need to have to deal with any local disks. One can simply use a minimal root filesystem, which greatly simplifies the infrastructure setup.
With this release, we introduce a new config option - METAFLOW_TEMPDIR, which, if defined, is used as the default metaflow.S3(tmproot). If METAFLOW_TEMPDIR is not defined, tmproot=’.’ as before. In addition, a few new attributes are introduced for @batch decorator -
Attribute (default) | Default behavior | Override semantics |
---|---|---|
use_tmpfs=False | tmpfs disabled | use_tmpfs=True enables tmpfs |
tmpfs_tempdir=True | sets METAFLOW_TEMPDIR=tmpfs_path | tmpfs_tempdir=False doesn't set METAFLOW_TEMPDIR |
tmpfs_size=None | sets tmpfs size to 50% of @resources(memory) | tmpfs size in megabytes |
tmpfs_path=None | use /metaflow_temp as tmpfs_path | custom mount point |
@batch(memory=100000, use_tmpfs=True)
In this case, at most 50GB is available for tmpfs and it is used by S3 by default. Note that tmpfs only consumes the amount of memory corresponding to the data stored, so there is no downside in setting a large size by default.
@batch(memory=100000, tmpfs_size=100000)
Let tmpfs use all available memory. Note that use_tmpfs=True doesn’t have to be specified redundantly.
@batch(memory=100000, tmpfs_size=10000, tmpfs_path=’/data’, tmpfs_tempdir=False)
Full control over settings - metaflow.S3 doesn’t use the tmpfs volume in this case.
Besides metaflow.S3, the user may want to use the tmpfs volume for their own use cases. In particular, many modern ML libraries require a local cache. To support these use cases, tmpfs_path is exposed through the current object, as current.tempdir.
This allows the user to leverage the volume straightforwardly:
AutoModelForSeq2SeqLM.from_pretrained(
model_path,
cache_dir=current.tempdir,
device_map='auto',
load_in_8bit=True,
)
With this release, Metaflow client objects will support autocomplete in ipython notebooks
from metaflow import Flow, Metaflow
Metaflow().flows
>>> [Flow('HelloFlow'), Flow('MovieStatsFlow')]
flow = Flow('HelloFlow') # No autocomplete here
flow._ipython_key_completions_()
>>>
['1680815181013681',
'1680815178214737',
'1680432265121345',
'1680430310127401']
run = flow["1680815178214737"]
run._ipython_key_completions_()
>>> ['end', 'hello', 'start']
step = run["hello"]
step._ipython_key_completions_()
>>> ['2']
task = step["2"]
task._ipython_key_completions_()
>>> ['name']
With this release, Metaflow flows should execute a tad bit faster since a few network calls to Metaflow's metadata service are now cached. Expect continued further improvements in flow execution times over the next few releases.
With this release, Metaflow card creation will handle non-JSON parseable types gracefully by replacing the column values with UnsupportedType : <TYPENAME>
.
In case you need any assistance or have feedback for us, ping us at chat.metaflow.org or open a GitHub issue.
pandas.DataFrame
for default
card by @valayDave in https://github.com/Netflix/metaflow/pull/1344
Full Changelog: https://github.com/Netflix/metaflow/compare/2.8.2...2.8.3
Published by savingoyal over 1 year ago
With this release, the Metaflow tutorials can now be executed within the Metaflow sandboxes, making it trivial to evaluate whether Metaflow is a good fit for your organization without committing to deploying the necessary cloud infrastructure upfront.
step-functions trigger
or argo-workflows trigger
With this release, if the Metaflow config (in ~/.metaflow_config
) includes a reference to the deployed Metaflow UI (assigned to METAFLOW_UI_URL
), the user-facing logs in the terminal will indicate the direct link to the relevant run view
in the Metaflow UI.
In case you need any assistance or have feedback for us, ping us at chat.metaflow.org or open a GitHub issue.
logs
command in cases where the step/task hasn't finished by @romain-intel in https://github.com/Netflix/metaflow/pull/1315
Full Changelog: https://github.com/Netflix/metaflow/compare/2.8.1...2.8.2
Published by savingoyal over 1 year ago
task.metadata_dict
when a task executes on AWS BatchWith this release, task.metadata_dict
will include the fields - ec2-instance-id
, ec2-instance-type
, ec2-region
, and ec2-availability-zone
whenever the Metaflow task is executed on AWS Batch and the task container has access to ec2 metadata magic URL.
run
or resume
With this release, if the Metaflow config (in ~/.metaflow_config
) includes a reference to the deployed Metaflow UI (assigned to METAFLOW_UI_URL
), the user-facing logs in the terminal will indicate the direct link to the relevant run view
in the Metaflow UI.
In case you need any assistance or have feedback for us, ping us at chat.metaflow.org or open a GitHub issue.
Published by savingoyal over 1 year ago
With this release, we are introducing an integration with Apache Airflow similar to our integrations with AWS Step Functions and Argo Workflows where Metaflow users can easily deploy & schedule their DAGs by simply executing
python myflow.py airflow create mydag.py
which will create an Airflow DAG for them. With this feature, Metaflow users can now enjoy all the features of Metaflow on top of Apache Airflow - including a more user-friendly and productive development API for data scientists and data engineers, without needing to change anything in your existing pipelines or operational playbooks, as described in its announcement blog post. To learn how to deploy and operate the integration, see Using Airflow with Metaflow.
When running on Airflow, Metaflow code works exactly as it does locally: No changes are required in the code. With this integration, Metaflow users can inspect their flows deployed on Apache Airflow as before and debug and reproduce results from Apache Airflow on their local laptop or within a notebook. All tasks are run on Kubernetes respecting the @resources decorator as if the @kubernetes decorator was added to all steps, as explained in Executing Tasks Remotely.
The main benefits of using Metaflow with Airflow are:
In case you need any assistance or have feedback for us, ping us at chat.metaflow.org or open a GitHub issue.
Published by romain-intel over 1 year ago
metaflow_extensions
, add an empty __init__.py
file. by @romain-intel in https://github.com/Netflix/metaflow/pull/1276
Full Changelog: https://github.com/Netflix/metaflow/compare/2.7.22...2.7.23
Published by romain-intel over 1 year ago
Full Changelog: https://github.com/Netflix/metaflow/compare/2.7.21...2.7.22
Published by romain-intel almost 2 years ago
Full Changelog: https://github.com/Netflix/metaflow/compare/2.7.20...2.7.21
Published by romain-intel almost 2 years ago
If you are using the unsupported Metaflow Extensions mechanism, you may have to change them slightly. Please see https://github.com/Netflix/metaflow-extensions-template/blob/master/CHANGES.md for more details.
Full Changelog: https://github.com/Netflix/metaflow/compare/2.7.19...2.7.20
Published by savingoyal almost 2 years ago
Full Changelog: https://github.com/Netflix/metaflow/compare/2.7.18...2.7.19
Published by romain-intel almost 2 years ago
Full Changelog: https://github.com/Netflix/metaflow/compare/2.7.17...2.7.18
Published by romain-intel almost 2 years ago
Full Changelog: https://github.com/Netflix/metaflow/compare/2.7.16...2.7.17
Published by romain-intel almost 2 years ago
Full Changelog: https://github.com/Netflix/metaflow/compare/2.7.15...2.7.16
Published by romain-intel almost 2 years ago
._orig
access for submodules for MF extensions by @romain-intel in https://github.com/Netflix/metaflow/pull/1174
Full Changelog: https://github.com/Netflix/metaflow/compare/2.7.14...2.7.15
Published by romain-intel almost 2 years ago
Full Changelog: https://github.com/Netflix/metaflow/compare/2.7.13...2.7.14
Published by romain-intel about 2 years ago
cmd
extension point to allow MF extensions to extend it by @romain-intel in https://github.com/Netflix/metaflow/pull/1143
kubernetes_conn_id
in Airflow integration by @valayDave in https://github.com/Netflix/metaflow/pull/1153
Image.from_matplotlib
by @valayDave in https://github.com/Netflix/metaflow/pull/1147
Full Changelog: https://github.com/Netflix/metaflow/compare/2.7.12...2.7.13