The Virtual Feature Store. Turn your existing data infrastructure into a feature store.
MPL-2.0 License
Bot releases are visible (Hide)
Published by sdreyer 8 months ago
offline_store_spark_runner.py
by @epps in https://github.com/featureform/featureform/pull/1213
get_dynamodb
method by @epps in https://github.com/featureform/featureform/pull/1309
Full Changelog: https://github.com/featureform/featureform/compare/v0.11.0...v0.12.0
Published by sdreyer 11 months ago
Full Changelog: https://github.com/featureform/featureform/compare/v0.10.3...v0.11.0
Published by sdreyer about 1 year ago
pytest
Coverage by @epps in https://github.com/featureform/featureform/pull/940
Full Changelog: https://github.com/featureform/featureform/compare/v0.10.0...v0.10.1
Published by sdreyer over 1 year ago
- A brand-new Dashboard UI and enhanced functionality
- Vector database support in local and hosted mode for Weaviate and Pinecone
- API improvements for data science development
- Updated documentation and bugfixes
We're excited to bring you a more visually appealing Dashboard with new functionality for both users and administrators, including metadata management for resource tags, previewing transformation results, and clear visibility of transformation logic
Assign tags to resources directly through the dashboard UI
Edit Resource Metadata from the Dashboard
Preview datasets directly from the dashboard
Better formatting for Python and SQL transformations
You can now register Weaviate and Pinecone as providers!
Read all files from a directory into a dataframe with ff.register_directory()
Inference Stores are now optional in Local Mode: if an inference store is not specified, it will default to local mode.
Full Changelog: https://github.com/featureform/featureform/compare/v0.9.0...v0.10.0
Published by sdreyer over 1 year ago
You can use Featureform to define and orchestrate data pipelines that generate embeddings. Featureform can write them into either Redis for nearest neighbor lookup. This also allows users to version, re-use, and manage embeddings declaratively.
ff.register_redis(
name = "redis",
description = "Example inference store",
team = "Featureform",
host = "0.0.0.0",
port = 6379,
)
docs = spark.register_file(...)
@spark.df_transform(
inputs=[docs],
)
def embed_docs():
docs[“embedding”] = docs[“text”].map(lambda txt: openai.Embedding.create(
model="text-embedding-ada-002",
input=txt,
)["data"]
return docs
@ff.entity
def Article:
embedding = ff.Embedding(embed_docs[[“id”, “embedding”]], dims=1024, vector_db=redis)
@ff.entity
class Article:
embedding = ff.Embedding(
embed_docs[["id", "embedding"]],
dims=1024,
variant="test-variant",
vector_db=redis,
)
client.Nearest(Article.embedding, “id_123”, 25)
You can already interact with sources as dataframes, this release adds the same functionality to training sets as well.
import featureform as ff
client = ff.Client(...)
df = client.training_set(“fraud”, “simple”).dataframe()
print(df.head())
Featureform supports Cron syntax for scheduling transformations to run. This release rebuffs this functionality to make it more stable and efficient, and also adds more verbose error messages.
@snowflake.sql_transform(schedule=“0 * * * *”)
def avg_transaction_price()
return “SELECT user, AVG(price) FROM {{transaction}} GROUP BY user”
Featureform schedules and runs your transformations for you. We support running Pandas directly, Featureform spins up a Kubernetes job to run it. This isn’t a replacement for distributed processing frameworks like Spark (which we also support), but it’s a great option for teams that are already using Pandas for production.
aws_creds = ff.AWSCredentials(
aws_access_key_id="<aws_access_key_id>",
aws_secret_access_key="<aws_secret_access_key>",
)
s3 = ff.register_s3(
name="s3",
credentials=aws_creds,
bucket_path="<s3_bucket_path>",
bucket_region="<s3_bucket_region>"
)
pandas_k8s = ff.register_k8s(
name="k8s",
description="Native featureform kubernetes compute",
store=s3,
team="featureform-team"
)
src = pandas_k8s.register_file(...)
@pandas_k8s.df_transform(inputs=[src])
def transform(src):
return src.groupby("CustomerID")["TransactionAmount"].mean()
Published by sdreyer over 1 year ago
Full Changelog: https://github.com/featureform/featureform/compare/v0.8.0...v0.8.1
Published by ahmadnazeri over 1 year ago
client = Client() # presumes $FEATUREFORM_HOST is set
client.apply(insecure=False) # `insecure=True` for Docker (Quickstart only)
# Primary source as a dataframe
transactions_df = client.dataframe(
transactions, limit=2
) # Using the ColumnSourceRegistrar instance directly with a limit of 2 rows
# SQL transformation source as dataframe
avg_user_transaction_df = client.dataframe(
"average_user_transaction", "quickstart"
) # Using the source name and variant without a limit, which fetches all rows
print(transactions_df.head())
"""
"transactionid" "customerid" "customerdob" "custlocation" "custaccountbalance" "transactionamount" "timestamp" "isfraud"
0 T1 C5841053 10/1/94 JAMSHEDPUR 17819.05 25.0 2022-04-09T11:33:09Z False
1 T2 C2142763 4/4/57 JHAJJAR 2270.69 27999.0 2022-03-27T01:04:21Z False
"""
Full Changelog: https://github.com/featureform/featureform/compare/v0.7.3...v0.8.0
Published by sdreyer over 1 year ago
timestamp_column
when registering features/labelseksctl
Full Changelog: https://github.com/featureform/featureform/compare/v0.7.2...v0.7.3
Published by sdreyer over 1 year ago
Full Changelog: https://github.com/featureform/featureform/compare/v0.7.1...v0.7.2
Published by sdreyer over 1 year ago
Published by sdreyer over 1 year ago
Featureform has added a new way to define entities, features, and labels. This new API, which takes inspiration from Python ORMs, makes it easier for data scientists to define and manage their features and labels in code.
Example
transactions = postgres.register_table(
name="transactions",
table="Transactions", # This is the table's name in Postgres
)
@postgres.sql_transformation()
def average_user_transaction():
return "SELECT CustomerID as user_id, avg(TransactionAmount) " \
"as avg_transaction_amt from {{transactions.default}} GROUP BY user_id"
@ff.entity
class User:
avg_transactions = ff.Feature(
average_user_transaction[["user_id", "avg_transaction_amt"]],
type=ff.Float32,
inference_store=redis,
)
fraudulent = ff.Label(
transactions[["customerid", "isfraud"]], variant="quickstart", type=ff.Bool
)
ff.register_training_set(
"fraud_training",
label="fraudulent",
features=["avg_transactions"],
)
You can read more in the docs.
A highly requested feature was to feature-ize incoming data at serving time. For example, you may have an on-demand feature that turns a user comment into an embedding, or one that processes an incoming image.
On-demand feature that turns a comment to an embedding at serving time
@ff.ondemand_feature
def text_to_embedding(serving_client, params, entities):
return bert_transform(params[“comment”])
You can learn more in the docs
All features, labels, transformations, and training sets now have a tags
and properties
argument. properties
is a dict and tags
is a list.
client.register_training_set(“CustomerLTV_Training”, “default”, label=”ltv”, features=[“f1”, “f2”], tags=[“revenue”], properties={“visibility”: “internal”})
You can read more in the docs.
Featureform has a local mode that allows users to define, manage, and serve their features when working locally off their laptop. It doesn’t require anything to be deployed. It would historically re-generate training sets and features on each run, but with 0.7, we cache results by default to decrease iteration time.
Full Changelog: https://github.com/featureform/featureform/compare/v0.6.4...v0.7.0
Published by sdreyer over 1 year ago
Published by sdreyer over 1 year ago
Published by sdreyer over 1 year ago
Full Changelog: https://github.com/featureform/featureform/compare/v0.6.1...v0.6.2
Published by sdreyer over 1 year ago
Full Changelog: https://github.com/featureform/featureform/compare/v0.6.0...v0.6.1
Published by sdreyer over 1 year ago
Featureform has had support for Spark on EMR and Spark on Databricks for a while. We’ve generalized our Spark implementation to handle all versions of Spark using any of S3, GCS, Azure Blob Store, or HDFS as a backing store!
spark_creds = ff.SparkCredentials(
master=master_ip_or_local,
deploy_mode="client",
python_version=cluster_py_version,
)
gcp_creds = ff.GCPCredentials(
project_id=project_id,
credentials_path=path_to_gcp_creds,
)
gcs = ff.register_gcs(
name=gcs_provider_name,
credentials=gcp_creds,
bucket_name=”bucket_name”,
bucket_path="directory/",
)
spark = ff.register_spark(
name=spark_provider_name,
description="A Spark deployment we created for the Featureform quickstart",
team="featureform-team",
executor=spark_creds,
filestore=gcs,
)
databricks = ff.DatabricksCredentials(
host=host,
token=token,
cluster_id=cluster,
)
azure_blob = ff.register_blob_store(
name=”blob”,
account_name=os.getenv("AZURE_ACCOUNT_NAME", None),
account_key=os.getenv("AZURE_ACCOUNT_KEY", None),
container_name=os.getenv("AZURE_CONTAINER_NAME", None),
root_path="testing/ff",
)
spark = ff.register_spark(
name=”spark-databricks-azure”,
description="A Spark deployment we created for the Featureform quickstart",
team="featureform-team",
executor=databricks,
filestore=azure_blob,
)
spark_creds = ff.SparkCredentials(
master=master_ip_or_local,
deploy_mode="client",
python_version=cluster_py_version,
)
aws_creds = ff.AWSCredentials(
aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID", None),
aws_secret_access_key=os.getenv("AWS_SECRET_KEY", None),
)
s3 = ff.register_s3(
name="s3-quickstart",
credentials=aws_creds,
bucket_path=os.getenv("S3_BUCKET_PATH", None),
bucket_region=os.getenv("S3_BUCKET_REGION", None),
)
spark = ff.register_spark(
name="spark-generic-s3",
description="A Spark deployment we created for the Featureform quickstart",
team="featureform-team",
executor=spark_creds,
filestore=s3,
)
spark_creds = ff.SparkCredentials(
master=os.getenv("SPARK_MASTER", "local"),
deploy_mode="client",
python_version="3.7.16",
)
hdfs = ff.register_hdfs(
name="hdfs_provider",
host=host,
port="9000",
username="hduser"
)
spark = ff.register_spark(
name="spark-hdfs",
description="A Spark deployment we created for the Featureform quickstart",
team="featureform-team",
executor=spark_creds,
filestore=hdfs,
)
You can read more in the docs.
A highly requested feature was to add a lineage link between models and their feature & training set. Now when you serve a feature and training set you can include an optional model argument.
client.features("review_text", entities={"order": "df8e5e994bcc820fcf403f9a875201e6"}, model="sentiment_analysis")
client.training_set(“CustomerLTV_Training”, “default”, model=”linear_ltv_model”)
It can then be viewed via the CLI & the Dashboard:
Dashboard
CLI
You can learn more in the docs
Backup and recovery was originally exclusive to our enterprise offering. It is our goal to open-source everything in the product that isn’t related to governance, though we often first pilot new features with clients as we nail down the API.
> python backup/create_secret.py --help
Usage: create_secret.py [OPTIONS] COMMAND [ARGS]...
Generates a Kubernetes secret to store Featureform backup data.
Use this script to generate the Kubernetes secret, then apply it with:
`kubectl apply -f backup_secret.yaml`
Options:
-h, --help Show this message and exit.
Commands:
azure Create secret for azure storage containers
gcs Create secret for GCS buckets
s3 Create secret for S3 buckets
helm upgrade featureform featureform/featureform [FLAGS] --set backup.enable=true --set backup.schedule=<schedule>
Where schedule is in cron syntax, for example an hourly backup would look like:
"0 * * * *"
Recovering from a backup is simple. In backup/restore, edit the .env-template file with your cloud provider name and credentials, then rename to .env. A specific snapshot can be used by filling in the SNAPSHOT_NAME variable in the .env file.
After that, run recover.sh
in that directory.
You can learn more in the docs.
Prior to this release, if you were to rotate a key and/or change a credential you’d have to create a new provider. We made things immutable to avoid people accidentally overwriting each other's providers; however, this blocked the ability to rotate keys. Now, provider changes work as an upsert.
For example if you had registered Databricks and applied it like this:
databricks = ff.DatabricksCredentials(
host=host,
token=old_token,
cluster_id=cluster,
)
You could change it by simply changing the config and re-applying it.
databricks = ff.DatabricksCredentials(
host=host,
token=new_token,
cluster_id=cluster,
)
Prior to this release, you could only search resources from the dashboard. We’ve added the same functionality into the CLI. Our goal is to stay as close to feature parity between the dashboard and CLI as possible.
CLI
Dashboard
Featureform has historically made all resources immutable to solve a variety of different problems. This includes upstreams changing and breaking downstreams. Over the next couple releases we expect to dramatically pull back on forcing immutability while still avoiding the most common types of problems.
Featureform apply now works as an Upsert. For providers specifically, you can change most of their fields. This also makes it possible to rotate secrets and change credentials as outlined earlier in these release notes.
Older deployments of Snowflake used an Account Locator rather than an Organization/Account pair to connect, you can now use our register_snowflake_legacy method.
ff.register_snowflake_legacy(
name = "snowflake_docs",
description = "Example training store",
team = "Featureform",
username = snowflake_username,
password: snowflake_password,
account_locator: snowflake_account_locator,
database: snowflake_database,
schema: snowflake_schema,
)
You can learn more in the docs.
Pandas on K8s is still an experimental feature that we’re continuing to expand on. You were previously able to specify container limits for all, but now for specifically heavy or light transformations you can get more granular about your specifications as follows:
resource_specs = K8sResourceSpecs(
cpu_request="250m",
cpu_limit="50Mi",
memory_request="500m",
memory_limit="100Mi"
)
@k8s.df_transformation(
inputs=[("transactions", “v2”)],
resource_specs=resource_specs
)
def transform(transactions):
pass
You can learn more in the docs.
Published by sdreyer over 1 year ago
Full Changelog: https://github.com/featureform/featureform/compare/v0.5.0...v0.5.1
Published by sdreyer over 1 year ago
Full Changelog: https://github.com/featureform/featureform/compare/v0.4.0...v0.5.0
Published by sdreyer over 1 year ago
Full Changelog: https://github.com/featureform/featureform/compare/v0.4.5...v0.4.6