aistore

AIStore: scalable storage for AI applications

MIT License

Downloads
33.7K
Stars
1.1K
Committers
41

Bot releases are hidden (Show)

aistore -

Published by alex-aizman about 4 years ago

Highlights

  • (new) ETL offload: support for running custom extract-transform-load workloads on (and by) storage cluster;
  • (new) TensorFlow integration to support existing training clients that use S3 API - done via tar2tf ETL offload that handles on-the-fly TFRecord/tf.Example conversion;
  • List objects v2: optimized list-objects to greatly reduce response times;
  • (new) Query objects: extends list-objects with advanced filtering capabilities;
  • (new) Downloader: an option to keep AIS bucket in-sync with a (downloaded) destination;
  • (new) Information Center (IC), to improve visibility and manageability of the asynchronous batch operations (such as global rebalance, n-way mirroring, erasure coding, ETL, and more);
  • (new) role-based authentication;
  • Distributed Shuffle (dSort) - performance improvements;
  • multi-checksumming, with per-dataset configurable checksum and (new) support for cryptographic checksums.

And also:

  • performance optimizations, CLI usability improvements, erasure coding optimizations, automated no-downtime rebalancing for erasure-coded buckets, refactoring, cleanup, and stability fixes across the board.

Downloader

Skip already downloaded/existing objects, limit download speed, support Azure Cloud, option to synchronize Cloud into AIS bucket, numerous CLI improvements.

  • New API (and CLI) option to keep Cloud bucket and AIS bucket in-sync - #760, !2322
  • Throttle download - #726
  • Download an entire bucket (an option that specifies a range or list of objects to download can now be omitted) - #759
  • Store 3rd party Cloud metadata (version, md5) as part of the AIS object's own metadata; use Cloud metadata for multi-versioning (latest version) and data protection - #701
  • Progress Bar when downloading from Cloud - #773
  • Downloader to support Azure Cloud - #763
  • CLI: download prefix-ed objects - !2204
  • Fix re-downloading a cloud bucket (skip downloading when have identical local replica) - !2221, !2236
  • Downloading a Cloud bucket can be now done only to an AIS bucket that has an associated cloud backend - !2241

Distributed Shuffle (dSort)

Reduce/optimize CPU and memory usage. Refactor and stabilize.

  • CLI usability and improvements - #768
  • Reduce memory usage - !2197
  • Number of workers per mountpath to optimize disk utilization - !2263
  • CLI: Add support for alternative output shard name formats - !2205
  • Use MessagePack instead of JSON for intra-cluster communications - !2262

Authentication server (AuthN)

Replace old basic authentication with a role-based one. Allow a single AuthN server to manage any number of AIS clusters. Add support for both HTTP and HTTPS AIS clusters. More API endpoints require a token issued by AuthN when AuthN is enabled (before this all GET requests worked without any authentication)

  • Use BuntDB to persist all authentication data (instead of previously used separate JSON files) - !2146, !2178
  • Remove (obsolete) user Cloud credentials management - !2146
  • Support multiple AIS clusters with automatic HTTP/HTTPS selection - !2153
  • CLI: new AuthN management commands: add/remove/show user/show cluster - !2153
  • Introduce user roles (admin/cluster owner/bucket owner/read-only) - !2213
  • When AuthN is deployed majority of requests to AIS cluster require to carry valid AuthN token (previously only PUT operations) - !2284

List and Query objects

Revised and fast list-objects. Reduce memory usage. Use MessagePack. Employ bigger pages to speed up listing operations.

Experimental support for the caching - list-objects result can now be used across multiple users/requests.

  • Massive speed-up via streamable listing - #850, #856, #862, #851, !2494
  • list-objects API is now always paged; remove -fast option as obsolete - !2539
  • Use MessagePack for intra-cluster communications; optionally, employ MessagePack for client <=> cluster requests as well - !2568
  • Additional options to control list-objects content: only-cached, include-misplaced - !2613
  • Rename page marker as continuation token and fix paging the semantics accordingly - !2592
  • Use bigger pages (10,000 by default) for AIS buckets; use 10K-size pages for Cloud buckets for only-cached option - !2645

Query objects

New API that extends list-objects with added support for filtering and selection (a so-called inner and outer* SELECT).

  • Add init and next API - #754, !2399
  • Use MessagePack instead of JSON (client side) - !2672
  • Add support for querying Cloud buckets - !2521

Data protection

No more hardcoded xxhash as AIS checksum for objects: any checksum can be selected from a list that currently also includes MD5, SHA, CRC, and can be easily extended.

  • Multiple per-bucket configurable checksums - #722, !2154, !2187
  • SHA-256 and SHA-512 - !2190
  • Self-healing: automatic restore of a corrupted object from EC slices and/or mirrored replicas - !2196

CLI

Numerous improvements and bug fixes. In particular, new command-line options, shorter commands, better readable output, improved TAB-TAB support.

  • Show target uptime in show cluster - #744
  • PUT object from stdin ais put object bck/obj - - #748
  • s3:// and gs:// are aliases for aws:// and gcp:// - !1789
  • Rename register as join (as in: join new cluster node) - !1988
  • TAB-TAB and output improvements - #649, #772, !1888, !1857
  • User-provided checksum and end-to-end data protection - #779
  • Improve show cluster to display a single JSON output - #810
  • Add --chunk-size option for PUT object - !2164
  • Improve show object command - !2185
  • Add search command - !2400
  • All ais start xaction <name> are now ais start <name> - !2448
  • Run LRU on a list of specified buckets - allow user to temporarily override bucket's own LRU configuration - !2493
  • Improve set props command to show what's actually changed - !2479

Erasure Coding (EC)

  • Fix sending calculated slices on PUT objects - !2419
  • CLI: improve EC stats output - #823
  • Improve user experience on PUT - !2366
  • CLI: added options --parity-slices and --data-slices for ais ec-encode command` - !2387
  • Automatically enable EC when user starts erasure-coding of a given bucket (via start xaction or set props CLI, for instance) - !2377

Information Center (IC)

To efficiently and optimally monitor asynchronous operations (jobs), AIStore employs what we call Information Center (IC) - a group of gateways that “own” all the currently running (as well as already finished) jobs in the cluster. Those jobs, codenamed eXtended actions, or xactions, include global rebalance, n-way mirroring, erasure coding, ETL-type distributed workload, and more. IC continuously monitors all async by coordinating with other clustered nodes.

  • Cluster-wide ID for cluster-wide xactions - !2294, !2551
  • Intra-cluster notifications for xactions - !2304, !2326, !2321, !2334, !2378, !2355, !2346
  • 3 (three) IC members by default - !2561
  • Support list- and query-objects caching - !2570
  • Always keep IC members in-sync as far as currently-running and finished async ops - !2639, !2648

Extract-Transform-Load (ETL) locally

  • In-cluster ETL v1.0 - #842, !2659, !2660, !2651
  • Target and ETL affinity - !2451
  • CLI: add support for ETL - !2453
  • List all transformations - !2498
  • aisloader: add support for ETL (for benchmarking) - !2573

AIS loader (aisloader)

Support TAR generating and reading. Support ETL benchmarking via included echo (at https://hub.docker.com/repository/docker/aistore/transformer_echo), md5, and tar2tf ETL containers.

  • Add TAR reader - !2585
  • Add support for standard AIS_ENDPOINT environment variable (options--port and --ip are still supported) - !2642

Local Playground + Kubernetes (for developers)

  • Add minikube based Kubernetes development environment - !2456, !2558, !2508
  • Enable Kubernetes-based testing on GitLab CI - !2510, !2562
  • Enable Kubernetes based tests on Jenkins - !2609, !2685

Build & Release

  • Scripts for automating release management; in particular, scripts to upload released AIS binaries - !2597
  • An option to build aisnode (AIS target and AIS proxy) Alpine Linux-based minimal-footprint docker image - !2709

Miscellaneous

Make names of used environment variables consistent. Introduce $trash directory to keep deleted buckets for a while. Safer and better node startup: assorted APIs are now accessible only after the node is up and running.

Extend Local Playground for developers: add K8s minikube .

  • Rename a bunch of environment variables used by ais/aisloader/cli for consistency - !2133
  • Extend create bucket API (allow setting props) - #782, !2266
  • Added special $trash directory to put deleted buckets to it - !2351
  • Add minukube dev deployment - !2456
  • Node startup vs availability of assorted APIs - !2601, !2624
aistore -

Published by alex-aizman over 4 years ago

aistore -

Published by alex-aizman over 4 years ago

Highlights

AIStore v3.1 is a significant upgrade with new capabilities that include:

  • remote AIS clustering and unified global namespace
  • Azure Cloud as the 3rd supported Cloud provider (in addition to S3 and Google)
  • Amazon S3 API

And also:

  • TensorFlow integration (to transparently handle TFRecord and tf.Example formats)
  • performance optimizations
  • CLI usability improvements
  • erasure coding optimizations
  • automated no-downtime rebalancing for erasure-coded buckets
  • refactoring, cleanup, and stability fixes across the board

Core

  • remote AIS clustering, unified global namespace: #602, #667, !1937, !1954, !1958, !1959, !1963, !1964, !1965, !1966
  • Azure Cloud: !1856
  • Amazon S3 API: #690, #691
  • TensorFlow integration: #642, !2099
  • evict range, delete range, and prefetch range operations are now asynchronous: #641, !1778, !1785
  • cluster startup stability fixes and improvements: #707, !2084, !2047
  • new environment variable AIS_PRIMARY_ID: #706, !2033
  • EC rebalance speedup and improvements: #558, #670, !1765
  • return 503 (Service Unavailable) when a node is starting up but not ready yet: !2020
  • return 403 (Forbidden) when operation on object, bucket, or cluster is not permitted: !2121
  • new bucket property creation_date: !2010
  • new bucket property backend_bck for AIS bucket connected to a Cloud one - it contains a name of a parent cloud bucket: !2096
  • control-plane cluster-wide 2PC transactions to create, rename, destroy buckets, change bucket properties, etc.: !1852, !1862, !1876, !1844, !1825
  • new config option to avoid starting global rebalance at cluster startup (rebalance.dont_run_time): !2048
  • improved HTTPS support by all AIS built-in clients and components: !2106
  • new and extended bucket access permissions: !2121

Config

  • new EC rebalance tunable batch_size: !1922
  • move client-related timeouts to a separate config section (client): !1901

Downloader

  • improve object downloading (retrying and checking for existence): !2024, !2026
  • fix downloading timeout issue for big objects: !2057
  • improve/extend CLI job info (error list, ETA, progress): #725, !2069, !2061, !2062
  • new CLI option to limit concurrency while downloading: !2088
  • download list of objects from GCP: !2114
  • support HTTPS links on the clients' side: !2119

CLI

  • remote AIS cluster support: #683
  • remove --provider flag in favor of provider://bucket_name syntax: !1763
  • simplify ls command by moving subcommands to show command: !1786
  • new command wait to wait for xaction/dSort job/download job finishes: #645
  • new command cat to show object's content: #646
  • new commands attach remote and detach remote (cluster): !1968
  • new commands attach mountpath and detach mountpath: !1986
  • new command set primary: !2053
  • rename compose command as concat: !1745
  • add --dry-run flag for put, evict, delete, and prefetch commands: #636, !1828
  • make ais put more intuitive when generating object names from file paths: #640
  • ranged prefetch/evict/delete operation uses the same pattern rules as dSort and downloader: !1793
  • add bucket namespaces: #602, !1943
  • command and flags renamings and regrouping, TAB-TAB completion improvements: #649, !1745, !1786, !1763, !1818, !1988, !2006
  • fix various panics when processing TAB completions: !1923

AIS FS

  • fix object listing (ls) for large buckets: #644

Rebalance

  • multiple fixes, improvements

Documentation

  • revise/extend AIStore Authentication Server (AuthN)
  • add numerous CLI usage examples
  • extend and revise Downloader sections
  • document CLI to attach, detach and show remote clusters
  • revise sections describing cloud providers; add Azure
  • rewrite AIStore overview
  • cluster rebalance: update docs and CLI

AuthN

  • to support Kubernetes secrets, read security settings from an environment variable: !2130
aistore -

Published by alex-aizman over 4 years ago

Highlights

  • new on-disk layout optimized for per-bucket management policies, namespace partitioning, and cloud provider isolation
    • in addition to checksum, all metadata is now versioned to support backward compatibility when (and if) there are any future changes
    • global (cluster-wide) control structures - cluster map and bucket metadata - are now uniformly GUID-protected and compressed
    • bucket metadata, in particular, exists in multiple protected copies on data drives of all storage targets
  • added AIS as the 3rd fully supported Cloud Provider (in addition to Amazon S3 and Google Cloud)
  • global (cluster-wide) rebalancing:
    • improved, optimized, and enhanced rebalancing logic
    • revised to run stage by (enumerated) stage whereby the stages get synchronized across all targets
    • added support for erasure-coded buckets
    • stabilized long-running operation in the presence of network failures, drive faults, cluster partitioning, administrative restarts
    • will retransmit any migrating object (or EC slice of an object) that didn't get acknowledged
  • resilvering: support erasure-coded buckets
  • CLI: usability improvements, APPEND, dSort configuration
  • AIS FS: namespace caching, config reload/refresh at runtime

Core

  • new on-disk layout (#580, #578, #594)
  • LOM on-disk (#604)
  • bucket groups and namespaces (!1616, !1608, !1607, !1598, !1597, !1593)
  • AIS cluster to cluster connectivity, AIS as a new Cloud Provider (#584)
  • Smap and BMD cluster-wide consistency (#542, !1159, !1154, !1549)
  • rebalance erasure coded buckets (#577, !1651)
  • erasure coding: improve and optimize on-disk metadata representation (!1468)
  • configuration changes: versioning (!1461), Cloud Provider (!1572, !1594)
  • reliable register (join)/unregister node (!1648)
  • improve AWS versioning support (!1471)
  • better and more reliable out-of-space handling (!1696)
  • memory management and Slab allocation; small-size allocator and its usage for LOM (!1685)
  • rebalance/intra-cluster transport: optimize-out heap allocations (!1650)

CLI

  • usability improvements (!1163)
  • new bucket summary (!1505)
  • APPEND API (#612, !1701)
  • allow to override dSort configuration (!1692)
  • revise show xaction (!1704)

AIS FS: FUSE-based mountable filesystem to access objects as files

  • directory caching to optimize POSIX lookups (#563, #566, !1469)
  • config reload/refresh without unmounting (#568)

Rebalance

  • ACK and retransmit (#583)
  • support containerized deployments (#570)
  • recommence interrupted rebalance upon startup (!1661)

API/SDK

  • HEAD object request will now return erasure coding info as well (!1550)
  • fast bucket list (ls) now supports paging (!1475)
  • (bucket, provider, namespace) triplet structure used across numerous API calls (!1598, !1608, !1616)

Development

  • make and build: enhancements and improvements to consolidate most of (and most often used) build, run, and test operations (#564, !1466, !1483, !1498, !1512)
  • add support for Darwin (OSX/Mac) (!1526)

Kubernetes; containerized deployments

  • revise node labeling; fix aisnode container start script (!1725)
  • demo infrastructure for GTC; assorted fixes (!1699)
  • single-node-aistore: docker image for easy and fast turn-key single-host deployments

Documentation

  • on-disk layout
  • multiple corrections and additions