Priam

Co-Process for backup/recovery, Token Management, and Centralized Configuration management for Cassandra.

APACHE-2.0 License

Stars
1K
Committers
59
Priam - PostRestoreHook logging improvements

Published by sumanth-pasupuleti over 6 years ago

Avoiding too verbose logging by changing info to debug for start and stop execution, and logging postrestorehook command

Priam - PostRestoreHook logging improvements

Published by sumanth-pasupuleti over 6 years ago

Avoiding too verbose logging by changing info to debug for start and stop execution, and logging postrestorehook command

Priam - Add post restore hook as part of restore process

Published by sumanth-pasupuleti over 6 years ago

PostRestoreHook gets executed once the files are downloaded as part of the restore process, before starting C*.
There are several configurations for PostRestoreHook:

CONFIG_POST_RESTORE_HOOK_ENABLED - indicates if postrestorehook is enabled
CONFIG_POST_RESTORE_HOOK - contains the command with arguments to be executed as part of postrestorehook. Priam would wait for completion of this hook before proceeding to starting C*
CONFIG_POST_RESTORE_HOOK_HEARTBEAT_FILENAME - heartbeat file that postrestorehook emits. Priam keeps a tab on this file to make sure postrestorehook is making progress. Otherwise, a new process of postrestorehook would be spawned (upon killing existing process if still exists)
CONFIG_POST_RESTORE_HOOK_DONE_FILENAME - 'done' file that postrestorehook creates upon completion of execution.
CONFIG_POST_RESTORE_HOOK_TIMEOUT_IN_DAYS - maximum time that Priam should wait before killing the postrestorehook process (if not already complete)
Priam - Add post restore hook as part of restore process

Published by sumanth-pasupuleti over 6 years ago

PostRestoreHook gets executed once the files are downloaded as part of the restore process, before starting C*.
There are several configurations for PostRestoreHook:

CONFIG_POST_RESTORE_HOOK_ENABLED - indicates if postrestorehook is enabled
CONFIG_POST_RESTORE_HOOK - contains the command with arguments to be executed as part of postrestorehook. Priam would wait for completion of this hook before proceeding to starting C*
CONFIG_POST_RESTORE_HOOK_HEARTBEAT_FILENAME - heartbeat file that postrestorehook emits. Priam keeps a tab on this file to make sure postrestorehook is making progress. Otherwise, a new process of postrestorehook would be spawned (upon killing existing process if still exists)
CONFIG_POST_RESTORE_HOOK_DONE_FILENAME - 'done' file that postrestorehook creates upon completion of execution.
CONFIG_POST_RESTORE_HOOK_TIMEOUT_IN_DAYS - maximum time that Priam should wait before killing the postrestorehook process (if not already complete)
Priam - Mark snapshot as a failure if there is an issue with uploading a file.

Published by arunagrawal84 over 6 years ago

(#680): Mark snapshot as a failure if there is an issue with uploading a file. This is to ensure we fail-fast. This is in contrast to previous behavior where snapshot would "ignore" any failures in the upload of a file and mark snapshot as "success".

Since it was not truly a "success" marking that as "failure" is the right thing to do. Also, meta.json should really be uploaded in case of "success" and not in case of "failure" as the presence of "meta.json" marks the backup as successful.

The case for fail-fast: In a scenario where we had an issue say at the start of the backup, it makes more sense to fail-fast then to keep uploading other files (and waste bandwidth and use backup resources). The remediation step for backup failure is anyways to take a full snapshot again.

Priam - Mark snapshot as a failure if there is an issue with uploading a file.

Published by arunagrawal84 over 6 years ago

(#679) Mark snapshot as a failure if there is an issue with uploading a file. This is to ensure we fail-fast. This is in contrast to previous behavior where snapshot would "ignore" any failures in the upload of a file and mark snapshot as "success".

Since it was not truly a "success" marking that as "failure" is the right thing to do. Also, meta.json should really be uploaded in case of "success" and not in case of "failure" as the presence of "meta.json" marks the backup as successful.

The case for fail-fast: In a scenario where we had an issue say at the start of the backup, it makes more sense to fail-fast then to keep uploading other files (and waste bandwidth and use backup resources). The remediation step for backup failure is anyways to take a full snapshot again.

Priam - Change the default location of backup verification file and backup.status

Published by arunagrawal84 over 6 years ago

  • (#678): Change the default location of backup status and downloaded meta.json as part of backup verification
Priam - Change the default location of backup verification file and backup.status

Published by arunagrawal84 over 6 years ago

  • (#677) Change the default location of backup status and downloaded meta.json as part of backup verification
Priam - Graceful Shutdown of Cassandra

Published by jolynch over 6 years ago

New Features

  • (#665) Cassandra Process Manager can be configured to gracefully stop using the new
    gracefulDrainHealthWaitSeconds option. If this option set to a positive integer (>=0) then before calling
    the shutdown script, Priam will fail healthchecks (InstanceState.isHealthy) for the configured number of seconds and then will issue a nodetool drain with 30s timeout (since drain can hang), and finally call the provided stop script. By default this is set to -1 to disable this feature for backwards compatibility. This is useful if you want to gracefully drain cassandra clients off a node before running drain (which kills the Native/Thrift server and resets and tcp connections that were established; in flight requests can get dropped), then running drain to safely stop Cassandra, and then call your stop script. If your service discovery system does not integrate with Priam's health system or your stop script already does all these things then leave this functionality disabled.
  • (#665) /v1/cassadmin/stop http API call now takes an optional force parameter (e.g. /v1/cassadmin/stop?force=true which will skip the graceful path for that particular stop; default value is false.
  • (#650) Enable auth on the jmx port via jmxUsername and jmxPassword options. By default these are null and not provided.

Bug Fixes

  • (#662) Update commons-io, aws-java-sdk, snakeyaml

Breaking changes

  • (#665) If you previously implemented ICassandraProcess internally the start method has been refactored to take a boolean force parameter. If you implement this interface you can supply false to preserve previous behavior.
Priam - Graceful Shutdown of Cassandra

Published by jolynch over 6 years ago

New Features

  • (#664) Cassandra Process Manager can be configured to gracefully stop using the new
    gracefulDrainHealthWaitSeconds option. If this option set to a positive integer (>=0) then before calling
    the shutdown script, Priam will fail healthchecks (InstanceState.isHealthy) for the configured number of seconds and then will issue a nodetool drain with 30s timeout (since drain can hang), and finally call the provided stop script. By default this is set to -1 to disable this feature for backwards compatibility. This is useful if you want to gracefully drain cassandra clients off a node before running drain (which kills the Native/Thrift server and resets and tcp connections that were established; in flight requests can get dropped), then running drain to safely stop Cassandra, and then call your stop script. If your service discovery system does not integrate with Priam's health system or your stop script already does all these things then leave this functionality disabled.
  • (#664) /v1/cassadmin/stop http API call now takes an optional force parameter (e.g. /v1/cassadmin/stop?force=true which will skip the graceful path for that particular stop; default value is false.
  • (#650) Enable auth on the jmx port via jmxUsername and jmxPassword options. By default these are null and not provided.

Bug Fixes

  • (#659) Fix to Snapshotstatus to actually contain bkupMetadata
  • (#661) Update commons-io, aws-java-sdk, snakeyaml

Breaking changes

  • (#664) If you previously implemented ICassandraProcess internally the start method has been refactored to take a boolean force parameter. If you implement this interface you can supply false to preserve previous behavior.
Priam - Backup status bug fix

Published by tulumvinh over 6 years ago

Eliminate assumption that existence of an element in a data structure means successful backup.

Priam - Backup status bug fix

Published by tulumvinh over 6 years ago

Eliminate assumption that existence of an element in a data structure means successful backup.

Priam - Autoremediate Refactor

Published by jolynch over 6 years ago

New Features

  • (#639) bakup.status is now a variable
  • (#647) SDB clients standardized

Bugs

  • (#658) Autostart functionality now uses timers instead of ratelimiters so that
    the first autostart does not start until an interval after the first start.
  • (#632) Duplicate slf4j bindings excluded
  • (#643) Gracefully shut down quartz
Priam - Autoremediate Refactor

Published by jolynch over 6 years ago

  • Autostart functionality now uses timers instead of ratelimiters so that the first autostart does not start until an interval after the first start.
Priam - Metrics for Cassandra Process Manager and bug fixes

Published by vinaykumarchella almost 7 years ago

New Features

  • Cassandra Process Manager and Monitor now record metrics when C* is stopped, started or auto-started with recent autorestart functionality.
  • Location of backup status file is now configurable via configuration priam.backup.status.location.
  • SDBInstance for token management with default binding to us-east-1 but configurable via priam.sdb.instanceIdentity.region.

Bugs

  • Exclude duplicate sl4j module binding.
  • Shut down quartz at application stop
Priam - Gradle 4.4 and Autoremediate Bugfixes

Published by jolynch almost 7 years ago

  • Gradle 4.4 Support
  • Autostart functionality now only sets shouldCassandraBeAlive flag from
    the start api to prevent a race against the stop API in the monitoring
    thread.
Priam - Gradle 4.4 and Autoremediate Bugfixes

Published by jolynch almost 7 years ago

  • Gradle 4.4 Support
  • Autostart functionality now only sets shouldCassandraBeAlive flag from
    the start api to prevent a race against the stop API in the monitoring
    thread.
Priam - Priam autoremediates dead Cassandra

Published by jolynch almost 7 years ago

Bugs

  • None

New Features

  • Priam will now automatically restart Cassandra if it fails. If you use
    Priam to stop Cassandra (via the API) it will not automatically restart
    Cassandra until a subsequent start via the API. You can control this
    via the priam.remediate.dead.cassandra.rate configuration option. If
    negative it disables auto-remediation, if zero it immediately auto-remediates
    on any failure, and if a positive integer the auto-remediation waits for
    that number of seconds between restarts. The default is 360 seconds
    (one hour).

Breaking Changes

  • None
Priam - Priam autoremediates dead Cassandra

Published by jolynch almost 7 years ago

Bugs

  • None

New Features

  • Priam will now automatically restart Cassandra if it fails. If you use
    Priam to stop Cassandra (via the API) it will not automatically restart
    Cassandra until a subsequent start via the API. You can control this
    via the priam.remediate.dead.cassandra.rate configuration option. If
    negative it disables auto-remediation, if zero it immediately auto-remediates
    on any failure, and if a positive integer the auto-remediation waits for
    that number of seconds between restarts. The default is 360 seconds
    (one hour).

Breaking Changes

  • None
Priam - SNS Notifications

Published by arunagrawal84 almost 7 years ago

  • New feature: SNS notification service for backups/snapshots. Disabled by default. Use priam.backup.notification.topic.arn to enable this.
  • Set broadcast_rpc_address in cassandra.yaml to private-IP of the EC2 instance. This will allow COPY_TO, COPY_FROM functionality to work.
  • Bug Fix: Potential non-determinism in the event notification.
  • Bug Fix: Filtered CF in backups where left on the disk.