Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ You can run the termination handler on any Kubernetes cluster running on AWS, in
- EC2 Instance Rebalance Recommendation
- EC2 Auto-Scaling Group Termination Lifecycle Hooks to take care of ASG Scale-In, AZ-Rebalance, Unhealthy Instances, and more!
- EC2 Status Change Events
- EC2 Scheduled Change events from AWS Health
- Helm installation and event configuration support
- Webhook feature to send shutdown or restart notification messages
- Unit & Integration Tests
Expand Down Expand Up @@ -265,7 +266,7 @@ $ aws sqs create-queue --queue-name "${SQS_QUEUE_NAME}" --attributes file:///tmp

#### 4. Create Amazon EventBridge Rules

Here are AWS CLI commands to create Amazon EventBridge rules so that ASG termination events, Spot Interruptions, Instance state changes and Rebalance Recommendations are sent to the SQS queue created in the previous step. This should really be configured via your favorite infrastructure-as-code tool like CloudFormation or Terraform:
Here are AWS CLI commands to create Amazon EventBridge rules so that ASG termination events, Spot Interruptions, Instance state changes, Rebalance Recommendations, and AWS Health Scheduled Changes are sent to the SQS queue created in the previous step. This should really be configured via your favorite infrastructure-as-code tool like CloudFormation or Terraform:

```
$ aws events put-rule \
Expand Down Expand Up @@ -295,6 +296,13 @@ $ aws events put-rule \

$ aws events put-targets --rule MyK8sInstanceStateChangeRule \
--targets "Id"="1","Arn"="arn:aws:sqs:us-east-1:123456789012:MyK8sTermQueue"

$ aws events put-rule \
--name MyK8sScheduledChangeRule \
--event-pattern "{\"source\": [\"aws.health\"],\"detail-type\": [\"AWS Health Event\"]}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this event pattern isn't more specific? Should this be limited to EC2 scheduled changes since the code ignores all other events?

{
  "source": [
    "aws.health"
  ],
  "detail-type": [
    "AWS Health Event"
  ],
  "detail": {
    "service": [
      "EC2"
    ],
    "eventTypeCategory": [
      "scheduledChange"
    ]
  }
}

Apologies for commenting on a closed PR. I'm trying to upgrade NTH and need to understand the new event rules that should be added.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also add "region": ["us-east-1"] or whatever region you're using for the cluster. Since AWS Health events are global and at least want the filter to EC2 scheduled changes for the region you're using.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabegorelick can you open an issue with your thoughts on how we should constrain the rule? Also, if you'd like to PR an update to the README and run some tests, that would be great!


$ aws events put-targets --rule MyK8sScheduledChangeRule \
--targets "Id"="1","Arn"="arn:aws:sqs:us-east-1:123456789012:MyK8sTermQueue"
```

#### 5. Create an IAM Role for the Pods
Expand Down
1 change: 1 addition & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ require (
go.opentelemetry.io/otel v0.20.0
go.opentelemetry.io/otel/exporters/metric/prometheus v0.20.0
go.opentelemetry.io/otel/metric v0.20.0
go.uber.org/multierr v1.7.0
golang.org/x/crypto v0.0.0-20210513164829-c07d793c2f9a // indirect
golang.org/x/net v0.0.0-20210405180319-a5a99cb37ef4 // indirect
golang.org/x/sys v0.0.0-20210608053332-aa57babbf139
Expand Down
6 changes: 6 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -608,8 +608,12 @@ go.starlark.net v0.0.0-20200306205701-8dd3e2ee1dd5/go.mod h1:nmDLcffg48OtT/PSW0H
go.uber.org/atomic v1.3.2/go.mod h1:gD2HeocX3+yG+ygLZcrzQJaqmWj9AIm7n08wl/qW/PE=
go.uber.org/atomic v1.4.0/go.mod h1:gD2HeocX3+yG+ygLZcrzQJaqmWj9AIm7n08wl/qW/PE=
go.uber.org/atomic v1.5.0/go.mod h1:sABNBOSYdrvTF6hTgEIbc7YasKWGhgEQZyfxyTvoXHQ=
go.uber.org/atomic v1.7.0 h1:ADUqmZGgLDDfbSL9ZmPxKTybcoEYHgpYfELNoN+7hsw=
go.uber.org/atomic v1.7.0/go.mod h1:fEN4uk6kAWBTFdckzkM89CLk9XfWZrxpCo0nPH17wJc=
go.uber.org/multierr v1.1.0/go.mod h1:wR5kodmAFQ0UK8QlbwjlSNy0Z68gJhDJUG5sjR94q/0=
go.uber.org/multierr v1.3.0/go.mod h1:VgVr7evmIr6uPjLBxg28wmKNXyqE9akIJ5XnfpiKl+4=
go.uber.org/multierr v1.7.0 h1:zaiO/rmgFjbmCXdSYJWQcdvOCsthmdaHfr3Gm2Kx4Ec=
go.uber.org/multierr v1.7.0/go.mod h1:7EAYxJLBy9rStEaz58O2t4Uvip6FSURkq8/ppBp95ak=
go.uber.org/tools v0.0.0-20190618225709-2cfd321de3ee/go.mod h1:vJERXedbb3MVM5f9Ejo0C68/HhF8uaILCdgjnY+goOA=
go.uber.org/zap v1.10.0/go.mod h1:vwi/ZaCAaUcBkycHslxD9B2zi4UTXhF60s6SWpuDF0Q=
go.uber.org/zap v1.13.0/go.mod h1:zwrFLgMcdUuIBviXEYEH1YKNaOBnKXsx2IPda5bBwHM=
Expand Down Expand Up @@ -916,6 +920,8 @@ gopkg.in/yaml.v2 v2.4.0 h1:D8xgwECY7CYvx+Y2n4sBz93Jn9JRvxdiyyo8CTfuKaY=
gopkg.in/yaml.v2 v2.4.0/go.mod h1:RDklbk79AGWmwhnvt/jBztapEOGDOx6ZbXqjP6csGnQ=
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c h1:dUUwHk2QECo/6vqA44rthZ8ie2QXMNeKRTHCNY2nXvo=
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
gopkg.in/yaml.v3 v3.0.0-20210107192922-496545a6307b h1:h8qDotaEPuJATrMmW04NCwg7v22aHH28wwpauUhK9Oo=
gopkg.in/yaml.v3 v3.0.0-20210107192922-496545a6307b/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
gotest.tools/v3 v3.0.2/go.mod h1:3SzNCllyD9/Y+b5r9JIKQ474KzkZyqLqEfYqMsX94Bk=
gotest.tools/v3 v3.0.3 h1:4AuOwCGf4lLR9u3YOe2awrHygurzhO/HeQ6laiA6Sx0=
gotest.tools/v3 v3.0.3/go.mod h1:Z7Lb0S5l+klDB31fvDQX8ss/FlKDxtlFlw3Oa8Ymbl8=
Expand Down
2 changes: 1 addition & 1 deletion pkg/monitor/sqsevent/asg-lifecycle-event.go
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ type LifecycleDetail struct {
LifecycleTransition string `json:"LifecycleTransition"`
}

func (m SQSMonitor) asgTerminationToInterruptionEvent(event EventBridgeEvent, message *sqs.Message) (*monitor.InterruptionEvent, error) {
func (m SQSMonitor) asgTerminationToInterruptionEvent(event *EventBridgeEvent, message *sqs.Message) (*monitor.InterruptionEvent, error) {
lifecycleDetail := &LifecycleDetail{}
err := json.Unmarshal(event.Detail, lifecycleDetail)
if err != nil {
Expand Down
2 changes: 1 addition & 1 deletion pkg/monitor/sqsevent/ec2-state-change-event.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ type EC2StateChangeDetail struct {

const instanceStatesToDrain = "stopping,stopped,shutting-down,terminated"

func (m SQSMonitor) ec2StateChangeToInterruptionEvent(event EventBridgeEvent, message *sqs.Message) (*monitor.InterruptionEvent, error) {
func (m SQSMonitor) ec2StateChangeToInterruptionEvent(event *EventBridgeEvent, message *sqs.Message) (*monitor.InterruptionEvent, error) {
ec2StateChangeDetail := &EC2StateChangeDetail{}
err := json.Unmarshal(event.Detail, ec2StateChangeDetail)
if err != nil {
Expand Down
2 changes: 1 addition & 1 deletion pkg/monitor/sqsevent/rebalance-recommendation-event.go
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ type RebalanceRecommendationDetail struct {
InstanceID string `json:"instance-id"`
}

func (m SQSMonitor) rebalanceRecommendationToInterruptionEvent(event EventBridgeEvent, message *sqs.Message) (*monitor.InterruptionEvent, error) {
func (m SQSMonitor) rebalanceRecommendationToInterruptionEvent(event *EventBridgeEvent, message *sqs.Message) (*monitor.InterruptionEvent, error) {
rebalanceRecDetail := &RebalanceRecommendationDetail{}
err := json.Unmarshal(event.Detail, rebalanceRecDetail)
if err != nil {
Expand Down
122 changes: 122 additions & 0 deletions pkg/monitor/sqsevent/scheduled-change-event.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License"). You may
// not use this file except in compliance with the License. A copy of the
// License is located at
//
// http://aws.amazon.com/apache2.0/
//
// or in the "license" file accompanying this file. This file is distributed
// on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
// express or implied. See the License for the specific language governing
// permissions and limitations under the License.

package sqsevent

import (
"encoding/json"
"fmt"
"time"

"github.com/aws/aws-node-termination-handler/pkg/monitor"
"github.com/aws/aws-node-termination-handler/pkg/node"
"github.com/aws/aws-sdk-go/service/sqs"
"github.com/rs/zerolog/log"
)

/* Example AWS Health Scheduled Change EC2 Event:
{
"version": "0",
"id": "7fb65329-1628-4cf3-a740-95fg457h1402",
"detail-type": "AWS Health Event",
"source": "aws.health",
"account": "account id",
"time": "2016-06-05T06:27:57Z",
"region": "us-east-1",
"resources": ["i-12345678"],
"detail": {
"eventArn": "arn:aws:health:region::event/id",
"service": "EC2",
"eventTypeCode": "AWS_EC2_DEDICATED_HOST_NETWORK_MAINTENANCE_SCHEDULED",
"eventTypeCategory": "scheduledChange",
"startTime": "Sat, 05 Jun 2016 15:10:09 GMT",
"eventDescription": [{
"language": "en_US",
"latestDescription": "A description of the event will be provided here"
}],
"affectedEntities": [{
"entityValue": "i-12345678",
"tags": {
"stage": "prod",
"app": "my-app"
}
}]
}
}
*/

// AffectedEntity holds information about an entity that is affected by a Health event
type AffectedEntity struct {
EntityValue string `json:"entityValue"`
}

// ScheduledChangeEventDetail holds the event details for AWS Health scheduled EC2 change events from Amazon EventBridge
type ScheduledChangeEventDetail struct {
EventTypeCategory string `json:"eventTypeCategory"`
Service string `json:"service"`
AffectedEntities []AffectedEntity `json:"affectedEntities"`
}

func (m SQSMonitor) scheduledEventToInterruptionEvents(event *EventBridgeEvent, message *sqs.Message) []InterruptionEventWrapper {
scheduledChangeEventDetail := &ScheduledChangeEventDetail{}
interruptionEventWrappers := []InterruptionEventWrapper{}

if err := json.Unmarshal(event.Detail, scheduledChangeEventDetail); err != nil {
return append(interruptionEventWrappers, InterruptionEventWrapper{nil, err})
}

if scheduledChangeEventDetail.Service != "EC2" {
err := fmt.Errorf("events from Amazon EventBridge for service (%s) are not supported", scheduledChangeEventDetail.Service)
return append(interruptionEventWrappers, InterruptionEventWrapper{nil, err})
}

if scheduledChangeEventDetail.EventTypeCategory != "scheduledChange" {
err := fmt.Errorf("events from Amazon EventBridge with EventTypeCategory (%s) are not supported", scheduledChangeEventDetail.EventTypeCategory)
return append(interruptionEventWrappers, InterruptionEventWrapper{nil, err})
}

for _, affectedEntity := range scheduledChangeEventDetail.AffectedEntities {
nodeInfo, err := m.getNodeInfo(affectedEntity.EntityValue)
if err != nil {
interruptionEventWrappers = append(interruptionEventWrappers, InterruptionEventWrapper{nil, err})
continue
}

// Begin drain immediately for scheduled change events to avoid disruptions in cases such as degraded hardware
interruptionEvent := monitor.InterruptionEvent{
EventID: fmt.Sprintf("aws-health-scheduled-change-event-%x", event.ID),
Kind: SQSTerminateKind,
AutoScalingGroupName: nodeInfo.AsgName,
StartTime: time.Now(),
NodeName: nodeInfo.Name,
InstanceID: nodeInfo.InstanceID,
Description: fmt.Sprintf("AWS Health scheduled change event received. Instance %s will be interrupted at %s \n", nodeInfo.InstanceID, event.getTime()),
}
interruptionEvent.PostDrainTask = func(interruptionEvent monitor.InterruptionEvent, n node.Node) error {
if errs := m.deleteMessages([]*sqs.Message{message}); errs != nil {
return errs[0]
}
return nil
}
interruptionEvent.PreDrainTask = func(interruptionEvent monitor.InterruptionEvent, n node.Node) error {
if err := n.TaintScheduledMaintenance(interruptionEvent.NodeName, interruptionEvent.EventID); err != nil {
log.Err(err).Msgf("Unable to taint node with taint %s:%s", node.ScheduledMaintenanceTaint, interruptionEvent.EventID)
}
return nil
}

interruptionEventWrappers = append(interruptionEventWrappers, InterruptionEventWrapper{&interruptionEvent, nil})
}

return interruptionEventWrappers
}
2 changes: 1 addition & 1 deletion pkg/monitor/sqsevent/spot-itn-event.go
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ type SpotInterruptionDetail struct {
InstanceAction string `json:"instance-action"`
}

func (m SQSMonitor) spotITNTerminationToInterruptionEvent(event EventBridgeEvent, message *sqs.Message) (*monitor.InterruptionEvent, error) {
func (m SQSMonitor) spotITNTerminationToInterruptionEvent(event *EventBridgeEvent, message *sqs.Message) (*monitor.InterruptionEvent, error) {
spotInterruptionDetail := &SpotInterruptionDetail{}
err := json.Unmarshal(event.Detail, spotInterruptionDetail)
if err != nil {
Expand Down
Loading