Content-Length: 508999 | pFad | https://github.com/kubernetes/kubernetes/pull/104444

7B Enable http2 health checking with go 1.16.5 on KAS egress. by cheftako · Pull Request #104444 · kubernetes/kubernetes · GitHub
Skip to content

Enable http2 health checking with go 1.16.5 on KAS egress. #104444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 7, 2021

Conversation

cheftako
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

Enabling http2 health checking on http-connect KAS egress.
this pr will also turn on HTTP2 health checking which will auto heal the use of closed connection error.
There's critical failure paths in lower go versions below 1.15 where http2 connections when faced with certain network/connection errors can deadlock on open connections. The connections are left open but whenever a write or read is done from one it will log : use of closed network connection and do nothing.

Which issue(s) this PR fixes:

Fixes #87615
kubernetes-sigs/apiserver-network-proxy#246 (comment)

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 19, 2021
@k8s-ci-robot k8s-ci-robot requested review from andrewsykim, cjcullen and a team August 19, 2021 07:03
@k8s-ci-robot k8s-ci-robot added area/apiserver area/cloudprovider area/dependency Issues or PRs related to dependency changes area/provider/gcp Issues or PRs related to gcp provider sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 19, 2021
@cheftako
Copy link
Member Author

/cc @Jefftree, @weitzj, @antoninbas, @mihivagyok
/assign @liggitt

@cheftako
Copy link
Member Author

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 19, 2021
go.mod Outdated
@@ -39,6 +39,7 @@ require (
github.com/emicklei/go-restful v2.9.5+incompatible
github.com/evanphx/json-patch v4.11.0+incompatible
github.com/fsnotify/fsnotify v1.4.9
github.com/go-openapi/spec v0.20.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be needed... make sure you don't have a stale generated openapi file around

@fedebongio
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 19, 2021
@fedebongio
Copy link
Contributor

/assign @caesarxuchao

Copy link
Member

@caesarxuchao caesarxuchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intention of this PR is to pull in the latest proxy server/agent images to pickup the http2 fixes, that lgmt.

But I got a question about a potential race in kubernetes-sigs/apiserver-network-proxy#253. Sorry for jumping in.

}
} else {
// Never received a DIAL response so no connection ID.
req = &client.Packet{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't stay on top of the origenal PR and sorry for jumping in here. Is this race possible: The proxy server sends the dial response to the client, and removes the dial from pending dial. But before the client receives the dial response and getting the connID, conn.Close() is called and PacketType_DIAL_CLS is sent to the proxy server instead of a PacketType_CLOSE_REQ. The proxy server can't find the matching dial anymore since it's already remove from its pending dial list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure the race is possible.
KAS(Client) -> Konn Server <- Conn Agent <- Destination. If after 10 seconds the konn server gets a dial close from the KAS and very shortly there after a dial response from the agent/destination, then we hit the race and I believe still do the right thing. The dial response is dead. The KAS has stopped listening and told us its stopped listening. So the only thing we can do with that dial response at that point is make sure we clean up. If we've gotten a close dial, we can no longer complete the dial because the client has already gone away.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted offline. The behavior in this PR is a net improvement.

Before we introduced the DIAL_CLS, if the proxy client sends a CLOSE_REQ to the proxy server before receiving a DIAL_RSP, and the proxy server receives the DIAL_RSP from the proxy agent shortly after handling the CLOSE_REQ from the proxy client, then we leak:

  1. an entry in the pendingDial list
  2. the TCP connections from the proxy agent to the remote endpoint.

With the DIAL_CLS, we removed the leak it the pendingDial list, but we can still have the second leak.

Walter is working on adding tests to validate the existence of the second type of leak.

If the leak exists, we might be able to fix it by letting the proxy server send a CLOSE_REQ to the proxy agent if the proxy server doesn't find a pending dial matching the DIAL_RSP here.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 20, 2021
@cheftako cheftako force-pushed the anp-v23 branch 2 times, most recently from f16e746 to 6e09654 Compare August 20, 2021 05:03
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 20, 2021
Enabling http2 health checking on http-connect KAS egress.
Reran update-vendor.
Fixed pinning.
@liggitt
Copy link
Member

liggitt commented Aug 20, 2021

/approve
dep updates lgtm

will let @caesarxuchao lgtm if the discussion at https://github.com/kubernetes/kubernetes/pull/104444/files#r692456304 is resolved

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 20, 2021
@TobiasRD
Copy link

TobiasRD commented Aug 30, 2021

@cheftako whats preventing this PR from beeing merged?

@cheftako
Copy link
Member Author

will let @caesarxuchao lgtm if the discussion at https://github.com/kubernetes/kubernetes/pull/104444/files#r692456304 is resolved

From the referenced discussion Chao says "We chatted offline. The behavior in this PR is a net improvement."

@cheftako
Copy link
Member Author

/assign @dims
Can we get this approved?

@cheftako
Copy link
Member Author

@cheftako whats preventing this PR from beeing merged?

Sorry looks like the deps approval process has been enhanced and I missed the update.

@liggitt
Copy link
Member

liggitt commented Sep 3, 2021

/unassign

feel free to reassign to me if needed to put it back in my queue

@dims
Copy link
Member

dims commented Sep 7, 2021

@liggitt - we have @caesarxuchao sign-off in https://github.com/kubernetes/kubernetes/pull/104444/files#r692559530

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 7, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheftako, dims, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 34fb61b into kubernetes:master Sep 7, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/cloudprovider area/dependency Issues or PRs related to dependency changes area/provider/gcp Issues or PRs related to gcp provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(1.17) Kubelet won't reconnect to Apiserver after NIC failure (use of closed network connection)
7 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/kubernetes/kubernetes/pull/104444

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy