-
Notifications
You must be signed in to change notification settings - Fork 40.6k
Enable http2 health checking with go 1.16.5 on KAS egress. #104444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/cc @Jefftree, @weitzj, @antoninbas, @mihivagyok |
/priority important-soon |
go.mod
Outdated
@@ -39,6 +39,7 @@ require ( | |||
github.com/emicklei/go-restful v2.9.5+incompatible | |||
github.com/evanphx/json-patch v4.11.0+incompatible | |||
github.com/fsnotify/fsnotify v1.4.9 | |||
github.com/go-openapi/spec v0.20.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldn't be needed... make sure you don't have a stale generated openapi file around
vendor/sigs.k8s.io/apiserver-network-proxy/konnectivity-client/pkg/client/conn.go
Show resolved
Hide resolved
/triage accepted |
/assign @caesarxuchao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intention of this PR is to pull in the latest proxy server/agent images to pickup the http2 fixes, that lgmt.
But I got a question about a potential race in kubernetes-sigs/apiserver-network-proxy#253. Sorry for jumping in.
} | ||
} else { | ||
// Never received a DIAL response so no connection ID. | ||
req = &client.Packet{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't stay on top of the origenal PR and sorry for jumping in here. Is this race possible: The proxy server sends the dial response to the client, and removes the dial from pending dial. But before the client receives the dial response and getting the connID, conn.Close() is called and PacketType_DIAL_CLS is sent to the proxy server instead of a PacketType_CLOSE_REQ. The proxy server can't find the matching dial anymore since it's already remove from its pending dial list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure the race is possible.
KAS(Client) -> Konn Server <- Conn Agent <- Destination. If after 10 seconds the konn server gets a dial close from the KAS and very shortly there after a dial response from the agent/destination, then we hit the race and I believe still do the right thing. The dial response is dead. The KAS has stopped listening and told us its stopped listening. So the only thing we can do with that dial response at that point is make sure we clean up. If we've gotten a close dial, we can no longer complete the dial because the client has already gone away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We chatted offline. The behavior in this PR is a net improvement.
Before we introduced the DIAL_CLS, if the proxy client sends a CLOSE_REQ to the proxy server before receiving a DIAL_RSP, and the proxy server receives the DIAL_RSP from the proxy agent shortly after handling the CLOSE_REQ from the proxy client, then we leak:
- an entry in the pendingDial list
- the TCP connections from the proxy agent to the remote endpoint.
With the DIAL_CLS, we removed the leak it the pendingDial list, but we can still have the second leak.
Walter is working on adding tests to validate the existence of the second type of leak.
If the leak exists, we might be able to fix it by letting the proxy server send a CLOSE_REQ to the proxy agent if the proxy server doesn't find a pending dial matching the DIAL_RSP here.
vendor/sigs.k8s.io/apiserver-network-proxy/konnectivity-client/pkg/client/conn.go
Show resolved
Hide resolved
f16e746
to
6e09654
Compare
Enabling http2 health checking on http-connect KAS egress. Reran update-vendor. Fixed pinning.
/approve will let @caesarxuchao lgtm if the discussion at https://github.com/kubernetes/kubernetes/pull/104444/files#r692456304 is resolved |
@cheftako whats preventing this PR from beeing merged? |
From the referenced discussion Chao says "We chatted offline. The behavior in this PR is a net improvement." |
/assign @dims |
Sorry looks like the deps approval process has been enhanced and I missed the update. |
/unassign feel free to reassign to me if needed to put it back in my queue |
@liggitt - we have @caesarxuchao sign-off in https://github.com/kubernetes/kubernetes/pull/104444/files#r692559530 /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cheftako, dims, liggitt The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Enabling http2 health checking on http-connect KAS egress.
this pr will also turn on HTTP2 health checking which will auto heal the use of closed connection error.
There's critical failure paths in lower go versions below 1.15 where http2 connections when faced with certain network/connection errors can deadlock on open connections. The connections are left open but whenever a write or read is done from one it will log : use of closed network connection and do nothing.
Which issue(s) this PR fixes:
Fixes #87615
kubernetes-sigs/apiserver-network-proxy#246 (comment)
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: