Still seeing the issue for endpoints staying out of sync #126578

kedar700 · 2024-08-07T14:22:04Z

What happened?

This issue #125638 was supposed to have fixed the issue where endpoint stay out of sync

I0807 14:01:51.613700       2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"
I0807 14:01:51.624576       2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"
I0807 14:01:51.645704       2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"
I0807 14:01:51.686942       2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"
I0807 14:01:51.768648       2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"
I0807 14:01:51.808043       2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test2-qa" err="endpoints informer cache is out of date, resource version 10168250766 already processed for endpoints test1/test2-qa"
I0807 14:01:51.930345       2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"

I also wrote a small script which would get me the out of sync endpoints compared to the endpointslices

from kubernetes.client import CoreV1Api, DiscoveryV1Api
from hubspot_kube_utils.client import build_kube_client
import json
import os
from datetime import datetime

def extract_ips_from_endpoint(endpoint):
    ips = set()
    if endpoint.subsets:
        for subset in endpoint.subsets:
            if subset.addresses:
                ips.update(addr.ip for addr in subset.addresses)
            if subset.not_ready_addresses:
                ips.update(addr.ip for addr in subset.not_ready_addresses)
    return ips

def extract_ips_from_endpoint_slice(slice):
    if not slice.endpoints:
        return set()
    return set(address for endpoint in slice.endpoints
               for address in (endpoint.addresses or []))

def compare_endpoints_and_slices(core_client, discovery_client):
    all_mismatches = []

    try:
        namespaces = core_client.list_namespace()
    except Exception as e:
        print(f"Error listing namespaces: {e}")
        return all_mismatches

    for ns in namespaces.items:
        namespace = ns.metadata.name
        print(f"Processing namespace: {namespace}")

        try:
            endpoints = core_client.list_namespaced_endpoints(namespace)
        except Exception as e:
            print(f"Error listing endpoints in namespace {namespace}: {e}")
            continue

        for endpoint in endpoints.items:
            name = endpoint.metadata.name

            try:
                slices = discovery_client.list_namespaced_endpoint_slice(namespace, label_selector=f"kubernetes.io/service-name={name}")
            except Exception as e:
                print(f"Error listing endpoint slices for service {name} in namespace {namespace}: {e}")
                continue

            endpoint_ips = extract_ips_from_endpoint(endpoint)
            slice_ips = set()

            for slice in slices.items:
                slice_ips.update(extract_ips_from_endpoint_slice(slice))

            if endpoint_ips != slice_ips:
                mismatch = {
                    "namespace": namespace,
                    "service_name": name,
                    "endpoint_ips": list(endpoint_ips),
                    "slice_ips": list(slice_ips),
                    "missing_in_endpoint": list(slice_ips - endpoint_ips),
                    "missing_in_slice": list(endpoint_ips - slice_ips)
                }
                all_mismatches.append(mismatch)

        print(f"Completed processing namespace: {namespace}")
        print("---")

    return all_mismatches

def save_to_json(data, cluster_name):
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{cluster_name}_mismatches_{timestamp}.json"

    with open(filename, 'w') as f:
        json.dump(data, f, indent=2)

    print(f"Mismatch data for cluster {cluster_name} saved to {filename}")

def main():
    clusters = ["test"]
    all_cluster_mismatches = {}

    for cluster_name in clusters:
        print(f"Processing cluster: {cluster_name}")

        try:
            kube_client = build_kube_client(host="TEST",
                              token="TOKEN")

            core_client = CoreV1Api(kube_client)
            discovery_client = DiscoveryV1Api(kube_client)

            mismatches = compare_endpoints_and_slices(core_client, discovery_client)

            all_cluster_mismatches[cluster_name] = mismatches

            save_to_json(mismatches, cluster_name)

            print(f"Completed processing cluster: {cluster_name}")
            print(f"Total mismatches found in this cluster: {len(mismatches)}")
        except Exception as e:
            print(f"Error processing cluster {cluster_name}: {e}")


if __name__ == "__main__":
    main()

What did you expect to happen?

I expect the endpoints to eventually sync and reflect the most upto date information.

How can we reproduce it (as minimally and precisely as possible)?

I have just deployed the newer patch to our cluster and that has resulted in endpoints never ending up being updated if the status goes out of sync.

Anything else we need to know?

No response

Kubernetes version

Client Version: v1.29.7
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.7

Cloud provider

OS version

almalinux-9

Install tools

Container runtime (CRI) and version (if applicable)

cri-o

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-08-07T14:22:14Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

HirazawaUi · 2024-08-07T15:09:45Z

/sig network

tnqn · 2024-08-07T15:15:47Z

@kedar700 from the logs, endpoint controller was retrying test1/test-qa in the expected way. It has retried 5 times and the next retry should be after 320ms, which should long enough for the informer cache to get the updated endpoints. If there is no more logs related to the failure, I think it may have synced it successfully.

The python script used to detect the failure seems capturing diff between endpoints and endpointslice at a moment, but that could happen if there is any Pod change in the cluster, and technically the script is comparing endpoints of one moment with endpointslice of another moment.

So did you see the endpoints not being synced after endpoint controller stopped retrying it?

kedar700 · 2024-08-13T18:53:59Z

Yeah i still see it and see similar messages continuously. Let me post some additional log messages after the one i posted above.

mengqiy · 2024-08-14T17:15:18Z

Observed the same issue in 1.29 and 1.30.
I keep seeing the same error then new pods were added behind a service. And the endpoint never catch up after all pods are settled.
I0809 03:28:48.648274 11 endpoints_controller.go:355] "Error syncing endpoints, retrying" logger="endpoints-controller" service="kube-system/kube-proxy" err="endpoints informer cache is out of date, resource version 1340462 already processed for endpoints kube-system/kube-proxy"

tnqn · 2024-08-15T07:49:25Z

@kedar700 @mengqiy could you provide complete log of kube-controller-manager and the yamls of the related endpoints and endpointslice?

MikeZappa87 · 2024-08-15T16:12:39Z

/assign @tnqn

Are you able to triage this since you are already looking at this?

kedar700 · 2024-08-15T16:19:50Z

Yes i can get that over by EOD today

tnqn · 2024-08-16T03:10:11Z

Are you able to triage this since you are already looking at this?

Sure, I'm waiting for more logs and the content of the endpoints to understand how it happened. Currently there are only retry logs which is expected in some cases.

M00nF1sh · 2024-08-28T05:47:07Z

@tnqn
We found a customer that triggered an edge case in your change of #125675, which will cause Endpoints never updated.

They have a mutatingWebhook on Endpoints object that automatically injects a label, let's say dummy-k: value.

During syncServices, the endpoints controller compares the Service's labels against Endpoints' labels, and they won't match since Endpoints have the extra "dummy-k: value".
the endpoints controller will update the Endpoint's labels to be same as Service's labels by removing the dummy-k: value label.
their mutatingWebhook will automatically adds back the dummy-k: value label as part of the update request. From the APIServer's POV, the Endpoints object don't have any changes, thus the update request will succeed but returns the same resourceVersion as before.
the change in Fix endpoints status out-of-sync when the pod state changes rapidly #125675 will mark that resourceVersion as stale.
Any future pods events won't be handled correctly due to the step#3 didn't result in a new resourceVersion.

I think we should fix this by check the resulting resourceVersion from the update request, and only mark the old resourceVersion as stale when a new resourceVersion is generated.

Not sure whether this is the same case others in this thread encountered :D

tnqn · 2024-08-28T08:13:56Z

@M00nF1sh thanks for explaining the use case and the analysis. I did analyze the code after the issue was opened and suspected the only possibility is the returned resource version somehow equals to the one in the request but didn't know how it could happen after it has checked the desired endpoints and the current endpoints are different:

kubernetes/pkg/controller/endpoint/endpoints_controller.go

Lines 495 to 503 in 95b3fe9

    
           // When comparing the subsets, we ignore the difference in ResourceVersion of Pod to avoid unnecessary Endpoints 
        
           // updates caused by Pod updates that we don't care, e.g. annotation update. 
        
           if !createEndpoints && 
        
           	endpointSubsetsEqualIgnoreResourceVersion(currentEndpoints.Subsets, subsets) && 
        
           	apiequality.Semantic.DeepEqual(compareLabels, service.Labels) && 
        
           	capacityAnnotationSetCorrectly(currentEndpoints.Annotations, currentEndpoints.Subsets) { 
        
           	logger.V(5).Info("endpoints are equal, skipping update", "service", klog.KObj(service)) 
        
           	return nil 
        
           }

Now I understand this could happen when the endpoints appear different on client side but are mutated to be same on server side. Do you plan to create a PR since you have proposed a fix?

@kedar700 @mengqiy could you confirm if this is also your case, i.e. there is a mutatingWebhook for endpoints objects that could make the endpoints different from what endpoints controller expects?

M00nF1sh · 2024-08-28T16:45:39Z

@tnqn
Yeah, i can do a fix for this

thockin · 2024-08-29T16:17:14Z

@tnqn @M00nF1sh - can we assign to you?

tnqn · 2024-08-30T13:57:28Z

@thockin Sure, I'm already assigned.

@M00nF1sh Given that we've had three user reports on this issue, I think it's important to address it as soon as possible. Could you please let me know when you plan to create the PR? If you're currently occupied, I'm happy to handle it and submit the PR.

M00nF1sh · 2024-09-03T21:59:27Z

@tnqn
Sorry, were occupied by some internal incidents last week. Just created a WIP PR #127103 for it.
Will test it in kops for this specific webhook scenario in the following two days and and remove the WIP flag

tnqn · 2024-09-04T03:39:35Z

/assign @M00nF1sh

kedar700 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 7, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 7, 2024

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 7, 2024

mengqiy mentioned this issue Aug 15, 2024

temporarily remove serviceMonitor manifest for coredns awslabs/kubernetes-iteration-toolkit#472

Merged

k8s-ci-robot assigned tnqn Aug 15, 2024

M00nF1sh mentioned this issue Sep 3, 2024

[WIP]fix an edge case where endpoints might be incorrectly marked as stale #127103

Closed

k8s-ci-robot assigned M00nF1sh Sep 4, 2024

aojea mentioned this issue Sep 17, 2024

bugfix: endpoints controller track resource version conrrectly #127417

Merged

k8s-ci-robot closed this as completed in #127417 Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Still seeing the issue for endpoints staying out of sync #126578

Still seeing the issue for endpoints staying out of sync #126578

kedar700 commented Aug 7, 2024

k8s-ci-robot commented Aug 7, 2024

HirazawaUi commented Aug 7, 2024

tnqn commented Aug 7, 2024 •

edited

Loading

kedar700 commented Aug 13, 2024

mengqiy commented Aug 14, 2024

tnqn commented Aug 15, 2024

MikeZappa87 commented Aug 15, 2024

kedar700 commented Aug 15, 2024

tnqn commented Aug 16, 2024

M00nF1sh commented Aug 28, 2024 •

edited

Loading

tnqn commented Aug 28, 2024

M00nF1sh commented Aug 28, 2024 •

edited

Loading

thockin commented Aug 29, 2024

tnqn commented Aug 30, 2024

M00nF1sh commented Sep 3, 2024 •

edited

Loading

tnqn commented Sep 4, 2024

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Still seeing the issue for endpoints staying out of sync #126578

Still seeing the issue for endpoints staying out of sync #126578

Comments

kedar700 commented Aug 7, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Aug 7, 2024

HirazawaUi commented Aug 7, 2024

tnqn commented Aug 7, 2024 • edited Loading

kedar700 commented Aug 13, 2024

mengqiy commented Aug 14, 2024

tnqn commented Aug 15, 2024

MikeZappa87 commented Aug 15, 2024

kedar700 commented Aug 15, 2024

tnqn commented Aug 16, 2024

M00nF1sh commented Aug 28, 2024 • edited Loading

tnqn commented Aug 28, 2024

M00nF1sh commented Aug 28, 2024 • edited Loading

thockin commented Aug 29, 2024

tnqn commented Aug 30, 2024

M00nF1sh commented Sep 3, 2024 • edited Loading

tnqn commented Sep 4, 2024

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

tnqn commented Aug 7, 2024 •

edited

Loading

M00nF1sh commented Aug 28, 2024 •

edited

Loading

M00nF1sh commented Aug 28, 2024 •

edited

Loading

M00nF1sh commented Sep 3, 2024 •

edited

Loading