[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[jira] [Created] (FLINK-9567) Flink does not release resource in Yarn Cluster mode

Shimin Yang created FLINK-9567:

             Summary: Flink does not release resource in Yarn Cluster mode
                 Key: FLINK-9567
             Project: Flink
          Issue Type: Bug
          Components: Cluster Management, YARN
    Affects Versions: 1.5.0
            Reporter: Shimin Yang

After restart the Job Manager in Yarn Cluster mode, Flink does not release task manager containers in some specific case. According to my observation, the reason is the instance variable *numPendingContainerRequests* in *YarnResourceManager* class does not decrease since it has not received the containers. And after restart of job manager, it will make increase the *numPendingContainerRequests* by the number of task executors. 

Since the callback function *onContainersAllocated* will return the excessive container immediately only if the *numPendingContainerRequests* <= 0, so the number of container grows bigger and bigger while only a few are acting as task manager.

I think it is important to clear the *numPendingContainerRequests* variable after restart the Job Manager, but not very clear at how to do that. There's no other way to decrease the *numPendingContainerRequests* except the *onContainersAllocated*. Is it fine to add a method to operate on the *numPendingContainerRequests* variable? And meanwhile, there's no handle of YarnResourceManager in the *ExecutionGraph* restart logic.

This message was sent by Atlassian JIRA