Stream processing and batch processing are very different. The principle of batch processing determines that it needs to process bulk data, such as memory-based sorting, join, and so on. So, in this case, it needs to wait for the relevant data to arrive before it is calculated, but this does not mean that the data is concentrated in one node, and the calculation is still distributed. Flink has corresponding optimization measures for the execution plan of batch processing. For the storage of large data sets, it uses a custom-managed memory mechanism (you can use more memory by applying extra-heap memory). Of course, the amount of data is still stored in the memory. It will spill to disk when not in use.
Regarding fault tolerance, the current checkpoint mechanism is only applicable to stream processing. Batch fault tolerance can be re-executed by directly playing back the complete data set. A TaskManager fails, Flink will kick it out of the cluster, and the Task running on it will fail, but the result of stream processing and batch Task failure is different. For stream processing, it triggers a restart of the entire job, which may only trigger a partial restart for batch processing.