[LG] Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling
[Max Planck Institute for Intelligent Systems]
https://arxiv.org/abs/2506.12543