Recently, attention based models have been used extensively in image captioning and are expected to ground correct image regions with proper generated words. However, for each time step in the decoding process, the attention based models usually use the hidden state of current input to attend to the image regions. Under this setting, these attention models have a “deviated focus” problem, that they calculate the attention weights based on previous words instead of the one to be generated, impairing the performance of both grounding and captioning. In this paper, we propose the Prophet Attention, similar to the form of self-supervision. In the training stage, this module utilizes the future information to calculate the “ideal” attention weights towards image regions. These calculated weights are further used to regularize the “deviated” attention. In this manner, image regions are grounded with the correct words. Prophet Attention does not introduce additional model parameters or inference computations, making it easily incorporated into any existing systems. The experiments on the Flickr30k Entities and MSCOCO datasets show that the proposed Prophet Attention consistently outperforms baselines in both automatic metrics and human evaluations. It is worth noticing that we set new state-of-the-arts on the two benchmark datasets and achieve the 1st place on the leaderboard of the online MSCOCO benchmark.