MacCap introduces a novel zero-shot image captioning framework with text-only training by noise injection training and visual subregion aggregation. It leverages the fine-grained visual alignments from multimodal embedding space, achieving an efficient zero shot captioning. MacCap improves zero-shot captioning performance over popular captioning benchmarks under various settings.