Voicebox is a generative AI model for speech developed by Meta that can generalize to speech-generation tasks it was not specifically trained to accomplish with state-of-the-art performance. It can synthesize speech across six languages, as well as perform noise removal, content editing, style conversion, and diverse sample generation. The model is based on a method called Flow Matching, which has been shown to improve upon diffusion models. Voicebox outperforms the current state of the art English model VALL-E on zero-shot text-to-speech in terms of both intelligibility and audio similarity, while being as much as 20 times faster.