For the image generation step, conditional generative adversarial networks have been commonly used, with diffusion models also becoming a popular option in recent years. The text encoding step may be performed with a recurrent neural network such as a long short-term memory (LSTM) network, though transformer models have since become a more popular option. Text-to-image models have been built using a variety of architectures. įollowing other text-to-image models, nascent AI-powered text-to-video platforms such as Runway have emerged to demonstrate potential utility for video editing and generation. A successor capable of generating more complex and realistic images, DALL-E 2, was unveiled in April 2022. One of the first text-to-image models to capture widespread public attention was OpenAI's DALL-E, announced in January 2021. encouraging", but which lacked coherence in their details. A model trained on the more diverse COCO dataset produced images which were "from a distance. With models trained on narrow, domain-specific datasets, they were able to generate "visually plausible" images of birds and flowers from text captions like "an all black bird with a distinct thick, rounded bill". became the first to use generative adversarial networks for the text-to-image task. Images generated by alignDRAW were blurry and not photorealistic, but the model was able to generalize to objects not represented in the training data (such as a red school bus), and appropriately handled novel prompts such as "a stop sign is flying in blue skies", showing that it was not merely "memorizing" data from the training set. AlignDRAW extended the previously-introduced DRAW architecture (which used a recurrent variational autoencoder with an attention mechanism) to be conditioned on text sequences. The first modern text-to-image model, alignDRAW, was introduced in 2015 by researchers from the University of Toronto. The more tractable inverse problem, image captioning, saw a number of successful deep learning approaches prior to the first text-to-image models. An image conditioned on the prompt "an astronaut riding a horse, by Hiroshige", generated by Stable Diffusion, a large-scale text-to-image model released in 2022.īefore the rise of deep learning, there were some limited attempts to build text-to-image models, but they were limited to effectively creating collages by arranging together existing component images, such as from a database of clip art.