Wouldn‚Äôt it be nice if machines could understand content in images and communicate this understanding as effectively as humans? Such technology would be immensely powerful, be it for aiding a visually-impaired user navigate a world built by the sighted, assisting an analyst in extracting relevant information from a surveillance feed, educating a child playing a game on a touch screen, providing information to a spectator at an art gallery, or interacting with a robot. As computer vision and natural language processing techniques are maturing, we are closer to achieving this dream than we have ever been. Visual Question Answering (VQA) is one step in this direction. Given an image and a natural language question about the image (e.g., ‚ÄúWhat kind of store is this?‚Äù, ‚ÄúHow many people are waiting in the queue?‚Äù, ‚ÄúIs it safe to cross the street?‚Äù), the machine‚Äôs task is to automatically produce an accurate natural language answer (‚Äúbakery‚Äù, ‚Äú5‚Äù, ‚ÄúYes‚Äù). In this talk, I will present our VQA dataset, VQA models, and open research questions in free-form and open-ended Visual Question Answering (VQA). Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Answering any possible question about an image is one of the ‚Äòholy grails‚Äô of AI requiring integration of vision, language, and reasoning. I will end with a teaser about the next step moving forward: Visual Dialog. Instead of answering individual questions about an image in isolation, can we build machines that can hold a sequential natural language conversation with humans about visual content?