BreakingNews: Article Annotation by Image and Text Processing

Arnau Ramisa*, Fei Yan*, Francesc Moreno-Noguer, Krystian Mikolajczyk

The BreakingNews Dataset

To foster research on multi-modal news article analysis, we propose the BreakingNews dataset, that includes images, captions, geo-location information and comments. This dataset includes approximately 100,000 news articles from several major newspapers and media agencies, collected between the 1st of January and the 31st of December of 2014. All articles include at least one image, and cover a wide variety of topics, including sports, politics, arts, healthcare or local news. The copyright of all text and images resides with the original owners.

If you find this dataset useful, please cite:

Related publications:

*) Equal contribution.
‡) In collaboration with Fei Yan, Francesc Moreno-Noguer and Krystian Mikolajczyk

Downloads

Due to copyright restrictions, we offer the dataset as a list of URLs for the articles and images, as well as files with pre-computed features. For a complete description of all downloads, see the README file.
UPDATE: Since Arnau Ramisa is no longer at IRI, for further information on the dataset please contact Francesc Moreno

†) Order as in "Image URLs"