================================
 BreakingNews Dataset: Features
================================


INTRODUCTION

The full BreakingNews dataset is only available to researchers working in academic institutions for research, non-comercial, purposes. However, to facilitate access, and reduce computation, we provide a public collection of features pre-computed in the original articles as matlab v7.3 mat files. To load the data in Python, you can use h5py (See the note at the end of the README).

For convenience, each row in every matrix corresponds to an article image from the dataset. Since some of the articles contain multiple images, vectors are repeated for each image in features that apply to the whole article. The file "image_urls.tsv" shows the order of the images and articles in the feature matrices.

If you use this data in your work, please cite this article:
Arnau Ramisa, Fei Yan, Francesc Moreno-Noguer, Krystian Mikolajczyk; BreakingNews: Article Annotation by Image and Text Processing. arXiv:1603.07141 [cs.CV] http://arxiv.org/abs/1603.07141

Finally, under certain exceptional conditions, the authors are open to computing new features for the data upon request. If you would like to request a new feature type to be computed, please, send a mail to the authors explaining your case.


FEATURES FROM FULL ARTICLE TEXT

We release three feature types generated from the full text of the articles: a Bag of Words with the counts of the 115,427 words in the vocabulary (valid unique tokens in the training set), and two word2vec [1] based embeddings: max and mean of all the word vectors in an article.  In the bag of words, the tokens are sorted by frequency in decreasing order. The word2vec embedding was learned using the training part of the dataset. In the future we are planning to release a representation based on lists of automatic entity and topic annotations.

Files: [text_article_bow.mat, text_article_w2v_max.mat, text_article_w2v_mean.mat, tmp_dictionary.txt in features_mat.zip]

We also release the word2vec model. [w2v.zip]

FEATURES FROM ARTICLE IMAGES

The article images of the dataset are represented as the FC7 relu layer of the VGG19 [2] and the PLACES [3] convolutional neural networks (CNN).

Files: [vision_vgg19_relu7.mat, vision_places_relu7.mat in more_features.zip]


FEATURES FROM ARTICLE IMAGE CAPTIONS

Like in the case of full article text, the article captions are represented by a bag of words and two word2vec embeddings. The BoW vocabulary is computed from the captions in the training set, while the w2v embedding is the same used for the full text.

Files: [meta_bow_capt_caption.mat, meta_w2v_caption_max.mat, meta_w2v_caption_mean.mat, tmp_caption_dictionary.txt in features_mat.zip]


FEATURES FROM RELATED IMAGES

Groups of related images are represented as the average of the FC7 relu layers for the five related images, both for the VGG19 and the PLACES CNNs.

Files: [vision_places_relu7_related_avg.mat, vision_vgg19_relu7_related_avg.mat in more_features.zip]


METADATA

We provide some metadata for each article.

- Geo-location data: We provide relevant geo-location coordinates for approximately 60% of the articles (those that had this information available). The data is stored as a cell array of latitude-longitude pairs.

Files: [meta_location.mat in features_mat.zip]

- Article source: The source of the article (news agency that posted it online) is represented as a string with the following possible values: [BBC News, The Irish Independent, The Sydney Morning Herald, The Guardian, The Telegraph, Yahoo News, Washington Post]

Files: [meta_source.mat in features_mat.zip]

- Time of publication: The time of publication of each article, stored as a string in International Standard Time [4].

Files: [meta_timepub.mat in features.zip]

- Original URL of the article: Unfortunately, some URLs may have changed since the creation of this dataset. The columns are: "Article_id,month,day,file,url"

File: [index.csv in index.zip]

More files with additional metadata can be found in features_map.zip, please reach out if you need some clarification.

NOTES

- Matlab 7.3 files are in fact H5 files, so they can be loaded using standard H5 libraries. To load them in Python follow these steps (you will need to install the "h5py" and "numpy" libraries):
>>> import h5py
>>> import numpy as np
>>> M = h5py.File('text_article_w2v_max.mat', 'r')  # Replace by the .mat file you want
>>> A = np.array(M['features'])  # A now contains the feature matrix
>>> M.close()

- To load the BoW sparse matrices from Python, follow these steps (you will also need the "scipy" library):
>>> import h5py
>>> from scipy.sparse import crc_matrix
>>> M = h5py.File('text_article_bow.mat', 'r')
>>> A = csc_matrix((M['features']['data'], M['features']['ir'], M['features']['jc']))
>>> M.close()  # You can now transform A to a format adequate for your needs [5]


REFERENCES

[1] https://code.google.com/archive/p/word2vec/
[2] https://gist.github.com/ksimonyan/3785162f95cd2d5fee77#file-readme-md
[3] http://places.csail.mit.edu/model/placesCNN.tar.gz
[4] https://www.w3.org/TR/NOTE-datetime
[5] http://docs.scipy.org/doc/scipy/reference/sparse.html