Cristian Consonni bio photo

Cristian Consonni

Ph.D. in Computer Science, free software activist, physicist and storyteller

Email Twitter Facebook LinkedIn Github Stackoverflow keybase

Datasets: Temporal Evolution of Templates on Wikipedia

This work has constituted the B.Sc. thesis of Mattia Lago and has been supervised by prof. Alberto Montresor. We analyzed the temporal evolution of templates in the Italian and English language Wikipedia counting how the number of occurences of templates changed over time.

Code

The code is available under the MIT license on GitHub.

Italian Wikipedia (itwiki)

These datasets were produced analyzing the Italian Wikipedia dump with complete page edit history in .bz2 format of 2015-10-20.

  • template_count_it.tar.7z (544MB compressed, 9.0GB uncompressed, md5sum: 57ff71be1e81ce069bf6407596ff23e7). This dataset consists in the count of the appeareance of each template for each revision in Italian Wikipedia. The archive contains a CSV file with the following fields:
    1. page_id: (numerical) identifier of the page
    2. page_tile: page title
    3. rev_id: (numerical) identifier of the article revision
    4. timestamp: revision timestamp
    5. dictionary: a (Python) dictionary containing the count of the templates appearing in that given revision. Keys are the name of the templates, values are the counts.

Extract of the file.

page_id,page_title,rev_id,timestamp,dictionary
2,Armonium,3,20010914101928,{}
...
2,Armonium,73738102,20150710080500,"{'Nota disambigua': 1, 'Thesaurus BNCF': 1, 'Portale': 1, 'Controllo di autorità': 1, 'Strumento musicale': 1, 'Interprogetto': 1}"
3,Antropologia,4,20020111200304,{}
...
3,Antropologia,71799348,20150404230808,"{'Nota disambigua': 1, 'Scienze sociali': 1, 'Thesaurus BNCF': 1, 'Zoologia': 1, 'Interprogetto': 1, 'Portale': 1, 'Controllo di autorità': 1, 'Scienze etnoantropologiche': 1}"
  • redirects_it.tar.7z (74KB compressed, 257K uncompressed, md5sum: 4ccaca5cc86657f3a36cb6f974d13a61). This dataset consists in a list of redirects for each template in Italian Wikipedia. The archive contains a CSV file with the following fields:
    1. template: template name
    2. redirect: destination of the redirect
    3. rev_id: (numerical) identifier of the page revision
    4. timestamp: revision timestamp.

Extract of the file.

template,redirect,rev_id,timestamp
1461 Trabzon,Calcio 1461 trabzon,53804499,20121109120449
3TeamBracket,Torneo semifinali con 3 squadre,68884284,20141028102030
404,Collegamento interrotto,33230901,20100701081710
Aa,Avvisoavvisi,21929679,20090207224952
AA,Avvisoavvisi,63955793,20140204021440
Abbreviazione aeronautica,Abbreviazioni aeronautiche,56951406,20130304162116
ABK,Abcasia,52746688,20120923230717
AC,Avvisocommento,45522329,20111209160525
Accountbot,Accountbot,45939459,20111228164505

License

The code is released under the MIT license and it is available on GitHub. The dataset have been extracted from Wikipedia dumps and have the same license (CC-BY-SA 2.5).

How to cite

If you reuse this dataset, please cite it as:

Mattia Lago, Cristian Consonni, Alberto Montresor. Temporal evolution of templates on Wikipedia. (Cite using WebCite®, cite using perma.cc/VR45-24JP)


Questions?

For further info send me an e-mail.