As of December 2018 This page is no-longer maintained. Please visit my new page people.cs.aau.dk/~matteo

I wanted to extract all the text from links in a wikipedia article. Expecially for important pages this list can be extremely big, for this reason I wrote a couple of lines of javascript that can extract exactly that.

The script identfies all links in the body with a title and all the link in the table of content (TOC). From those links will remove those that contains a colon : that are service URLs. The titles collected - that are longer than 2 letters - are then kept in the titles variable. Notice that this is actually jQuery syntax, and we can use this because wikipedia already uses it.

var titles = [];
$('#bodyContent a[title], .toctext')
    .not('[title*=":"]')
    .each(function(){
        if($(this).text().length > 2) {
            titles.push($(this).text())
        }
    });

Additionally we can also retrieve titles from footnotes in the page, we assume that the important part is between quotes, as usually these are from titles of books, articles and so on. Thus we use a js regular expression to extract only the bit of text that is in the title of a footnote surrounded by quotes.

var rexp =  new RegExp("\\\"(.+)\\\"", "g");
$('.external.text')
    .each(function(){
        var title = $(this).text().replace(rexp,"$1");
        if(title.length > 2) {
            titles.push(title)
        }
    });

Since we are there handling variables, we can also sort the list of titles.

  titles.sort();

Now the variable is somewhere inside the browser, but we want an actionable deliverable, a file. To transmute a js variable into a file we can use the following code from here:

(function(console){
    console.save = function(data, filename){
        if(!data){
            console.error('Console.save: No data')
            return;
        }
        if(!filename) {
            filename ='console.json'
        }

        if(typeof data ==="object"){
            data = JSON.stringify(data,undefined,4)
        }

        var blob =newBlob([data],{type:'text/json'}),
                e    = document.createEvent('MouseEvents'),
                a    = document.createElement('a');
        a.download = filename ;
        a.href = window.URL.createObjectURL(blob) ;
        a.dataset.downloadurl =  ['text/json', a.download, a.href].join(':') ;
        e.initMouseEvent('click',true,false, window,0,0,0,0,0,false,false,false,false,0,null) ;
        a.dispatchEvent(e);
    }
})(console)

At this point the following line of javascript will -magically- ask your browser to download a .json file containing your variable

console.save( titles, "wikipedia-titles-filename.json" )

In order to automatically save a file with a meaningful filename we can use something like the current page URL

    var fileName = (location.href.split('/').reverse()[0]) = '-titles.json';

    console.save(titles, fileName);

Now we can concate all those into a single js file and use a link to the external file to create the code for a bookmarklet like this:

javascript: (function(){
    var titlesExtractorUrl = 'https://disi.unitn.it/~lissandrini/files/wikipedia-extractor-complete.js';
    var jsCodeExtract = document.createElement('script');

    jsCodeExtract.setAttribute('src', titlesExtractorUrl);

    document.body.appendChild(jsCodeExtract);
 }());

You can then compress it with https://jscompress.com/ or https://javascript-minifier.com/ but mind the gap! in the compressed files they are using double quotes " to delimit strings, I suggest you replace them with single quotes when you transform it into a bookmarklet to be put in the href attribute for a link like this one below.

Here is the Wikipedia link Title Extractor bookmarklet. Drag this link to your bookmark bar!