What kind of language do British parliamentarians use? We scraped, parsed and vectorised a sample of recent debates from the House of Commons. We then applied a k-means clustering algorithm to these vectors, and created a word cloud for each cluster.
The image contains 24 word clouds, representing the 24 categories into which a sample of roughly 130,000 statements from UK House of Commons parliamentarians, all made between 2006 and the present day, were partitioned by the clustering algorithm. Each cloud contains ten words; the larger the word, the more representative it is of the cluster. The colouring is also meaningful: red words have meanings more closely aligned with remarks by Labour politicians; blue words, with those of Conservatives; and yellow words, with the sentiments of Liberal Democrats (contributions from all other parties were disregarded in this analysis, for simplicity). The brightness of a colour reflects the degree to which its meaning aligns with statements of politicians from the party it signifies, and grey words are spoken roughly equally often by all parties.
In such a debate, a politician poses a question, which is then answered by one or more members of an opposing, or sometimes the same party. Occasionally, The Speaker intervenes to police etiquette, curtail digressions, or indeed exchange pleasantries with members of parliament (the exchange below is from 6th November, 2014):
Bob Blackman: Yesterday I had the honour of captaining the House of Commons bridge team in our annual match against the other place. I am happy to report that we successfully retained the Jack Perry trophy with an outstanding victory. Sadly, I was the only sitting Member participating and I had to enlist a number of ex-MPs—former distinguished Members of this House—to join me. Even more sadly, UK Sport refuses to recognise bridge, chess and other mind sports as sports. May we have an early debate, in Government time, on ensuring that there is recognition of those mind sports, which are important for sporting purposes in schools and for older people, so that we can encourage participation in them?
Mr Speaker: I am sure the House is pleased to learn of the hon. Gentleman’s prowess and distinction at the bridge table. It is a prowess and distinction of which I was hitherto unaware, but I am now better informed.
The text for this analysis comes from the publication Hansard, which provides transcriptions of speeches and debates in both houses of parliament in the United Kingdom. These records are all free for public perusal on web, although I agree with the economist and data scientist Andrew Whitby, who in another Hansard-related blog post from 2013 lamented the lack of a workable bulk-download option for the data. (Generously, he includes in his blog post a link to a text file containing a large list of URLs to XML files containing historical Hansard records, which he was obliged to collate himself.)
Our goal was to construct word clouds from a representative collection of extracts from the corpus, to give a quick sense of the main themes discussed. I will outline the steps here in brief, before explaining them below in more detail:
We've made available the Python code we used to complete for Steps 1, 2 and 4 on a public Github repo. We've not made the code for Step 3 available, since it uses the underlying document vectors, which we don't expose via our API.
The first step to creating the word clouds was to identify and download a sample of the parliamentary debates. Transcripts of "Debates and Oral Answers" for the House of Commons can be searched by date starting from this webpage. Here is an example URL for the transcript from 15th June 2016:
The characters in red encode the date and the page number. We wrote a Python script to iterate over all such dates and attempt to download the associated webpage, starting at the beginning of 1988, the earliest date from which the records are searchable at the above link. The script:
Running the above script gave us around 3,500 HTML files, starting in around 2006. I wanted to download all transcripts since 1988, but this failed. It seems that the URL format above is only valid for pages starting in 2006.
The next step was to create a parser to extract and format text, in suitable chunks, together with metadata. We rolled our own parser using Python's BeautifulSoup module, specifying that each chunk should be a "Remark", by which we mean an uninterrupted block of text spoken by one politician, terminating when another politician begins to speak. The parser also extracts the name of each speaker and their title, if present (e.g. Mr., Mrs., Sir, ...), as well as either their constituency and party affiliation, or their position in cabinet. The parliamentarian's political party is omitted if they have a cabinet position, perhaps because the editors feel it ought to be clear to a human reader which party is governing at any recent point in history. This makes our job harder, and means that we were unable to programmatically extract the political parties for the most senior politicians. The output of applying the parser to each of the 3,500 html files in the corpus was a three-column CSV file with around 130,000 lines, one for each Remark. The first column contained a unique identifier, the second the speaker metadata in json format, and the third the plain text. In particular, each Remark is labelled not only by a speaker, but also by a political party.
A warning to the would-be user: The parser does not perfectly extract the political party in all cases; sometimes it also includes extraneous bits of surrounding text. This means it is very likely our analysis failed to find the party for some MPs, thus distorting the calculations used to create the red/blue/yellow party affiliation colour coding shown in the word cloud image above. Use at your own risk!
The next step was to vectorise the Remark contained in each line of the above CSV file using our vectoriser, obtaining a CSV file of 130,000 vectors, one for each Remark. Those curious can read more about the mathematical and linguistic ideas underpinning the vectorisation process in this earlier Lateral blog post by Benjamin Wilson. We then clustered these vectors into groups using an implementation of the k-means clustering algorithm from the open source Python scikit-learn library. We experimented with different values for the number of clusters and found that 30 -- 50 gave results that looked reasonable, although it's clear from the image at the beginning of the post that some clusters are duplicated. All initialisations of sklearn.cluster.KMeans objects used almost all the standard parameters listed on the documentation page, with the exception that we set n_init=20, instead of 10. This k-means cluster object computes, in particular, a centroid vector for each cluster.
Having clustered the above vectors, hence obtained the centroid vectors for each cluster, we wrote a Python script to create the word clouds seen in the image above. The word cloud creation itself uses a very slightly modified version of Andreas Müller's lovely open source Python module word_cloud to create the word clouds. The input to the Python script was:
For each centroid vector c, we compute the words in the word clouds as follows. Note that in all cases the "nearness" of two vectors v and w is measured by the cosine similarity of their normalisations v_norm and w_norm.
We've published the raw output the above word cloud computation, together with the texts of the five most important documents to each cluster, is given in this text file on the Github repo. We hope it's of interest to mavens of data science and of politics!
Breaking documents into “chunks”, like sections and subsections, is easy for humans, but surprisingly hard for computers. In this post we explain why this is, why it’s a valuable problem to solve, and we introduce our new solution.
This post describes a simple principle to split documents into coherent segments, using word embeddings.