I submitted my PhD thesis over a month ago now (on the 11th of September) and I’ve still not recovered properly from the experience. Perhaps that’s to be expected after 5 years of it. At some point I’ll have to try to write something coherent about what it has been like, but all I can really say at the moment is that I still stand by my advice that embarking on PhD research is a bad idea for almost everyone. Anyway, as a way of trying to put it all into perspective I wrote a few scripts to visualize my thesis and the process of writing it, so I’ve collected a few of these here.
The first of these is pretty simple to do, since I just collected some word frequency data and fed that into Wordle:
This next graph shows how the number of lines in my thesis document slowly increased over time. The flat period for a year at the beginning really represents starting small bits of chapters and then realizing that much of the work and analysis would have to be redone:
(In case you’re wondering, the thesis was about 50000 words in the end, which corresponds to about 40000 lines of the LyX document, since the LyX format is very verbose – it does roughly correspond to how the thesis as a whole progressed, though.)
Throughout writing the thesis I wondered what the graph of citations would look like, but didn’t have time to do anything about it until after submitting. I was hoping I could use Google Scholar (or some similar online archive) to discover the “A cites B” relationship, but there isn’t an API for it at the moment, and I didn’t think webscraping these data would be worth it. However, I kept all the papers I could find in PDF format in my thesis git repository, consistently named as papers/[BIBTEX-KEY].pdf, so it was simple to write a short Python script which searched for each paper’s title in the text of every other paper. This means that it will miss quite a lot of relationships, since pdftotext doesn’t work satisfactorily on many of the papers, some have OCR errors, etc. etc. but I’m pleased that it seems to have extracted so many of them:
The colours indicate how recently the paper was published, from purple (1967) to 2009 (red). The script outputs the relationships in graphviz‘s dot format, and that image was rendered with “neato”. I excluded any apparently unconnected papers. In case you’re interested in the rather shoddy script, I’ve put it online.
Finally, I thought it might be nice to include a section of one of the images from my thesis to add a flavour of what I’ve been doing – this shows the primary paths of some some neurons which were traced with my “Simple Neurite Tracer” tool and registered with CMTK: