Guest post by Joseph Greene, Research Repository UCD
How accurate are our download statistics?
From a ‘little experiment’ to an international collaboration
In February last year, we reached the one millionth download of items hosted in the Research Repository UCD. We use these download statistics quite frequently to promote the repository, sending a monthly report to authors on their papers, School/ College level reports to heads, top items to the Research intranet, infographics delivered by CLLs, even providing data for School quality reviews. Having attracted the interest of so many authors and providing a basic measure of return-on-investment for the repository, it became important to know just how accurate these statistics really are.
The problem with any website’s usage statistics is that they are clouded with robot (yes, robot) usage. Numerous organisations, Google, the Internet Archive, link checkers, use computer programmes to systematically crawl the web, following any link they find, so that they can give you their particular web based service, be it search index, web archive, etc. So do content spammers, phishers and other less virtuous organisations and individuals.
Good robots follow rules and ‘announce’ themselves — bad robots (and some lazy but good robots) do not, and often go to great lengths to mask themselves as humans. Telling them apart from genuine human users is three parts science, two parts dark art. I first described this at the CONUL Annual Conference last year, and over the course of a few months in late 2015 I researched the art and science of web robot detection. This gave me the skills to perform an empirical test on the Research Repository and determine once and for all (with 96% certainty – thanks to the UCD Maths Support Centre!) how accurate our web robot detection is.
The results were surprising: 85% of our unfiltered downloads come from robots. This is very high, and interestingly, the only other study (an unpublished white paper) found similar results for 20 repositories in the UK. We successfully detect 94% of these robot downloads and discount them from our statistics, as much as the highest detection techniques researched. The full results of this study, including a detailed description of robot detection in all the major repository software, will be published in July in the peer-reviewed journal Library Hi Tech.
Once I had this dataset of robot downloads, my curiosity got the better of me and I did some experiments to see which repository software does better on its own — DSpace or EPrints — in comparison with the statistics system we use, the University of Minho DSpace Stats Add-on. Through contacts I made during my initial research, I formed a four person panel to present at Open Repositories 2016, with representatives from IRUS-UK, COAR and bepress Digital Commons, where I’ll present the results of this experiment (spoiler alert: we’re much more successful at UCD than DSpace and EPrints on their own!)
I took the data one step further in order to offer evidence-based suggestions for how to improve DSpace and EPrints. In this experiment, I compare two possible technical modifications to DSpace and EPrints that combined could improve the accuracy of their usage statistics by 19% and 20% respectively. I’ll present this at OR2016’s developer track.
What started as a ‘little experiment’ has turned into an international collaboration. I’ve since been invited to be a member of the COUNTER Technical Advisory Group, for my sins. The question that remains is, if we could achieve comparable usage statistics for Open Access resources, how would they compare to the closed access versions of the same publications? In other words, is OA used more or less than subscriptions? Think I need to do another little experiment…