Wednesday, 10 September 2014

Does citizen science invite sabotage?

Q: Does citizen science invite sabotage?

A: No.

Ok, you may want a longer version. There's a paper on crowdsourcing competitions that has lost some important context in doing the rounds of media outlets. For example, on Australia's ABC, 'Citizen science invites sabotage':
'a study published in the Journal of the Royal Society Interface is urging caution at this time of unprecedented reliance on citizen science. It's found crowdsourced research is vulnerable to sabotage. [...] MANUEL CEBRIAN: Money doesn't really matter, what matters is that you can actually get something - whether that's recognition, whether that's getting a contract, whether that's actually positioning an idea, for instance in the pro and anti-climate change debate - whenever you can actually get ahead.'.
The fact that the research is studying crowdsourcing competitions, which are fundamentally different to other forms of crowdsourcing that do not have a 'winner takes all' dynamic, is not mentioned. It also does not mention the years of practical and theoretical work on task validation which makes it quite difficult for someone to get enough data past various controls to significantly alter the results of crowdsourced or citizen science projects.

You can read the full paper for free, but even the title, Crowdsourcing contest dilemma, and the abstract makes the very specific scope of their study clear:
Crowdsourcing offers unprecedented potential for solving tasks efficiently by tapping into the skills of large groups of people. A salient feature of crowdsourcing—its openness of entry—makes it vulnerable to malicious behaviour. Such behaviour took place in a number of recent popular crowdsourcing competitions. We provide game-theoretic analysis of a fundamental trade-off between the potential for increased productivity and the possibility of being set back by malicious behaviour. Our results show that in crowdsourcing competitions malicious behaviour is the norm, not the anomaly—a result contrary to the conventional wisdom in the area. Counterintuitively, making the attacks more costly does not deter them but leads to a less desirable outcome. These findings have cautionary implications for the design of crowdsourcing competitions.
And from the paper itself:

'We study a non-cooperative situation where two players (or firms) compete to obtain a better solution to a given task. [...] The salient feature is that there is only one winner in the competition. [...] In scenarios of ‘competitive’ crowdsourcing, where there is an inherent desire to hurt the opponent, attacks on crowdsourcing strategies are essentially unavoidable.'
From Crowdsourcing contest dilemma by Victor Naroditskiy, Nicholas R. Jennings, Pascal Van Hentenryck and Manuel Cebrian. Published 20 August 2014 doi: 10.1098/​rsif.2014.0532 J. R. Soc. Interface 6 October 2014 vol. 11 no. 99 20140532
I don't know about you, but 'an inherent desire to hurt the opponent' doesn't sound like the kinds of cooperative crowdsourcing projects we tend to see in citizen science or cultural heritage crowdsourcing.   The study is interesting, but it is not generalisable to 'crowdsourcing' as a whole.

If you're interested in crowdsourcing competitions, you may also be interested in: On the trickiness of crowdsourcing competitions: some lessons from Sydney Design from May 2013. 

Tuesday, 9 September 2014

Helping us fly? Machine learning and crowdsourcing

Moon Machine by Bernard Brussel-Smith via Serendip-o-matic
Over the past few years we've seen an increasing number of projects that take the phrase 'human-computer interaction' literally (or perhaps turning HCI into human-computer integration), organising tasks done by people and by computers into a unified system. One of the most obvious benefits of crowdsourcing on digital platforms has been the ability to coordinate the distribution and validation of tasks, but now data classified by people through crowdsourcing is being fed into computers to improve machine learning so that computers can learn to recognise images almost as well as we do. I've outlined a few projects putting this approach to work below. Of course, this creates new challenges for the future - what do cultural heritage crowdsourcing projects do when all the fun tasks like image tagging and text transcription can be done by computers? After all, Fast Company reports 'at least one Zooniverse project, Galaxy Zoo Supernova, has already automated itself out of existence'. More positively, assuming we can find compelling reasons for people to spend time with cultural heritage collections, how does machine learning and task coordination free us to fly further?

The Public Catalogue Foundation has taken tags created through Your Paintings Tagger and turned them over to computers. As they explain, the results are impressive. The art of computer image recognition: 'Using the 3.5 million or so tags provided by taggers, the research team at Oxford 'educated' image-recognition software to recognise the top tagged terms. Professor Zisserman explains this is a three stage process. Firstly, gather all paintings tagged by taggers with a particular subject (e.g. ‘horse’). Secondly, use feature extraction processes to build an ‘object model’ of a horse (a set of characteristics a painting might have that would indicate that a horse is present). Thirdly, run this algorithm over the Your Paintings database and rank paintings according to how closely they match this model.'

The BBC World Service archive ‘used an open-source speech recognition toolkit to listen to every programme and convert it to text’, extracted keywords or tags from the transcripts then got people to check the correctness of the data created: ‘As well as listening to programmes in the archive, users can view the automatic tags and vote on whether they’re correct or incorrect or add completely new tags. They can also edit programme titles and synopses, select appropriate images and name the voices heard’. From Algorithms and Crowd-Sourcing for Digital Archives by Tristan Ferne. See also What we learnt by crowdsourcing the World Service archive by Yves Raimond, Michael Smethurst, Tristan Ferne on 15 September 2014: 'we believe we have shown that a combination of automated tagging algorithms and crowdsourcing can be used to publish a large archive like this quickly and efficiently'.

And of course the Zooniverse is working on this. From their Milky Way project blog, New MWP paper outlines the powerful synergy between citizens scientists, professional scientists, and machine learning: '...a wonderful synergy that can exist between citizen scientists, professional scientists, and machine learning. The example outlined with the Milky Way Project is that citizens can identify patterns that machines cannot detect without training, machine learning algorithms can use citizen science projects as input training sets, creating amazing new opportunities to speed-up the pace of discovery. A hybrid model of machine learning combined with crowdsourced training data from citizen scientists can not only classify large quantities of data, but also address the weakness of each approach if deployed alone.'

If you're interested in the theory, an early discussion of human input into machine learning is in Quinn and Bederson's 2011 Human Computation: A Survey and Taxonomy of a Growing Field. More recently, the SOCIAM: The Theory and Practice of Social Machines project is looking at 'a new kind of emergent, collective problem solving, in which we see (i) problems solved by very large scale human participation via the Web, (ii) access to, or the ability to generate, large amounts of relevant data using open data standards, (iii) confidence in the quality of the data and (iv) intuitive interfaces', including 'citizen science social machines'. If you're really keen, you can get a sense of the state of the field from various conference papers, including ICML ’13 Workshop: Machine Learning Meets Crowdsourcing and ICML ’14 Workshop: Crowdsourcing and Human Computing. There's also a mega-list of academic crowdsourcing conferences and workshops, though it doesn't include much on the tiny corner of the world that is crowdsourcing in cultural heritage.

NB: this post is a bit of a marker so I've somewhere to put thoughts on machine learning and human-computer integration as I finish my thesis; I'll update this post as I collect more references. Do you know of examples I've missed, or implications we should consider? Comment here or on twitter to start the conversation... 

Thursday, 7 August 2014

Who loves your stuff? How to collect links to your site

If you've ever wondered who's using content from your site or what people find interesting, here are some ways to find out, using the Design Museum's URL as an example.

'Links to your site' via Google Webmaster Tools

Reddit - plug your URL in after /domain/

Wikipedia - plug your URL in after target=*
Depending on your topic coverage you may want to look at other language Wikipedias.

Pinterest - plug your URL in after /source/

Twitter - search for the URL with quotes around it e.g. ""

If you can see one particular page shooting up in your web stats, you could try a reverse image search on TinEye to see where it's being referenced.

What am I missing? I'd love to hear about similar links and methods for other sites - tell me in the comments or on twitter @mia_out.

Update: in a similar vein, Tim Sherratt  launched a new experiment called Trove Traces the same day, to 'explore how Trove newspapers are used' by listing pages that link to articles:

Update 2: Desi Gonzalez @ tried out some of these techniques and put together a great post on 'Thoughts on what museums can learn from Reddit, Yelp, and what @briandroitcour calls vernacular criticism'
You might also be interested in: Can you capture visitors with a steampunk arm?