Crowdsourcing tips

Based on my own experience with Amazon Mechanical Turk and valuable advice from my lab members.
### Overall turking recipe 1. Create a Turk task 2. Debug it on the MTurk sandbox 3. Figure out pricing to pay people reasonably 4. Lauch the task on "real" MTurk and find good workers #### Sandbox and pricing 1. Create an MTurk sandbox [requester](https://requestersandbox.mturk.com/) and [worker](https://workersandbox.mturk.com/) accounts 2. Upload a small batch of hits (~10) 3. Do like 2 or 3 hits (and share the sandbox link with co-authors), see how long it takes for all of us on average / median time. 4. Using those times, find a price per HIT that amounts to like ~$12-15/h approximately (goal is to be above minimum wage) #### Launching the task and finding good workers 1. Run a pilot task on a small amount of examples, with a higher number of workers/HIT than what your final task will be, to ensure wide participation. - For categorical tasks: consider selecting the examples that you know the answer to, for easier grading of workers. - For free-text tasks: you can sample between 100-500 HITs. 2. Assess the quality of each worker's responses: - For categorical tasks: you can set up an autograder based on the responses you expect (if you have your own answers). - For free-text tasks: download CSV of results and scan through HITs, sorted by workerID. 3. While scanning, make two lists of workerIDs - List of qualified workers, i.e., workers who did well on your task. - List of unqualified workers, i.e., workers who didn't do well, were spamming, etc. 4. Create two MTurk qualifications on Mturk: - For qualified workers: call it "GreatAtMyTask" - For unqualified workers: call it "PreviouslyDoneMyTask"; the idea is that we can avoid upsetting workers by avoiding saying that they're bad at something. 5. Assign unqualified workers to "PreviouslyDoneMyTask" qual and qualified workers to the "GreatAtMyTask" qual. 6. Create a copy of your pilot task - Set the requisites to be "GoodAtMyTask" (and other quals) - Reduce the number of workers/HIT to what you intended originally 7. If needed, re-run small "qualification" batch and do 1-6 again. Make sure to disallow good and bad workers from doing this qualification task. ### Some other notes: - **Avoid rejecting as much as you can**, and do NOT mass reject workers/HITs. A lot of Turkers are honestly doing their job and want to do good work, and can their livelihood can severely be hindered from mass rejections (as it causes their stats to go down and can prevents them from taking future HITs). If you truly suspect you are dealing with a spammer or bot, you can soft reject them and *maybe* reject one or two HITs as a warning sign. But chances are that the Turker was trying to do the task and just didn't understand it, but they should still be compensated for their time. This is why it's important to pilot and qual workers, so that you can avoid having to soft block or reject them. - **Avoiding ChatGPT-generated answers**. This recent issue emerged where some workers may be using ChatGPT for free-text answers. It's quite hard to detect this, but you can be very clear in your instructions that they should not be using AI systems to do the HITs, and can use the soft- or hard-blocking if the issues persist. - **Be responsive to emails**. People often do Turking as their full-time job, which means that the stakes are very high for them. If they cannot understand or submit a HIT, they might email you (sometimes in anger), but almost every time, responding with empathy and apologies will lead them to be super nice back. Don't forget to thank them, after all, they chose to take your HIT out of all the other HITs they could do! - **Filtering for English-speaking Turkers**. Restricting to turkers located in the U.S. as a filter for English-speaking is not very effective (not all US-based people speak English, and non-US people can use VPNs). Instead, friends over at TurkOpticon recommended using a qualification task to filter for English-speaking instead. ### Background I compiled this high-level recipe based on valuable advice I got from my current and former labmates who Turk (including [Emily Allaway](https://www.aclweb.org/anthology/people/e/emily-allaway/), [Hannah Rashkin](https://homes.cs.washington.edu/~hrashkin/)) and my own experience Turking (specifically, on the [ATOMIC](https://mosaickg.apps.allenai.org/kg_atomic), [SocialIQa](https://leaderboard.allenai.org/socialiqa/submissions/get-started), and [Social Bias Frames](https://homes.cs.washington.edu/~msap/social-bias-frames/) projects) ### See also - [A Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization (Zhang et al 2023)](https://aclanthology.org/2023.acl-long.835) - [Chris Callison-Burch's Crowdsourcing class](http://crowdsourcing-class.org/)