SAN FRANCISCO, April 20, 2017 /PRNewswire/ -- Artificial Intelligence systems may one day drive a car, help cure diseases, or simply choose the movie we watch at night. However, the work to get there is tedious and time consuming according to the just completed third annual Data Scientist Survey conducted by CrowdFlower, the essential human-in-the-loop AI platform for data science and machine learning teams.
The survey of nearly 200 data scientists found that the jobs they hate the most – cleaning, labeling and categorizing data are where they have to spend the most time. For example, data scientists spend 500% more time cleaning, labeling and categorizing data than they spend mining the data. In fact, those surveyed said they spend double the amount of time on these laborious task than creating and building algorithms.
The reason? It is twofold. First, the lack of high quality training data is the single biggest reason AI systems fail according to the results of the survey. In fact – it is so critical, respondents said they'd rather break their leg than delete their training data. Secondly, data scientists have concerns about the integrity of the training data and worry that if they aren't careful, the wrong training data could bias an AI system because it could be influenced by human prejudices around things such as religion, race or gender.
"There is a tremendous amount of hard work that is needed to make an AI system deliver on its promise and at the core is getting the training data right," said Robin Bordoli, CEO of CrowdFlower, "Cleaning, labeling and categorizing data isn't sexy or fun, but it's critical. Data scientists know it and that's why they are spending the bulk of their time doing the work they hate. The reality is that algorithms are far from perfect, however, with higher quality training data – created by human intelligence – we can generate business value even with these imperfect algorithms."
As AI systems increasingly enter the mainstream, their usefulness is often defined by the quality of the training data used. While a machine can process complex mathematical equations or structured data in milliseconds, training data teaches a machine how to process more abstract data like flagging inappropriate content or distinguishing between objects in images. While higher quality initial training data will improve the accuracy of an algorithm's initial output, ongoing training data is required to constantly improve upon the algorithm's results.
Among the other insights gleaned from AI experts:
- Ethical issues: AI and ethics is an issue that bears close watch in the coming years. While the potential of AI replacing human-staffed jobs is an issue according to 42% of respondents, the biggest issues in their eyes is the impact of human bias in training data. More than 63% of those surveyed said that they are concerned that human bias and prejudices such as race, religion or demographics will corrupt the data used to teach AI systems. Another 42% express skepticism that we can avoid the programming of biases and are concerned about the 'impossibility of programming a commonly agreed upon moral code.'
- Job satisfaction: Data scientists love their jobs, even if they hate the grunt work. More than 90% of those surveyed said they were happy doing their jobs. In fact, nearly 50% said they were thrilled. Additionally, 63% of those surveyed agree with the oft-quoted moniker that data scientist is the sexiest job in the industry.
- Demand for data scientists: While the field of data science is still pretty new, there is no question that the job market for data scientists is red hot. Even though the majority of respondents have only been in the jobs less than 5 years, they are getting called all the time about new opportunities. Over half of the respondents are contacted at least once per week with a job offer and nearly 30% receive calls multiple times each week.
To view the full report, please visit: http://crwdflr.com/2oMPCzh
CrowdFlower is the essential human-in-the-loop AI platform for data science teams. CrowdFlower helps customers generate high quality customized training data for their machine learning initiatives, or automate a business process with easy-to-deploy models and integrated human-in-the-loop workflows. The CrowdFlower software platform supports a wide range of use cases including self-driving cars, intelligent personal assistants, medical image labeling, content categorization, customer support ticket classification, social data insight, CRM data enrichment, product categorization, and search relevance.
Headquartered in San Francisco and backed by Canvas Venture Fund, Trinity Ventures, and Microsoft Ventures, CrowdFlower serves data science teams at Fortune 500 and fast-growing data-driven organizations across a wide variety of industries. For more information, visit www.crowdflower.com.