Challenges associated with data collection practices
However, current data collection practices have raised a number of social, political, and regulatory challenges related to privacy, consent, security, and bias. In recent years, there has been increasing media coverage, public debates, advocacy, and political discussions pointing towards the lack of policies surrounding the collection and use of an individual’s data, creating gaps in oversight and citizen protections.
At the root of this discussion is the issue of consent. While regulation requiring consent exists, there is debate surrounding the impact of these laws.[16] The Government of Canada’s Personal Information Protection and Electronic Documents Act (PIPEDA)[17] requires organizations to obtain consent from consumers in order to collect and disclose how they will use, disclose, and manage their data. However, it is questionable whether individuals are able to give informed consent. Individuals are only able to give informed consent when they have a clear understanding of how organizations will be collecting, using, and sharing their data.
Beyond the issue of consent, privacy is a major concern for many consumers, as individuals are becoming more aware that information such as their location data, search history, or genetic information are being collected by application providers and other organizations. Privacy policies generate the expectation that organizations will take measures to ensure the privacy of individuals and their data. For this reason, many organizations anonymize the data they collect from individuals.[18] However, in July 2019, a group of researchers found that 99.98 percent of Americans could be correctly re-identified in any anonymized dataset by using just 15 demographic attributes.[19] Challenges associated with privacy, like these, are likely to continue arising as technology advances and the amount of data shared by individuals increases.
The issue of privacy ties in closely to data security. Once individuals provide their consent, it is typically expected that the organization collecting their data will use and manage it in a secure way, so as to prevent individuals’ information from being used by other organizations, or for purposes for which it was not intended. However, there have been many high-profile examples of data security breaches that have raised concerns around the challenge of data security. One of the most prominent data security breaches in recent years was the Cambridge Analytica[20] scandal, in which the personal data of approximately 87 million Facebook users was acquired without their knowledge and permission, and used to inform election campaign strategies in countries such as the United States, Kenya, and India.[21]
Another major challenge associated with data collection practices is bias. Pre-existing biases related to race, ethnicity, religion, gender, sexual orientation, age, or disability could be, consciously or unconsciously, “baked” into a data set by virtue of the person who collected it,[22] the processes through which the data has been gathered, and how it is used.[23] Datasets could also contain bias by virtue of the representativeness of populations — who has and has not been represented in the data. This does not necessarily come about due to pre-existing biases, but could be the result of individuals lacking access to technologies used to collect this data or consenting to data collection. In many cases, data is repurposed to train the underlying algorithms within AI systems — a process that data was not originally collected for. Some notable examples of data which has been repurposed include historical police data used to train predictive policing algorithms[24] and Flickr photos used to train facial recognition algorithms.[25] There are a number of examples where the use of biased data resulted in largely unintended social consequences when used to train AI-driven hiring tools[26] and facial recognition algorithms.[27] The use of biased data can have far reaching impacts, such as affecting individuals’ access to services and opportunities.[28]
The first article in this series will explore data collection practices that occur at home: the place where most individuals start and end their days.
Technology and policy related to this topic are constantly evolving. If you think we have missed something or see an error please contact Sarah Villeneuve (sarah.villeneuve@ryerson.ca). If you want to get involved in subsequent phases of this project, apply here.
README.txt: Introducing Into the Dataverse, the article series
Illustration by: Riely McFarlane
Data holds immense value in today’s economic, political, and social realms as a tool for monitoring, regulation, surveillance, convenience, profit generation, and knowledge creation. The use of data to better understand the world around us has a long history. Traditionally, data collection and analysis have been costly and time-consuming, generally only providing a snapshot into a given area of inquiry due to the difficulty of storing and interpreting large amounts of data. Yet recent technological advances in computing power — coupled with the near-perpetual connectedness offered by the internet and mobile devices — have enabled data collection and analysis to be performed at scale, often referred to as ‘big data’ practices.[1] Among other technological breakthroughs, these developments have facilitated rapid advances in artificial intelligence (AI) capabilities and applications.
While ‘data’ can be used to refer to a number of phenomena, such as weather patterns or the production of goods, this Brookfield Institute series focuses specifically on the collection of digital data generated by human activity. This includes the active, passive, and manual collection of data generated from activities such as exercising, traveling, and shopping. To gain insight into these practices requires an understanding of the types of data individuals are generating and sharing, both consciously and unconsciously, how this data is actively or passively collected, and what impacts this may have.
Share
About this series
As the significance of data and the technology it enables increases, so will the need to mitigate the consequences and understand how to best leverage its capabilities to maximize public and private benefits. The question of how to balance the opportunities data enables with the risks associated with privacy and equity is gaining steam across civic advocacy groups, government, corporations, academia, and the general public. This overview will help to paint a picture of the current data sharing and collection practices to enable policymakers, industry, advocacy groups, and citizens to make more informed decisions about the associated benefits and risks.
This series provides an overview of the available literature exploring current data collection practices associated with common daily activities, emerging activism, and government interventions. Specifically, this report will provide an overview of active and passive data generation and collection practices performed by public, private, and not-for-profit actors in the following domains:
The domains in this series have been selected to encompass a range of common activities and situations an individual may be involved in throughout an average day.
There is a significant gap in research about Canadian data collection activities on a granular scale. This lack of knowledge regarding data collection practices within Canada hinders the ability of policymakers, civil society organizations, and the private sector to respond appropriately to the challenges and harness unrealized benefits. Current discussions regarding data collection practices are largely happening in the U.S. and Europe, and for this reason this literature review draws heavily on international sources. However, many of the applications and services mentioned within this series are also widely used in Canada; therefore, the data activities and effects may be similar in the Canadian context.
This series has been developed to provide a broad understanding of the current landscape of active and passive data collection practices. This series has uncovered a significant gap in research about Canadian data collection activities on a granular scale. Following this series, BII+E will employ ethnographic methods to deliver insights into how individuals in Canada are currently sharing their data, what data is being collected and by whom, as well as perceptions of data privacy. This novel approach will ground policy discussions in the actual experiences and perceptions of Canadians. More details can be found on the project page. This project is part of our broader workstreams on AI + Society and Inclusive Innovative Economies, which study emerging technologies and their impact on Canadians.
Context
Artificial intelligence applications, specifically those which rely on machine learning[2] or deep learning[3], require large amounts of high quality training data[4] to learn relationships and achieve the desired output as efficiently as possible. Regardless of the technique[5] used to train an AI system, the quality, quantity, structure, and makeup of training data are key determinants of how ML and deep learning models will perform in a real environment.
The increasing use of AI systems to aid in decision-making has a range of impacts on individuals’ lives across a variety of domains. This includes whether or not a person is hired for a job,[6] recommended a certain product through targeted advertisements,[7] or receives a lower life insurance premium.[8] Looking at the data these systems have been trained on will help to better understand the effects these systems have on individuals.
As the use of AI systems, and data-driven decision making more generally, become more wide-spread, the importance of data will increase. It is therefore crucial for governments, civic advocacy groups, industry, and individuals to understand the opportunities big data practices afford, while also considering the challenges these practices present.
Opportunities associated with data collection practices
Our increased ability to collect, store, and analyze data provides a range of opportunities and benefits. For the public sector, this includes using large quantities of data to generate new insights used to inform policy making, improve government service design and delivery, and decrease costs in areas such as medical diagnosis,[9] child protection,[10] and adjudication.[11]
In the private sector, opportunities are arising for businesses to use data to inform their business strategies, target consumers, and develop innovative goods and services. Some businesses have gained the competitive edge through the use of data collection and analytics. Amazon, for example, collects data on consumer behaviour to inform its recommendation algorithms, allowing it to customize suggestions[12] for individual customers. Even brick and mortar retail stores are harnessing novel data-based technology to monitor foot traffic to better optimize for customer interaction.[13]
Civil society organizations can also utilize data about individual characteristics, behaviours, and habits to generate beneficial interventions in their communities. Examples of this include developing a system to detect food bank dependency and need for additional supports among repeat users,[14] and personalizing mental health care.[15]
Challenges associated with data collection practices
However, current data collection practices have raised a number of social, political, and regulatory challenges related to privacy, consent, security, and bias. In recent years, there has been increasing media coverage, public debates, advocacy, and political discussions pointing towards the lack of policies surrounding the collection and use of an individual’s data, creating gaps in oversight and citizen protections.
At the root of this discussion is the issue of consent. While regulation requiring consent exists, there is debate surrounding the impact of these laws.[16] The Government of Canada’s Personal Information Protection and Electronic Documents Act (PIPEDA)[17] requires organizations to obtain consent from consumers in order to collect and disclose how they will use, disclose, and manage their data. However, it is questionable whether individuals are able to give informed consent. Individuals are only able to give informed consent when they have a clear understanding of how organizations will be collecting, using, and sharing their data.
Beyond the issue of consent, privacy is a major concern for many consumers, as individuals are becoming more aware that information such as their location data, search history, or genetic information are being collected by application providers and other organizations. Privacy policies generate the expectation that organizations will take measures to ensure the privacy of individuals and their data. For this reason, many organizations anonymize the data they collect from individuals.[18] However, in July 2019, a group of researchers found that 99.98 percent of Americans could be correctly re-identified in any anonymized dataset by using just 15 demographic attributes.[19] Challenges associated with privacy, like these, are likely to continue arising as technology advances and the amount of data shared by individuals increases.
The issue of privacy ties in closely to data security. Once individuals provide their consent, it is typically expected that the organization collecting their data will use and manage it in a secure way, so as to prevent individuals’ information from being used by other organizations, or for purposes for which it was not intended. However, there have been many high-profile examples of data security breaches that have raised concerns around the challenge of data security. One of the most prominent data security breaches in recent years was the Cambridge Analytica[20] scandal, in which the personal data of approximately 87 million Facebook users was acquired without their knowledge and permission, and used to inform election campaign strategies in countries such as the United States, Kenya, and India.[21]
Another major challenge associated with data collection practices is bias. Pre-existing biases related to race, ethnicity, religion, gender, sexual orientation, age, or disability could be, consciously or unconsciously, “baked” into a data set by virtue of the person who collected it,[22] the processes through which the data has been gathered, and how it is used.[23] Datasets could also contain bias by virtue of the representativeness of populations — who has and has not been represented in the data. This does not necessarily come about due to pre-existing biases, but could be the result of individuals lacking access to technologies used to collect this data or consenting to data collection. In many cases, data is repurposed to train the underlying algorithms within AI systems — a process that data was not originally collected for. Some notable examples of data which has been repurposed include historical police data used to train predictive policing algorithms[24] and Flickr photos used to train facial recognition algorithms.[25] There are a number of examples where the use of biased data resulted in largely unintended social consequences when used to train AI-driven hiring tools[26] and facial recognition algorithms.[27] The use of biased data can have far reaching impacts, such as affecting individuals’ access to services and opportunities.[28]
The first article in this series will explore data collection practices that occur at home: the place where most individuals start and end their days.
Technology and policy related to this topic are constantly evolving. If you think we have missed something or see an error please contact Sarah Villeneuve (sarah.villeneuve@ryerson.ca). If you want to get involved in subsequent phases of this project, apply here.
[1] For example, see Dans, Enrique. 2018. “How Analytics Has Given Netflix the Edge Over Hollywood.” Forbes, May 27, 2018. https://www.forbes.com/sites/enriquedans/2018/05/27/how-analytics-has-given-netflix-the-edge-over-hollywood/.
[2] Machine learning (ML) refers to the ability of a program to detect patterns in data and continuously improve its pattern-recognition capabilities to identify trends and make predictions from data to uncover past, present, and future events (Berube, 2018).
[3] Deep learning is a type of machine learning that layers algorithms to develop an artificial neural network which can autonomously recognize patterns within data in order to perform tasks that have originally required human-level knowledge and reasoning.
[4] “Training data refers to a data set that has been collected, prepared, and provided to the model for the purpose of teaching prior to active deployment.” Malli, Nisa, Melinda Jacobs, and Sarah Villeneuve. 2018. “Intro to AI for Policymakers: Understanding the Shift.” The Brookfield Institute for Innovation + Entrepreneurship. https://brookfieldinstitute.ca/report/intro-to-ai-for-policymakers.
[5] This includes supervised learning, semi-supervised learning, reinforcement learning, and unsupervised learning. For more information, see page 5 of: Malli, Nisa, Melinda Jacobs, and Sarah Villeneuve. 2018. “Intro to AI for Policymakers: Understanding the Shift.” The Brookfield Institute for Innovation + Entrepreneurship. https://brookfieldinstitute.ca/report/intro-to-ai-for-policymakers.
[6] Hoobanoff, Jamie. 2019. “The Potential (and Limits) of Artificial Intelligence in HR and What It Means for Your Business.” The Globe and Mail, July 2, 2019. https://www.theglobeandmail.com/business/careers/leadership/article-the-potential-and-limits-of-artificial-intelligence-in-hr-and-what/.
[7] Tran, Kevin. 2018. “Google Is Capitalizing on AI in Marketing.” Business Insider. July 12, 2018. https://www.businessinsider.com/google-uses-ai-to-enhance-ad-campaigns-2018-7.
[8] Adriano, Lyle. 2018. “Manulife Re-Enters Market – Turns to AI.” Insurance Business Magazine Canada. June 20, 2018. https://www.insurancebusinessmag.com/ca/technology/manulife-reenters-market–turns-to-ai-103781.aspx.
[9] Ubelacker, Sheryl. 2019. “Transforming Health Care with AI: Tons of Potential, but Not without Pitfalls.” CTV News, April 8, 2019. https://www.ctvnews.ca/health/transforming-health-care-with-ai-tons-of-potential-but-not-without-pitfalls-1.4370124.
[10] Hurley, Dan. 2018. “Can an Algorithm Tell When Kids Are in Danger?” New York Times, January 2, 2018. https://www.nytimes.com/2018/01/02/magazine/can-an-algorithm-tell-when-kids-are-in-danger.html.
[11] Piovesan, Carole, and Vivian Ntiri. 2018. “Adjudicating by Algorithm: The Risks and Benefits of Artificial Intelligence in Judicial Decision-Making.” The Advocates’ Journal, no. Spring 2018 (Spring): 42–45.
[12] Rejoiner. 2019. “Amazon’s Recommendation Engine: The Secret To Selling More Online.” 2019. http://rejoiner.com/resources/amazon-recommendations-secret-selling-online/.
[13] Stofan, Daniel. 2018. “Location Analytics: Exploring Foot Traffic at Your New Business Location with GoodVision.” Medium. September 22, 2018. https://medium.com/goodvision/goodvision-location-analytics-exploring-foot-traffic-at-your-new-business-location-805b8919b0ea.
[14] DataKind. 2019. “Identifying Food Bank Dependency Early.” DataKind. June 2019. https://www.datakind.org/projects/identifying-food-bank-dependency-early.
[15] DataKind. 2016. “De-Siloing Data to Help Improve the Lives of Those Suffering from Mental Illness.” DataKind. December 2016. https://www.datakind.org/projects/de-siloing-data-to-help-improve-the-lives-of-those-suffering-from-mental-illness.
[16] For example see Thomson, Iain. 2019. “Talk about Unintended Consequences: GDPR Is an Identity Thief’s Dream Ticket to Europeans’ Data.” The A Register, August 9, 2019. https://www.theregister.co.uk/2019/08/09/gdpr_identity_thief/.
[17] Office of the Privacy Commissioner of Canada. 2019. “The Personal Information Protection and Electronic Documents Act (PIPEDA).” Canada: Office of the Privacy Commissioner of Canada. https://www.priv.gc.ca/en/privacy-topics/privacy-laws-in-canada/the-personal-information-protection-and-electronic-documents-act-pipeda/.
[18] For example, see: Google. n.d. “How Google Anonymizes Data – Privacy & Terms – Google.” Google. n.d. https://policies.google.com/technologies/anonymization?hl=en.
[19] Rocher, Luc, Julien M. Hendrickx, and Yves-Alexandre de Montjoye. n.d. “Estimating the Success of Re-Identifications in Incomplete Datasets Using Generative Models | Nature Communications.” Nature Communication (10). Accessed August 30, 2019. https://www.nature.com/articles/s41467-019-10933-3.
[20] Cadwalladr, Carole, and Emma Graham-Harrison. 2018. “Revealed: 50 Million Facebook Profiles Harvested for Cambridge Analytica in Major Data Breach.” The Guardian, March 17, 2018. https://www.theguardian.com/news/2018/mar/17/cambridge-analytica-facebook-influence-us-election.
[21] For more information, see: DNA webdesk. 2018. “BBC Documentary Clip Goes Viral, Shows Congress Poster in Office of Cambridge Analytica’s Ex-CEO Alexander Nix.” DNA, March 29, 2018. https://www.dnaindia.com/india/report-bbc-documentary-clip-goes-viral-shows-congress-poster-in-office-of-cambridge-analytica-s-ex-ceo-alexander-nix-2598837.
[22] “With emerging technologies we [may] assume that racial [and societal, cultural, et al.] bias will be more scientifically rooted out. Yet, rather than challenging or overcoming the cycles of inequity, technical fix too often reinforce or even deepen that status quo,” in Benjamin, Ruha. 2019. Race After Technology: Abolitionist Tools for the New Jim Code. Wiley.
[23] Bowker, Geoffrey C. 2013. “Data Flakes: An Afterword To “Raw Data” Is An Oxymoron”. In “Raw Data” Is An Oxymoron, 167-172. Cambridge, MA: The MIT Press.
[24] Rieland, Randy. 2018. “Artificial Intelligence Is Now Used to Predict Crime. But Is It Biased?” Smithsonian.com. March 5, 2018. https://www.smithsonianmag.com/innovation/artificial-intelligence-is-now-used-predict-crime-is-it-biased-180968337/.
[25] Liao, Shannon. 2019. “IBM Didn’t Inform People When It Used Their Flickr Photos for Facial Recognition Training.” The Verge. March 12, 2019. https://www.theverge.com/2019/3/12/18262646/ibm-didnt-inform-people-when-it-used-their-flickr-photos-for-facial-recognition-training.
[26] Dastin, Jeffrey. 2018. “Amazon Scraps Secret AI Recruiting Tool That Showed Bias against Women.” Reuters, October 9, 2018. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G.
[27] Lohr, Steve. 2018. “Facial Recognition Is Accurate, If You’re a White Guy.” The New York Times, February 9, 2018, https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html.
[28] Eubanks, Virginia. 2018. Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. St. Martin’s Press. https://us.macmillan.com/books/9781250074317; Madden, Mary, Michele Gilman, Karen Levy, and Alice Marwick. 2016. “The Class Differential in Big Data and Privacy Vulnerability.” Data & Society. https://datasociety.net/output/the-class-differential-in-big-data-and-privacy-vulnerability/.
For media enquiries, please contact Nina Rafeek Dow, Marketing + Communications Specialist at the Brookfield Institute for Innovation + Entrepreneurship.
Share