Ryan McGrady/Ethan Zuckerman: Your personal and family videos are stuff for AI and so a privacy risk
From The Conversation
AMHERST, Mass.
The promised artificial intelligence revolution requires data. Lots and lots of data. OpenAI and Google have begun using YouTube videos to train their text-based AI models. But what does the YouTube archive actually include?
Our team of digital media researchers at the University of Massachusetts Amherst collected and analyzed random samples of YouTube videos to learn more about that archive. We published an 85-page paper about that dataset and set up a website called TubeStats for researchers and journalists who need basic information about YouTube.
Now, we’re taking a closer look at some of our more surprising findings to better understand how these obscure videos might become part of powerful AI systems. We’ve found that many YouTube videos are meant for personal use or for small groups of people, and a significant proportion were created by children who appear to be under 13.
Most people’s experience of YouTube is algorithmically curated: Up to 70% of the videos users watch are recommended by the site’s algorithms. Recommended videos are typically popular content such as influencer stunts, news clips, explainer videos, travel vlogs and video game reviews, while content that is not recommended languishes in obscurity.
Some YouTube content emulates popular creators or fits into established genres, but much of it is personal: family celebrations, selfies set to music, homework assignments, video game clips without context and kids dancing. The obscure side of YouTube – the vast majority of the estimated 14.8 billion videos created and uploaded to the platform – is poorly understood.
Illuminating this aspect of YouTube – and social media generally – is difficult because big tech companies have become increasingly hostile to researchers.
We’ve found that many videos on YouTube were never meant to be shared widely. We documented thousands of short, personal videos that have few views but high engagement – likes and comments – implying a small but highly engaged audience. These were clearly meant for a small audience of friends and family. Such social uses of YouTube contrast with videos that try to maximize their audience, suggesting another way to use YouTube: as a video-centered social network for small groups.
Other videos seem intended for a different kind of small, fixed audience: recorded classes from pandemic-era virtual instruction, school board meetings and work meetings. While not what most people think of as social uses, they likewise imply that their creators have a different expectation about the audience for the videos than creators of the kind of content people see in their recommendations.
Fuel for the AI Machine
It was with this broader understanding that we read The New York Times exposé on how OpenAI and Google turned to YouTube in a race to find new troves of data to train their large language models. An archive of YouTube transcripts makes an extraordinary dataset for text-based models.
There is also speculation, fueled in part by an evasive answer from OpenAI’s chief technology officer Mira Murati, that the videos themselves could be used to train AI text-to-video models such as OpenAI’s Sora.
The New York Times story raised concerns about YouTube’s terms of service and, of course, the copyright issues that pervade much of the debate about AI. But there’s another problem: How could anyone know what an archive of more than 14 billion videos, uploaded by people all over the world, actually contains? It’s not entirely clear that Google knows or even could know if it wanted to.
Kids as Content Creators
We were surprised to find an unsettling number of videos featuring kids or apparently created by them. YouTube requires uploaders to be at least 13 years old, but we frequently saw children who appeared to be much younger than that, typically dancing, singing or playing video games.
In our preliminary research, our coders determined nearly a fifth of random videos with at least one person’s face visible likely included someone under 13. We didn’t take into account videos that were clearly shot with the consent of a parent or guardian.
Our current sample size of 250 is relatively small – we are working on coding a much larger sample – but the findings thus far are consistent with what we’ve seen in the past. We’re not aiming to scold Google. Age validation on the internet is infamously difficult and fraught, and we have no way of determining whether these videos were uploaded with the consent of a parent or guardian. But we want to underscore what is being ingested by these large companies’ AI models.
Small Reach, Big Influence
It’s tempting to assume OpenAI is using highly produced influencer videos or TV newscasts posted to the platform to train its models, but previous research on large language model training data shows that the most popular content is not always the most influential in training AI models. A virtually unwatched conversation between three friends could have much more linguistic value in training a chatbot language model than a music video with millions of views.
Unfortunately, OpenAI and other AI companies are quite opaque about their training materials: They don’t specify what goes in and what doesn’t. Most of the time, researchers can infer problems with training data through biases in AI systems’ output. But when we do get a glimpse at training data, there’s often cause for concern. For example, Human Rights Watch released a report on June 10, 2024, that showed that a popular training dataset includes many photos of identifiable kids.
The history of big tech self-regulation is filled with moving goal posts. OpenAI in particular is notorious for asking for forgiveness rather than permission and has faced increasing criticism for putting profit over safety.
Concerns over the use of user-generated content for training AI models typically center on intellectual property, but there are also privacy issues. YouTube is a vast, unwieldy archive, impossible to fully review.
Models trained on a subset of professionally produced videos could conceivably be an AI company’s first training corpus. But without strong policies in place, any company that ingests more than the popular tip of the iceberg is likely including content that violates the Federal Trade Commission’s Children’s Online Privacy Protection Rule, which prevents companies from collecting data from children under 13 without notice.
With last year’s executive order on AI and at least one promising proposal on the table for comprehensive privacy legislation, there are signs that legal protections for user data in the U.S. might become more robust.
When the Wall Street Journal’s Joanna Stern asked OpenAI CTO Mira Murati whether OpenAI trained its text-to-video generator Sora on YouTube videos, she said she wasn’t sure.
Have You Unwittingly Helped Train ChatGPT?
The intentions of a YouTube uploader simply aren’t as consistent or predictable as those of someone publishing a book, writing an article for a magazine or displaying a painting in a gallery. But even if YouTube’s algorithm ignores your upload and it never gets more than a couple of views, it may be used to train models like ChatGPT and Gemini.
As far as AI is concerned, your family reunion video may be just as important as those uploaded by influencer giant Mr. Beast or CNN.
Ryan McGrady is a senior researcher, Initiative for Digital Public Infrastructure at the University of Massachusetts at Amherst.
Ethan Zuckerman is an associate professor of public policy, communication and Information at UMass Amherst.
Disclosure statement
Ethan Zuckerman says: “My work - and the work we refer to in this article - is supported by the MacArthur Foundation, the Ford Foundation, the Knight Foundation and the National Science Foundation. I am on the board of several nonprofit organizations, including Global Voices, but none are directly connected to politics.’’
Ryan McGrady does not work for, consult, own shares in or receive funding from any company or organization that would benefit from this article, and has disclosed no relevant affiliations.
And you can wear a beret
{Amherst is} the last place in America where you can find people who think politically correct is a compliment…probably the only place in the United States where men can wear berets and not get beaten up.’’
Madeleine Blais (born 1946), journalist and professor at UMass Amherst, in In These Girls Hope is a Muscle (1995).
Amherst is best known now for UMass and Amherst and Hampshire colleges. The Connecticut Valley is thick with colleges, from Connecticut up to Dartmouth College, in Hanover, N.H.
Patricia A. Marshall, Robert J. Awkward, Stephanie Teixeira: Mass. is examplar of getting free stuff for colleges
Via The New England Journal of Higher Education, a service of The New England Board of Higher Education (nebhe.org), on whose advisory board New England Diary editor Robert Whitcomb used to sit.
In just over a year, Massachusetts public colleges and universities have galvanized a statewide movement to adopt more comprehensive use of Open Educational Resources (OER). How did state and campus leaders achieve such momentum?
By way of background, OER includes teaching, learning and research materials in any medium—digital or otherwise—that reside in the public domain or have been released under an open license that permits no-cost access, use, adaptation and redistribution by others with no or limited restrictions.
There had been prior nascent efforts to increase the utilization of OER in Massachusetts including: the launch of the Open Education Initiative at UMass Amherst, the MA #Go Open Project funded by a TAACCCT grant and the creation of a MA Community College OER Hub.
These initiatives served as watershed moments in the journey to begin to make more faculty, staff, administrators and students aware of the utility of OER as a learning approach and as an effective way of reducing rapidly rising textbook costs. For example, the efforts at the University of Massachusetts Amherst have benefitted nearly 13,000 students and have resulted in $1.8 million in savings. The Go Open community college program involved 9,000 students and 115 faculty members resulting in savings of $1.2 million for students, and the launch of the MA Community College OER Hub brought a repository for newly created open educational resources.
However, these efforts were accelerated when the statewide Student Advisory Council (SAC) presented a resolution to the state Board of Higher Education (BHE) in April 2018 asking the board to recognize OER as an approach to generate textbook costs savings for students and calling on the state Department of Higher Education (DHE) to explore and identify opportunities for implementing OER on a broader scale. Further, SAC noted that it would continue its advocacy for and support of OER.
The equity angle
During this same timeframe, Higher Education Commissioner Carlos E. Santiago was nurturing the development of what is now known as the Equity Agenda for Massachusetts public higher education. The Equity Agenda, officially adopted by the BHE in December 2018, aims to significantly raise the enrollment, attainment and long-term success outcomes among underrepresented student populations.
The goals of OER to reduce student textbook costs align with the Equity Agenda to increase persistence and completion of underrepresented students by: having a positive impact on student learning, addressing increasing interest among key stakeholders (e.g., students, public higher education institutions and faculty), responding to rising costs since textbook costs have risen by 88% over the last decade (OER State Policy Playbook, 2018), and addressing increasing interest in the Legislature.
________________________________________
A Textbook Case of Unaffordability
In a Florida Virtual Campus Survey conducted in 2012 and again in 2016, 20,000 public students
were asked what the cost of required textbooks had caused them to do in their academic careers.
Here are some of the results:
• Not purchase the required textbook: Two out of three
• Not register for a specific course: One out of two
• Take fewer courses: One out of two
• Earn a poor grade: One out of three
• Drop a course: One out of four
• Fail a course: One out of five
________________________________________
OER performs
This led the DHE to increase its involvement in OER beginning with awarding two direct OER Performance Incentive Fund (PIF) grants of $150,000 to the Massachusetts OER Collaborative, comprising UMass Amherst, Worcester State University, Northern Essex Community College and Holyoke Community College; and $100,000 to the Viking OER Textbook Affordability Initiative at Salem State University. In addition, two indirect OER PIF grants were distributed to Northern Essex Community College for its Competency-Based Pathways in Early Education for $198,414 and to Massasoit Community College for its Early College Strategies to Enhance Learning for $59,525.
In late fall 2018, Commissioner Santiago established an OER Working Group to convene, study, evaluate and make recommendations to him and the BHE that addressed:
The need to identify lower-cost educational resources for students
The BHE’s goals of increasing access and affordability, closing performance gaps and increasing completion
The issue of addressing equity for underserved, low-income, and first-generation students, especially students of color
Enhancing instructor effectiveness while lowering costs for students.
The OER Working Group convened in November 2018, co-chaired by Marilyn Billings, who heads the Office for Scholarly Research Communications at UMass Amherst, and Susan Tashjian, coordinator of instructional technology at Northern Essex Community College. The OER Working Group was staffed by Robert Awkward and the work overseen by Patricia A. Marshall, both at the DHE and both authors of this NEJHE piece. The OER Working Group consisted of 21 members representing all higher education segments and geographic locations in Massachusetts and included faculty, librarians, administrators, students and external representatives, including union, bookstore and employer reps.
First, a survey
To begin this initiative, the DHE partnered with the Massachusetts OER Collaborative to create and distribute a statewide OER survey to establish a baseline on OER utilization. The survey response rate was 100% and it provided very useful information on the state of OER in Massachusetts. The following are highlights from the 2018 OER Prevalence Survey:
71% of Massachusetts public higher education institutions had some level of OER activity
Although there were higher and lower numbers of courses served, eleven to 20 was the most prevalent number of courses using OER, resulting in student savings of $10,000 to $100,000 for about half of the institutions (47%)
English, Math and Biology were the highest enrolled courses and the courses with the most OER use
Faculty select their textbook individually or as a common textbook
Most prevalent deterrents to faculty adoption of OER included:
Too hard to find what I need (25%)
Not enough resources for my subject (19%)
Not enough high-quality resources (17%).
The survey data was used not only to inform the work of the OER Working Group, but also to inform the Massachusetts OER Collaborative as it designed OER training for faculty across the state. Nearly 500 faculty attended five successful regional training sessions at UMass Amherst, Worcester State University, Northern Essex Community College, Roxbury Community College and Bridgewater State University.
After the kickoff meeting of the OER Working Group in November 2018, the work was divided into five subcommittees to fulfill the mission. The subcommittees included: Faculty Development, Infrastructure, Marketing Communications, Policy & Legislative, and Stakeholders. The subcommittees began meeting and working in December and met continuously until they submitted their subcommittee reports in April 2019.
Meanwhile, the Student Advisory Council continued its efforts to support and encourage greater utilization of OER across the state as it had promised, holding a Legislative Advocacy Day in January 2019, a Public Higher Education Advocacy Day in March 2019 and an OER Photo Campaign (during the international Open Education Week) in the spring of 2019.
A timeline
By April 2019, the five subcommittees had completed their work and submitted their reports to create a draft full report, which was reviewed and revised by the OER Working Group. The draft full report was used to provide an update on OER to the BHE’s Academic Affairs Committee. In addition to sharing the research and findings with the committee, it contained time-sequenced recommendations.
The short-term recommendations called for adopting a statewide OER definition, designating a statewide coordinator, establishing a statewide advisory council, encouraging and supporting continued student advocacy of OER and identifying OER courses in course management systems
The mid-term recommendations included: providing OER faculty professional development, actively promoting the use of OER for graduate and continuing education and expanding a unified OER repository to make the discovery of local content easier.
In the long term, it called for increasing funding to address campus technology challenges and encouraging the consideration of OER in faculty tenure and promotion.
During the summer, DHE staff finalized the full report and sent it to public higher education presidents and chancellors to obtain their insight, ideas and perspective on the findings and recommendations, and how they will impact their campuses. The feedback received was incorporated into the final full report to the commissioner. After his review, the commissioner recommended the full report and a motion being submitted to the BHE’s Academic Affairs Committee to accept the final report and to implement the recommendations at its Oct. 15 meeting. After a useful and engaged discussion, including active participation by the two student members on the Academic Affairs Committee, the motion was approved unanimously. The ACC brought the final report and motion to the BHE on Oct. 22, where it was again approved unanimously, including active support by the student voting member of the BHE.
This OER initiative has been an exciting, multipronged effort that has actively engaged stakeholders from the grassroots and actively partnered with students. The utilization of a broad, diverse, representative working group to develop thoughtful and useful recommendations for BHE consideration and action was key to achieving useful and effective outcomes. Finally, the opportunity to coordinate these efforts with other campuses and with PIF grantees, and to work with OER advocacy groups and other states, has been rewarding to everyone involved. Nicole Allen, director of education for Scholarly Publishing Alliance Resource Coalition (SPARC), a national OER advocacy organization, noted that “Massachusetts is an exemplar for state policy action.”
Ultimately, the largest beneficiaries of this work will be the students of Massachusetts for whom reducing the cost of textbooks and other ancillary learning materials will significantly reduce student direct, out-of-pocket expenses.
In addition, the quality of student learning will also increase. The national student success initiative Achieving the Dream conducted a study comparing the use of OER to traditional textbooks at 32 community colleges in four states. According to the study, “more than 60 percent of students reported that the overall quality of their learning experience in an OER course was higher than in a typical non-OER course.” This is the power of collective action focused on a shared goal. The Massachusetts DHE is proud to be an active participant in this institutional change effort on behalf of the students at our public colleges and universities.
Patricia A. Marshall is deputy commissioner for academic affairs & student success at the Massachusetts Department of Higher Education. Robert J. Awkward is director of learning outcomes assessment at the department. Stephanie Teixeira is former Massachusetts Student Advisory Council chair. Visit here to view the final OER report and recommendations.
UMass Amherst surges to ninth in sustainability ranking
The New England Council reports:
"The University of Massachusetts at Amherst recently announced that it has been ranked ninth in the nation for Sustainable Universities by the Association for the Advancement of Sustainability in Higher Education’s (AASHE) Sustainability Tracking Assessment and Rating System (STARS). The STARS program recognizes sustainability accomplishments in areas such as academics, research, engagement, operations, and administration.
In 2015, the university was rated 29th in the STARS Campus Sustainability Index among U.S. doctorate-granting institutions. However, with the creation of the School of Earth and Sustainability, the design and construction of the John W. Olver Design Building -- the largest and most technologically advanced academic contemporary wood structure in the U.S. -- and the decision to be the first major public university to divest its endowment from direct holdings in fossil fuels, the university has significantly increased its STARS score from a 68.18 to a 75.77, resulting in a leap of 20 places from the previous 2015 rating.
Chancellor Kumble R. Subbaswamy said, “This new STARS score reflects the university’s continuing commitment to excellence in sustainability. UMass Amherst is a leader in best practices for energy efficient construction and sustainable food use, conducting world-class research and preparing a new generation of students to be inspired stewards of our planet.”
Read more on the UMass Amherst web site.
Buying small colleges
That the University of Massachusetts at Amherst is taking over the campus of tiny and bankrupt Mount Ida College, in Newton, is a sign or the times. The fact is that there are too many small private colleges in a time of a smaller cohort of college-age kids and ever more intense competition for student money. It’s probably tougher in New England than in most of the country because the region has a famed collection of very distinguished and well-endowed private colleges (most famously four of the eight Ivy League institutions and MIT) and generally improving, and expanding, state university systems to lure customers.
So now the flagship of the UMass system will get a physical site in Greater Boston. In the deal, UMass Amherst will assume $55 million to $70 million in debt from Mount Ida and then use the campus, which has dorms, labs, library and sports fields, as a place from which students can work on internships and engage in academic collaborations with other Boston-area colleges as well as with businesses. UMass Amherst also said that having the campus will boost fundraising by providing a site closer to rich alumni and others in the great wealth-creating machine centered in Boston, Cambridge and along Route 128. Understandably UMass Boston feels dissed.
Watch for more such takeovers of small colleges by larger ones. I’m glad they’ll still be used for education, though that might turn out to be mostly vocational.