The Atlantic exposes the music datasets powering AI—and makes them searchable
Reporter Alex Reisner identified four training datasets containing over 21 million tracks, revealing how AI companies quietly harvest music from YouTube, Spotify, and free archives.
What matters
- Atlantic reporter Alex Reisner identified four music training datasets totaling over 21 million tracks and made them publicly searchable.
- Two datasets contain 12 million and 9 million tracks; the other two each hold over 100,000 songs.
- Google and Stability AI have confirmed using some of these datasets in research papers.
- Three datasets are distributed as links to YouTube and Spotify, with developers using automated tools that may violate platform terms of service.
- The Free Music Archive dataset is free for personal streaming but requires licensing for commercial use.
What happened
Atlantic reporter Alex Reisner identified four datasets of music being used to train AI models and made them fully searchable for the public. Two of the sets are enormous—one containing 12 million tracks and another 9 million. The remaining two are smaller but still substantial, each holding over 100,000 songs.
According to Reisner, the datasets have been downloaded thousands of times. While it's impossible to know exactly who has used them, both Google and Stability AI have confirmed in research papers that they trained on some of this data.
Three of the four datasets are distributed not as audio files but as lists of links to songs hosted on YouTube or Spotify. AI developers then use automated tools to download the actual audio—tools that can bypass logins, advertisements, and other mechanisms designed to compensate creators. As Reisner notes, such tools violate the terms of service of those platforms.
Some sources, like the Free Music Archive dataset, are free to stream for personal use but require licensing for commercial applications—raising questions about whether AI training constitutes commercial use.
Why it matters
This is the first time the public can search and verify whether specific songs or artists appear in the training data behind AI music models. That transparency matters for several reasons:
- Artist rights: Musicians and rights holders can now check if their work was used without consent or compensation.
- Legal exposure: The datasets' reliance on scraping YouTube and Spotify links—potentially violating platform terms of service—adds fuel to ongoing litigation around AI training data.
- Industry precedent: With Google and Stability AI already confirming use, the database provides concrete evidence for lawmakers, regulators, and courts examining how AI companies source creative works.
- Public accountability: Making the data searchable shifts the burden of transparency from AI companies (who have largely stayed silent) to the public, empowering journalists, researchers, and artists to investigate independently.
The scale is striking: over 21 million tracks across the four datasets, downloaded thousands of times by parties that may include major tech companies.
Public reaction
No strong public signal was available from Reddit or other discussion forums at the time of this report. The story is still developing, and community discussion may emerge as artists and rights holders begin searching the database for their own work.
What to watch
- Whether artists or labels use the searchable database to file new lawsuits or join existing ones.
- How Google and Stability AI respond to the confirmed use of these datasets in their research.
- Whether other AI companies disclose their training data in response to growing public pressure.
- Potential regulatory action, particularly in jurisdictions where scraping streaming platforms for commercial AI training may violate data or copyright laws.
- Whether The Atlantic's database prompts similar investigative efforts into training data for other modalities (text, image, video).
Sources
- The Verge — The Atlantic created a searchable database of the music used to train AI
- Wilson's Media — The Atlantic created a searchable database of the music used to train AI
- GNNHD — The Atlantic created a searchable database of the music used to train AI
- Amkio — The Atlantic created a searchable database of the music used to train AI
Public reaction
No Reddit or public discussion threads were available at the time of this report. The story may generate significant community discussion as artists and rights holders begin searching the database for their own work.
Open questions
- Will artists discover their work in the datasets and pursue legal action?
- How will AI companies that used these datasets respond to increased public scrutiny?
- Will this investigation prompt similar transparency efforts for text and image training data?
What to do next
Developers
Search The Atlantic's database to check if any music you've released or manage appears in the identified training datasets.
Understanding whether your work is in these datasets can inform decisions about licensing, legal action, or public statements.
Founders
Audit your AI training data sources and document provenance for any music datasets used.
With confirmed use by major companies now public, founders should ensure their training data practices are defensible and transparent.
PMs
Review whether your AI products' music features rely on models trained with these datasets and prepare compliance talking points.
The searchable database makes it easy for journalists and users to trace training data back to specific products, increasing reputational risk.
Investors
Assess portfolio companies' exposure to music training data litigation and ask for documentation of data sourcing practices.
Confirmed use of potentially improperly licensed datasets by Google and Stability AI signals broad legal risk across the AI music space.
Operators
If your organization licenses music for any purpose, cross-reference your catalog against The Atlantic's searchable database.
Identifying overlap between licensed and AI-trained catalogs can help quantify potential revenue loss or inform licensing negotiations.
Testing notes
Caveats
- This is a journalistic investigation and public database, not a developer tool, API, or model release. While the public can search the database, it does not constitute a testable product in the traditional sense.