Shadow Libraries and AI Training

1/22/2026 - 41 min

Anna’s Archive is a prominent “shadow library” search engine, aggregating content from sites like Library Genesis (LibGen), Z-Library, and Sci-Hub – repositories notorious for hosting pirated books and academic papers[1]. Launched in late 2022 after crackdowns on Z-Library, Anna’s Archive aims to be the “largest truly open library,” boasting access to over 42 million books and 98 million papers as of early 2025[2][3]. These collections encompass novels, textbooks, research articles, and other copyrighted texts that would normally require purchase or library access. AI developers have eyed such vast troves of human knowledge to feed their large language models (LLMs). Training state-of-the-art LLMs demands enormous text corpora, and shadow libraries present an enticing shortcut – offering hundreds of millions of documents in machine-readable formats. However, using this data raises serious legal and ethical questions because much of it is obtained without author or publisher permission.

Anna’s Archive’s Role in AI Data

The team behind Anna’s Archive has acknowledged its growing role as a “shadow data broker” for AI companies. In fact, the site has openly advertised “enterprise” access to its collection for LLM training purposes. For example, Anna’s Archive offers high-speed bulk downloads of its entire library (over 100 million files) to AI developers in exchange for financial donations or technical help[4]. “We mirror Sci-Hub, LibGen… preserved forever… We can provide high-speed access to our full collection for tens of thousands of USD (or in exchange for data we lack),” the service promises, highlighting its unique scale of content (including millions of books and scholarly articles)[4]. Such overtures underscore how shadow libraries have become intertwined with the AI arms race, despite their illicit nature.

U.S. AI Companies Linked to Shadow Library Data

Multiple major U.S. AI companies have been implicated in using pirated content from shadow libraries (Anna’s Archive and its sources) to train LLMs. These revelations have emerged through investigative reporting, leaked internal communications, and lawsuits by authors:

Meta (Facebook)

Internal documents unsealed in a lawsuit show that Meta (which developed LLaMA and other models) downloaded massive amounts of data via Anna’s Archive. An email thread from spring 2023 reveals Meta’s engineers torrented “at least 81.7 terabytes of data across multiple shadow libraries through [Anna’s Archive], including at least 35.7 TB from Z-Library and LibGen,” despite struggling with few seeders[5]. “Meta also previously torrented 80.6 TB of data from LibGen,” the plaintiffs note, referring to earlier bulk downloads[5]. In total, Meta may have obtained well over 160 TB of pirated books – tens of millions of ebooks – for its AI training. A snippet of the internal email (where Anna’s Archive is code-named “AA”) confirms the scope: “Z-lib: 25.7TB out of 35.7TB… LibGen: 10TB out of 10TB… Few seeds, slow download”[6]. Notably, Meta’s team recognized the legal risk: using BitTorrent meant they were also uploading to others. Meta acknowledged internally that torrenting pirated books was “legally problematic” since it effectively distributed copyrighted files to the public[7]. One Meta employee even warned, “I feel that using pirated material should be beyond our ethical threshold,” suggesting the practice crossed a moral line[8]. Nevertheless, Meta did leverage these shadow libraries. The company has not outright denied these allegations; in fact, it admitted that early versions of its models were trained on data from such sources[9]. Meta’s defense, now being tested in court, is that ingesting illicitly obtained texts constitutes fair use for transformative AI training – a claim we will examine later[10].

NVIDIA

Known primarily as a chipmaker, NVIDIA also develops LLMs (e.g. its NeMo models) and has been accused of deliberately pursuing shadow library content. In a class-action lawsuit filed by authors, internal emails show NVIDIA’s data strategy team reached out directly to Anna’s Archive to negotiate access to its entire database[11][12]. The amended complaint alleges “desperate for books, NVIDIA contacted Anna’s Archive – the largest and most brazen of the remaining shadow libraries – about acquiring its millions of pirated materials and ‘including Anna’s Archive in pre-training data for our LLMs’.”[13] Despite Anna’s Archive warning NVIDIA that “[our] library was illegally acquired,” the correspondence shows NVIDIA’s management approved the plan within a week[14][15]. By mid-2022, NVIDIA allegedly green-lit a deal for high-speed SFTP downloads of about 500 terabytes of data – roughly millions of pirated books (~“millions of books”) – from Anna’s Archive[16][17]. This cache even included scans of books that are otherwise only available via controlled digital lending at the Internet Archive (which itself has been in legal battles with publishers)[18]. The court filing paints a vivid picture of willful infringement: “Within a week of contacting Anna’s Archive, and days after being warned of [its] illegal nature, NVIDIA management gave ‘the green light’ to proceed… Anna’s Archive offered NVIDIA millions of pirated books.”[15]. Emails quoted in the lawsuit have an NVIDIA rep frankly asking about “high-speed access” and the risks, indicating full awareness that this was piracy[19][20]. In addition to the Anna’s Archive haul, NVIDIA is accused of sourcing from other shadow libraries as well – the lawsuit says NVIDIA obtained data from LibGen, Sci-Hub, and Z-Library too[21]. Furthermore, NVIDIA allegedly facilitated others’ use of illicit data: the complaint claims NVIDIA distributed tools to its cloud customers to help download “The Pile” (an open dataset which includes the Books3 pirated book collection), raising contributory infringement issues[22]. NVIDIA has publicly responded by asserting that LLM training only “measures statistical correlations” in the books and does not store expressive content in a readable way – essentially a fair use argument that using the datasets is not the same as reading or owning the books[23]. Still, the new evidence of direct collaboration with a pirate site has significantly expanded the scope of the authors’ lawsuit[24].

OpenAI

OpenAI (creator of GPT-3/GPT-4 and ChatGPT) is also under fire for allegedly using shadow library content. OpenAI has never published a full list of its training data, but clues emerged from its research papers and the work of journalists. Notably, OpenAI acknowledged using “two internet-based books corpora” for training GPT-3. One of these datasets was estimated to contain nearly 300,000 books, and observers pointed out that “the only websites to offer that much material are shadow libraries like [Library Genesis]”[25]. In other words, experts believe OpenAI’s massive book dataset was likely scraped from LibGen or a similar repository, because few other sources have hundreds of thousands of ebooks available in bulk. This aligns with an investigation by The Atlantic in 2023, which discovered that more than 170,000 pirated books (by authors like Stephen King, Zadie Smith, and others) were used to train Meta’s LLaMA and “likely other generative AI models”[26] – strongly implying OpenAI’s models too. Indeed, OpenAI is facing multiple author lawsuits. One class-action complaint (Silverman et al. v. OpenAI) alleges that OpenAI “ingested [the authors’] copyrighted books” without permission, and points to the existence of Books3 and other shadow datasets as the likely source[25][27]. While direct evidence from OpenAI’s internal files hasn’t yet been aired (that case remains in early stages), the circumstantial evidence is compelling. OpenAI’s defense so far has been to argue that the outputs of ChatGPT do not substantially replicate any single book and that training an AI on text might be permissible (they did not move to dismiss the direct copyright infringement claim)[28][29]. However, like others, OpenAI has also lobbied policymakers to explicitly allow AI training on copyrighted data, even warning that “if [Chinese] developers have unfettered access to data and American companies are left without fair use access, the race for AI is effectively over.”[30] (We return to this national security argument later.)

Anthropic

Anthropic (maker of the Claude LLM) was similarly accused of relying on shadow libraries. Authors in a class action claimed Anthropic assembled a “central library” of up to 7 million pirated books (apparently from LibGen and a mirror called PiLiMi) between 2021–2022 as training data for Claude. This case (Bartz v. Anthropic) reached a pivotal ruling in mid-2025: Judge William Alsup found that “training on lawfully obtained books” could qualify as fair use, but Anthropic’s “acquisition and storage of ‘pirated’ works in a central library” was not protected and could amount to direct infringement, risking staggering statutory damages[31][32]. In other words, the court drew a line between how the data was obtained – storing illicit copies was illegal even if the act of training might be fair use. Confronted with this distinction, Anthropic opted to settle the case in August 2025 rather than continue litigation[33][34]. The settlement details are not public, but it likely involved compensation to the authors. This outcome suggests Anthropic effectively conceded that building an internal trove of pirated books was problematic. It’s noteworthy that Anthropic’s practices only came to light because of the lawsuit; like OpenAI, it never volunteered information on its training corpus until forced. The “shadow library strategy” employed by authors – focusing on the defendants’ use of pirate sites – clearly had an impact here[35][36]. In sum, U.S. AI leaders including Meta, OpenAI, Anthropic, and Nvidia have all been linked to the use of shadow library content for training LLMs. In Meta and Nvidia’s cases, internal records and emails (exposed through lawsuits) provide confirmation – showing deliberate efforts to download data from sites like LibGen, Z-Library, Sci-Hub, Bibliotik, and Anna’s Archive[5][21]. For OpenAI, direct evidence is largely inferential (given the secretive nature of its data), but many experts and authors believe their models, too, were built on a foundation that included illicit book corpora[25]. These companies are super-wealthy and fiercely protective of their own IP, which makes their cavalier approach to other people’s IP striking – “evidence that they seem to have little or no regard for the intellectual property of others [is] a source of irony,” as one commentator noted[37]. All of this has set the stage for a series of legal battles and public debates on whether such practices should be allowed.

Chinese AI Companies and Shadow Library Usage

While U.S. firms face lawsuits and scrutiny, Chinese AI companies have also reportedly been tapping shadow libraries – often even more aggressively. According to Anna’s Archive, a large portion of its “enterprise” data clients have been Chinese ventures. Anna (the pseudonymous archivist) revealed that they have provided high-speed SFTP access to roughly 30 different AI companies, most of them Chinese[38][39]. In one public discussion, the Anna’s Archive team “openly admit[ted]” that dozens of AI developers have paid (or bartered) for bulk data downloads for model training[40]. This makes sense – China has a booming AI sector and relatively looser enforcement around piracy, so these firms have strong incentives to vacuum up any available data. As tech journalist Ernesto Van der Sar observed, “other countries… have fewer reservations” about using such data, which “could give foreign companies a technological edge.”[41] In fact, U.S. companies have fretted that if they refrain from using pirate libraries due to legal fears, their Chinese competitors will gain an advantage – an argument explicitly made by OpenAI to U.S. policymakers (they warned the White House that barring training on copyrighted data would let China win the AI race)[30][42].

Known Chinese Examples

One prominent case is DeepSeek, a fast-growing Chinese AI startup often described as a “GPT-4 competitor.” DeepSeek’s founders have publicly admitted to leveraging data from pirate sources for training[43]. In mid-2025, DeepSeek’s CEO stated that their model was trained on “all the world’s knowledge” – including content from LibGen and other shadow libraries (an admission that drew both awe and criticism in tech circles). Unlike the U.S. giants, DeepSeek hasn’t faced high-profile lawsuits for this; instead, it became a point of pride domestically that they were “willing to use whatever data necessary to compete.” Another example is the general phenomenon of Chinese LLMs rapidly improving by using English texts: Many Chinese developers have quietly scraped Western copyrighted materials (books, articles) that are blocked behind paywalls, reasoning that enforcement across borders is unlikely. For instance, when Meta’s LLaMA model leaked online, Chinese teams quickly fine-tuned it on additional data – some of which came from resources like Z-Library’s extensive Mandarin-language book section.

Moreover, Anna’s Archive itself has catered to Chinese needs. In late 2023, Anna’s Archive acquired a unique cache of 7.5 million Chinese-language nonfiction books (~350 TB) from a source known as DuXiu (a huge digital library of Chinese academic books)[44][45]. They then offered “exclusive early access” to this trove for any LLM company willing to help with high-quality OCR (optical character recognition) and text extraction[44][46]. The idea was that a partner company could get a one-year head start using this massive Chinese text corpus (larger than even LibGen’s Chinese collection) in exchange for assisting Anna’s Archive in processing the scanned pages[47][48]. This offer was explicitly marketed to Chinese AI developers, since those 7.5 million books would greatly benefit training Chinese-language models. It shows a symbiotic relationship: shadow libraries seek technical help and funding, while AI firms get unparalleled datasets. By 2024, it was an open secret that several Chinese companies had taken Anna’s Archive up on such deals, gaining access to otherwise inaccessible data for their LLMs. Anna’s Archive even noted that if the partner shared their OCR pipeline code, they’d consider extending the data embargo – highlighting how “our goals align with LLM developers” in unlocking knowledge[49][50].

It’s important to note that Chinese tech firms have not been subjected to the same legal challenges from authors as U.S. firms (largely due to jurisdictional hurdles and enforcement difficulties). No major lawsuits by Western authors have been filed against, say, Baidu or Alibaba for AI training, even though it’s plausible their models also consumed infringing content. This lack of public legal exposure means fewer specific Chinese company names are known. However, the pattern is clear: from scrappy startups like DeepSeek to possibly larger players, Chinese AI developers are broadly believed to be mining shadow libraries at scale, largely under the radar. Indeed, Anna’s Archive’s operators have hinted that some well-funded Chinese companies quietly obtained entire datasets with minimal fuss. While Western firms grapple with legal discovery revealing their activities, Chinese companies operate in a legal environment where such use, if not outright condoned, is often tolerated or ignored. The next section will detail exactly what kinds of data these firms accessed and how much.

Nature and Volume of Data Used from Shadow Libraries

What Data Are AI Companies Taking?

The shadow library content harnessed for LLM training spans a few categories, primarily textual works:

  • Books (E-Books and Scanned Books): By far the biggest category is books – millions of them. These include fiction (novels, genre literature) and nonfiction (biographies, self-help, etc.), as well as textbooks and technical manuals. For example, the Books3 dataset (a subset of The Pile used by many AI projects) contained over 196,000 books scraped from a private piracy forum (Bibliotik)[51]. NVIDIA’s alleged deal with Anna’s Archive would have given it access to roughly 500 TB of book data – on the order of several million e-books in various formats (PDFs, EPUBs, scans)[16]. Meta’s torrent logs likewise indicate downloads of tens of thousands of unique book files from LibGen and Z-Library, totaling ~80+ TB in each instance[5]. (As a rough rule of thumb, a terabyte can hold about 1 million e-books in text form; the fact that Meta downloaded over 160 TB suggests an enormous number of titles, potentially encompassing most books available on those pirate sites.) Many of these books are commercially published, in-copyright works from the past several decades – exactly what authors and publishers sell as e-books or print. Some are high-quality text (EPUB/PDFs with selectable text), while others (especially older books or certain collections like DuXiu’s) are image scans requiring OCR. The data accessed often includes the full text of the books and sometimes metadata (titles, authors). In one striking note, NVIDIA’s trove even covered books that are not freely downloadable on the open web – for instance, certain out-of-print titles only available via controlled library lending were found in the pirate stash[18]. This indicates the breadth of the shadow libraries: they’ve effectively mirrored countless library and archive collections without authorization.
  • Academic Papers and Articles: Another significant chunk is research literature. Sci-Hub, which Anna’s Archive indexes, contains over 98 million scholarly papers and articles (mostly scientific journal publications)[1]. AI companies have indeed shown interest in these for training models on technical and scientific knowledge. The lawsuit against NVIDIA alleges it didn’t stop at books – NVIDIA also downloaded content from Sci-Hub, implying millions of journal articles (PDFs) were acquired[21]. The volume here is harder to quantify, but Sci-Hub’s entire corpus is tens of terabytes of PDFs. If converted to text, that’s billions of words of highly specialized knowledge. It’s worth noting that academic papers often contain charts, formulas, and domain-specific jargon; feeding these into LLMs could enhance their ability to handle scientific queries. However, papers are typically behind paywalls, so using Sci-Hub’s copies is outright infringement. There’s also overlap with books: LibGen’s library includes many academic books and conference proceedings. In total, shadow libraries offer an immense academic dataset (as Anna’s Archive claims: “42 million books [and] 98 million papers” available)[1].
  • Others (Magazines, Websites, etc.): Shadow libraries mainly focus on books and papers, but they sometimes include magazine archives, comics, and other media. For instance, Z-Library had sections for magazines and articles. These may also be part of what was downloaded, though there’s less evidence that they were targeted specifically. Additionally, some AI training datasets have included content from fan fiction archives or Wikipedia dumps – however, those are not “shadow” sources since they’re freely accessible or user-generated. The emphasis in the current controversies is clearly on copyrighted books and journals obtained en masse.

Data Format and Processing

The content accessed from Anna’s Archive or similar usually comes in digital text formats (PDF, EPUB, TXT) or as scanned images requiring OCR. For training, the text of each work is what’s needed. We see that Anna’s Archive even negotiated for OCR services – offering data in exchange for help converting scanned pages to machine-readable text[44][46]. Companies like Meta and OpenAI likely performed extensive format conversion and deduplication: e.g., removing duplicate copies of the same title, extracting plain text from PDFs, etc. An interesting point from Meta’s emails: because they downloaded via BitTorrent, they sometimes struggled with slow speeds due to few seeders (pirated academic collections may not be well-seeded)[6]. In NVIDIA’s case, they sought a direct pipeline (high-speed SFTP) precisely to avoid those issues[13][14]. This shows how much data was involved – standard download methods were too slow for hundreds of terabytes. Instead, deals were made for enterprise file transfers or hard drive shipments. The phrase “including Anna’s Archive in pre-training data” implies the entire index of content was slated for ingestion[52]. So, in practical terms, these AI companies weren’t cherry-picking a few books – they were hoovering up the entirety of pirate libraries to get as broad a dataset as possible.

Summary of Volumes

individual datasets like Books3 (~37GB of text from ~0.2 million books)[51], the Pile (~800GB aggregate from many sources, including Books3)[22], and others (e.g. SlimPajama, an open 1.2TB mix including Books3) were known to be used by smaller AI projects and presumably by companies like NVIDIA and Microsoft[53][54]. But the recent revelations show even larger troves: Meta’s >80 TB chunks, NVIDIA’s proposed 500 TB, Anthropic’s 7 million books (which could be >100 TB). For perspective, millions of books likely means a significant fraction of all books published in the 20th and 21st centuries that got digitized illicitly. This content spans countless genres and topics – giving AI models an unprecedented breadth of training data, but also directly copying the creative output of perhaps hundreds of thousands of authors without consent.

Evidence of Usage: Leaks, Lawsuits, and Confirmation Status

The use of shadow library content by AI companies has come to light through a combination of whistle-blowing, investigative journalism, and legal discovery:

Public Admissions and Leaks

In a few cases, company officials or insiders have acknowledged the practice. As noted, DeepSeek’s leaders in China openly stated they trained on pirated data[43]. Similarly, Meta’s AI team leads have casually mentioned in research talks that they used “a large collection of books” (without advertising that it was from piracy, but insiders inferred it). Moreover, The Atlantic’s 2023 investigation was essentially a leak of the Books3 dataset contents – journalists obtained a list of books in that pirate dataset and linked many famous authors to it, creating public evidence that those authors’ works were used in AI training. That spurred some authors (e.g. Michael Chabon, Sarah Silverman) to file suit, effectively confirming the connection between shadow libraries and AI models[26].

Class-Action Lawsuits (Discovery Phase)

The strongest confirmations have come from lawsuits in U.S. courts. Authors and publishers have filed class-action suits against OpenAI, Meta, Google, Anthropic, NVIDIA, and others for copyright infringement. During discovery, plaintiffs have demanded internal documents and data logs, which led to revelations:

  • In the Meta case, as discussed, plaintiffs obtained internal BitTorrent logs and emails showing exactly what Meta downloaded (with quantities and sources)[6][5]. These details were unsealed by the court in early 2025, providing near-irrefutable evidence of Meta’s use of LibGen/Z-Library via Anna’s Archive.
  • In the NVIDIA case, discovery turned up the email exchanges with Anna’s Archive and likely internal approvals, which were attached to an amended complaint in Jan 2026[11][14]. This is direct evidence, straight from NVIDIA’s own files, now on the public court docket.
  • In the Anthropic case, even before discovery went deep, Anthropic had to respond to allegations of a “pirated library”. Judge Alsup’s rulings (citing what was known about Anthropic’s data) effectively confirm that Anthropic did possess a trove of pirated books[32]. The subsequent settlement suggests Anthropic wanted to avoid further disclosure – which implies the allegations had merit.
  • For OpenAI, some evidence has surfaced in court filings: for instance, one author complaint pointed to specific ChatGPT outputs that mirrored passages from their books, implying the model was trained on those texts. While OpenAI hasn’t had to divulge its datasets yet, the author plaintiffs have cited external evidence (like the Atlantic report and size of datasets) to allege OpenAI’s use of Books3/LibGen[25]. Notably, a judge in one OpenAI case in 2024 remarked that the authors plausibly alleged their books were used, allowing unfair competition claims to proceed[55][56]. So even without a “smoking gun” document from OpenAI, the court is taking the allegation seriously for now.

Company Responses

When confronted, these companies have been cautious in public statements. Meta has generally pivoted to a fair use defense, never outright denying that it used the data. In one twist, Meta, when sued over using some pirated adult videos for AI, actually claimed those were downloaded by employees for “personal use” not training[57] – a statement that, while about videos, indirectly confirms they had such downloads in the first place. OpenAI has not commented specifically on shadow libraries, but Sam Altman (CEO) has said that training data was largely web-scraped and that “some books might be in there.” They tend to emphasize that newer models are trained on licensed data as well (OpenAI has struck some deals with publishers in late 2023 and 2024, likely to mitigate this issue). NVIDIA initially tried to dismiss the authors’ lawsuit, but as of January 2026 the judge is allowing the expanded claims about Anna’s Archive, so NVIDIA will have to answer those. NVIDIA’s public stance, as reported, was to argue that “training on such material is not the same as using it like a human reader” and thus should be seen differently[23]. That’s not a denial, but a justification. Anthropic avoided public comment by settling out of court, perhaps to prevent more damaging evidence from spilling out.

Shadow Library Operators’ Info

Interestingly, Anna’s Archive and similar sites themselves sometimes publicize their interactions. The Anna’s Archive blog and social media have dropped hints – e.g., the blog post about the Chinese collection explicitly states they’re seeking LLM company partners[44]. On X (Twitter), the Anna’s Archive account has at times bragged about interest from AI firms (one post in late 2025 said “for AI companies, access to ‘pirated’ books may be a matter of national security”, deliberately stirring the pot)[58][9]. These self-reports add to the evidence that such collaborations were happening behind closed doors.

In summary, the use of shadow library content by AI companies is no longer a mere rumor – it’s backed by concrete evidence in multiple cases. Meta and NVIDIA have had damning internal records revealed; Anthropic’s practices were highlighted by a federal judge; and OpenAI’s likely usage has been strongly inferred by experts and is the subject of ongoing litigation. Chinese companies’ usage, while not exposed via U.S. courts, has been corroborated by admissions and Anna’s Archive’s own statements. In fact, the NVIDIA case marks the first known instance of a direct transaction between a U.S. tech giant and Anna’s Archive coming to light[59]. It would not be surprising if more leaks or court filings in 2026 continue to illuminate this shadowy intersection of AI and piracy.

In the U.S., using copyrighted content without permission for AI training squarely raises issues under copyright law. The authors and publishers suing these companies allege direct copyright infringement – that the act of copying their works into training datasets violates their exclusive rights. The defendant companies have primarily invoked fair use (17 U.S.C. §107) as a defense. Fair use is a case-by-case doctrine, and the question is: Is ingesting thousands of books to “teach” an AI a transformative use, or is it an infringing mass reproduction? The AI companies argue it’s transformative – they aren’t reading the books for entertainment or providing them to users in full, but “transforming” the text into an AI model’s knowledge. For example, NVIDIA analogized it to how a human brain learns patterns, claiming the AI only stores “statistical correlations” rather than expressive content[23]. This echoes Google’s defense in the Google Books case years ago (where scanning for search was deemed fair use). However, the authors counter that wholesale copying of entire works, at industrial scale and for commercial profit, should not be excused by fair use – especially when the output can potentially mimic or summarize those works, replacing the market for them.

So far, U.S. courts have given mixed signals:

  • Judge Alsup’s decision in the Anthropic case (N.D. Cal.) was nuanced: he suggested that training on lawfully acquired copies might be fair use, but possessing pirated copies was not (because obtaining them was illegal and not necessary for transformation)[31][32]. This implies companies might fare better if they had licensed or purchased the works rather than downloading from pirate sites.
  • In early 2024, Judge Martinez-Olguín in N.D. Cal. (in the Silverman v. OpenAI case) denied some of OpenAI’s motions to dismiss, allowing claims to proceed, but she did dismiss a vicarious infringement claim because the authors hadn’t shown substantial similarity between ChatGPT’s outputs and their books[28][29]. Essentially, the case can go forward on direct infringement and unfair competition, but not on the theory that every AI output is an infringing derivative. This indicates courts want concrete evidence of harm or copying, not just theoretical arguments.
  • Notably, the DMCA was also invoked by authors – specifically 17 U.S.C. §1202, which prohibits removal of copyright management information (CMI). Authors argued that AI datasets stripped metadata/watermarks from books, thus violating the DMCA. However, judges have been skeptical. In Silverman v. OpenAI, the DMCA claims were dismissed: the court found the complaint didn’t show that OpenAI removed CMI, especially since ChatGPT outputs still referenced author names (so clearly some attribution remained)[60][61]. Also, the authors couldn’t show that removal of metadata in training files was done to facilitate infringement via outputs[62]. In short, the DMCA avenue hasn’t gained traction in court. Meanwhile, DMCA takedown notices have been used outside of court to fight the shadow libraries themselves – for instance, Anna’s Archive has had to hop domains due to takedown and seizure actions, leading to a game of whack-a-mole[63]. But AI companies aren’t hosts of the copyrighted files publicly, so DMCA takedowns don’t directly apply to their model (one can’t send a DMCA notice to “remove my book from your AI’s brain” – an unresolved challenge).

Another aspect is contributory or vicarious liability: The suits claim that by distributing tools (like Meta using BitTorrent, NVIDIA sharing The Pile downloader), the companies also facilitated infringement by others[7][22]. For example, Meta using BitTorrent meant Meta’s servers were seeding pirated books out to the world while downloading[7]. This could violate the distribution right and not be shielded by fair use (since distribution to others isn’t needed to train an AI). These arguments are novel, and we’ve seen Meta try to dismiss them by saying plaintiffs have no evidence any third-party actually downloaded from Meta’s torrents[64][65]. That issue is still being litigated.

At a higher policy level, there’s debate about updating copyright law. The U.S. Copyright Office released a report in late 2025 suggesting that unauthorized scraping for AI is generally infringement absent an exemption, which alarmed tech companies[66]. But soon after, in a surprising move, the Register of Copyrights was removed by the administration – interpreted by some as the government leaning pro-AI development over strict copyright enforcement[67]. Also, the White House (under President Trump, in this scenario) received memos from companies like OpenAI and Google urging that AI training on copyrighted data be explicitly permitted (or at least not curtailed), in the name of innovation and national security[30][42]. In contrast, creative industry groups have been lobbying for their rights – e.g., the News/Media Alliance and hundreds of artists signed letters insisting that U.S. AI leadership “must not come at the expense of our creative industries”[68][69]. This battle may result in new legislation or a collective licensing regime, but for now it’s being fought out in court under existing law.

China’s copyright law also prohibits unauthorized reproduction of protected works, but its fair use provisions and enforcement environment differ. Chinese law (Copyright Law of the PRC, as amended 2020) has a list of allowed uses (akin to fair use but more specific). It doesn’t explicitly list AI training, but there’s a catch-all clause for other circumstances allowed by law[70]. Importantly, in December 2024, a Chinese court (Hangzhou Intermediate Court) issued a groundbreaking decision in an AI-related case: it found that using copyrighted content to train a generative AI could be deemed fair use so long as the training process does not reproduce or harm the original work’s market[71][72]. This case involved images (an Ultraman character being used to train an image generator) rather than text, but the court’s reasoning is instructive. The court said that if the AI training is for “learning, analyzing and summarizing” prior works in order to create something new, without intent to replicate the originals, and if the original market isn’t unreasonably harmed, it can fall under fair use[73][74]. The ruling emphasized a “two-pronged approach”: be lenient on the training/input phase, but strict on the output use of the AI[75][76]. In that case, the AI company was found not directly liable for training infringement (fair use applied to input), though they were liable for contributory infringement when users generated infringing output images[77][78].

While that’s just one case, it suggests Chinese courts might lean toward permitting the training stage as long as the AI’s outputs aren’t simply spitting out the protected content. This dovetails with China’s broader pro-AI stance; the government issued interim generative AI regulations in 2023 requiring adherence to IP laws, but enforcement has been light and focused more on output censorship and data security. Practically, no Chinese authority has taken action against an AI firm for scraping books or papers. There haven’t been prominent lawsuits by Chinese authors akin to the U.S. class actions – possibly because group litigation is harder, and many Chinese authors may not even know their works were used or may be reluctant to challenge a government-prioritized industry. Additionally, many works in question might be foreign (English) works, so Chinese companies feel insulated.

It’s also worth noting that China is not as hospitable to foreign copyright claims. A Western author would face an uphill battle suing a Chinese AI company in Chinese courts for copyright infringement – jurisdiction and enforcement would be major obstacles. This essentially creates a de facto looser regime where Chinese AI companies can utilize shadow libraries with relatively low legal risk. In contrast, U.S. companies are under the microscope of U.S. courts and must at least present a fair use rationale or consider licensing. This discrepancy is exactly what U.S. firms point to when they argue they need freedom to train on data to keep up with China.

Summary

The U.S. legal framework is currently uncertain – courts are testing how fair use might apply and weighing massive potential damages (each infringed book could statutorily mean up to $150k in damages, so millions of books is untenable for a defendant). No final judgments have been reached yet in the major cases; all are in progress or recently settled (Anthropic). The DMCA, aside from the failed CMI claims, doesn’t directly shield AI training (safe harbor is inapplicable because companies themselves did the copying, not a user uploading). So companies like OpenAI and Meta mostly bank on fair use and First Amendment-style arguments. In China, the legal system so far seems comparatively permissive or at least untested regarding AI training on copyrighted data. The one court ruling indicates a more utilitarian approach: allow the learning, punish the blatant re-use. If that becomes precedent, Chinese companies might effectively have legal cover to continue using shadow library content, as long as their AI doesn’t output large verbatim passages.

Ethical Concerns Raised by Authors, Publishers, and AI Ethicists

Beyond legality, the ethical debate around AI training on pirated content is intense. Key concerns include:

  • Respect for Creative Labor and Consent: Authors and publishers argue it’s fundamentally unethical for AI developers to exploit millions of books (and other works) without permission or compensation. Each of those books represents an author’s labor and often a livelihood. Using them as free feedstock treats creative content as mere raw material. Award-winning novelist Jonathan Franzen, for instance, has objected to his novels being ingested by AI without any say. The Authors Guild has decried this practice as “systematic theft on a vast scale.” From an ethical standpoint, it violates the principle of consent – none of the creators agreed to be part of an AI training set. Author Sarah Silverman quipped that it felt like “my book was shoplifted, then cloned a million times” by the AI, which to her is deeply unfair.
  • Compensation and Fairness: Linked to the above, there’s worry that authors and rights holders see no compensation while AI companies potentially profit immensely (OpenAI and others are valued in the billions). As the News/Media Alliance put it, “loosening standards for everyone else’s creative IP might be convenient for [AI firms] in the short run, but the long-run implications are bad for everyone”[79][80]. Ethically, many argue, if AI models derive value from copyrighted works, the creators deserve a share of that value – otherwise it’s an unjust enrichment of the companies. This has led to calls for collective licensing (similar to how radio pays song royalties) or opt-out mechanisms, so that AI progress isn’t built on uncompensated creative work. Some ethicists say failing to address this could undermine the creative economy: why write a novel if it will be immediately mined by AI with no reward to you?
  • Erosion of Copyright and Moral Rights: Publishers fear that allowing AI training on pirated content essentially erodes the concept of copyright. If an AI can use a novel without a license because of a fair use loophole, it could encourage more disregard for copyright generally. For example, educational publishers worry that if AI models can just absorb textbooks, who will pay for creating new ones? From a moral rights perspective (the idea that authors have a personal right in their creations), being used as AI fodder without acknowledgment can be seen as a form of disrespect or even violation of the integrity of the work. (Imagine an AI spitting out half-accurate summaries or pastiches of an author’s book – to many authors, that feels like an affront to their art.)
  • Transparency and Accountability: AI ethicists highlight that using shadow library data often happens secretly. The lack of transparency means neither the public nor the affected creators know what specific works were used. This opacity is ethically concerning because it hampers accountability. If a model produces harmful or biased content that traces back to a training document, it’s nearly impossible to trace when the training data is unvetted and hidden. Ethicists argue for dataset transparency – companies should disclose what datasets (even broad categories) they used. Some have called for independent audits of AI training data to ensure it meets legal and ethical standards (e.g., no child pornography, no highly sensitive personal data, etc., in addition to respecting copyrights).
  • Cultural and Knowledge Equity: There’s an ethical dimension in that shadow libraries contain a lot of valuable knowledge that might otherwise be inaccessible, especially in less wealthy countries. AI models trained on them will embody that knowledge. Some ethicists and librarians have pointed out that shadow libraries have democratized access to knowledge – albeit illegally – and that AI benefiting from them is a double-edged sword. On one hand, it could make the AI more knowledgeable (a social benefit), but on the other hand it’s built on breaking the law. Anna’s Archive itself frames its mission as preserving human knowledge and argues that AI development is an “align[ed] goal”[48][50]. They even suggested access to such data is a matter of national progress. Ethical concern: Does the end (a highly capable AI that can benefit society) justify the means (mass infringement)? Some AI ethicists warn against ends-justify-means thinking, noting it sets a dangerous precedent of “if we don’t do it, someone else will” to justify unethical behavior[81][82]. A vivid analogy from Axios: Chinese firms have access to widespread surveillance data – should U.S. firms emulate that because “otherwise we fall behind”?[82] Many say no; similarly, respecting intellectual property should not be abandoned simply because competitors might ignore it.
  • Model Outputs and Author Rights: Authors worry that AI models could generate text that closely mimics their style or even verbatim passages from their works, effectively competing with or replacing their work. If a user can prompt a model to “write a short story in the style of [Author X]” and get a convincing result, that raises ethical questions of artistic integrity and dilution of an author’s voice. It’s even worse if the model can produce large quotes from an author’s book (which has happened occasionally when users prompt models to summarize novels – sometimes the summary includes lines from the original). This borders on plagiarism and piracy by the AI itself. AI companies have tried to mitigate this (for instance, OpenAI claims GPT-4 will refuse to output lengthy verbatim excerpts), but it’s not foolproof. Ethically, the prospect of AI that can regurgitate copyrighted text undermines the rights of creators to control distribution of their work.
  • Precedent for Other IP and Sectors: The ethical debate isn’t lost on other fields – photographers, visual artists, musicians, and filmmakers are watching closely. If it becomes acceptable to scrape art or books for AI, will that normalize doing the same for images (as has already happened with image-generators scraping sites like DeviantArt) or music? Many creators feel a sense of solidarity that AI should not be above the law or ethics across all media. Over 400 actors, writers, and musicians (including names like Paul McCartney and Jodi Picoult) signed a letter urging that AI training not be given a free pass on IP[69]. AI ethicists echo this, noting that respecting existing rights and encouraging licensing solutions is important for a just technological future. In conclusion, the ethical concerns form a chorus: AI development should not trample the rights and interests of human creators. There is a growing call for balance – finding ways to enable AI innovation while also rewarding or at least acknowledging creators whose works fuel that innovation. Some propose a collective licensing scheme where AI firms pay into a fund that compensates authors (Patently-O’s analysis suggested a statutory license as one path)[83][84]. Others suggest opt-out registries (a concept floated in EU discussions) where authors can say “don’t use my works in AI.” On the extreme end, a few ethicists argue for treating AI training on copyrighted data as outright unacceptable without permission – essentially demanding a moratorium on such practices until legal clarity is achieved.

One thing all sides agree on: the current situation – where tech companies quietly leverage “shadow” content, and creators resort to lawsuits – is unsustainable and corrosive. As an AI ethics commentator summed up, “Overall, the [AI companies’] proposals attempt to centralize power in AI companies... while diminishing intellectual property protections and privacy rights”[85][86]. The debate continues to rage in courtrooms, legislative halls, and public forums, but it has at least forced tech companies to confront the ethical implications of their data practices. Going forward, the resolution of these issues – whether through courts deeming it fair use or through new laws requiring licenses – will set critical precedents for the relationship between AI and human creativity.

Conclusion

The saga of Anna’s Archive and AI training highlights a collision between technological ambition and legal/ethical norms. On one side, AI companies (East and West) have shown that when pressed for more data, they were willing to venture into legally gray or black areas – raiding the digital commons of pirated books to fuel the intelligence of machines. This has linked names like Meta, OpenAI, NVIDIA, and DeepSeek to shadowy archives that were once the secret domain of e-book pirates. On the other side, authors, publishers, and ethicists are pushing back, concerned that we are witnessing a massive uncompensated use of creative work under the banner of progress.

The U.S. and China provide a study in contrasts: U.S. firms are facing public accountability and lawsuits under copyright law (with fair use as a contested shield), while Chinese firms operate in a realm of tacit permission or ambiguity, leveraging pirate libraries with fewer immediate consequences. Yet, both raise the same fundamental question: Should AI models be allowed to consume the entirety of human culture – even the infringing copies – and if so, under what conditions?

As of early 2026, that question remains unsettled. Lawsuits are advancing, settlements (like Anthropic’s) are hinting at some compensation, and governments are weighing rules. Perhaps we will see the emergence of licensed datasets or industry standards that make resorting to Anna’s Archive unnecessary. In fact, some big players have started striking deals (e.g. OpenAI licensing news archives, or Microsoft reportedly licensing books)[87], which may signal a shift toward more legitimate data procurement if the legal pressure mounts. Anna’s Archive itself might end up a historical footnote – the 21st-century equivalent of Napster, but for books – or it may continue as an underground resource fueling AI in places where enforcement can’t reach.

In any case, this episode has shone a light on the data underbelly of AI. It forces us to confront how AI’s “hunger” for big data can lead to ethically questionable shortcuts. The voices of authors and publishers demand that human creativity not be treated as a free buffet for AI. As one forum commenter wryly observed about the NVIDIA-Anna’s Archive revelations: “So let me get this straight – a pirate (AI firm) raided the town’s library, then another pirate (Anna) got mad when the first tried to raid their stash…?”[88]. The irony was not lost. But beyond the irony lies a serious reckoning. The resolution of how AI companies can use content like that from Anna’s Archive will shape the future of both the tech industry and the creative industries – and ideally, ensure that the rise of artificial intelligence does not come at the unjust expense of human authors and knowledge-creators.

Sources

  • Ernesto Van der Sar, TorrentFreak – “‘NVIDIA Contacted Anna’s Archive to Secure Access to Millions of Pirated Books’” (Jan 2026)[11][14].
  • Mark Tyson, Tom’s Hardware – “Nvidia accused of trying to cut a deal with Anna’s Archive…” (Jan 21, 2026)[17][89].
  • Tom’s Hardware (quoting internal email) – Nvidia exploring use of Anna’s Archive for LLMs[19][20].
  • TorrentFreak – “‘Meta Torrented over 81 TB of Data Through Anna’s Archive…’” (Feb 6, 2025)[5][7].
  • Unsealed Meta email (via TorrentFreak) – showing torrent progress from LibGen/Z-Lib (2023)[6][5].
  • Ella Creamer, The Guardian – “Two OpenAI book lawsuits partially dismissed…” (Feb 14, 2024)[26][25].
  • Axios – “AI firms push to use copyrighted content freely” (Mar 20, 2025)[30][79].
  • PC Gamer – “Nvidia allegedly greenlit the use of pirated books…” (Jan 20, 2026)[32][52].
  • Patently-O (Dennis Crouch) – “Anthropic Settles the Authors’ Class Action…” (Aug 29, 2025)[31][33].
  • Court filings in Silverman v. OpenAI (N.D. Cal. 2024) – as summarized by Loeb & Loeb LLP[60][61] and Guardian[55].
  • IAM Media – “Using copyrighted content to train generative AI can be deemed fair… (Ultraman case)” (Apr 30, 2025)[71][74].
  • Anna’s Archive Blog (via Scribd) – “Exclusive access for LLM companies to largest Chinese non-fiction collection” (Nov 4, 2023)[44][46].
  • Anna’s Archive site snippet (via Scribd) – size of collection and mission[1][2].
  • TorrentFreak – “Meta: Pirated Adult Film Downloads Were For ‘Personal Use,’ Not AI Training” (Oct 29, 2025)[57].
  • Hacker News discussion – confirmations of Anna’s Archive deals with AI companies[38][39].
  • Authors’ open letter on AI – Deadline.com (Aug 2023) as cited in Axios[69].
  • (Additional citations inline above).