![]() Using fic in a LLM deliberately aimed at writers is antithetical to fandom culture at large, and deeply disrespectful to the people who have written and distributed fic online, for free, for years. Once again, in 2019, the AO3 had 32 billion words. I spoke to a Sudowrite customer representative via chat who confirmed that they trained their network on OpenAI’s large language models and “their own models,” and reiterated that these models were trained on online text published from 2011 through 2019. The only way that Sudowrite would be able to generate recognizable Omega Verse stories was if it had been trained on so much fanfiction that the impact of fic was unignorable within the LLM programing. It is a culture-specific style of writing that has only recently made its way into mainstream, if non-traditional, publishing outlets. ![]() The point is that this style of writing and the various tropes involved in writing within the Omega Verse are localized to online fanfiction communities, and was actually developed on AO3 itself. ![]() I am NOT getting into what constitutes an Omega Verse fic, and if you go looking for that information yourself I am not responsible for what you learn. This rather ingenious and tongue-in-cheek piece of reporting revealed that Sudowrite could be prompted to generate a story within recognizable Omega Verse strictures. ![]() In a series of more and more unhinged experiments, Wired was able to prove that Sudowrite had not only been trained on AO3, but was able to replicate stories that developed within its derivative, transformative culture. And it used billions of words from the Archive of Our Own to develop its models. It is a highly advanced language generator focused on creating stories. Additionally, users can paste their original words into the writing tool and the generator will offer options for what should come next. Users can sign up and use their account to generate words that may or may not resemble a story shape. Unlike the call and response of ChatGPT, Sudowrite was built to facilitate fiction writing. How does Sudowrite link to Omega Verse fic?Ī few weeks ago, Sudowrite-a GPT-based LLM-released its product for public beta. So while some folks understood that the AO3 had likely been Crawled, nobody had done the digging to figure out if it was really being used. The average internet user can only assume that if they had publicly available writing online, their writing ended up caught in the Crawl. While the Crawl’s data exists in a publicly available index, it is extremely difficult to access if you don’t have the ability to understand and execute code at a fairly high level. Nobody was told this was happening many fic writers still don’t know that their work was scraped at all. For our purposes, it’s worth knowing that most, if not all, of those 32 billion words of fanfic available in 2019 are in the Common Crawl dataset that was used in OpenAI’s GPT LLM. Again, for comparison-as these are absurdly huge numbers-there are currently 4.2 billion English words on Wikipedia. I was unable to find a good source for how many words are on AO3 now, but I wouldn’t be surprised if it was much, much more than 50 billion words. In 2019, the Archive of Our Own had 32 billion words of fanfic available, calculated from around five million pieces of fanwork. This API is the basis for many other text-based LLMs-which means that the current state of various “ stochastic parrot” text-generator AI programs are supported by the Common Crawl via GPT API, and, technically speaking, built on a massive corpus of fanfiction. OpenAI released the GPT API to the public in 2021. OpenAI used the Common Crawl dataset in GPT’s development, and it is currently using it as it develops further versions of its successful use case, ChatGPT. The most well-known LLM is GPT, which was created by the company OpenAI. In order to create generative text AI programs, programmers used the Common Crawl dataset to underpin artificial neural networks, which are called LLMs. Its archive began in 2008 and is currently being updated every two months. In order to create the dataset, the Common Crawl scraped the internet for writing and made it publicly accessible. The most well-known dataset is hosted by the Common Crawl, a non-profit that provides an open repository of web data to anyone who wants it, for free. Large language models (LLMs) are the foundation for AI text generators, which were “trained” on data in order to create artificial neural networks. How did modern LLMs scrape fanfiction sites?
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |