Reddit blocks Internet Archive’s Wayback Machine from scraping data: What is it? | Technology News


The strife between content platforms and AI companies is intensifying. In the latest turn of events, popular social media platform and forum Reddit claimed that AI companies have been scraping its data from the Internet Archive’s Wayback Machine. And, in response, the platform has announced that it is starting to block the Internet Archive from indexing most of its content. This means the Wayback Machine will not be able to crawl user posts, comments, or profile details. However, the company will allow the tool to crawl its homepage, Reddit.com. This means that the Internet Archive will only be able to archive visible content on the homepage. 

The development is unfolding at a time when Reddit is beefing up its grip on user data. Conversely, Reddit is not against AI firms training their models based on its posts, but they would be required to pay first. Earlier, the platform had said that it will not be containing ‘good faith actors’ such as the Internet Archive. However, now it has changed its stance after sensing that some are aiding AI firms to circumvent licence fees. The social media company has claimed that it has evidence that some AI firms are manipulating the Wayback Machine to avoid its policies and scrape user content. 

The Story of the Internet Archive

Founded in 1996, the Internet Archive is a US-based non-profit organisation that operates the digital library website archive.org. The platform offers free access to an array of digitised media, including websites, software applications, music, audiovisual and prints. The organisation is a proponent for a free and open internet, meaning it is committed to offering universal access to all knowledge. 

Story continues below this ad

While users from anywhere can upload or download digital materials from its data clusters, a vast majority of its data is gathered automatically through its web crawlers that work towards preserving as much public content as possible.

The Internet Archive was founded by Brewster Kahle, a computer engineer and entrepreneur. The Internet Archive stemmed from his want to create a comprehensive, publicly accessible record of the internet. Much of Kahle’s motivation can be found in the mid-1990s, when he saw how quickly content vanished. He saw websites would disappear after redesign, servers may go offline, and there were no historical records of early digital works. This was also the time when he was making a fortune by selling his internet search company WAIS to AOL.

Kahle wanted this information to be available for future generations to study and reference. He has been a long-term advocate of free and lasting access to human knowledge. The Internet Archive is famous for the Wayback Machine. Kahle also wanted to preserve books, audio, video, software and other formats of digital video, essentially protecting them from physical damage. Kahle came to the understanding that regardless of its vastness, the internet is fragile, and without active preservation, a majority of it could vanish anytime. 

 “The opportunity before all of us is living up to the dream of the Library of Alexandria and then taking it a step further – universal access to all knowledge. Interestingly, it is now technically doable,” Kahle has been quoted as saying by multiple media publications. According to the digital librarian, Internet Archive is not only a library or archive but also a cultural safeguard. It is essentially a way to ensure that knowledge persists despite the technological shifts. 

Story continues below this ad

What is a Wayback Machine?

The Wayback Machine is a digital archive founded by the Internet Archive. It was launched in 2001 and essentially offers a means for users to see how websites or digital content looked in the past. Kahle developed it to offer access to universal knowledge, as the tool preserves archived copies of now-defunct websites and web pages. As of today, users can explore over 866 billion web pages that have been saved over time. 

The Wayback Machine’s software has been developed to crawl the web and download all publicly available information and data files on web pages. However, the information gathered by the crawlers does not include all of the content, as data is restricted by publishers or the databases are inaccessible. Limited by partially cached databases, Internet Archive introduced Archive-It.org to enable institutions and creators to voluntarily preserve digital content. When it comes to storage capacity, the Wayback Machine began with 12 terabytes per month, with the first 100 TB rack operational in 2004 to accumulate over 100 petabytes of data till November 2024.

While it has been complying with the robots.txt exclusion standard and removed even previously archived pages when blocked. However, in 2017 it shifted to needing explicit removal requests. Reportedly, around the same time, it began overlooking robot.txt on US government and military sites, which led to a later broadening of the policy. 

Who uses the Wayback Machine?

Since its launch, the Wayback Machine has been extensively used by researchers, scholars, and journalists to access defunct sites, track content changes, and hold figures accountable. Reportedly. Wikipedia editors use it heavily for citation preservation, and it has partnerships such as the 2020 Cloudflare integration that allows for automatic archiving for always-online sites.

Story continues below this ad

There have been numerous legal disputes concerning the Wayback Machine, especially mixed court rulings over the admissibility of its archives as evidence. Reportedly, some patent offices in the US and Europe acknowledge its timestamps as proof of prior art. Legal challenges have also been made due to privacy and copyright claims. Despite its vast reservoir of information, the Wayback Machine is also affected by censorship and access restrictions. For instance, archive.org is blocked in China, it was temporarily blocked in Russia and has witnessed takedown requests from governments and corporations.





Source link

Leave a Reply