Show HN: Reddit Archiving Tool Inspired by the ongoing call-to-action by the Internet Archive team over at /r/DataHoarder [1], I've decided I want to try to preserve all cybersecurity related subreddits. [2] For people that don't know what's going on: There's a likelihood that the try to monetize the Reddit API will lead to a lot of moderators quitting the platform, and it could be that a lot of subreddits are going to be set on private and/or their threads are going to be deleted. At least that's kind of the fear from the ongoing moderator strike. In my case I learned a LOT from reddits' discussions about malware, exploits and how they work, and without those I certainly wouldn't be where I am today ... so I'm trying to preserve them. As the Archive Warrior only scrapes the HTML directly to the Web Archive, I'm trying to preserve the data itself directly as JSON files; with intent to store it later on IPFS (having been inspired a couple days ago by the-eye-team's effort to archive RARBG on IPFS). I just wanted to let people know here about the tool, and in case you want to archive your favorite subreddits, feel free to modify it. There are some limitations though, because listings (new/hot/top/search) are all limited to 1000 entries, which means that the discovery of old threads is quite limited. Keyword search increases the discovery of old threads. In my case I'm searching for a lot of keywords (like CVE, RCE, vulnerability etc) in order to discover more threads. Would love to hear feedback, currently it's just a prototypical quick n' dirty tool because the threat of my favorite subreddits going dark is quite immediate. I tried to reduce as much noise from the schema as possible, and the tool is only archiving the subreddit threads and comments, with the idea to be able to scrape the websites/blog articles at a later point in time. [1] https://ift.tt/Ta3bEkq [2] https://ift.tt/y0mMcBb June 11, 2023 at 01:57AM
0 Comments