699,178 posts

Help Wanted: Needed archive script, public facing

Reddit View
August 18, 2015
8 upvotes

I'm looking to make a script that has the same features as archive.today. It occurs to me that archive.today has saves of almost all of our posts' links. I'm worried about keeping eggs in one basket, so I want to host one myself.

Anybody have any good downloadable scripts, or want to write one (php, perl)..

It needs to allow people to submit links to archive, so it should be secure (no javascript and such).

Thanks


Post Information
Title Help Wanted: Needed archive script, public facing
Author redpillschool
Upvotes 8
Comments 18
Date 18 August 2015 09:32 PM UTC (5 years ago)
Subreddit TheRedPill
Link https://theredarchive.com/post/35669
Original Link https://old.reddit.com/r/TheRedPill/comments/3hi0f9/help_wanted_needed_archive_script_public_facing/
Similar Posts

Red Pill terms found in post:
the red pill
Comments

[–]cageypenguin2 points3 points  (13 children) | Copy

My actual career consists mostly of automating solutions to archive and migrate web data for one of the world's largest web hosts.

This is something that I could easily write in php (and infact can use code I've already written and tested), but I do not do front-end (html) work, and I don't want to manage the hosting aspect of it.

Even with some smart code to ignore ads, compress images, etc; a project like this has the potential to run away with disk usage, meaning many shared hosts with their "unlimited" disk usage (really fair-use-policy) would have a fucking fit.

Would be open to working with others in the community on it.

[–]Modredpillschool[S] 0 points1 point  (3 children) | Copy

It would be moderated and I have hosting- so I'm not worried about it running away being used by others- only on TRP posts.

[–]cageypenguin0 points1 point  (2 children) | Copy

Okay, that clears those issues.

From there we then need to decide the basics of what you want this script to do. Will it be feature rich or simple? i.e. What, if any, extra features beyond simply archiving a URL would be available to the user. (moderator tools, authorization, etc.)

If it's simple enough, a quick proof of concept would be easy enough to deploy.

I'm currently traveling for business at the moment, and busy with other projects anyway, so it wouldn't be a priority for me, but I could set aside a few hours in the near future to dig through my libraries and setup such a POC, should no one else in the community want to take up the cause before I become available. (1.5-2 weeks out).

[–]Modredpillschool[S] 0 points1 point  (1 child) | Copy

keep it simple. users can add new archives, easy to link urls, i can moderate with a simple index view and single password delete i can share with my mods

mysql is preferable

[–]cageypenguin0 points1 point  (0 children) | Copy

sounds good, simple is better.

mysql is the best choice anyway.

[–]mwandazimu0 points1 point  (6 children) | Copy

I wouldn't mind contributing if this is written in PHP. I've had tons of experience working on spiders/crawlers.

[–]cageypenguin0 points1 point  (5 children) | Copy

Maybe we can collab with an anonymous github or something to prevent doxxing.

I've been thinking about some of the challenges about this, and the biggest one that comes to mind is how do we archive sites that use some stupid JS or something that causes the content to be dynamically generated.

i.e. the actual content only appears in generated source after being parsed by a browser.

I know these kinds of things aren't that common, but they exist, and all the current crawlers/scrapers I have written are written using raw sockets, I don't use any kind of JS parser.

Another (more solvable challenge) is rewriting any static resource URLs to dynamic ones so that cached content doesn't link back to original content, defeating the purpose of the archive. This is simple enough and I already have lots of code for this - but re-writing tends to be a little wonky in my experience, you can't always account for every possible expectation with a few regexes.

Have you thought about these challenges before? Have you worked with/written any libs that might alleviate the problems?

[–]mwandazimu0 points1 point  (4 children) | Copy

I actually didn't even think about that. From a little research I found http://casperjs.org/ which seems to be able to do just that.

I imagine we can pull the full source/screenshot with casperjs and then run the results through https://github.com/tijsverkoyen/CssToInlineStyles and a couple of minifiers to keep the file sizes and numbers to a minimum. At the end of it we could remove all the javascript, base64 the images, and have the results saved as a static html file.

What are your thoughts on that?

[–]cageypenguin0 points1 point  (3 children) | Copy

From a little research I found http://casperjs.org/

This is actually pretty interesting, using PhantomJS, it would parse with WebKit. I only took a quick look at the docs, but is getting the generated HTML really as simple as a single call?

The only caveat is that it needs to be built from source. If the hosing that /u/redpillschool has is shared hosting, and not a VPS or something, we wouldn't be able to build it.

I like the idea of minified source code, and stripped scripts. I haven't thought enough about base64 encoding all the images inline to have an opinion on it. My first instinct was to first convert images above a certain size to a lossy format to save space, but base64 could be done after that fact.

I'm more concerned with coming up with a way to strip images that don't matter. e.g. banner & side ads.

[–]blacwidonsfw0 points1 point  (0 children) | Copy

For the JavaScript issue I use Selenium (python version) which loads the javascript as well. You can run it with phantomjs if you want very easily. I could write this system as well but I dont have much experience with PHP and strongly prefer python especially for this scripting scraping stuff.

[–]mwandazimu0 points1 point  (1 child) | Copy

I just installed in on my dev server and it seems to be pretty easy to work with.

Essentially you can write a script that will take arguments and can be run from command line. I wrote a quick one that will take a grab the page's source and make a screenshot of it, and then output the results on the cli. This could be captured using exec().

/u/redpillschool can you let us know if you have root access to your server?

Finding banners and ads is going to be another issue.

[–]Modredpillschool[S] 0 points1 point  (0 children) | Copy

I'm afraid no root access. This should be doable with a small php script.

[–]One_friendship_plz0 points1 point  (1 child) | Copy

I can do the front-end. If you guys open-source this PM me.

[–]cageypenguin0 points1 point  (0 children) | Copy

Open source eh..

TRP already has a bullseye on it. Open source is too much of a security risk. Some beta white-knight script-kiddie SJW orbiter might find a hole in the code, and wipe every archive we have just to gain the approval of his feminist overlord(s).

Source should be closed, access to repo should be given to TRP community members who want to contribute.

[–]Lt_Muffintoes1 point2 points  (1 child) | Copy

Something I find particularly frustrating here is how people say dumb shit, then delete it later like a lil bitch. A permanent comment archive to enforce taking responsibility for what you say, like men should, would be nice.

Obviously the caveat being situations with potential doxing.

[–]One_friendship_plz0 points1 point  (0 children) | Copy

I delete my comments when I realize I didn't contribute to the discussion and I went off on a little rant that doesn't benefit anyone only derails the topic.

I delete my threads when I realize they don't contribute to anyone.

[–][deleted] 1 points1 points | Copy

[permanently deleted]

[–][deleted] 1 points1 points | Copy

[permanently deleted]



You can kill a man, but you can't kill an idea.

© TheRedArchive 2020. All rights reserved.

created by /u/dream-hunter