Sciencemadness Discussion Board
Not logged in [Login - Register]
Go To Bottom

Printable Version  
 Pages:  1  2    4
Author: Subject: Lengthy, time-consuming overhaul of forum software
streety
Hazard to Others
***




Posts: 100
Registered: 14-5-2018
Member Is Offline


[*] posted on 28-8-2018 at 14:17


Melgar, I think your selection process is sensible and phpBB probably is the best of the bunch. I just find that somewhat depressing.

I'm glad you're thinking of switching from sql functions to ruby. Hopefully you will be able to use the python functions I put together a while ago.

Quote:
My thought was that phpBB has been going strong for a very long time, and will probably continue to be maintained in the future. So even if we make changes to the code in places, then if we move on like Polverone has, our successors here won't be stuck with a huge mess on their hands.


Quote:
phpBB has a nice system for developing extensions


I haven't taken a good look at the phpBB code or the extension system so the situation may be far better than I fear.

From looking at the database structure though I do worry that due to all the added complexity we don't need and the historical baggage phpBB will be a mess not for the whoever comes after us but for us from day one. From there it will only get worse.

The mention of Zend by JJay reminds me that before working more with python I used Zend for probably a year or so. At the time PHP had a very bad reputation for 'spaghetti code' and the release of the Zend framework, and others like it, seemed to be the turning point. phpBB 3 looks like it was under development long before the PHP community really got its act together.

I think we need hands on experience to know whether phpBB is the right choice for us and a simple extension would be a good first step.

View user's profile View All Posts By User
Melgar
Anti spam agent
*******




Posts: 1999
Registered: 23-2-2010
Location: NYC
Member Is Online

Mood: Aromatic

[*] posted on 30-8-2018 at 13:35


Here is the latest working link to the phpBB test site

Although it's true that phpBB is based on an older development philosophy, that isn't necessarily a bad thing for us. For one, it means that the databases have quite a bit in common even if there are a lot of new tables that correspond with new features. I attempted to set up a working installation of phpBB 2.x, but it seems to have been designed under php4. Documents seem to indicate that php5 can work with some fixes, but there were issues relating to whether older versions of php can work with nginx.

Along the way, I got more familiar with the phpBB table system, and concluded that it's not as intimidating as it looked. For one, the table structure isn't that different between versions, and the differences mostly come down to things that were added when new features were added. I don't necessarily know what every field does, but they mostly seem to relate to user options and permissions. So I can just register a user with basic permissions, save the values that get stored in these fields, and then populate them with these default values as users get transferred over. This would preserve things like post count, register date, signature, etc.

Another benefit of phpBB's long history is that there used to be a time when encrypted passwords were stored as MD5 hashes. This means that there are still plenty of phpBB installations out there where some of the passwords are MD5-hashed, if a user hasn't logged in for maybe a decade. So phpBB already has code built in that will check if a password is MD5-hashed. Then, if the password hash matches the one stored in the database, it will automatically hash the password again using a stronger algorithm and overwrite the MD5 password hash with that one. It's hard to overstate how convenient this feature is for preserving forum security during a transition.

Another thing that works in our favor is that phpBB uses bbcode, and although the variant it uses is slightly different, they're probably about 90% the same. I wrote some helper classes and a few Ruby functions for converting XMB records to phpBB records. The code isn't that polished yet, but I like how there is just about nothing that Ruby doesn't let you do if you think it would help you solve a problem. It makes it really convenient for banging out quick scripts that you might only use once, but that are a huge help at the time. You can see my progress here:

https://github.com/toldani/sm-transition/blob/master/sql_hel...

I've mostly set up the environment for interfacing with the databases, and also written two functions. One is a hashing function that's generated from the email address and is used by phpBB to ensure that new users register with an email address that hasn't been used already, or get a notice that they've already registered or something. The other is a function to replace the "rquote" tag that's used here, with the quote tag system that phpBB uses. They serve the exact same function, but with phpBB, you just use the regular quote tag but with options specified.

My recommendation is that we continue with phpBB, but try to find a good balance between transferring data using scripts and just setting things up in the phpBB control panel. As an example, I'd put forth moderator privileges. I don't fully understand how they work in phpBB, but I don't think I need to. We'd just transfer every user over with the privileges of a normal registered user, and then use the admin control panel to find and reinstate mod/admin privileges for whoever should have them.

I read your blog post about optimizing the phpBB search function. Your problem is that MySQL hasn't properly implemented full-text indexing until relatively recently. Prior to its implementation, I would have recommended transitioning the database to Postgres and implementing proper full-text indexing there. But since MySQL seems to have implemented full-text indexing in version 5.6, then all you need to do is upgrade that, then set up a proper index. Here's a good place to start:

https://dev.mysql.com/doc/refman/5.6/en/innodb-fulltext-inde...

I've actually decided to use MySQL for the phpBB installation. I've been studying it a lot lately, and as a result, I've come to realize that many impressions I had about its limitations have been overcome in recent years. And there are definitely benefits to using the same databases on both ends.

Table Conversion

Ultimately, there are only maybe five tables that are important to transfer over on the backend.

xmb_posts: Most important. Message text will have to be processed significantly. Some HTML and bbcode tags will need to be modified, but it's less than I thought. The "rquote" tags will need to be fixed, but I've got that taken care of. "file" tags work the same way in phpBB, just need to be replaced with "attachment".

xmb_threads: This data mainly connects the other tables together, and allows for the grouping of posts into threads. Not much work.

xmb_members: Very important. Stores users, registration dates, post counts, encrypted passwords, etc. Moderate amount of work. Since phpBB can accommodate XMB hashed passwords (MD5) and will even automatically upgrade password security when users log in, this can allow for the most seamless, secure transition possible.

xmb_attachments: Quite important. phpBB stores attachments differently than XMB, in a way that uses the filesystem rather than the SQL tables to hold raw data. By mounting a cloud storage bucket as a volume, this can allow phpBB to store attachments indefinitely without ever having to worry about running out of disk space. The conversion of this table is very straightforward once the filesystem is in place.

xmb_u2u: Also very important. They'll probably have to be processed similar to posts. This presents some difficulty, as the test database I have has been wiped of u2u messages. I can write the code, but I wouldn't know if it worked until it's been tested.

Features Being Researched

Post Edit Time Limit: We want to make sure people can't go back and delete their old posts. You might already be able to implement this in phpBB, but I'd need to import the post data to test it properly. Suffice to say, it's on the table and definitely possible.

Avatars and Smilies: Smilies are replaced on-the-fly in messages, so it's just a matter of someone going in and wiping the phpBB smilies table, then replacing them with whatever images we want. Avatars can be added easily enough, although they can be distracting. I thought we could go with an old-fashioned theme, and convert avatars to grayscale.

Redirecting Links: This is really easy to do in Nginx, which is the open-source web server that I've set up on the test site. The only real alternative is Apache, which I personally hate. Doing it this way though, it's easy to redirect old links to sciencemadness threads, so links continue to work to this site from the rest of the internet.

Theme: If someone wants to help with this project, setting up and configuring themes would really help. I haven't looked into that at all, mainly since I wanted to make sure the backend stuff was possible first. But that's been falling into place more and more lately.

Donations and Such: I had a few people ask about helping or donating money or such. I really didn't want to accept anything from anyone yet, and don't plan to unless there's been enough progress to justify it. But if anyone wants to agree to donate specific amounts of money if and when certain goals get met, I'm sure that would help keep me motivated. I've probably spent about $40-$50 and about a week and a half of work counting research, IT, and programming. Sorry for not emphasizing more of the technical reasons for selecting phpBB, but I tried to cover a bunch of them in this thread. I'm also getting to the point where it's starting to come together on the backend, so expect significant visible progress to be made in the near future.




The first step in the process of learning something is admitting that you don't know it already.

I'm givin' the spam shields max power at full warp, but they just dinna have the power! We're gonna have to evacuate to new forum software!
View user's profile View All Posts By User
WGTR
International Hazard
*****




Posts: 816
Registered: 29-9-2013
Location: Online
Member Is Offline

Mood: Outline

[*] posted on 30-8-2018 at 17:06


Thanks for all your work Melgar. Both myself and probably everybody else (even the ones that are lurking silently) appreciate it a lot. It sounds like you're making some real headway with this, and I'm hopeful (without putting pressure on you) that we'll be ready to migrate after so many years.

SM is a great forum for people of all different backgrounds to get together and discuss mad science. I'm not so sure that there's anything else out there that is really like it. There's a lot of information contained within the threads of the past 16 years or so, and hopefully more to come.





View user's profile View All Posts By User
streety
Hazard to Others
***




Posts: 100
Registered: 14-5-2018
Member Is Offline


[*] posted on 31-8-2018 at 17:23


I took at a look at the phpBB extension system and it is much better than I thought it would be. Between that and your great idea to target an earlier version and then upgrade it is looking much better than I thought.

I don't think anything like what I did previously for search will be needed. phpBB already comes with support for several options built in. We can try MySQL full text search first and if it doesn't work just change the option in the phpBB admin interface.

Other tables to consider are forums and polls.

View user's profile View All Posts By User
j_sum1
Super Moderator
*******




Posts: 4187
Registered: 4-10-2014
Location: Oz
Member Is Online

Mood: Metastable, and that's good enough.

[*] posted on 31-8-2018 at 18:23


I am really encouraged by all of this.

I'd love to hear Polverone's thoughts.
He has not dropped off the scene: he logged in and dealt to some spam this morning.
View user's profile View All Posts By User
Melgar
Anti spam agent
*******




Posts: 1999
Registered: 23-2-2010
Location: NYC
Member Is Online

Mood: Aromatic

[*] posted on 3-9-2018 at 20:09


I may as well give an update. I've been able to pull in users from the XMB table pretty easily, and preserve just about everything. I'm going to make it a point to keep ids the same, so that thread and post ids will carry over. Same with user ids and attachment ids. One of the troubles I'm running into is that everything seems to be cross-referenced all over the place to the point of significant redundancy, and it's hard to set up the linking when the tables you need to link to haven't been imported yet. But fortunately, phpBB has such a similar underlying table structure to XMB that all the important id columns will carry over. This means that redirecting can be done on a URL just by changing the syntax, and thus wouldn't require any sophisticated lookups.

The code I've been writing in ruby is kind of haphazard, and is the inevitable result of reverse-engineering. It's worked to pull in users though. Next step is properly linking attachments, which shouldn't be difficult. There's a slight difference in the way that thumbnails are handled in phpBB for images, but it's pretty minor. Next up would be threads, then posts. But the attachments table needs to be there first, since importing threads and posts would need to pull data from the attachments table.

I also figured out how to disable editing posts after a certain amount of time. Turns out it was just somewhere in the admin settings, and was pretty simple. I'm using the latest version of phpBB, incidentally, not going with my earlier plan to start with phpBB 2.X and then upgrade. They're really not all that different on the backend, and the table structure is less intimidating now that I've identified the important parts of it. Not to mention, the documentation is updated for new versions, and there's a public wiki that describes table structures;

https://wiki.phpbb.com/Table.phpbb_attachments

Also, I've gotten a lot of good information by reading the comments in the source code.




The first step in the process of learning something is admitting that you don't know it already.

I'm givin' the spam shields max power at full warp, but they just dinna have the power! We're gonna have to evacuate to new forum software!
View user's profile View All Posts By User
Melgar
Anti spam agent
*******




Posts: 1999
Registered: 23-2-2010
Location: NYC
Member Is Online

Mood: Aromatic

[*] posted on 16-9-2018 at 18:58


So I've been working pretty extensively on reverse-engineering XMB's database format. Fortunately phpBB has some internal tools to reparse all the bbcode once everything's all imported.

I imported all the users and all the attachments. That seemed to work well. Everyone will be able to keep things like their signatures, post counts, even things like location and mood. The only things that might not carry over are maybe some user preferences that don't have counterparts in phpBB. Also, time zones, just because it seemed like too much work having to import the actual timezone abbreviations into phpBB rather than the hour offsets XMB uses. So everyone will just have to fix their timezones manually.

One thing I'm not sure how to handle is the fact that it used to be possible for people to use HTML in their posts. So there are a bunch of old posts like this one:

https://www.sciencemadness.org/whisper/viewthread.php?tid=9&...

Now, I can pretty easily parse that and make it show whatever it's supposed to show. I just have no idea what that is. There seem to be weird bits of html embedded all over the place on this site. Since I'm doing the work anyway, I'd rather parse it and have it be fixed when the conversion goes live. But I've been seeing a bunch of really weird malformed HTML too, which I don't know what to make of. BBCode tags seem to carry over perfectly fine, but the two forums seem to handle attachments differently. I guess I have to convert all the posts over before I'll know if that's a problem. I've been using really slow, unoptimized Ruby code to do conversions, but I doubt that's a problem. It might take a day to finish running, but I don't think anyone would mind.

Also, you Europeans sure do have a lot of special characters that can screw up SQL insertion.




The first step in the process of learning something is admitting that you don't know it already.

I'm givin' the spam shields max power at full warp, but they just dinna have the power! We're gonna have to evacuate to new forum software!
View user's profile View All Posts By User
Jackson
Hazard to Self
**




Posts: 95
Registered: 22-5-2018
Location: U S of A
Member Is Offline

Mood: :) Happy about new glassware :)

[*] posted on 16-9-2018 at 19:21


Would this finally fix the spam?
View user's profile View All Posts By User
fusso
International Hazard
*****




Posts: 794
Registered: 23-6-2017
Location: Toaru city, Toaru nation, Asia, Earth, ∥ universe
Member Is Offline

Mood: anti-chemophobia:mad:

[*] posted on 16-9-2018 at 21:42


Quote: Originally posted by Melgar  
One thing I'm not sure how to handle is the fact that it used to be possible for people to use HTML in their posts. So there are a bunch of old posts like this one:

https://www.sciencemadness.org/whisper/viewthread.php?tid=9&...
Isn't imageshack notorious for deleting/privatize uploaded images?



List of materials made by ScienceMadness users:
https://docs.google.com/document/u/1/d/1AoI2VA5L4bmFw2HwXS2O...
View user's profile View All Posts By User
Melgar
Anti spam agent
*******




Posts: 1999
Registered: 23-2-2010
Location: NYC
Member Is Online

Mood: Aromatic

[*] posted on 17-9-2018 at 06:33


Quote: Originally posted by Jackson  
Would this finally fix the spam?

Yes. It'd also fix some other security holes.

As for imageshack, none of the links seem to work, so there's nothing I can do about that. Still, it seems like this software used to allow embedded HTML in posts and no longer does, and I'm not sure to what extent I should try to parse it and reencode it as bbcode.




The first step in the process of learning something is admitting that you don't know it already.

I'm givin' the spam shields max power at full warp, but they just dinna have the power! We're gonna have to evacuate to new forum software!
View user's profile View All Posts By User
streety
Hazard to Others
***




Posts: 100
Registered: 14-5-2018
Member Is Offline


[*] posted on 26-9-2018 at 17:40


I think I signed up here after the backup you are using was created so I'm not in the database. I've created an account. Given how thoroughly defeated the image based CAPTCHAs are it might as well be deactivated.

I've said this already in the 'tired of spam' thread but I'll say it again here: I doubt switching to phpBB by itself will help with the spam. It should make maintenance easier and prevent future security issues but handling the spam is a somewhat separate issue. Hopefully this just means using one or more of the already developed anti-spam extensions. Or, at worse, implementing our own which would still be easier than the equivalent on the current forum software.

For the tasks remaining might it be useful to use the issues feature on the github repository? What do you have on your list?

Taking a quick look I noticed http://35.185.63.230/talk/viewtopic.php?p=396775#p396775 Not sure why that isn't displaying as an image.
View user's profile View All Posts By User
Melgar
Anti spam agent
*******




Posts: 1999
Registered: 23-2-2010
Location: NYC
Member Is Online

Mood: Aromatic

[*] posted on 28-9-2018 at 19:26


Quote: Originally posted by streety  
I think I signed up here after the backup you are using was created so I'm not in the database. I've created an account. Given how thoroughly defeated the image based CAPTCHAs are it might as well be deactivated.

My experience has been that the bots are only programmed to solve CAPTCHAs on their default settings. If you change the fonts and the background pictures that the CAPTCHA software uses to generate images, that's enough to trip up nearly all of them. I haven't had a single bot register on that site yet, and not for a lack of Chinese and Russian IP addresses requesting pages from the server. Presumably, we would try and see if anything works, rather than give up immediately?

Quote:
I've said this already in the 'tired of spam' thread but I'll say it again here: I doubt switching to phpBB by itself will help with the spam. It should make maintenance easier and prevent future security issues but handling the spam is a somewhat separate issue. Hopefully this just means using one or more of the already developed anti-spam extensions. Or, at worse, implementing our own which would still be easier than the equivalent on the current forum software.

My experience has been that the more popular an anti-spam software package is, the more spambot programmers there are out there trying to write scripts to defeat them. Makes sense, no? Right now, this site doesn't have any anti-spam measures put in place at all, which would probably account for the enormous volume of spam.

Quote:
For the tasks remaining might it be useful to use the issues feature on the github repository? What do you have on your list?

Well, it's only been me so far, and I've been writing mainly in SQL and Ruby. I've been doing more reverse-engineering than anything. As a result, I'll build up a collection of functions until I start seeing an appropriate use for object-oriented programming, then I'll tidy up my code and organize it more into classes with inheritance and so forth. Since you've expressed some interest now, I updated the README.md to explain what I have so far. But I don't know if you'd want to work in Ruby or not.

Quote:
Taking a quick look I noticed http://35.185.63.230/talk/viewtopic.php?p=396775#p396775 Not sure why that isn't displaying as an image.

Yeah, there were some glitches in the automatic phpBB bbcode parser when I used it on the XMB post content. I think it had to do with the fact that it parsed links before it parsed bbcode. phpBB actually stores posts in a sort of XML, where bbcode start tags are surrounded by <s> and </s> and bbcode end tags are surrounded by <e> and </e>. It takes a little getting used to, but there's really not much to it. The image you pointed out is fixed now, I believe. It takes about a minute or two to run a regex find/replace on all the posts in the database, so once I find one pattern that isn't working right, I can just fix all of them at once.

What I'm working on is less of an automatic transfer script and more a set of tools that facilitate the conversion of data between the two message board formats. Right now, it'd work by transferring all the posts over with a few modifications, then use the phpBB parser on the bbcode. Then it'd have to fix all the glitches with that. But now that I understand the phpBB post format a lot better, I'm leaning towards just doing all the parsing on the fly, during the initial transfer. That way, I could also transfer the older messages from back when HTML was enabled in posts, by parsing it and converting it back to bbcode.

So yeah, I guess that means the next major thing on my agenda is writing a more robust text parser that can deal with basic HTML as well as bbcode. But I plan to write that whole thing in Ruby. My mind is drawing a blank right now, but I'm sure there's a lot to do in PHP. You'd probably need access to the server to do very much though, and I'd need your public key for that.

Falling asleep at the keyboard here, sorry. Later.




The first step in the process of learning something is admitting that you don't know it already.

I'm givin' the spam shields max power at full warp, but they just dinna have the power! We're gonna have to evacuate to new forum software!
View user's profile View All Posts By User
streety
Hazard to Others
***




Posts: 100
Registered: 14-5-2018
Member Is Offline


[*] posted on 29-9-2018 at 04:44


It certainly would be nice if small changes to the CAPTCHA were enough to defeat most/all of the spambots. I logged in and took a look at the admin interface. It seems any admin can change to different plugins including a Q&A CAPTCHA with custom questions. I couldn't see any options to change the font or background image with the currently used plugin but there are definitely options.

It is probably a good idea to have options in place before the board goes live. I assume that after a burst of activity in migrating the board over access to the server might again be limited.

I tested out the recommendation of the visually impaired of sending a message to the admin. I didn't receive anything. Did you? If not, that is something we should fix. We need to make sure that goes to someone who can promptly act on it.

I would not give too much significance to no spam signups despite visits by multiple IP addresses. The bots hunting around for vulnerabilities on random IP addresses are probably different to the bots posting spam on forum boards indexed in the search engines.

What is your workflow when you make changes to the conversion script? It sounds like you are not deleting the phpBB database and starting again. Rather than have another person working on the same server it might be best if I set up a local server and work from that. Then any changes when tested can be applied to the running server.

I'll send you my public key and start posting issues for remaining steps.
View user's profile View All Posts By User
j_sum1
Super Moderator
*******




Posts: 4187
Registered: 4-10-2014
Location: Oz
Member Is Online

Mood: Metastable, and that's good enough.

[*] posted on 29-9-2018 at 05:17


It has been mentioned before but I thought it worth saying it again.

One potential method for weeding out bots involves having an invisible field during registration. A human will always leave it blank. A bot will generally fill in everything. So any accounts where this field is non-empty can be instantly deregistered.

This device combined with a captcha will likely eliminate most of the spam problem IMO.
View user's profile View All Posts By User
streety
Hazard to Others
***




Posts: 100
Registered: 14-5-2018
Member Is Offline


[*] posted on 29-9-2018 at 05:35


I've started posting issues to address at https://github.com/toldani/sm-transition/issues

I think we are less likely to overlook anything in that todo list format than here where the discussion of each issue is interlaced.

By the same thinking it may be useful to compile a list of all the different options for handling spam.
View user's profile View All Posts By User
fusso
International Hazard
*****




Posts: 794
Registered: 23-6-2017
Location: Toaru city, Toaru nation, Asia, Earth, ∥ universe
Member Is Offline

Mood: anti-chemophobia:mad:

[*] posted on 29-9-2018 at 14:02


Could we develop our own forum software based on XMB?

[Edited on 29/09/18 by fusso]




List of materials made by ScienceMadness users:
https://docs.google.com/document/u/1/d/1AoI2VA5L4bmFw2HwXS2O...
View user's profile View All Posts By User
streety
Hazard to Others
***




Posts: 100
Registered: 14-5-2018
Member Is Offline


[*] posted on 29-9-2018 at 14:27


It really depends on what exactly you mean by develop our own forum software based on XMB.

Adding a basic feature on a single page would be possible and using XMB as a base would make sense. I wrote a modification adding a chemistry inspired CAPTCHA question. Melgar independently did the same.

Adding multiple features would be possible but it wouldn't take much before starting from scratch would make more sense than using XMB as a base.

The XMB software doesn't do very much so it would actually be fairly easy to write an equivalent piece of software. At one point, looking at the complexity of phpBB, I semi-seriously suggested this. Melgar has now gotten so much of the migration done that, to my mind, he is clearly on the right path.
View user's profile View All Posts By User
Melgar
Anti spam agent
*******




Posts: 1999
Registered: 23-2-2010
Location: NYC
Member Is Online

Mood: Aromatic

[*] posted on 29-9-2018 at 17:24


I set up Q&A captcha in phpBB. I initially put in questions like "what element has the symbol 'C'?", but spambots often work by entering the question into Google and parsing the top answer. And Google is really good at answering those types of easy questions. Now I made them questions like "What does the 'C' mean in 'CO2'?"

Right now, I'm coming at this from a reverse-engineering approach, so I'm looking through XMB data, both raw and on this board. I'd like to be able to parse the HTML from when HTML was enabled on the XMB boards. I'm also looking into downloading the images that are hosted on external servers and saving them locally, so we don't lose those pictures if the hosts go down or people let their accounts lapse.

I looked into the issues you posted, and closed all but the one that would actually take a lot of time. That being improving parsing from the XMB post text to phpBB's format. I've also imported signatures and post/thread titles without removing the escape marks from in front of single and double quotes, so we just need to make sure to remember to run those fields through the unescape function when importing data.




The first step in the process of learning something is admitting that you don't know it already.

I'm givin' the spam shields max power at full warp, but they just dinna have the power! We're gonna have to evacuate to new forum software!
View user's profile View All Posts By User
RogueRose
International Hazard
*****




Posts: 1129
Registered: 16-6-2014
Member Is Offline


[*] posted on 30-9-2018 at 06:28


Have you looked at pulling all the images off of photobucket and other image hosting sources (and anything else hosted off-site for that matter) and pull them down onto a local drive?

I've been browsing a lot of threads checking out where the images are hosted, what the HTTP links link to and other things, and there is A LOT of data that isn't stored locally on the site.

What are your plans for this? Have you figured out a way to do this? If you need help, I've figured out a way to do this, may not be the best, but I think it would work pretty well with a little bit if manual manipulation.

So many of these free hosting sites end up just disappearing over night or they shutdown people accounts for no reason. Maybe saving that data would be a good place to start and if you need someone else to do it, I'd be happy to do it.
View user's profile View All Posts By User
Melgar
Anti spam agent
*******




Posts: 1999
Registered: 23-2-2010
Location: NYC
Member Is Online

Mood: Aromatic

[*] posted on 30-9-2018 at 09:24


Quote: Originally posted by RogueRose  
Have you looked at pulling all the images off of photobucket and other image hosting sources (and anything else hosted off-site for that matter) and pull them down onto a local drive?

Yeah, the method of downloading images is slightly different for each host. The tricky ones are the ones that give you the bbcode to paste into whatever forum you're using. Photobucket does this. What their code actually does is show a thumbnail inline, and then make that a link that takes you to a web page with the large-size version of that image on it, along with some ads. In browsers, you can right click on images and go to "copy image address". Then, the idea is to compare the two urls. These were the thumbnail and full-sized image, respectively:

http://i454.photobucket.com/albums/qq261/Arkoma_USA/celestro...
http://i454.photobucket.com/albums/qq261/Arkoma_USA/celestro...

Next step is to come up with a "regular expression" or "regex" to match the one that's going to show up in posts, which is the thumbnail. This image came from post #332030. You can get a posts's id by quoting it. When you see the "[rquote=..." tag, the number right after the = sign is the post id. But anyway, the following is a Ruby regular expression that matches photobucket thumbnails:

Code:
/https?:\/\/\w+\.photobucket\.com\/[^\[]+\/th_.+?\.(?:jpg|gif|png|jpeg)/i


I'm not sure if anyone here uses regular expressions much, but they're really useful for finding text that fits a specific pattern. With that regular expression, I can find every instance of a Photobucket thumbnail. You'll see that the full-sized image is almost identical, but without that "th_" prefacing the image name. It's pretty straightforward to just strip out the "th_" and then download the image from the internet. If you try to do it in a browser, it won't work, but apparently it does work using "curl", which is a common tool used for downloading stuff off the internet from a command prompt. Then I'd just save that data as a new attachment, and change all the Photobucket tags to attachment tags. The initial work is kind of annoying, but once there's a system in place, it's pretty easy to apply it to the whole database.

If anyone wants to help in this process, try and look for posts that have formatting glitches and link to them. Like this one, for example:

https://www.sciencemadness.org/whisper/viewthread.php?tid=89...

If you notice, the signature has single quotes and double quotes escaped in it, (preceded by a backslash) which is obviously a glitch. Those backslashes shouldn't be there. I'm going to need a really glitchy sample of posts to test my parsing tools on, so I guess if you have some free time, try and find the posts with the most convoluted mixtures of bbcode and/or HTML that exist in the database, and post them here.

Thanks!




The first step in the process of learning something is admitting that you don't know it already.

I'm givin' the spam shields max power at full warp, but they just dinna have the power! We're gonna have to evacuate to new forum software!
View user's profile View All Posts By User
streety
Hazard to Others
***




Posts: 100
Registered: 14-5-2018
Member Is Offline


[*] posted on 30-9-2018 at 10:54


Quote: Originally posted by Melgar  
I set up Q&A captcha in phpBB. I initially put in questions like "what element has the symbol 'C'?", but spambots often work by entering the question into Google and parsing the top answer. And Google is really good at answering those types of easy questions. Now I made them questions like "What does the 'C' mean in 'CO2'?"

Right now, I'm coming at this from a reverse-engineering approach, so I'm looking through XMB data, both raw and on this board. I'd like to be able to parse the HTML from when HTML was enabled on the XMB boards. I'm also looking into downloading the images that are hosted on external servers and saving them locally, so we don't lose those pictures if the hosts go down or people let their accounts lapse.

I looked into the issues you posted, and closed all but the one that would actually take a lot of time. That being improving parsing from the XMB post text to phpBB's format. I've also imported signatures and post/thread titles without removing the escape marks from in front of single and double quotes, so we just need to make sure to remember to run those fields through the unescape function when importing data.


There are still problems with some of the closed issues. I don't know if it makes more sense to re-open these or create new ones. I'll post in these and you can decide what to do.

Search seems to be broken now. I think it is the issue I mentioned a while ago about the phpBB implementation being inadequate for large sites.

The redirects I think will take several iterations to complete.
View user's profile View All Posts By User
Melgar
Anti spam agent
*******




Posts: 1999
Registered: 23-2-2010
Location: NYC
Member Is Online

Mood: Aromatic

[*] posted on 30-9-2018 at 12:59


I've been working on getting Sphinx working with phpBB. I've managed to query it from a shell, and the results it returns are really good ones. They're practically instant, and there are lots of options that can be configured to customize the weighting of different parameters. The way it works, phpBB queries sphinx outside of MySQL to get its results, but for some reason it's just been timing out, even though sphinx's log files are showing it's been getting the search params. This shouldn't be a hard problem to fix, but for some reason it's been eluding me since this morning. In the worst case scenario, I can build sphinx extensions into MySQL so that the indexing was done natively, but that would require compiling MySQL from source, so I'm trying not to have to go that route. But sphinx does work for yielding good search results very quickly, there's just something misconfigured that's preventing the results from getting back to phpBB.

I reopened the search issue, and yeah, you're right, I should have reopened it as soon as I started trying to switch to using sphinx. I closed it when I got MySQL fulltext search working. It did work for most queries, but got hung up on a few, and I didn't think that was good enough to ever go live.




The first step in the process of learning something is admitting that you don't know it already.

I'm givin' the spam shields max power at full warp, but they just dinna have the power! We're gonna have to evacuate to new forum software!
View user's profile View All Posts By User
Melgar
Anti spam agent
*******




Posts: 1999
Registered: 23-2-2010
Location: NYC
Member Is Online

Mood: Aromatic

[*] posted on 1-10-2018 at 05:46


So, I was trying to think of cool features that I could add to this site that would accelerate people's interest in switching forum software, and came across this:

https://www.glmol.com/

Try clicking and dragging around any one of those four boxes. Pretty cool, no? This might not work with all mobile devices, but should work in most other browsers. We could make it so boxes like that could be embedded in posts, and scrape PubChem to pull in any molecular structure data, given a specific enough chemical name.

There's more information on the open-source components here:

https://web.chemdoodle.com/docs/chemdoodle-json-format/

[Edited on 10/1/18 by Melgar]




The first step in the process of learning something is admitting that you don't know it already.

I'm givin' the spam shields max power at full warp, but they just dinna have the power! We're gonna have to evacuate to new forum software!
View user's profile View All Posts By User
 Pages:  1  2    4

  Go To Top