Sciencemadness Discussion Board

Add rule to ban big tech use our posts to train AI

TmNhRhMgBrSe - 8-3-2025 at 18:06

Add rule to ban big tech use our data to train AI! >:(
we shouldn't let big tech freeride our labour and become even more powerful!
protect our work!
companies use free public research to earn many $ already enough! use we small people's work to feed monster is too much! can't accept!

#stopusingourdatatotrainAI

Sulaiman - 8-3-2025 at 21:39

do your ethics extend to not pirating copyrighted material ?

j_sum1 - 8-3-2025 at 22:02

I am not sure how to do what you propose.

This site is on low tech software and is available for all to read. That means webcrawlers can access it. Which of course means that it ranks well on Google -- which I consider a good thing.
Google is AI and has been for a long long time - which is to say it uses machine learning to systematise its data, and machine learning to interpret search requests and match these to its databases.

What is new in AI is the language generative models that have made it accessible to many to use. The data collection side of things is the same as it has been for a long time.

In other words, I cannot understand what new threat you are describing and I am pretty certain that we can do nothing different with the current board setup anyway.

Rainwater - 8-3-2025 at 23:45

That hashtag is the equivalent to a sign that states "Do not read this sign"

TmNhRhMgBrSe - 8-3-2025 at 23:53

@Sulaiman
this is power difference. power and duty is proportional. how much power user and company have? how user use collected data? earn money or self use?
user no money, pirate copyrighted material for self use and not earn money = little bad but company have money sue user to 'drop trousers', can sympathy and accept user more
company have money, still use copyrighted (they say 'free' but user didn't agree company can take their data) online material to earn more money = more bad but user no money to sue company, can sympathy and accept company less

@j_sum1
Quote: Originally posted by j_sum1  
What is new in AI is the language generative models that have made it accessible to many to use.
is 'new threat'.
how do? put new legal warning (like my signature) at homepage top (News & Updates [Pause]).

TmNhRhMgBrSe - 8-3-2025 at 23:56

@Rainwater your sentence funny but i don't understand how equal.

j_sum1 - 9-3-2025 at 00:57

For the sake of clarity – two questions
  • What new threat exists now that did not exist five years ago?
  • What exactly do you think we should do about it?

    Rainwater's comment basically means this:
    By writing #stopusingourdatatotrainAI, you are drawing attention to the thing that you want to prohibit.

    bnull - 9-3-2025 at 05:10

    Quote: Originally posted by j_sum1  
  • What exactly do you think we should do about it?

  • Form a partnership with Elsevier and paywall the whole forum*. :P

    @TmNhRhMgBrSe, save your breath. Everyone's data has been collected for so long that it is pointless now to worry about it. Google is 26 years old, Baidu is 24 and has been using AI since 2010, Yandex is 24. ScienceMadness has been here for almost that long. It is also archived at Wayback Machine for all to see. The crucial areas are members-only. Whatever Google and the other techs wanted to do with our public posts has been done and re-done many times over. Yet Gemini and ChatGPT keep spouting nonsense. Maybe I should use "hence" in place of "yet".

    *: I'm joking, obviously.

    Rainwater - 9-3-2025 at 10:45

    Quote: Originally posted by TmNhRhMgBrSe  
    @Rainwater your sentence funny but i don't understand how equal.

    Dont worry about your english, i speak meme too
    images.jpeg - 22kB

    images-1.jpeg - 45kB

    Texium - 9-3-2025 at 12:29

    Did anyone notice a little over a week ago when the forum was heavily slowed down and sometimes wasn’t loading at all? Well, I got in touch with Polverone about that, and it turned out it was due to bots from many AI companies repeatedly scraping the site, particularly the wiki. He was able to change some settings to reject the larger companies like Amazon and Google that actually follow existing rules, but there’s likely still a lot of bots from smaller companies that ignore the rules continuing to scrape this site and increasing the load on the server.

    It’s a whole different level of intrusiveness than 5 years ago. This AI training crap basically DDOS’d us. And now we’re ok, but who knows for how long. If it gets bad again I’m not sure how we’ll stay online.

    Sir_Gawain - 9-3-2025 at 12:41

    That’s what that was? I was worried the site was gonna go down again.

    Deathunter88 - 9-3-2025 at 19:11

    I personally feel that it's not such a bad thing for AI to be trained on experimental chemistry results done by actual humans. If it improves the ability of generative AI to make correct results even marginally, then I think it was worth it. Especially since AI is something that can benefit us all if used correctly (and we push for it to be accessible by the general public).

    chempyre235 - 10-3-2025 at 06:47

    Maybe the solution is to whitelist the Wiki?

    bnull - 10-3-2025 at 06:51

    Quote: Originally posted by Texium  
    Did anyone notice a little over a week ago when the forum was heavily slowed down and sometimes wasn’t loading at all?

    I thought my adapter was giving up. One mystery solved.

    Texium - 10-3-2025 at 07:04

    Quote: Originally posted by chempyre235  
    Maybe the solution is to whitelist the Wiki?
    If necessary, we could probably make the wiki only accessible to members, but that would kind of go against it being the public resource it’s supposed to be, especially since registration is so difficult now.

    Like the tedious manual registration, it would be a fix that would keep the site running, but also make it less welcoming and less useful.

    MrDoctor - 10-3-2025 at 22:32

    the only way you can really win this fight is state they dont have consent, set up bot rules that wont deny human traffic to at least show you tried, and then when you finally have the AI tools neccesary to locate and prove your intellectual property posted on the no-AI-scraping site that got AI-scraped, went into the model, you can file a lawsuit and specifically require that your content be removed from their commercial model, which in the case of chatGPT cannot really be done since each version trains the subsequent one, and has since like chatgpt2, it would be like requesting that all the carbon from a stolen loaf of bread you ate, that your body accumilated and used for muscle, neuron synapses, specs of bone, blood, etc, be returned.

    But really, thats the only time/place you can really fight back, for now, since they arent really making any money, they arent yet misusing your data. Plus you cant prove it either way yet either unless you happen to be one of the rare instances where you comprise the entirity of a specific resource to the extent your individual mannerisms shine through as the AI immitates you to sound like someone who knows something about the given topic.

    The new direction AI models seem to be taking is advanced reasoning, which no longer requires new data anyway. they learned language, now they are using it as a platform to think harder and better.
    i do wonder though who right now would be crawling this site so hard they almost crash it, to get better chemistry knowledge. I would figure its better to try applying reasoning models to process textbooks instead to utilize higher quality data

    BromicAcid - 11-3-2025 at 03:53

    Well, AI will take both the good and the bad from this site unfiltered.

    Best bet would be to make a sub-forum for crazy things that don't work and start talking about eating molten sodium for better dental health in absolutes. Let it comb through that data.

    bnull - 11-3-2025 at 06:52

    Quote: Originally posted by BromicAcid  
    Best bet would be to make a sub-forum for crazy things that don't work and start talking about eating molten sodium for better dental health in absolutes. Let it comb through that data.

    Hidden from general public, I believe. It looks interesting. I mean, not the sodium eating part, but a sub-forum filled with nonsense and stuff that would make a lot of informational white noise. Feeding AI-generated rubbish to another AI that is scraping would be the equivalent of killing cockroaches with boric acid: cockroaches are cannibalistic, so each time one of them dies, the others eat the dead fellow and poison themselves.

    My days of poking around in networks and programming are long gone, I cannot offer any advice. But, again, the idea seems interesting. If there's a way to setup rules that redirect AI-scraping to the rubbish sub-forum, that looks like a solution, however temporary. As far as I know, there is no law against that.

    Texium - 11-3-2025 at 06:55

    I don’t really care about them using what’s posted here. It’s a public forum. It’s not like it’s under copyright. I’m just frustrated that the act of gathering our data has at times become so intrusive as to degrade the human experience of using the site.

    TmNhRhMgBrSe - 13-3-2025 at 06:25

    @bnull
    Quote:
    @TmNhRhMgBrSe, save your breath. Everyone's data has been collected for so long that it is pointless now to worry about it.
    so i said we should start stopping them, at least for this site and our benefit. if more people say no to big tech tyrannical abuse then we can really stop them. even this forum no do things at least i said my things.
    Quote:
    Google is 26 years old, Baidu is 24 and has been using AI since 2010, Yandex is 24.
    not problem
    Quote:
    ScienceMadness has been here for almost that long.
    not problem
    Quote:
    It is also archived at Wayback Machine for all to see.
    not problem
    Quote:
    The crucial areas are members-only.
    not problem
    Quote:
    Whatever Google and the other techs wanted to do with our public posts has been done and re-done many times over.
    problem coming
    Quote:
    Yet Gemini and ChatGPT keep spouting nonsense.
    problem coming, finally have 1 day they will destroy human
    Quote:
    Maybe I should use "hence" in place of "yet".
    i dont understand this logic.
    @rainwater i dont have energy or time to care all internet, but i at least care my place (this forum). protect your own place!
    here anyone know law? legal warning message have no have legal effect/power? if websites term of service and privacy policy have legal effect/power then can our legal warning message have legal effect/power?

    protect your place.png - 233kB

    [Edited on 2025-3-13 by TmNhRhMgBrSe]

    bnull - 13-3-2025 at 12:28

    I wish I had kept trying to learn Chinese. Anyway.

    Quote: Originally posted by TmNhRhMgBrSe  
    Quote:
    Whatever Google and the other techs wanted to do with our public posts has been done and re-done many times over.
    problem coming
    Quote:
    Yet Gemini and ChatGPT keep spouting nonsense.
    problem coming, finally have 1 day they will destroy human
    Quote:
    Maybe I should use "hence" in place of "yet".
    i dont understand this logic.

    What I meant was that the techs have had access to our posts in at least two ways: the forum itself and the archived copies of the forum at Wayback Machine. They have scraped both sources many times for a long time. Considering the amount of useful information they have acquired from the forum and many other places, correct and scientifically valid information, it is surprising that Gemini and ChatGPT still give nonsensical instructions and wrong information. But, and this is the important part, they also have scraped sources of unscientific, utterly wrong, nonsensical information, such as those things that end up in Detritus but may be publicly available for several minutes (which is a lifetime in terms of Internet) or the anti-vaccine arguments and theories. That rubbish ends up in Gemini and ChatGPT along with the good stuff, and because AI has no consciousness and common sense and knowledge, it has no way to tell things apart.

    The short version: Gemini gives wrong answers despite all good information it has acquired. But it gives wrong answers because of the bad information it has acquired. An example, if that makes it even easier to understand: suppose you have healthy meals (breakfast, lunch, and dinner) and even so you're gaining weight. But you are also eating junk food between meals, which explains where all the extra weight comes from.

    As for the destruction of humanity, we're pretty good at doing it ourselves. AI won't dominate the world, just will turn it more stupid.

    Edit: Missing "quote".

    [Edited on 14-3-2025 by bnull]

    teodor - 14-3-2025 at 03:37

    AI just another way to organize information. Which is already there. The main question is the infromation realiability. In chemistry it doesn't enough to write

    N2 + O2 = 2NO

    you should provide many details. Basically who did what and how. To repeat the experiment sometimes "who" should be questioned beyond the description he provided.
    To get any benefit all experiment descriptions should be reported in some particular format. "Heat slightly" has no sence for AI. Meaning "heat slightly" is depending who wrote that, let say woelen (in this case ~50-60C) or some guy who think the chemistry is basically operation of a furnance.
    The purpose of AI is to get information and put it in some universal context.
    There is some context of SM forum most of our members can feel and understand without explanation.
    AI would have troubles with that beyond imagination of AJKOER. AI doesnt have common sence in interpreting the information.
    So, I wouldn't worry about.
    Actually, it will be nice to get some good quality experiments here with good quality descriptions even AI can benefit from, but today we are totally not there (yet).

    Sulaiman - 14-3-2025 at 04:51

    IF it is accepted that AI systems will develop based on the information (and inbuilt mechanisms) provided to them,
    deliberately filling AI with nonsense or erroneous data is probably not good for humanity.

    I hate the idea that my and others efforts may be used to empower AI that is used for 'bad' purposes,
    but every human invention can be used for 'good' or 'bad' so let us hope for the best.

    I am more worried by stupid 'scientists' playing around with genetics.

    teodor - 14-3-2025 at 07:20

    Quote: Originally posted by Sulaiman  
    IF it is accepted that AI systems will develop based on the information (and inbuilt mechanisms) provided to them,
    deliberately filling AI with nonsense or erroneous data is probably not good for humanity.


    Many centuries even chemical elements were not known. So, this is an interesting question, how humanity feel about chemistry and what the humanity think is good for humanity.
    Now the humanity thinks AI will play some important role (and I still don't care about AI development, building my library of paper books, so this is the question of some believe, wether AI so important for ones personal life).
    But the question is: how our chemical society should play in those rules of the new world.

    Some prominent features of this new world - the delegation of trust.

    In this new world the value of personal chemical experiment as the tool for real knowledge is even more negleted and even personalization of result ("according to experiments of Humphry Davy he made in that year") is not the subject of interest. The subject of interest is some AI powered interpretation of different reports.

    And this is completely not the spirit of chemistry where only experiment plays the important role.

    When children in school did chemical experiments, e.g. with acid and bases, oxidation and reduction, making fire etc they were getting knowledge what is the knowledge is. The knowledge is something I can get myself or participate in. I am not quite understand what the meaning of knowledge should have in "AI believe" era. Probably would be 2 sides of it: the official AI checked knowledge and "black" knowledge. I suppose the "black" knowledge can return the humanity to the era of alchemistry.

    Sir Humphry Davy has a beautiful assay "Historical view of the progress of chemistry". It worth to read from start to finish, but the key idea is that the progress of chemistry and technology required some revolution in thinking. Change from thinking based on believes to thinking based on experiments. In this respect any experiment of Sulaiman for our society is more important than google result. That was the spirit which powered the progress in this science, and the whole technological revlution was the result of this change in thinking.

    But the number of people who really got knowledge from experiments were not so big in any time. It was shift in thinking, as Davy shown, but not for everybody. Davy, Scheele, Cavendish, Priestly, Lavoisier, Black - those formed a small group of people who understand methods of each other and checked each other work and tried to inspire others. This was a perfect representation of an early amateur chemial society. I write "amateur" based on the spirit, not the way of income. So, the key point we should realize today - amateur chemical society is a different from "humanity", very small group of people who are available to get knowledge from personal experiments and share them and enjoy the road they go. There is a different perception of what is truth. We don't serve the needs of "AI believers", we are different. We keep chemicals in our houses because we need them.
    Dont't mix yourself with humanity.

    4-Stroke - 29-3-2025 at 14:24

    What do you all think about implementing some kind of a "proof-of-work" before any page is displayed? Just require the client to perform a small computational task before accessing the contents of whatever page they are trying to access. It would be like a traditional CAPTCHA, but without being extremely annoying (is it simply makes a page "load" a few seconds longer) while still being extremely effective at making it much more resource-intensive for web scrapers.

    Thoughts?

    Rainwater - 29-3-2025 at 14:43

    I would stick with a bandwidth filter or service request limit.
    For example, limit a single IP address to 2 requests congruently
    1k/sec for pages, x/sec for images. That way we all share the information super highway.
    The result for me would be the page loads as normal without my noticing anything different other than images/files loading slower.
    And a bot archiving the site will simply take a few months without causing a dos.

    MrDoctor - 31-3-2025 at 02:41

    Quote: Originally posted by Rainwater  
    I would stick with a bandwidth filter or service request limit.
    For example, limit a single IP address to 2 requests congruently
    1k/sec for pages, x/sec for images. That way we all share the information super highway.
    The result for me would be the page loads as normal without my noticing anything different other than images/files loading slower.
    And a bot archiving the site will simply take a few months without causing a dos.

    im not sure that is feasible, however, request-cooldowns IS reasonable, where it takes something like say, 5-10 seconds delay between consecutive requests for page-loads.
    I think the IT guys need to chime in here what is possible, rate limiting i think would require something a bit advanced.

    another thing to note is, grok3 recently got released, grok can access the internet to conduct research, for certain kinds of chemistry questions i found it quite helpful, not in predicting outcomes of reactions but using it like an advanced search engine giving a vague description of what i wanted since i have no idea how people navigate scholarly articles and resources, i think chatgpt can too now. its possible the site gets pinged each time its scanned regarding a new chemistry question, i assume those bots dont train on it nor do they store articles that are used/analyzed, the new deep-seek or reasoning models try to generate more accurate responses by improving upon how real information is analyzed rather than simply totally fabricating a response. if they arent being considerate or caching, then they could be the cause of this.