Reddit usernames like ‘SolidGoldMagikarp’ are somehow causing the chatbot to give bizarre responses.

      • icerunner@kbin.social
        link
        fedilink
        arrow-up
        6
        ·
        2 years ago

        I believe, if this sort of generative AI is going to be trustworthy in the future, we need some sort of external verification system so we can make our own trust judgements based on the data used to train the system. For example, if a system is trained including 4chan as a data source, I’m going to trust it less than if it wasn’t trained using that source.

        I don’t think big business yet realises how important the training data is but, as soon as they do, they will want the AI companies to provide guarantees about the sanity and appropriateness of the training data.

        • pgm_01@kbin.social
          link
          fedilink
          arrow-up
          3
          ·
          2 years ago

          That’s how human intelligence works. We assign a value to the source of the information. The fact that the AI’s seemed to be trained without that explains why they “lie” so much. They simply reconstruct patterns without giving any weight to specific patterns.

          For example, if you have the information “President Biden will launch a ground invasion of Russia.” If the New York Times, BBC, and CNN are all reporting it, we would give that information a higher likelihood of being true than if the information was found on random blogs. However, if the random blogs reporting the information belonged to reputable reporters or bloggers on military and international affairs, we would assign the information a higher value of being correct than if the information came from Bob’s Bigfoot and Alien sightings Index.

          Without the ability to check the level of accuracy of source data, all the generative AI could be corrupted. If you fed an art AI photos of the Statue of Liberty but kept telling it that it was the Eiffel Tower, when asked to draw the Eiffel Tower it would spit out the Statue of Liberty. Right now, without the ability to assess the accuracy of a response, any of the chat-based AI are garbage for most of the use-cases companies are deploying them in.

        • FaceDeer@kbin.social
          link
          fedilink
          arrow-up
          2
          ·
          2 years ago

          Whether 4chan is a good data source or not depends on what you intend to use the AI for. If you want to have it interact with users on a web forum or similar context then using 4chan data would likely be very useful indeed.

          Bear in mind that as long as it’s properly labelled then “bad” data is still useful as an example of bad data. A common example is with image AIs, where people can give negative prompts like “ugly” and “blurry” to tell the AI to make images that are not like that.

      • hiyaaaaa23@kbin.social
        link
        fedilink
        arrow-up
        2
        ·
        2 years ago

        No. There’s a computerfile video but iirc r/counting was accidentally left in the training data set for part of the training process

  • kittykabal@kbin.social
    link
    fedilink
    arrow-up
    12
    ·
    2 years ago

    for anyone who was hoping to have fun with this, i regret to inform you that this is from back in February, which may as well be the 1990s in AI development time. none of this works on the current ChatGPT, and the nature of the problem is more well-known to researchers.

  • saucyloggins@lemmy.world
    link
    fedilink
    arrow-up
    8
    ·
    2 years ago

    I’m not surprised they used Reddit data to train. I am shocked a bit at how fucking lazy and haphazard they were with the data.

    There’s only logical arguments for anonymizing the data which it looks like they didn’t do. It’s such a massive privacy risk not to. It also puts the company at legal risk. Who knows what other bizarre info it’ll leak.

    • FaceDeer@kbin.social
      link
      fedilink
      arrow-up
      2
      ·
      2 years ago

      The silliness of anonymizing data that’s already wide open in the public aside, if you were to anonymize the usernames you’d end up producing a worse AI because often the literal username of the person in question is significant to the context of what’s being written. Think of all the “relevant username” comments, for example. People make puns about usernames, berate people for having offensive usernames, and so forth. If those usernames were all replaced with anonymized substitutes the AI would be training on nonsense.

  • readbeanicecream@kbin.social
    link
    fedilink
    arrow-up
    7
    ·
    2 years ago

    … trained on quite raw data, which included like a load of weird Reddit stuff, …

    Probably all we need to know, right there.

  • knexcar@kbin.social
    link
    fedilink
    arrow-up
    7
    ·
    2 years ago

    TheNitromeFan

    I haven’t thought about Nitrome in a long time! They made some amazing Flash games.