Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FidoNet and Generative AI: A New Approach to Mu...

Avatar for Dmitri Soshnikov Dmitri Soshnikov
September 08, 2023

FidoNet and Generative AI: A New Approach to Museumification of Historical Content Resources

This is a presentation at IEEE HISTELCON 2023 Conference

Avatar for Dmitri Soshnikov

Dmitri Soshnikov

September 08, 2023
Tweet

More Decks by Dmitri Soshnikov

Other Decks in Research

Transcript

  1. FIDONET Cybernetic immortality FidoNet and Generative AI: A New Approach

    to Museumification of Historical Content Resources Vasiliy Burov, Dmitry Soshnikov
  2. FIDOneT, BBS, etc. • Founded 1984 by Tom Jennings •

    Very popular in 1990s • Specific culture & communities • Writing style, quoting
  3. MUSEIFICATION of Traditional content • Museums have learned to present

    traditional content as physical artifacts • and make replicas from ancient books to give them the opportunity to leaf through them • But this is only static content…
  4. MUSEIFICATION of digital content Sites / Documents / Programs User

    generated content • FidoNet, Usenet, IRC – this is not only the frozen content itself, but also the style and dynamics… How can we recreate it?
  5. Goal: Fidonet cybernetic immortality • Train a language model capable

    of generating an infinite stream of potential FIDONet messages • What does this show? • Spirit of FIDONet being still alive • The idea of Cybernetic Immortality and that some form is already possible • Problems • Obtaining the dataset • Choosing and training base language model • Making an exhibit
  6. DATASET Source Years Size (original) Size (cleaned) Fido7 Usenet Archives

    2013-2015 16 Gb (compressed) - Private Archives (JAM) 2001-2004 100Mb 88Mb English Usenet fido group archives 1997-2002 1.7 Gb 0.8 Gb ExecPC BBS Archives (en) 1997-1999 500Mb 500Mb • Datasets are very difficult to find, due to different media available at a time • Google UseNet archives cannot be scraped • Not a single point of aggregation (separate echos on different BBS systems / backbones)
  7. Base language Model selection Model Size Comment LSTM 100K -

    1M Training from scratch GPT-2 124M – 774M – 1.5B ruGPT 3 117M – 760M – 1.3B ruGPT 3.5 13B LLaMA 7B+ 12 hours training 1 epoch Nvidia A100 80Gb GPU Compromise between training time, required dataset size to avoid overfitting
  8. Results Topic: UFO From: RON TAYLOR To: JACK SARGEANT Subj:

    UFOs JS> RT> You are certainly entitled to your opinion as an engineer or scientist, to JS> RT> the extent that you're entitled to an opinion with which you are JS> RT> absolutely convinced. However, if you are a skeptic, you shouldnt JS> RT> make your point clearly and firmly, not to "sell your ideas to me". JS>If you think of a UFO as merely a "flying disk" without an ET engineer behind the JS>scenes, you are free to believe that. I believe that the UFO is JS>real, but I am not content to just speculate about its nature. Because I'm not a skeptic and there are other people in this conference that are skeptic's for the most part... -Ron * QMPro 1.02 42-7029 * Why are there SO many atheists? Because God lets them. --- WILDMAIL!/WC v4.12 * Origin: CrimeBytes:Take A MegaByte Out Of Crime! (305)592-9831 1:135/5.0) https://huggingface.com/estonto/fido-gpt • Generated text is not present in the training dataset • Quoting style correctly reproduced (including names abbreviations) • Names are often present in the training dataset => Overfitting on names due to low database size/variability
  9. implementation Client Web App Cloud Server (GPU) Messenger App Client

    Web App Pre-generation http://soshnikov.com/art/fidoci
  10. Further work • Alternative approach – generation of conversation based

    on dialogue between different conversational models with different personalities • Training models for other languages • Implementing user interaction through chat-bots