Skip to content
/ LM-exp Public

LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces

Notifications You must be signed in to change notification settings

nrimsky/LM-exp

Repository files navigation

Experiments done during SERI MATS (Summer 2023)

Relation to research writeups

/refusal

Activation steering with a "refusal vector" to cause llama-2-chat model to stop refusing to answer harmful questions.

/sycophancy

Activation steering to modulate sycophancy in llama-2-chat and llama-2 base model.

/steering

Activation addition experiments (pure act-adds from single forward passes)

/intermediate_decoding

Logit-lens experiments (directly decoding intermediate activations by passing them through unembedding layer)

Other directories

/data_generation

  • Code for generating LLM-generated datasets using gpt-4, 3.5 and Claude APIs

/probability_calibration

  • Early stage experiments to try and measure whether LLMs are aware of their internal uncertainty over a prediction

/unlearning

About

LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces

Topics

Resources

Stars

Watchers

Forks

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy