nrimsky / LM-exp Public

Notifications You must be signed in to change notification settings
Fork 25
Star 91

LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces

91 stars 25 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
data_generation		data_generation
datasets		datasets
intermediate_decoding		intermediate_decoding
probability_calibration		probability_calibration
refusal		refusal
steering		steering
sycophancy		sycophancy
unlearning		unlearning
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

Experiments done during SERI MATS (Summer 2023)

Relation to research writeups

`/refusal`

Activation steering with a "refusal vector" to cause llama-2-chat model to stop refusing to answer harmful questions.

Red-teaming language models via activation engineering

`/sycophancy`

Activation steering to modulate sycophancy in llama-2-chat and llama-2 base model.

`/steering`

Activation addition experiments (pure act-adds from single forward passes)

`/intermediate_decoding`

Logit-lens experiments (directly decoding intermediate activations by passing them through unembedding layer)

Decoding intermediate activations in llama-2-7b

Other directories

`/data_generation`

Code for generating LLM-generated datasets using gpt-4, 3.5 and Claude APIs

`/probability_calibration`

Early stage experiments to try and measure whether LLMs are aware of their internal uncertainty over a prediction

`/unlearning`

Early stage attempt at Google's Machine Unlearning Challenge

About

LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces

Report repository

Languages

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Alternative Proxies:

Alternative Proxy