Hi, my name is Haru. I'm an independent AI alignment researcher based in Boston. I'm currently thinking about how to ensure that powerful future AI systems are aligned to human values.
My main research interest is to answer the question: "What is an adequate theoretical basis from which we can begin to tackle the alignment problem?" I argue that the field's core failure has been a category error: using prescriptive theories of agency (EUM, decision theory, RL) to solve what is inherently a descriptive problem. Prescriptive theories answer "which policy produces desirable behavior?" — they're primarily engineering tools for building agents. Alignment requires predicting the behavior of agents that are more complex than us: where will their boundaries form, how will they update their beliefs, how will their goals drift under capability gain? these are descriptive questions, and prescriptive theories are structurally not built to answer them. I think this is why agent foundations stalled — MIRI correctly identified that we need a theory of agency, but inherited their ontology from the prescriptive tradition (EUM, decision theory), and so were trying to solve a descriptive problem with prescriptive tools. It's also why prosaic approaches (RLHF, evals, control) don't touch the hard part: they operate entirely within the prescriptive frame, engineering behavioral compliance without a principled model of what the resulting agent actually does. Mechanistic Interpretability is an attempt to build out the tooling to eventually get a descriptive theory of LLM behavior, but I think it is unlikely to produce insights that generalize. Whatever paradigm is used to train an AI system, the alignment problem is always in the domain of descriptive theories.
get in touch: haru.jihwan@gmail.com