Lectures
:material-circle-edit-outline: 约 416 个字 :material-clock-time-two-outline: 预计阅读时间 1 分钟
Language Model and Formal Language
ACL tutorial 4
big promises
- understanding reasoning abilities
- generalization ablities
- inductive bias
- tractability
- algorithms
- connect phenomena to human languages
neural language models - expressiblity - formal language
useful tools - RNN: sequntial automata - Transformer: circuits
Definition
a language model p is a distribution on *\Sigma.
Two language models are equal if p(y) = q(y) for all y on *\Sigma.
Any language model can be written as an autogressive one.
tightness: for arbituary language model, the sum of all p(y) < 1 can happen. If the sum of all p(y) = 1, it is tight.
A minimum non-tight example: To set the probablity of EOS as 0, thus the generation might never stop.
special features of automaton: parity, hierachy detection
simplist RNN: Elman RNN
RNNs can be both as language models / recognizors (to classify grammatical or not)
RNNs work like language model (TODO: photo)
infinite state automata - with unbounded precision and computation time
a turing machine is equivalent to a two-stack pushdown automaton. (tape = two-stacks)
Theorem: an infinite-precision RNNs with unbounded computation time are Turing complete.
implies: - TODO: photo - we need the hidden states to encode the stacks
stack = neuron
drawbacks: - numbers are in finite precision - RNNs stop within finite time
improvement theoretical model: RNNS with bounded computation time and finite precision
Theorem: Finite-precision real-time RNNs are equivalent to FSAs (without stacks, taking the training and inference procedure apart. )
clear: - any finite-precision RNN will define a finite number of states - it can thus be represented by a FSA - an upper bound RNNs can do at most what finite-state automata can
=> Elman RNNs can simulate deterministic FSAs (any non-deterministic FSA can be => deterministic ones)
main differences: - automaton: tabular rules - RNNs: matrix multiplications
Any simple family of automata can efficient simulation of RNNs?
TODO: photo
Some FSA with |Q| states can be compressed into an RNN with O(log|Q|) hidden states.
Takeaways:
- stacks can be represented by linear functions.
an RNN can end up in any rung of the NC hierarchy depending on - whether allowing unbounded processing time - whether allowing unbounded precision
QAs:
- RNNs poor trainability leads to poor performance on learning recursivity.
How to turn an RNN into a classifier
the missing bit:
probablistic finite-state automata also compute probalisties
Beyond Binary States: LSTMs and counting
Counter machine:
TODO: photo, position on NC hierarchy
no model currently reach the computation ability of Turing machine. Some advanced models (e.g. ChatGPT) happen to recognize some programs.
LSTM as a counter machine, the memory cell works as the counter. set (TODO: some parameters) as 1 and the updating function of mem cells are as the counter.
Transformers and formal language theory, part on encoder
bolean circuits, definition:
- directed: acyclicle computational graph
- input: binary (0 or 1)
- nodes: basic boolean functions
- output: binary (0 or 1)
runtime without parallelism
definition (depth): longest path from input to output runtime with parallelism (?)
need sth dependent on input time
formal models: Boolean circuits
- AC^0
- TC^0
- NC^1
(what is ACC here)
formal models: first order logic (how did this appear)
TODO: photo, NC hierarchy interact with language classes
does circuits recognize parity?
decisions to make: attention, softmax
uniformity:
dependence on - number of parameters
unique hard attention:
linear temporal logic
TODO: photo on example of L=\Sigma*ab\Sigma*
adding sequentiality to Transformers, chain-of-thought, part of decoder
- during training get all the sentence all at once
- during inference, one token at a time
so far, only deterministic automata
non-deterministic
takeaways
QAs - the Turing machines can move head left-to-right and right-to-left. if taking that into account, we might
Reference