Jekyll2018-06-06T01:08:55+00:00http://blog.jzhanson.com/Josh ZhansonHi, I'm Josh, an undergraduate CS major at Carnegie Mellon University. Let me know if you love my blog, and definitely let me know if you hate it and why - email me at josh@jzhanson.com. © 2017 Joshua ZhansonFrom 0 to 200 - lessons learned from solving Atari Breakout with Reinforcement Learning2018-05-28T17:00:00+00:002018-05-28T17:00:00+00:00http://blog.jzhanson.com/blog/rl/project/2018/05/28/breakout<script type="text/javascript" async="" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<p>Note: this post has a lot of hefty GIFs. Be patient for the website to load! It’ll be worth it :)</p>
<p>The <a href="https://github.com/jzhanson/breakout-demo">GitHub repository</a> with my code.</p>
<p>I spent the last two months on what my Deep Reinforcement Learning and Control professor called the “MNIST for deep RL” — solving the classic Atari game Breakout. I originally thought it would be a two-week project, especially since I already had the code for a double deep Q-network, but, along with coursework and exams and model training challenges, took closer to two months to complete.</p>
<h2 id="first-stab-double-deep-q-network">First stab: Double Deep Q-Network</h2>
<p>The original deep RL methods that were used to play Atari games came from <a href="https://arxiv.org/abs/1312.5602">Mnih et al., Playing Atari with Deep Reinforcement Learning</a> (and the more cited <a href="http://www.davidqiu.com:8888/research/nature14236.pdf">Nature</a> paper), where Mnih and colleagues used the model-free reinforcement learning algorithm Q-learning, paired with a deep neural network to approximate the action-value Q-function, to play Atari.</p>
<p>Q-learning is a relatively simple algorithm that takes an action in the environment and uses the following update rule to update its estimate of the Q-function with the tuple of sampled experience state, action, reward, and next state <script type="math/tex">(s_t, a_t, r_t, s_{t+1})</script>:</p>
<script type="math/tex; mode=display">Q_{t+1}(s_t, a_t) \overset{\cdot}{=} Q_t(s_t, a_t) + \alpha (r_t + \gamma \cdot \max_a Q_t(s_{t+1}, a) - Q_t(s_t, a_t))</script>
<p>where <script type="math/tex">\alpha</script> is the learning rate and <script type="math/tex">\gamma</script> is the discount factor — see the RL literature for more info.</p>
<p>In a nutshell, the algorithm is pushing its estimate of the reward of taking a particular action in a particular state a little bit towards the real reward obtained by the agent in that state by taking that action. Under some conditions regarding infinite sums and the learning rate as well as that all states and all actions are visited and taken infinitely often, it has been shown that this estimate of the Q-function converges to the true Q-function.</p>
<h3 id="challenges">Challenges</h3>
<p>The network used in the Nature paper made of three convolutional layers plus a fully-connected layer and an output logit for each action to estimate its corresponding Q-value, was simple to implement and fairly standard at the time. 2015 was just three years ago, but more recent methods have essentially made deep Q-networks, at least on their own as presented in the Nature paper, obsolete. Reinforcement learning as a field is moving very quickly.</p>
<p>The main challenge lay in the replay memory: the Nature paper used a replay buffer of 1M transitions, and because each state was made up of four grayscale 84x84 images stacked together and each transition has two states attached to it, this meant that the replay buffer should have taken about 56 billion bytes, or 56 gigabytes, which is really not that much. However, when training the model on AWS, I found that the memory usage was exploding. The model was not small, of course, with 3 convolutional layers of 32, 64, and 64 kernels each, plus a dense layer of 512 units and then another dense layer to connect to the output logits, but saved model checkpoints should not have been nearly the size of the replay buffer. With some quick-and-dirty calculations in the search bar of my web browser, it seemed like each transition was eating up 0.0003 gigabytes or 300,000 bytes, which was way way way more than the 56,000 bytes or so each transition should have taken up. This was most likely due to the way I structured my replay buffer — the interaction between the numpy arrays that were the images and the Python deque must have had a memory leak somewhere.</p>
<p>There was also a possibly related problem that I have yet to figure out — after some time, the AWS instance would stop responding and I would be unable to SSH in to it. It didn’t matter whether I ran the Python process in a tmux session or in the foreground or background, but whenever I would let it run for a while and then tried to reconnect, the SSH would hang for 10-15 minutes and then print a simple “Permission denied.” So far, my best guess as to what happened is that the replay buffer fills up and with Tensorflow using up every ounce of compute the system has left, there is no memory left to respond to the SSH request. It could also be the case that there is sufficient memory (towards my later trials, I was allocating 2000 gigabytes or 2 terabytes per instance) but because so much was held in the swap/RAM, the caching slowdown brought on by having to constantly sift through the slower SSD flash memory to sample transitions at random from the replay memory completely overwhelmed the system and made it take a huge amount of time to respond to the SSH request.</p>
<p>In any case, it proved very difficult to even keep an AWS p2.xlarge instance alive long enough for me to be able to SSH back into it that I eventually abandoned the double deep Q-network and moved on to the other less GPU- and memory-intensive methods.</p>
<h2 id="second-try-advantage-actor-critic-a2c">Second try: Advantage Actor-Critic (A2C)</h2>
<p>Asynchronous Advantage Actor-Critic (A3C) is a more recent <a href="https://arxiv.org/abs/1602.01783">algorithm</a> from 2016 by the same authors as the original Nature paper which uses a deep network to learn the optimal policy using an estimate of the state-value V-function rather than the action-value Q-function. Both use multiple workers, each with their own copy of the enviornment, but A3C uses them asynchronously while A2C runs them synchronously. According to <a href="https://blog.openai.com/baselines-acktr-a2c/">OpenAI</a> there seems to be no noticible benefits provided by the asynchronicity.</p>
<p>This algorithm has two neat tricks here: first, we are calculating the actual value of a state from experience using a <em>rollout</em> of the rewards received over N time steps</p>
<script type="math/tex; mode=display">A_t = R_t - V(s_t) = \sum_{i=0}^{N-1} \gamma^i r_{t+i} + \gamma^N V(s_{t+N}) - V(s_t)</script>
<p>as well as subtracting the value of the starting state, which gives a quantity known as the <em>advantage</em>, i.e. a measure of the relative amount of reward that can be expected from a state. A really good (high-value) state is likely to have a high reward, so the advantage is small, and a really bad (low-value) state is likely to have a low reward, so the advantage is also small. However, receiving a high reward in a bad state results in a large advantage, while a low reward in a good state results in very small (likely negative) advantage.</p>
<p>We use this quantity squared as the loss for the part of the network that estimates the value function, known as the <em>critic</em>, and we use that quantity times the negative log of the probability we take the action we took in that state under the policy given by our network to update the part of the network responsible for computing the policy, known as the <em>actor</em>, hence, <em>advantage actor-critic</em>. It is fairly common in practice, however, to use the actor and the critic loss combined with an entropy term as the loss function, which is what I did.</p>
<p>The N in the above expression is a hyperparameter for the number of steps to unroll when calculating the cumulative discounted reward — basically, how far into the future to look when determining an action’s impact on obtained reward. Using a value of <script type="math/tex">N=1</script> gives a one-step advantage actor-critic, while using a value of <script type="math/tex">N = \infty</script> gives an algorithm known as REINFORCE, which are both part of the broader category of N-step advantage actor-critic methods.</p>
<p>The second trick here is that we run multiple workers, each with its own environment, all using and updating the same network weights — hence, <em>asynchronous</em>. Exploration can either come from the workers updating their network weights separately and syncing them periodically, all the workers using the same weights and updating them immediately, or even by adding a little bit of noise to the action-probabilities outputted by the policy network.</p>
<p>It is worth noting, however, that since A2C and A3C are on-policy learning algorithms, we require that the updates to the network come from the policy that is outputted by the network. This is in contrast to off-policy methods like Q-learning outlined above, which do not require that we follow the policy given by our network because we are not learning a policy — we are learning the value of taking various actions in the different states of the Markov Decision Process, rather than directly learning what to do in a particular state. This means that a replay buffer, a key component of deep Q-networks, cannot be used for A2C, as all experience used to train the network must come from the policy currenty given by the network.</p>
<h3 id="challenges-1">Challenges</h3>
<p>The biggest setback I suffered, or rather, challenge I surmounted :), was my initial misunderstanding of the algorithm. I initially thought that the cumulative discounted reward included the state-value function for each state in the N steps, rather than just the last state. That is, I was calculating the cumulative discounted reward for each step within a batch of <script type="math/tex">N</script> steps as (assuming <script type="math/tex">t=0</script> is the first step in the batch rather than first step in the episode), as illustrated below. Note that the value of a terminal state is defined to be 0.</p>
<p>WRONG:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for t from 0 to N-1:
cumulative_discounted = 0
for i from t to N-1:
cumulative_discounted += gamma^i * r_t
R[t] = gamma^N V(s_t) + cumulative_discounted
</code></pre></div></div>
<p>RIGHT:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>R[N] = V(s_N)
for t from N-1 to 0:
R[t] = r_t + gamma * R[t+1]
</code></pre></div></div>
<p>The primary difference being that only the last state value is included in the target, not the state value for every intermediate state in the N steps. The first rollout doesn’t work because the values outputted by the network itself, the estimate of the value function <script type="math/tex">V(s_t)</script>, play too large a part in the optimization of the network — the target is primarily comprised of value estimates, rather than real rewards. The second rollout only includes the value function of the very last state after N steps, which results in a target made up more of real rewards than estimates, which really does make all the difference.</p>
<p>Below are two animated GIFs I made with my phone’s camera set to time-lapse visually explaining the difference.</p>
<p><img src="/assets/breakout/correct_rollout.gif" alt="Correct rollout" title="Correct rollout" /></p>
<p><img src="/assets/breakout/wrong_rollout.gif" alt="Wrong rollout" title="Wrong rollout" /></p>
<p>This is also a good time to note that the key difficulty of deep reinforcement learning is that these two methods, as well as many other more recent methods like PPO and TRPO, all rely to a certain degree on using the network’s own estimates as part of the target to optimize towards. This is known as <em>bootstrapping</em> in the RL literature, coming from the 19th century expression “to pull yourself over a fence by your bootstraps,” meaning to do something impossible. Fitting, seeing as how these deep models are able to do just that — successfully learn how to play a game using real experience combined with its own estimates and pulling on itself to surmount a huge obstacle.</p>
<p>Contrast this with traditional supervised learning, where the target to train the network towards comes only from the labeled training data — MNIST or ImageNet would be a whole lot harder if networks were trained where half of the objective function is made up of the real label for an image, and half is made up of what the network thought the image was. It does seem quite impossible to bootstrap a model using its own output as a part of the target, but a really cool thing about reinforcement learning is that these methods actually work.</p>
<p>Some improvements to the OpenAI Gym Breakout environment I implemented included treating loss-of-life as the end of an episode, rather than the end of a game (5 lives) as an episode, and repeating the first frame of an episode in the frame-stacking rather than using frames of all zeros, as well as pressing the “fire” button at the beginning of an episode to launch the ball.</p>
<p>A minor training issue I encountered: since the outputs of the policy logits are at first very similar, putting them through a softmax distribution and then sampling from it meant that the agent was following a more or less random policy, which made it impossible to learn from experience — any tiny changes to the network weights would just be drowned out by the random sampling. A probability distribution of 0.25/0.25/0.25/0.25 is not a whole lot different than 0.245/0.247/0.253/0.255 when you’re sampling from it. I also discovered that adding noise to the outputs to encourage exploration simply meant that the agent had a harder time following its policy, and that the noise again drowned out the changes in the policy in the early episodes of learning, which are critical to bootstrapping. Taking the argmax of the outputted action probabilites was the way to go, since it offered the most consistency with the actor’s behavior and the network’s outputs — argmax is very sensitive to small changes when all the probabilities are very similar.</p>
<p><img src="/assets/breakout/bad_entropy.png" alt="Flatlined entropy" title="Flatlined entropy" /></p>
<p>Note that 1.38, the value at the flat line in the graph, is the entropy for the probability distribution 0.25/0.25/0.25/0.25.</p>
<p>This also had to do with the fact that our total loss, used to optimize both the actor and the critic, combined the actor loss, the critic loss, and a negative entropy term, which actually had the effect of pushing the policy action probabilites <em>closer</em> to a random policy: minimizing negative entropy means maximizing entropy, leading the network to be more “uncertain” about which action to take. While this may sound like a bad idea, it is actually necessary to prevent the algorithm from falling into some very easy local minima right off the bat by taking the same action to the exclusion of all others, making it impossible to learn anything but that suboptimal behavior. For example, training without the entropy term or with the entropy term’s sign flipped made the agent in Breakout move the paddle all the way to the right and do nothing but try to move the paddle right.</p>
<p>Finally, after correcting that big misunderstanding, I found some sort of learning rate decay necessary in order to skirt the local minima of the objective function in the early stages of training. If we kept the learning rate constant, the network would learn to hit the ball once or twice, perhaps even getting up to 30 reward or so, and then unlearn all of it and just move the paddle right. However, learning rate decay allows the network to value later learning less than initial learning, which makes sense since games of Breakout all look about the same at the beginning and we want to quickly learn the behavior of hitting the ball, but as the games progress, they tend to look different and we want to learn just enough that the agent can hit the ball but not too much that it thinks that some configuration of the blocks means that it should arbitrarily move left or right. Decaying the learning rate allows us to initially take large steps to step over early local minima and smaller steps later on once the algorithm is close to the true minimum.</p>
<p>I used a simple linear learning rate decay policy where the initial learning rate was decayed linearly over several million training iterations, but I wonder if different decay strategies like quadratic or exponential might make a difference in avoiding the sharp overfitting dropoff that we can see towards the end of training.</p>
<p>Some comments on the generated graphs: because we are slightly pushing entropy to be high in the loss function to avoid the network prematurely preferring one action to the exclusion of all others, the entropy should remain fairly high and fairly constant, but it should certainly not flatline as 1.38, which is the value associated with a random policy. It is interesting to see how the losses are related to the episode length and the average reward, and episode length and rewards are very closely correlated, since games of Breakout lasting longer = a higher score. Also note that I am averaging rewards per episode over 100 episodes, which trades precision for a better look at the overall trend of learning – the reward gotten per episode usually has quite a high variance, so a higher average reward per 100 episodes really means that it is consistently getting better. A more precise graph would probably use average reward per 20 or 25 episodes.</p>
<p>Apologies for some of the graphs running over their axes — I have so far only run on my local machine but plan to run on cloud compute next.</p>
<h3 id="n--5">N = 5</h3>
<p><img src="/assets/breakout/5_entropy.png" alt="N = 5 entropy" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/5_losses.png" alt="N = 5 losses" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/5_episode_length.png" alt="N = 5 episode length" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/5_rewards.png" alt="N = 5 rewards" style="width: 370px; float: left;" /></p>
<p>I have not yet run the N = 5 case extensively, but in the 3000 or so episodes I did run it, it did not seem to learn anything. More details (# iterations, etc.) to come as I train this for longer. For now, these graphs provide a good look at what an agent that doesn’t learn anything looks like.</p>
<h3 id="n--20">N = 20</h3>
<p><img src="/assets/breakout/20_entropy.png" alt="N = 20 entropy" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/20_losses.png" alt="N = 20 losses" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/20_episode_length.png" alt="N = 20 episode length" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/20_rewards.png" alt="N = 20 rewards" style="width: 370px; float: left;" /></p>
<p>N = 20 was the first case to show promising results — it was able to get up to a max reward of 376 and a good average reward per 100 episodes, although it wasn’t quite able to get over 200 average reward per 100 episodes before the overfitting cliff hit, which was at around 9000 training episodes (4M iterations).</p>
<h3 id="n--50">N = 50</h3>
<p><img src="/assets/breakout/50_entropy.png" alt="N = 50 entropy" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/50_losses.png" alt="N = 50 losses" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/50_episode_length.png" alt="N = 50 episode_length" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/50_rewards.png" alt="N = 50 rewards" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/50_max_reward.png" alt="N = 50 max reward" style="width: 370px; float: center;" /></p>
<p>N = 50 performed even better than N = 20, and since I began graphing the max reward obtained so far, N = 50 was able to get a somewhat but not significantly higher max reward of 397, though it took significantly longer to train (in terms of # of episodes, not sure yet about # iterations). N = 50 also had a policy that appeared more stable, likely because unrolling over more time steps trades training speed and immediate reward for a more long-term outlook both in the agent and in training.</p>
<h3 id="n--100">N = 100</h3>
<p><img src="/assets/breakout/100_entropy.png" alt="N = 100 entropy" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/100_losses.png" alt="N = 100 losses" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/100_episode_length.png" alt="N = 100 episode length" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/100_rewards.png" alt="N = 100 rewards" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/100_max_reward.png" alt="N = 100 max reward" style="width: 370px; float: center;" /></p>
<p>N = 100 was the slowest training agent that I had run so far, but it certainly did a good job of learning how to play Breakout, likely because 100 is a particuarly good number of steps, about the number of steps that it takes for the paddle to hit the ball and for the ball to hit the bricks and reward to be issued, which makes it particularly good for a rollout as each batch of 100 steps would include the paddle actually hitting the ball as well as the reward being issued. The max reward achieved is 428 and average reward per 100 episodes exceeded 200 towards the end of training.</p>
<p>At around 10400 episodes of training, the agent exhibits the advanced behavior of focusing hitting the ball towards one side of the wall, thus making a tunnel to hit the ball through and score a huge reward when the ball repeatedly bounces off the far wall and the higher-valued bricks in the back.</p>
<p><img src="/assets/breakout/gifs/10400side_tunnel.gif" alt="10400 episodes, side tunnel" title="10400 episodes, side tunnel" /></p>
<p>Here are two video captures from 11400 and 11900 episodes of training where it digs a tunnel through the center as well as a tunnel through the side and even catches the ball when it comes out one of the side tunnels, even though it was hit through the center tunnel.</p>
<p><img src="/assets/breakout/gifs/11400center_tunnel.gif" alt="11400 episodes, center tunnel" title="11400 episodes, center tunnel" /></p>
<p><img src="/assets/breakout/gifs/11900both_tunnel.gif" alt="11900 episodes, both tunnels" title="11900 episodes, both tunnels" /></p>
<p>Finally, here are two video captures from 15500 and 17800 training episodes where the agent has more or less solved the game, hitting almost every brick on the screen.</p>
<p><img src="/assets/breakout/gifs/15500balanced_almost_complete.gif" alt="15500 episodes, balanced, almost complete" title="15500 episodes, balanced, almost complete" /></p>
<p><img src="/assets/breakout/gifs/17800consistent_almost_complete.gif" alt="17800 episodes, consistent, almost complete" title="17800 episodes, consistent, almost complete" /></p>
<p>Unfortunately, after a week of training on my laptop, this model too hit the overfitting cliff. Here are the graphs from the end of training:</p>
<p><img src="/assets/breakout/100_entropy_final.png" alt="Final N = 100 entropy" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/100_losses_final.png" alt="Final N = 100 losses" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/100_episode_length_final.png" alt="Final N = 100 episode length" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/100_rewards_final.png" alt="Final N = 100 rewards" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/100_max_reward_final.png" alt="Final N = 100 max reward" style="width: 370px; float: center;" /></p>
<p>And here’s a video of the final policy. Note that it does seem to have retained something, but the policy logits are outputting action-probabilities that have the 1 on the move right action, which is usually what these learning algorithms resort to in this game when there’s a bug in the code or if they are not complex enough to learn how to play the game.</p>
<p><img src="/assets/breakout/gifs/100_final.gif" alt="24000 episodes, definitely overfit" title="24000 episodes, definitely overfit" /></p>
<h3 id="n--infinity">N = infinity</h3>
<p><img src="/assets/breakout/infty_entropy.png" alt="N = infinity entropy" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/infty_losses.png" alt="N = infinity losses" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/infty_episode_length.png" alt="N = infinity rewards" style="width: 370px; float: left;" /></p>
<p><img src="/assets/breakout/infty_rewards.png" alt="N = infinity rewards" style="width: 370px; float: left;" /></p>
<p>I found that N = infinity was not able to learn anything, most likely because the unrolling takes place over several hundred time steps and the rewards just become too diluted to train worth anything. Also, if the only estimated state-value wrapped into the rollout is that of the terminal state, then it removes the effect of even having a critic — the critic estimate is always disregarded in training and the estimated state-value is discarded. Even if it were run for a very long time, I doubt that it would be able to learn Breakout.</p>
<h3 id="reflection">Reflection</h3>
<p>There is also the very interesting steep dropoff towards the end of training when the agent seems to suddenly stop being able to play breakout. From video capture, it seems as if the agent can still move the paddle to more or less the right place but can’t keep it there to hit the ball, instead moving it aside at the last moment. This likely starts a positive feedback loop, resulting in the agent repeatedly achieving very little reward with the weights that it learned, leading to it unlearning how to play breakout in a cascade of poor episodes caused by <em>just</em> missing the ball.</p>
<p><img src="/assets/breakout/gifs/moving_aside_1.gif" alt="Moving aside, part 1" title="Moving aside, part 1" /></p>
<p><img src="/assets/breakout/gifs/moving_aside_2.gif" alt="Moving aside, part 2" title="Moving aside, part 2" /></p>
<p>And eventually, it performs more or less like a random agent.</p>
<p><img src="/assets/breakout/gifs/20do_nothing.gif" alt="Doing nothing" title="Doing nothing" /></p>
<p>Here is a video capture of a lucky random policy, for comparison:</p>
<p><img src="/assets/breakout/gifs/random_policy.gif" alt="Random policy" title="Random policy" /></p>
<p>In any case, my agent was able to achieve consistently 200+ reward, which is considered to have “solved” Breakout. Certainly it matches, if not surpasses, human-level performance, and besides the fact that a critical misunderstanding in the A2C algorithm took me two months to unravel, this was an extremely informative learning experience. Writing the code for the algorithm and the network was the easy part. The hard part was training and debugging. I was lucky in that respect — I found an implementation of the algorithm worked that I could look at to see which features it had that my code didn’t, and then implement them in my own code one by one.</p>
<p>Some very interesting questions that I would like to explore: why do smaller values of N even work, considering that the action that resulted in the paddle hitting the ball and the reward being issued may not even take place in the same N time steps? Particularly for N = 20 — how was it able to learn something when the reward definitely was not issued in the same batch as the action that led to the reward? Exactly how much of a role does entropy and the critic loss play — I used the canned coefficients of 0.5 for the critic loss and 0.1 for the entropy, but would the agent learn faster if the critic loss coefficient was increased, placing relatively more value on the quality of the network’s estimates, or if the entropy coefficient was increased (encouraging more evenly-distributed action probabilities) or decreased (encouraging more confident, distinct action probabilities).</p>
<p>And the biggest question of all: what exactly is the cliff at the end of training? I have observed that the cliff happens when the softmax action probabilities converge to all zeroes and one one. It must be some sort of overfitting, but is it in the same vein as overfitting in supervised learning, or is it something different? It is a very sharp drop rather than a slow decline, which means that the agent was very good at playing the game before somewhat suddenly becoming very bad. Breakout is deterministic, which means that the loss of uncertainty whould be a good sign — likely, the wrong kernels/units are being overly emphasized, which leads to worse decisions.</p>
<p>An interesting hint is that the actor loss goes to zero (again, because probability of choosing the action that it chose becomes 1 and the log of that becomes 0) but the critic loss explodes, becoming something around 10+ digits long, which hints us that the value estimate for each state is exploding while the obtained rewards stagnate or drop sharply, and since the critic loss is the difference of the two squared, it results in an extremely large loss, which is likely the reason for the agent’s quick decline in performance. This seems quite like a case of exploding gradients, where the network’s state-value estimate goes to infinity or negative infinity (likely the latter) and causes a positive feedback loop where the loss gets larger and larger and the gradients get larger and larger.</p>
<p>All in all, a very very good learning experience. Who knew that reinforcement learning was so hard? :P</p>Reinforcement Learning Part 1 - K-Armed Bandits2018-01-21T17:00:00+00:002018-01-21T17:00:00+00:00http://blog.jzhanson.com/blog/rl/tutorial/2018/01/21/rl-1<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<p>This is the first in a series I’ll be doing on (deep) reinforcement learning where I’ll write about the topic and the interesting parts in a lightweight, easy-to-read format! A lot of this will be based off <a href="http://www.incompleteideas.net/book/bookdraft2017nov5.pdf">Sutton & Barto’s Reinforcement Learning book</a>, and this particular post will be focusing on Chapter 2 from that book. Send any comments or corrections to <a href="mailto:josh@jzhanson.com">josh@jzhanson.com</a>.</p>
<h2 id="the-bandit-problem">The Bandit Problem</h2>
<p>The first time I heard about the bandit problem, I had just entered Carnegie Mellon University’s School of Computer Science. I knew next to nothing about the broader field of computer science. After I emailed the dean, <a href="http://www.cs.cmu.edu/~awm/">Andrew Moore</a>, asking for a bit of advice on finding my life direction, he very kindly set aside a bit of time in his undoubtedly busy schedule to talk with me one-on-one. He spoke about the transition from high school to college, and how one’s vision should appropriately broaden. He spoke about finding your niche, where you fit in and who you fit in with. He spoke about taking what he called <em>technological risks</em> - when you don’t know if something is even possible, but, knowing that you’re surrounded by the best minds in the field, you have a good chance of making something that was previously impossible, possible.</p>
<p>On the topic of a life direction, he introduced to me the <em>bandit problem</em>, which goes as follows: say you have a slot machine in front of you which has two levers - in contrast to normal slot machines, which have one lever and are often called <em>one-armed bandits</em> on account of their one lever. Say the two different levers of this two-armed bandit in front of you both make the slot machine spin and output some reward, but they do so differently so that pulling one lever or the other result in different payouts. Of course, nothing is for certain, so maybe the first lever has a higher average payout than the second one, or maybe the second one has a higher chance to give you nothing but also a higher chance to make you rich beyond your wildest dreams.</p>
<p>Unfortuately, you don’t know the statistical distributions of the payouts for each lever. But you want to get rich quick, and you only have enough money for, say, 100 lever pulls, so what do you do? One easy strategy is to pick a lever, and keep pulling that one. Maybe you’ll get lucky and pick the “better” lever, or maybe you’ll pick the “worse” lever. If you wanted to be smarter about it, you would sacrifice some initial payout and give each lever a couple pulls, just to see which one <em>seems</em> better, and once you had a good enough guess about which lever was better, spend the rest of your time only pulling that one. Hence, you spend some time in the <em>exploration</em> phase figuring out which lever is the best, and you spend the rest of your time in the <em>exploitation</em> phase, pulling the same lever and getting as much money as you can.<sup><a href="#footnote1">1</a></sup></p>
<p>It is important to note that the tasks of <em>exploration</em> and <em>exploitation</em> are conflicting - your goal is to get as much payout, or reward, as you can, and you get as much money as you can by exploitation. However, you might not know which strategy is best without exploration - exploring might make you try out unknown strategies to make sure that you’re not missing a potential goldmine. You can’t do just one and not the other - only exploring won’t pay off as much, and only exploiting might miss the best lever to pull. Finding the trade-off between the two is one of the most important parts of reinforcement learning.<sup><a href="#footnote2">2</a></sup></p>
<p>What exactly is reinforcement learning? <em>Reinforcement learning</em> is how an <em>agent</em> learns, by itself and by trying out different actions, which actions to take in various situations in order to maximize a <em>reward</em>. In fact, a reinforcement learning system has four main parts, a <em>policy</em>, which defines what actions the agent should take in a given situation, a <em>reward signal</em>, which gives a numerical representation of how well the agent is doing at the task or its goal, a <em>value function</em>, which specifies favorable states (where the potential for reward is high) and unfavorable states, and, optionally, a <em>model</em> of the environment, which can range from very simple to very complex and is quite often intractable.</p>
<h2 id="definitions">Definitions</h2>
<p>Note: in this section, notation is kept consistent with Sutton & Barto’s formulations in Chapter 2 of <em>Reinforcement Learning, an Introduction</em>.</p>
<p>A <em>k-armed</em> bandit problem is defined as a situation where, at each <em>time step</em>, the agent has a choice from <em>k</em> different actions where each action results in a <em>reward</em> chosen from some unchanging probability distribution for that action. The agent aims to maximize the total reward gained over some fixed number of time steps, say, 100 or 1000. The analogy is to a bandit slot machine because each action can be likened to pulling a particular one out of the <em>k</em> levers of the slot machine and receiving the reward chosen from the appropriate distribution.</p>
<p>Let’s write this more formally - just like in deep learning, it is easy to read a lot of high-level discussion about reinforcement learning without really understanding anything - it is fairly simple, and writing the base formulations helps make it simple.</p>
<p>If we call the <em>value</em> of an action the mean reward when that action is taken - recall that the reward is sampled from a distribution and is rarely just a constant - and the action selected on time step <script type="math/tex">t</script> as <script type="math/tex">A_t</script> and the reward of that particular action as <script type="math/tex">R_t</script>, we can write the value of an action <script type="math/tex">a</script> as the expected reward if <script type="math/tex">a</script> is taken:</p>
<script type="math/tex; mode=display">q_* (a) = E[R_t \vert A_t = a]</script>
<p>However, because we don’t always know the <em>true</em> value of every action, we denote our best estimate of the value of action <script type="math/tex">a</script> as <script type="math/tex">Q_t(a)</script>.</p>
<p>There are a couple ways of estimating <script type="math/tex">Q_t(a)</script> - one of the most basic is using the <em>sample-average</em> method, which is simply summing up all the rewards received after performing action <script type="math/tex">a</script> and dividing by the number of times action <script type="math/tex">a</script> was taken prior to the current time step <script type="math/tex">t</script>.</p>
<script type="math/tex; mode=display">Q_t(a) = \frac{\sum_{i = 1}^{t - 1} R_i \cdot \textbf{1}_{A_i = a}}{\sum_{i = 1}^{t - 1} \textbf{1}_{A_i = a}}</script>
<p>Where the bold <script type="math/tex">\textbf{1}</script> is just a random indicator variable that equals 1 if action <script type="math/tex">a</script> was taken on time step <script type="math/tex">i</script> and 0 otherwise, which just serves to make sure that we’re only working with the rewards when we actually took action <script type="math/tex">a</script>.</p>
<p>If we wish to do a <em>greedy</em> action selection (i.e. picking the immediate best action) we just take the max estimated reward over all our actions and pick that one and call it <script type="math/tex">A_t</script>.<sup><a href="#footnote3">3</a></sup></p>
<script type="math/tex; mode=display">A_t \leftarrow \text{argmax}_a Q_t (a)</script>
<p>We can begin, now, to formally mesh exploration and exploitation. We want to be exploiting most of the time, so let’s define a small probability <script type="math/tex">\varepsilon</script> that we explore and select a random action, and the rest of the time, we exploit (with probability <script type="math/tex">1-\varepsilon</script>) and select the action with the highest estimated reward. We call this type of exploration-exploitation balance <em><script type="math/tex">\varepsilon</script>-greedy</em> methods.</p>
<h2 id="updating-with-previous-estimate">Updating with previous estimate</h2>
<p>Now that we’re keeping track of all our estimates for action values <script type="math/tex">Q_n</script> after we’ve selected a given action <script type="math/tex">n - 1</script> times, we can show that for any <script type="math/tex">n</script>, we can calculate <script type="math/tex">Q_{n+1}</script> at that step given only the current estimate <script type="math/tex">Q_n</script> and the current reward <script type="math/tex">R_n</script>, rather than with all the previous rewards:</p>
<script type="math/tex; mode=display">Q_n \stackrel{.}{=} \frac{R_1 + R_2 + \ldots + R_n}{n}</script>
<p>so</p>
<script type="math/tex; mode=display">Q_{n + 1} = \frac{1}{n} \sum_{i = 1}^n R_i</script>
<script type="math/tex; mode=display">= \frac{1}{n}(R_n + \sum_{i = 1}^{n - 1} R_i)</script>
<script type="math/tex; mode=display">= \frac{1}{n}(R_n + (n-1)\frac{1}{n-1}\sum_{i = 1}^{n - 1} R_i)</script>
<script type="math/tex; mode=display">= \frac{1}{n}(R_n + (n-1)Q_n)</script>
<script type="math/tex; mode=display">= \frac{1}{n}(R_n + nQ_n - Q_n)</script>
<script type="math/tex; mode=display">= Q_n + \frac{1}{n}(R_n - Q_n)</script>
<p>This means that to calculate our new estimate, we just need our current estimate and the current reward! It’s also worth noting that the last equation is of the form</p>
<script type="math/tex; mode=display">\text{New estimate} = \text{Old estimate} + \text{Step size} (\text{Target} - \text{Old estimate})</script>
<p>which intuitively makes sense - we want to be updating our estimate based off what our previous estimate was and how much the reality differs from our previous estimate, weighted by some learning factor.</p>
<h2 id="my-implementation">My implementation</h2>
<p>I’m working on my own basic implementation of <script type="math/tex">\varepsilon</script>-greedy methods on a 10-armed testbed where the true reward <script type="math/tex">q_*(a)</script> for each action is sampled from a normal distribution with mean 0 and variance 1, and the reward per action is sampled from a normal distribution with mean <script type="math/tex">q_*(a)</script> and variance 1. Stay tuned for results and my own plots - but for the meantime, Sutton & Barto have a good discussion of their sample results.</p>
<hr />
<p><a name="footnote1">1</a>: Andrew Moore said that I was still in the exploration phase, where my goal was to figure out what I wanted to do with my life and what I liked doing - the exploitation phase came later, when I would work at it as hard as I could.</p>
<p><a name="footnote2">2</a>: Things get a bit more complicated once we make the payoffs for each lever change over time - what you thought was the optimal arm to pull might not be, after a while. But we’ll get into that later.</p>
<p><a name="footnote3">3</a>: I use the pseudocode arrow notation for assignment here while Sutton & Barto use the <script type="math/tex">\stackrel{.}{=}</script> notation to represent a definition</p>Algorithms - Selection2018-01-17T03:00:00+00:002018-01-17T03:00:00+00:00http://blog.jzhanson.com/blog/practice/code/algorithms/2018/01/17/algos-2<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<p>Welcome to the second of a series where I write a bit about an interesting algorithm I learned. Send comments or corrections to <a href="mailto:josh@jzhanson.com">josh@jzhanson.com</a>.</p>
<p>This week, we’ll be going over a problem similar to last week’s median of two sorted arrays - finding the kth-smallest element in an <em>unsorted</em> array! This problem is taken from the <a href="https://www.cs.cmu.edu/~15451/lectures/lec01-intro.pdf">first lecture</a> of <a href="https://www.cs.cmu.edu/~15451/">15-451 Algorithm Design and Analysis</a> at CMU this semester - which happened today. I thought the algorithms that were presented were cool and worth writing a post about.</p>
<p>Note: in this post, the algorithms will be all sequential - therefore, the work equals the span.</p>
<p>Second note: I’m considering whether or not to use LaTeX in some parts - it adds mathematical precision and rigor but it makes the tone of the post a little too formal.</p>
<h2 id="the-problem">The problem</h2>
<p>Let’s define terms first. Say we have a sorted array of elements, not necessarily consecutive. We define an element’s <em>rank</em> to be its position in the sorted array, starting from 1. For example, if we have the array [1, 3, 6, 7, 14, 20, …], the element 1 has rank 1, the element 3 has rank 2, the element 6 has rank 3, and so on.</p>
<p>Our problem: given an unsorted array A of length n and an integer k, find the kth-smallest element. Note that we can find the median of this unsorted array by taking the element with rank n/2 if n is even and n/2 + 1 if n is odd. Also, if the array is sorted, then the kth smallest element is trivially the element with index k.</p>
<p>It is important to always precisely state the input and output of the problem - it helps understand what the problem is asking and prevent you from solving an adjacent but different problem.</p>
<p><strong>Input</strong>: An array A of n unsorted data elements with a total order (which just means that the elements can always be compared against each other and “greater” “less” and “equal” are defined), and an integer k in the range 1 to n, inclusive.</p>
<p><strong>Output</strong>: The element of A with rank k.</p>
<h2 id="algorithm-1-quick-select">Algorithm #1: Quick select</h2>
<p>If we look at the problem, we see that it bears some resemblence to quicksort - in fact, whenever the sorted-ness of an array is mentioned in a problem, a good starting point will be to think about different sorting algorithms - <em>selection sort</em>, <em>insertion sort</em>, <em>mergesort</em>, <em>quicksort</em>, and maybe <em>heap sort</em> or <em>radix/bucket sort</em> if you know extra information about the elements.</p>
<p>In particular, let’s think about quicksort, which is sequential - thinking about mergesort won’t go too far in this case, because after we split the array, we only care about the half that the median is in. In addition, we can’t make any assumptions about the elements in the subarrays after we split in mergesort, while in quicksort we know that the elements in each half of the array are less than the pivot element. We’ll be looking at <em>randomized</em> quicksort, which means that instead of always picking the “middle” index or the “first” index, we pick an element uniformly at random from the array to be the pivot.</p>
<p>Here’s the quicksort algorithm and pseudocode:</p>
<ol>
<li>
<p>Pick a pivot element x from the array uniformly at random.</p>
</li>
<li>
<p>Put elements that are <em>less than or equal to</em> x before it and elements that are <em>greater than</em> x after it. Let L be the subarray of elements before x and R be the subarray of elements after x.</p>
</li>
<li>
<p>Recursively call quicksort on L and R.</p>
</li>
</ol>
<p>Note that while quicksort (and the other algorithms presented in this post) work fine with duplicate elements, it simplifies our discussion a little to assume all elements in A are distinct.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def quicksort(A):
if |A| is 1:
return A
x = uniformly random element of A
L = all elements of A less than x
R = all elements of A greater than x
L' = quicksort(L)
R' = quicksort(R)
return L' + x + R'
</code></pre></div></div>
<p>The bars around an array stand for “length of” that array.</p>
<p>We can make an observation here that lets us adapt this algorithm for finding the kth element. <em>We actually know the lengths of L and R</em>. This means that we can recursively call the algorithm on the subarray that the kth element falls in, and if we are recurring into the left array then we leave k as is but if we are recurring into the right array then we subtract the length of the left array from k.</p>
<p>Specifically, if there are k elements or more in L, we know the element of rank k lies in L. If there are less than k-1 elements in L, then the element of rank k lies in R. We can additionally say that if there are exactly k-1 elements in L, then x is the element of rank k and we’re done!</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> def quickselect(A, k):
if |A| = 1:
return A[1]
x = uniformly random element of A
L = all elements of A less than x
R = all elements of A greater than x
if |L| == k-1:
return x
else if |L| >= k:
return quickselect(L, k)
else:
return quickselect(R, k-|L|)
</code></pre></div></div>
<h3 id="runtime-analysis">Runtime analysis</h3>
<p>Let’s do some runtime analysis! Runtime in this context is number of comparisons. We aim to show the entire algorithm has expected runtime O(n).</p>
<p>Informally:</p>
<p>It takes linear O(n) time to construct L and R, since we have to walk through the array and put each element into either L or R. The recursive call is either on the larger side or the smaller side, but we can simplify our worst-case analysis by forcing the recursive call to always be on the larger half.</p>
<p>Because there’s a 1/n chance that each of the n elements is chosen, and each element has a different rank that makes the larger half of size n-1, n-2, …, n/2, n/2, n/2 + 1, …, n-1, (note how the size of the larger half wraps around and gets larger after n/2) this means, along with our inductive hypothesis that it takes some constant d times n runtime which we assume to be true for all values less than n</p>
<p>Formally, we have the recurrence</p>
<script type="math/tex; mode=display">T(n) = cn + E[T(\text{larger side})], \; T(1) = 1</script>
<script type="math/tex; mode=display">= cn + \frac{1}{n} T(n-1) + \ldots + \frac{1}{n} T(\frac{n}{2}) + \frac{1}{n} T(\frac{n}{2}) + \ldots + \frac{1}{n} T(n - 1)</script>
<script type="math/tex; mode=display">= cn + \frac{2}{n} \sum_{i = \frac{n}{2}}^{n - 1} T(i)</script>
<script type="math/tex; mode=display">= cn + \frac{2}{n} (d(n - 1) + d(n - 2) + \ldots + d(\frac{n}{2}))</script>
<script type="math/tex; mode=display">= cn + \frac{3}{4} dn \leq dn \quad \text{if} \quad d = 4c</script>
<p>Of course, it is unlikely that writing and solving a recurrence will be required in anything other than an academic setting. Note also that our O(n) runtime <em>in expectation</em>, which means that we <em>could</em> have worse runtime (namely, O(n<sup>2</sup>) runtime if we consistently pick bad or the worst element, just like in quicksort, but this is unlikely.</p>
<h2 id="algorithm-2-median-of-medians">Algorithm #2: Median of medians</h2>
<p>While the above quicksort-based method is most likely the one that will be expected in a programming interview, it is interesting to explore a rather elegant linear-time deterministic algorithm posed by <a href="https://amturing.acm.org/award_winners/blum_4659082.cfm">Manuel Blum</a> (Turing Award winner), <a href="https://amturing.acm.org/award_winners/floyd_3720707.cfm">Robert Floyd</a> (Turing Award Winner), <a href="https://en.wikipedia.org/wiki/Vaughan_Pratt">Vaughan Pratt</a> (helped found Sun Microsystems), <a href="https://amturing.acm.org/award_winners/rivest_1403005.cfm">Ronald Rivest</a> (Turing Award winner), and <a href="https://amturing.acm.org/award_winners/tarjan_1092048.cfm">Robert Tarjan</a> (Turing Award winner).</p>
<p>It goes like this:</p>
<ol>
<li>
<p>Break the input into groups of 5 elements. For example, the array [4, 3, 7, 5, 8, 1, 0, 2, 9, 6, …] would be broken up into [4, 3, 7, 5, 8], [1, 0, 2, 9, 6], and so on in linear time.</p>
</li>
<li>
<p>Find the median of each group in linear time - because finding the median of exactly five elements takes constant time.</p>
</li>
<li>
<p>Find the median of these medians recursively - let’s call it x. If we assume that the algorithm is indeed O(n), then this takes T(n/5).</p>
</li>
<li>
<p>Construct L from all elements less than or equal to x and R from all elements greater than x, just like in quicksort or quickselect. 1/2 of the groups of 5 will have medians less than x, and 1/2 of the groups of 5 will have medians greater than x. Within each group where the median is less than x, the two smallest elements are less than the median and are therefore less than x. Likewise, for each group of 5 where the median is greater than x, the two largest elements are greater than the median and are therefore greater than x. Therefore, at least 1/2 (groups less than x) * 3/5 (elements less than x per group of 5 - the 3 comes from the two elements less than the median and the median itself) = 3/10 of the total elements are less than x, and likewise 3/10 of the total elements are greater than x - see the below picture for the intuition behind this. <img src="/assets/prog-2/medians.jpg" alt="Median of medians" title="Median of medians" /> This means that the larger half of the array is <em>at most</em> 7/10 the size of the original array. Therefore, this step takes T(7n/10), if we simplify matters and always analyze the larger half of the array - it is worst-case analysis, after all.</p>
</li>
<li>
<p>Recursively call median of medians on the half of the array that k lies in - again, if |L| >= k, then recur on L, if |L| = k - 1, then pick x, and if |L| < k - 1, then recur on R.</p>
</li>
</ol>
<h3 id="runtime-analysis-1">Runtime Analysis</h3>
<p>For the runtime analysis, it is a bit tricky to arrive at the desired O(n) bound without writing and solving the recurrence, but if we look at the fact that at each recursive step, it takes some O(n) work plus the recursive calls T(n/5) and T(7n/10), which, when “added”, equal 9n/10, which means that after each recursive call, the size of the input decreases geometrically, which means that our recurrence is big-O of the time of the first recursive call, O(n).</p>
<p>Formally, we can draw a brick diagram of runtimes and then show, with the aid of an infinite sum, that geometrically decreasing runtime per step is effectively a constant.</p>
<p><img src="/assets/prog-2/brick.jpg" alt="Brick diagram" title="Brick diagram" /></p>
<script type="math/tex; mode=display">T(n) \leq cn (1 + \frac{9}{10} + (\frac{9}{10})^2 + (\frac{9}{10})^3 + \ldots)</script>
<script type="math/tex; mode=display">\text{Formula for geometric sum is} \quad \frac{1}{1 - a}, \quad \text{where} \quad a = \frac{9}{10}, \quad \text{so}</script>
<script type="math/tex; mode=display">T(n) \leq cn(10) \in O(n)</script>
<p>It is interesting to note that if we break the input into groups of three, we are unable to show the O(n) upper bound because the first recursive term in the recurrence becomes T(n/3) and the second becomes T(2n/3) - because we can guarantee that the median is greater than 2n/6 = n/3 elements, and therefore the larger half of the array is at most 2n/3 - which sum up to n and each recursive step is <em>the same work</em> as the last, and the recurrence is <em>balanced</em> rather than <em>root dominated</em> which gives us O(n log n) runtime.</p>
<p><img src="/assets/prog-2/groups-3.jpg" alt="Groups of three" title="Groups of three" /></p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post, we introduced the problem of selection and explained two algorithms that solve it: a randomized algorithm based on quicksort that finds the k-th element in O(n) expected work, and a deterministic algorithm that finds the k-th element in O(n) work always. We also did some runtime analysis with recurrences, a powerful tool to formally show tight runtime bounds for recursive algorithms that would be difficult or impossible to arrive at informally.</p>Deep Learning Part 2 - Restricted Boltzmann Machines and Feedforward Neural Networks2018-01-12T05:45:00+00:002018-01-12T05:45:00+00:00http://blog.jzhanson.com/blog/dl/tutorial/2018/01/12/dl-2<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<p>This is the second in a several-part series on the basics of deep learning, presented in an easy-to-read, lightweight format. <a href="/blog/dl/tutorial/2017/12/30/dl-1.html">Here</a> is a link to the first one. Previous experience with basic probability and matrix algebra will be helpful, but not required. Send any comments or corrections to <a href="mailto:josh@jzhanson.com">josh@jzhanson.com</a>.</p>
<p>Mathematically, Restricted Boltzmann Machines are derived from the Maxwell-Boltzmann Distribution plus matrix algebra, which we’ll go over in this post. We’ll also use that as a bridge to connect to the basics of neural networks.</p>
<h2 id="the-boltzmann-distribution">The Boltzmann Distribution</h2>
<p>Let us first define <strong>x</strong> to be a vector of <em>n</em> outcomes, where each <em>x<sub>i</sub></em> can either be 0 or 1. Of course, each <em>x<sub>i</sub></em> can have a different probability of being 1. The probabilities can even be conditional, <em>a la</em> Markov Chains. But more on that later. In the previous post, we have usually thought of <em>x</em> as being a single random variable. Here, however, it is a vector of individual random variables. We are assuming the <strong>discrete</strong> case here, where each element of a vector can either be 0 or 1.</p>
<script type="math/tex; mode=display">% <![CDATA[
\textbf{x} = \begin{bmatrix} x_1 & x_2 & \ldots & x_n \end{bmatrix}, \: x_i \in \{0, 1\} %]]></script>
<p>With that definition out of the way, we can examine the Boltzmann distribution, invented by Ludwig Boltzmann, which models a bunch of things in physics, like how a hot object cools, or how energy dissipates into the environment. We have</p>
<script type="math/tex; mode=display">p(x) = \frac{1}{Z} \exp (-E(\textbf{x})), \: E(\textbf{x}) = - \textbf{x}^T \textbf{U} \textbf{x} - \textbf{b}^T \textbf{x}</script>
<p>Here, <em>Z</em> is the partition function or normalizing constant which makes sure that the distribution sums to one. It has actually been proven that the partition function <em>Z</em> is intractable, which means that it cannot be efficiently solved or evaluated. This is not hard to see, because <em>Z</em> requires calculating all combinations of <strong>x</strong>s, and if <strong>x</strong> has <em>n</em> elements, then we have <em>2<sup>n</sup></em> possibilities.</p>
<p>The exp function is the same as raising the constant <em>e</em> to the function’s argument, which is the <em>energy function</em>. Within the energy function, we have a <strong>U</strong>, which is the matrix of weights that our variable <strong>x</strong> interacts with, and a <strong>b</strong> is the vector of biases for each <strong>x</strong>. For now, let’s force <strong>U</strong> to be symmetric.</p>
<p>If we expand the first matrix multiplication term,</p>
<script type="math/tex; mode=display">% <![CDATA[
\textbf{x}^T \textbf{U} \textbf{x} =
\begin{bmatrix} x_1 & x_2 & \ldots & x_n \end{bmatrix}
\Bigg[ \textbf{u}_1 \quad \textbf{u}_2 \quad \ldots \quad \textbf{u}_n \Bigg]
\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}
= \begin{bmatrix} \textbf{x}^T \textbf{u}_1 & \textbf{x}^T \textbf{u}_2 & \ldots & \textbf{x}^T \textbf{u}_n \end{bmatrix}
\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} %]]></script>
<p>Which we observe is a scalar, since each <strong>x</strong><sup>T</sup><strong>u</strong><sub><em>i</em></sub> is a scalar.</p>
<h2 id="rbms">RBMs</h2>
<p>To formally define a <strong>Restricted Boltzmann Machine</strong> (referred to as a <strong>RBM</strong>), we need to make a couple things clear. So far, we’ve thought of the input to the energy function, the vector <strong>x</strong>, as our observations or samples from the distribution. RBMs switch that up a little - they assume that the state vector <strong>x</strong> is composed of two parts: some number of <em>visible</em> variables <strong>v</strong>, and some number of <em>hidden</em> variables <strong>h</strong>.</p>
<script type="math/tex; mode=display">\textbf{x} = (\textbf{v}, \textbf{h})</script>
<p>Why do we explicity split <strong>x</strong> into the visible and hidden variables? It turns out that modeling the interaction between visible and hidden variables is very powerful - in fact, by modeling these interactions and stacking RBMs, we can do a lot of cool things.</p>
<p>We can then rewrite the energy function:</p>
<script type="math/tex; mode=display">% <![CDATA[
E(\textbf{v}, \textbf{h}) = - \begin{bmatrix} \textbf{v}^T & \textbf{h}^T \end{bmatrix} \begin{bmatrix} \textbf{R} & \frac{1}{2}\textbf{W} \\ \frac{1}{2}\textbf{W}^T & \textbf{S} \end{bmatrix}
\begin{bmatrix} \textbf{v} \\ \textbf{h} \end{bmatrix}
- \begin{bmatrix} \textbf{b}^T \\ \textbf{c}^T \end{bmatrix}
\begin{bmatrix} \textbf{v} & \textbf{h} \end{bmatrix} %]]></script>
<p>Note that we have decomposed <strong>U</strong> into four quarters, which are themselves matrices and which we compose out of matrices we name <strong>R</strong>, <strong>W</strong>, and <strong>S</strong>, and we have decomposed <strong>b</strong><sup>T</sup> into <strong>b</strong><sup>T</sup> and <strong>a</strong><sup>T</sup>, which are the respective parts of the bias matrix that are multiplied by <strong>v</strong> and <strong>h</strong>. Because <strong>U</strong> is symmetric, the upper-right and lower-left quarters must be each other’s transpose. We name them <em>1/2</em> <strong>W</strong> instead of just <strong>W</strong> for reasons that will become clear once we expand the first matrix multiplication:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{bmatrix} \textbf{v}^T & \textbf{h}^T \end{bmatrix}
\begin{bmatrix} \textbf{R} & \frac{1}{2}\textbf{W} \\ \frac{1}{2}\textbf{W}^T & \textbf{S} \end{bmatrix}
\begin{bmatrix} \textbf{v} \\ \textbf{h} \end{bmatrix} %]]></script>
<script type="math/tex; mode=display">% <![CDATA[
= \begin{bmatrix} \textbf{v}^T \textbf{R} + \frac{1}{2} \textbf{h}^T \textbf{W}^T & \frac{1}{2} \textbf{v}^T \textbf{W} + \textbf{h}^T \textbf{S} \end{bmatrix}
\begin{bmatrix} \textbf{v} \\ \textbf{h} \end{bmatrix} %]]></script>
<script type="math/tex; mode=display">= \textbf{v}^T \textbf{R} \textbf{v} + \frac{1}{2} \textbf{h}^T \textbf{W}^T \textbf{v} + \frac{1}{2} \textbf{v}^T \textbf{W} \textbf{h} + \textbf{h}^T \textbf{S} \textbf{h}</script>
<p>and by applying the property of matrix multiplication that (<strong>AB</strong>)<sup>T</sup> = <strong>B</strong><sup>T</sup><strong>A</strong><sup>T</sup> on the second term, we have</p>
<script type="math/tex; mode=display">\textbf{h}^T \textbf{W}^T \textbf{v} = (\textbf{W} \textbf{h})^T \textbf{v} = [\textbf{v}^T (\textbf{W} \textbf{h})]^T = \textbf{v}^T \textbf{W} \textbf{h}</script>
<p>The last equality is because the triple matrix multiplication results in a scalar value and the transpose of a scalar value is the scalar value. Therefore,</p>
<script type="math/tex; mode=display">E(\textbf{v}, \textbf{h})= - (\textbf{v}^T \textbf{R} \textbf{v} + \textbf{v}^T \textbf{W} \textbf{h} + \textbf{h}^T \textbf{S} \textbf{h}) - (\textbf{b}^T \textbf{v} + \textbf{a}^T \textbf{h})</script>
<p>We can actually see that <strong>R</strong> models the interactions among visible variables and <strong>S</strong> models the interactions among hidden variables. If we ignore those two matrix multiplication terms and focus only on the interactions of visible variables with hidden variables, we have the modified energy function</p>
<script type="math/tex; mode=display">E(\textbf{v}, \textbf{h})= - \textbf{v}^T \textbf{W} \textbf{h} - \textbf{b}^T \textbf{v} - \textbf{a}^T \textbf{h}</script>
<p>which is the basis of a <strong>Restricted Boltzmann Machine</strong> - the difference between an RBM and a normal Boltzmann Machine is we forget about the visible-visible and hidden-hidden interactions and only concern ourselves with the visible-hidden interactions.</p>
<h2 id="conditional-derivation">Conditional Derivation</h2>
<p>With our new energy function, we can write the joint distribution of <strong>v</strong> and <strong>h</strong> for a RBM. Here comes the really cool stuff.</p>
<script type="math/tex; mode=display">P(\textbf{v}, \textbf{h}; \theta) = \frac{1}{Z(\theta)} \exp (-E(\textbf{v}, \textbf{h}; \theta))
\quad \text{where} \quad Z(\theta) = \sum_\textbf{v} \sum_\textbf{h} \exp(-E(\textbf{v}, \textbf{h}; \theta))</script>
<p>The following derivation of the conditional distribution of <strong>h</strong> is an expansion of the derivation found in the first couple pages of <a href="https://tspace.library.utoronto.ca/handle/1807/19226">Ruslan Salakhutdinov’s PhD thesis</a>, so I use the same notation here, where <em>theta</em> is <strong>W</strong>, <strong>b</strong>, and <strong>a</strong>, and the semicolon stands for “given” or “dependent upon” while the commas denote parameters of the joint distribution.</p>
<p>Because we’re working in the discrete case, we say that <strong>v</strong> and <strong>h</strong> are <em>D</em> and <em>F</em> dimensional vectors, all of elements that can be either 0 or 1.</p>
<script type="math/tex; mode=display">\textbf{v} \in \{0, 1\}^D \quad \text{and} \quad \textbf{h} \in \{0, 1\}^F</script>
<p>We aim to find the conditional distribution of <strong>h</strong> given <strong>v</strong>, because that would allow us to model the distribution of the hidden variables given values of visible variables. We can start by applying Bayes’ Rule to rewrite the conditional in terms of the joint, which we have above, and the marginal on the denominator, which we will proceed to derive.</p>
<script type="math/tex; mode=display">P(\textbf{h} \vert \textbf{v}; \theta) = \frac{P(\textbf{v}, \textbf{h}; \theta)}{P(\textbf{v}; \theta)}</script>
<p>To derive the marginal, we take the joint distribution on <strong>v</strong> and <strong>h</strong> and sum over all values of <strong>h</strong> and expand, replacing matrix multiplication terms with sigma notation.</p>
<script type="math/tex; mode=display">P(\textbf{v}; \theta) = \sum_h P(\textbf{v}, \textbf{h}; \theta) = \frac{1}{Z(\theta)} \sum_h \exp (-E(\textbf{v}, \textbf{h}; \theta))</script>
<script type="math/tex; mode=display">= \frac{1}{Z(\theta)} \sum_h \exp (-(- \textbf{v}^T \textbf{W} \textbf{h} - \textbf{b}^T \textbf{v} - \textbf{a}^T \textbf{h}))</script>
<script type="math/tex; mode=display">= \frac{1}{Z(\theta)} \sum_h \exp (\sum_{i = 1}^D \sum_{j = 1}^F v_i W_{ij} h_j + \sum_{i = 1}^D b_i v_i + \sum_{j = 1}^F a_j h_j)</script>
<p>We can bring out the <em>b<sub>i</sub> v<sub>i</sub></em> term out of the exp and the outer summation as a product, because <em>e<sup>a + b</sup> = e<sup>a</sup> e<sup>b</sup></em>.</p>
<script type="math/tex; mode=display">= \frac{1}{Z(\theta)} \cdot \exp(\sum_{i = 1}^D b_i v_i) \cdot \sum_h \exp (\sum_{i = 1}^D \sum_{j = 1}^F v_i W_{ij} h_j + \sum_{j = 1}^F a_j h_j)</script>
<p>We can also swap the double summations in the latter exp as well as pull out the <em>h<sub>j</sub></em>, because it only depends on <em>j</em> and not <em>i</em>, and then pull out the <em>j = 1</em> to <em>F</em> summation.</p>
<script type="math/tex; mode=display">= \frac{1}{Z(\theta)} \cdot \exp(\sum_{i = 1}^D b_i v_i) \cdot \sum_h \exp (\sum_{j = 1}^F ( \sum_{i = 1}^D v_i W_{ij}) h_j + \sum_{j = 1}^F a_j h_j)</script>
<script type="math/tex; mode=display">= \frac{1}{Z(\theta)} \cdot \exp(\sum_{i = 1}^D b_i v_i) \cdot \sum_h \exp \Big[ \sum_{j = 1}^F ( ( \sum_{i = 1}^D v_i W_{ij}) h_j + a_j h_j) \Big]</script>
<p>Just like we did above, we can use the fact that <em>e<sup>a + b</sup> = e<sup>a</sup> e<sup>b</sup></em> to pull out the <em>j = 1</em> to <em>F</em> summation out of the exp and turn it into a product.</p>
<script type="math/tex; mode=display">= \frac{1}{Z(\theta)} \cdot \exp(\sum_{i = 1}^D b_i v_i) \cdot \sum_h \exp \Big[ \sum_{j = 1}^F (( \sum_{i = 1}^D v_i W_{ij}) h_j + a_j h_j) \Big]</script>
<script type="math/tex; mode=display">= \frac{1}{Z(\theta)} \cdot \exp(\sum_{i = 1}^D b_i v_i) \cdot \sum_h \prod_{j = 1}^F \exp ((\sum_{i = 1}^D v_i W_{ij}) h_j + a_j h_j)</script>
<p>Now it seems fairly intuitive that you can switch the product and the sum, especially if we remember that each <em>h<sub>j</sub></em> must be either 0 or 1. Indeed, if we simply take the two cases which <em>h<sub>j</sub></em> can be and plug in <em>h<sub>j</sub></em> = 0 (which cancels everything out and exp(0) = 1) and <em>h<sub>j</sub></em> = 1, we arrive at</p>
<script type="math/tex; mode=display">= \frac{1}{Z(\theta)} \cdot \exp(\sum_{i = 1}^D b_i v_i) \cdot \prod_{j = 1}^F \sum_{h_j \in \{0, 1 \}} \exp ((\sum_{i = 1}^D v_i W_{ij}) h_j + a_j h_j)</script>
<script type="math/tex; mode=display">= \frac{1}{Z(\theta)} \cdot \exp(\sum_{i = 1}^D b_i v_i) \cdot \prod_{j = 1}^F (1 + \exp (\sum_{i = 1}^D v_i W_{ij} + a_j))</script>
<p>If you’re willing to take this on faith, skip the next subheading and go to <a href="#plugging-in">Plugging in</a>. If you would like a detailed explanation of why this is true, read on!</p>
<h3 id="expansion-of-the-product-sum">Expansion of the product-sum</h3>
<p>To formally derive that</p>
<script type="math/tex; mode=display">\sum_h \prod_{j = 1}^F \exp ((\sum_{i = 1}^D v_i W_{ij}) h_j + a_j h_j)
= \prod_{j = 1}^F (1 + \exp (\sum_{i = 1}^D v_i W_{ij} + a_j))</script>
<p>Let’s define a function as follows:</p>
<script type="math/tex; mode=display">f(j, h_j; \theta) = \exp ((\sum_{i = 1}^D W_{ij} v_i) h_j + a_j h_j)</script>
<p>for our hidden variable vector,</p>
<script type="math/tex; mode=display">% <![CDATA[
\textbf{h} = \begin{bmatrix} h_1 & h_2 & \ldots & h_F \end{bmatrix}, h_j \in \{ 0, 1 \} %]]></script>
<p>Note that</p>
<script type="math/tex; mode=display">f(j, 0; \theta) = 1 \quad \text{and} \quad f(j, 1; \theta) = \exp (\sum_{i = 1}^D W_{ij} v_i + a_j) \quad \forall j</script>
<p>Therefore, the whole product is equal to evaluating the product on a subset of the terms where <script type="math/tex">h_j = 1</script>.</p>
<script type="math/tex; mode=display">\prod_{j = 1}^F f(j, h_j; \theta) = \prod_{j \in \{i_1, \ldots, i_k \}} f(j, 1; \theta) \quad \text{where} \quad h_j = 1, \: j \in \{ i_1, i_2, \ldots, i_k \}</script>
<p>We want to make statements and write equations about <em>all</em> vectors of this type. For any vector of this type, it has <script type="math/tex">k</script> ones. Because they’re <script type="math/tex">F</script>-dimensional, that means that there <script type="math/tex">F - k</script> zeroes. The ones can be distributed in any fashion - evidently, summation notation is insufficient, and adding combinations into the mix won’t strengthen the concept…how about we use an uppercase kappa, standing for “k-combinations of products” in the same vein as the uppercase sigma for sum and pi for product? Another option: lowercase nu, which looks like a <script type="math/tex">\nu</script>?</p>
<p>Hereafter, we denote “sum across all vectors <strong>h</strong> with dimension <em>F</em> and from <em>k</em> = 0 to <em>F</em> ones” as</p>
<script type="math/tex; mode=display">\underset{j \in \{i_1, \ldots, i_k \} }{K}</script>
<p>In any case, we can write that the latter portion of the equation up there with this new function <em>f</em> and our new notation as</p>
<script type="math/tex; mode=display">\underset{j \in \{i_1, \ldots, i_k \} }{K} f(j, h_j; \theta)</script>
<p>which is summing over all vectors <strong>h</strong> with 0 to <em>F</em> ones and all other zeroes <script type="math/tex">f(j, h_j; \theta</script>, where <em>j</em> is the vector element index and <em>h<sub>j</sub></em> is the element at that index, and multiplying them together - the product <script type="math/tex">\prod_{j = 1}^F</script> is included in the <em>kappa</em> notation.</p>
<p>To expand it and make it a little less abstract, we have</p>
<script type="math/tex; mode=display">= \big[ f(1, 0; \theta) f(2, 0; \theta) \ldots f(F, 0; \theta) \big]</script>
<script type="math/tex; mode=display">+ \big[ f(1, 1; \theta) f(2, 0; \theta) \ldots f(F, 0; \theta) + f(1, 0; \theta) f(2, 1; \theta) \ldots f(F, 0; \theta) + \ldots + f(1, 0; \theta) f(2, 0; \theta) \ldots f(F, 1; \theta) \big]</script>
<script type="math/tex; mode=display">+ \ldots</script>
<script type="math/tex; mode=display">+ \big[ f(1, 1; \theta) f(2, 1; \theta) \ldots f(F, 1; \theta) \big]</script>
<p>where between each set of square brackets is all vectors <strong>h</strong> with <em>k</em> = 0, <em>k</em> = 1, and <em>k</em> = <em>F</em> ones. There is one vector each for <em>k</em> = 0 and <em>k</em> = <em>F</em> and there are <em>F</em> vectors for <em>k</em> = 1, and <em>F</em> choose two vectors for <em>k</em> = 2, and so on.</p>
<p>Now here’s our doozy: because all <script type="math/tex">f(j, 0; \theta)</script> turn into ones, we can actually factor the <em>entire expression</em> into</p>
<script type="math/tex; mode=display">= \prod_{j = 1}^F (1 + \exp (\sum_{i = 1}^D W_{ij} v_i + a_j))</script>
<p>It might be a bit easier to see with an example. Let’s factor the two dimensional case, <em>F</em> = 2 with the four vectors
<script type="math/tex">% <![CDATA[
\textbf{h} = \begin{bmatrix} 0 & 0 \end{bmatrix}, \begin{bmatrix} 0 & 1 \end{bmatrix} , \begin{bmatrix} 1 & 0 \end{bmatrix} , \begin{bmatrix} 1 & 1 \end{bmatrix} %]]></script></p>
<p>We have</p>
<script type="math/tex; mode=display">\underset{j \in \{i_1, i_2 \} }{K} f(j, h_j; \theta) = f(0, 0) f(1, 0) + \big[ f(0, 1) f(1, 0) + f(0, 0) f(1, 1) \big] + f(0, 1) f(1, 1)</script>
<script type="math/tex; mode=display">= 1 + 1 \cdot f(0, 1) + 1 \cdot f(1, 1) + f(0, 1) f(1, 1) = (1 + f(0, 1))(1 + f(1, 1)) = \prod_{j = 1}^2 (1 + f(j, 1))</script>
<script type="math/tex; mode=display">= \prod_{j = 1}^2 (1 + \exp (\sum_{i = 1}^D W_{ij} v_i + a_j))</script>
<p>which seems like a whole lot of ado for what could have been a simple expansion, but I found this to be a neat math trick :).</p>
<h3 id="plugging-in">Plugging in</h3>
<p>Now that we have expanded the marginal, we can actually note that because we don’t actually manipulate the summation over all <strong>h</strong> except the last part, we can similarily expand the joint distribution <script type="math/tex">P(\textbf{v}, \textbf{h}; \theta)</script> using the same steps.</p>
<script type="math/tex; mode=display">P(\textbf{h} \vert \textbf{v}; \theta) = \frac{P(\textbf{v}, \textbf{h}; \theta)}{P(\textbf{v}; \theta)} = \frac{\frac{1}{Z(\theta)} \exp (-E(\textbf{v}, \textbf{h}; \theta))}{P(\textbf{v}, \theta)}</script>
<script type="math/tex; mode=display">= \frac{\frac{1}{Z(\theta)} \exp (\sum_{i = 1}^D \sum_{j = 1}^F v_i W_{ij} h_j + \sum_{i = 1}^D b_i v_i + \sum_{j = 1}^F a_j h_j)}{P(\textbf{v}, \theta)}</script>
<script type="math/tex; mode=display">= \frac{\frac{1}{Z(\theta)} \exp (\sum_{i = 1}^D b_i v_i) \cdot \exp (\sum_{j = 1}^F \sum_{i = 1}^D v_i W_{ij} h_j + \sum_{j = 1}^F a_j h_j)}{P(\textbf{v}, \theta)}</script>
<script type="math/tex; mode=display">= \frac{\frac{1}{Z(\theta)} \exp (\sum_{i = 1}^D b_i v_i) \cdot \prod_{j = 1}^F \exp ((\sum_{i = 1}^D v_i W_{ij}) h_j + a_j h_j)}
{\frac{1}{Z(\theta)} \exp (\sum_{i = 1}^D b_i v_i) \cdot \prod_{j = 1}^F (1 + \exp(\sum_{i = 1}^D W_{ij} v_i + a_j))}</script>
<p>Cancelling terms and pulling out the product,</p>
<script type="math/tex; mode=display">= \prod_{j = 1}^F \frac{\exp ((\sum_{i = 1}^D v_i W_{ij}) h_j + a_j h_j)}{1 + \exp(\sum_{i = 1}^D W_{ij} v_i + a_j)}</script>
<p>which we can write as the element-wise conditional</p>
<script type="math/tex; mode=display">= \prod_{j = 1}^F P(h_j \vert \textbf{v}; \theta) \quad \text{where} \quad P(h_j \vert \textbf{v}; \theta) = \frac{\exp ((\sum_{i = 1}^D v_i W_{ij}) h_j + a_j h_j)}{1 + \exp(\sum_{i = 1}^D W_{ij} v_i + a_j)}</script>
<p>Now we make the step that takes the cake. We care about the conditional probability that <em>h<sub>j</sub></em> = 1, and when we set <em>h<sub>j</sub></em> = 1, we actually see that the distribution turns into the sigmoid function!</p>
<script type="math/tex; mode=display">P(h_j = 1 \vert \textbf{v}; \theta) = \sigma (\sum_{i = 1}^D W_{ij} v_i + a_j) \quad \text{where} \quad \sigma(x) = \frac{\exp (x)}{1 + \exp (x)}</script>
<p>And now we have shown a mathematical theoretical basis for why the units in a neural network carry a nonlinearity - oftentimes, the sigmoid function, as the activation function. It corresponds exactly to the conditional probability that the hidden variable is 1. What’s the sigmoid function dependent on? The sum of every visible variable - which can be 0 or 1 depending on whether each visible unit “fired” or not - times its appropriate weight plus the bias for that hidden unit.</p>
<p>Moreover, we’ve actually derived the architecture of vanilla neural networks from the mathematical structure of Restricted Boltzmann Machines, where some number of visible units all feed into each hidden unit, where their connections are multiplied by weights and biases are added within each unit and the sigmoid function is applied to determine whether the output of that unit will be 1 or 0. That is, whether the “neuron” will “fire” or not.</p>
<p><img src="/assets/dl-part-2/feedforward.png" alt="Feedforward neural network" title="Feedforward neural network" /></p>
<p>Thanks to <a href="http://madebyevan.com/fsm/">Evan Wallace’s Finite State Machine Designer</a>.</p>
<p>Most of these distributions in statistics and machine learning are taught because they <em>work</em> - the Boltzmann Distribution, for example, is notable because it does a good job of modeling natural phenomena. Many many distributions and methods are lost because, while mathematically novel, they aren’t useful. The ones we do remember are the ones that work, the ones that fit phenomena or predict well.</p>
<p>The difference between RBMs and feedforward neural networks is that RBMs are a <em>probabilistic model</em> while feedforward neural networks are <em>deterministic</em>. We just take the mean of the first conditional distribution <em>p(h<sub>j</sub> | <strong>v</strong>)</em> to get our deterministic neural networks. We can also go from discrete, where our inputs and outputs can only be 0 or 1, to continuous, where inputs and outputs can take any value from 0 to 1, but we have to add some restrictions and flip some signs around - the energy function has to have all its signs reversed and the weights matrix <strong>U</strong> has to be <em>positive definite</em> for the distribution to converge and integrate to 1.</p>
<p>Again, we have just shown that there’s a theoretical foundation for neural networks. It was actually this proof, combined with Hinton’s discovery that <a href="https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf">stacking</a> <a href="https://arxiv.org/abs/1206.5533">RBMs</a> - in much the same fashion as we now stack layers of hidden units to form deep neural networks - yielded promising results in feature extraction, discrimination/classification, object detection, and many other classes of tasks actually kicked off the boom in AI and deep learning that we’re seeing now. We’ve just shown the basis of all that.</p>
<p>Pretty cool.</p>Algorithms - Double Binary Search2018-01-08T20:00:00+00:002018-01-08T20:00:00+00:00http://blog.jzhanson.com/blog/practice/code/2018/01/08/algos-1<p>Welcome to the first of a series where I post a programming interview question and work through it, posting code and explanations of my approaches, pitfalls, and clever tricks! I may use different languages and compare the results if there are interesting or noteworthy differences, but I will generally use Python due to its brevity and ease of understanding. The focus here is on the algorithm, approaches, and clarity of code rather than any particular code finesse. Send comments or corrections to <a href="mailto:josh@jzhanson.com">josh@jzhanson.com</a>.</p>
<p>Note: <em>time</em> and <em>runtime</em> in the context of runtime analysis both mean <em>work</em>, which is how the algorithm takes to execute on a single processor, i.e. sequentially, as opposed to <em>span</em>, which is how long the algorithm takes if we assume infinite processors - span is the longest single branch of the recurrence tree, that is, the most work that has to be done by any single processor among our infinite processors. If the wording is ever ambiguious, I mean <em>work</em>.</p>
<p>Second note: the diagrams are pictures I took with my phone of the diagrams drawn on paper - once I figure out a good diagraming software, I’ll probably replace the pictures. But having pictures of hand-drawn diagrams actually adds a bit of character and humanity to these posts, which I like :).</p>
<h1 id="double-binary-search">Double Binary Search</h1>
<p>or, Median of Two Sorted Arrays, or, kth-smallest</p>
<p>It’s trivial to find the median of a single sorted array A: just take the length of the array n and find A[n/2]. If you want to be fancy, you can find A[n/2] if the array is of odd length or the midpoint between or average of A[n/2 - 1] and A[n/2] if the array if of even length.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">median_simple</span><span class="p">(</span><span class="n">A</span><span class="p">):</span>
<span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">A</span><span class="p">)</span>
<span class="k">return</span> <span class="n">A</span><span class="p">[</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">median_fancy</span><span class="p">(</span><span class="n">A</span><span class="p">):</span>
<span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">A</span><span class="p">)</span>
<span class="k">if</span> <span class="p">(</span><span class="n">n</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">):</span> <span class="c"># array length is even</span>
<span class="k">return</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">A</span><span class="p">[</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">])</span> <span class="o">/</span> <span class="mi">2</span>
<span class="k">else</span><span class="p">:</span> <span class="c"># array length is odd</span>
<span class="k">return</span> <span class="n">A</span><span class="p">[</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">]</span>
</code></pre></div></div>
<p>But what if you wanted to find the median of <em>two</em> sorted arrays? It might seem straightforward at first, especially if all the elements of one array are less than all the elements of another array, e.g. [1, 2, 5, 7] and [15, 21, 33], but what if the arrays overlap, or even share elements? How would we find the median of, say, [2, 5, 7, 8] and [0, 3, 4, 6, 7, 9]?</p>
<p>Word on the street is that this is an essential question to know for Google coding interviews, and by word on the street, I mean word straight from the mouth of Professor Guy Blelloch in the lecture of 15-210: Parallel and Sequential Data Stuctures and Algorithms at Carnegie Mellon University taught under the School of Computer Science undergraduate program…</p>
<h2 id="the-problem">The problem</h2>
<p>We define the <em>median element</em> of two or more arrays to be the median of the array formed when all the arrays are combined and sorted, preserving duplicates. For example, the median element of [2, 5, 7, 8] and [0, 3, 4, 6, 7, 9] would be 5. We define the <em>kth-smallest</em> element of two or more arrays to be element k + 1 of the array formed when all the arrays are combined and sorted, preserving duplicates. Using the above example, the 1st-smallest element would be 0 and the 4th-smallest element would be 4.</p>
<ol>
<li>Given two sorted arrays <strong>of equal size</strong> A and B, find the median element.</li>
</ol>
<p><strong>Input</strong>: two sorted arrays of equal size A and B whose elements are integers (but can be any other element for which there exists a total ordering).</p>
<p><strong>Output</strong>: the median element of the array formed when both arrays are combined and sorted - if C is the sorted “union” preserving duplicates of A and B with length n, the median would be the element at index n/2 if n is even and n/2 + 1 if n is odd.</p>
<ol>
<li>Given two sorted arrays <strong>of unequal size</strong> A and B, find the median element.</li>
</ol>
<p><strong>Input</strong>: two sorted arrays of unequalequal size A and B whose elements are integers (but can be any other element for which there exists a total ordering).</p>
<p><strong>Output</strong>: the median element of the array formed when both arrays are combined and sorted - if C is the sorted “union” preserving duplicates of A and B with length n, the median would be the element at index n/2 if n is even and n/2 + 1 if n is odd.</p>
<ol>
<li>Given two sorted arrays <strong>of unequal size</strong> A and B and an integer k, where k <= |A| + |B|, find the kth-smallest element of the two arrays. We use the bars | to denote the size of an array or the length of a string, so |A| is the size of A.</li>
</ol>
<p><strong>Input</strong>: two sorted arrays of unequal size A and B whose elements are integers (but can be any other element for which there exists a total ordering).</p>
<p><strong>Output</strong>: the kth-smallest element of the array formed when both arrays are combined and sorted - if C is the sorted “union” preserving duplicates of A and B, the kth-smallest element would be element k+1.</p>
<h2 id="foray-1-brute-force">Foray #1: Brute force</h2>
<p>We will tackle 1 and 2 together while 3 is mostly left as an exercise.</p>
<p>A good place to start, in programming interviews, is always to talk through and explore the simplest, often brute force solution. It is almost never the correct solution, but doing so 1) prevents you from sitting there silently for several minutes thinking like a maniac and trying to come up with the perfect solution, 2) fills up the time and helps show your thought process to the interviewer, and 3) helps build intuition on the problem.</p>
<p>The simplest solution here is to combine both arrays into one big array, sort it, and then trivially find the median of that big array. Let n = |A| + |B|. If we use an implementation of arrays that allows appending in O(n) work and O(1) span and a decent (read: asymptotically optimal) sorting algorithm which runs in O(n log n) work and, if we’re picky about the parallelism of our algorithms, has O(log<sup>3</sup> n) span <em>cough</em> mergesort <em>cough</em>, then this gives us a total work of O(n log n) and a total span of O(log<sup>3</sup> n) span.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">double_median_naive</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">):</span>
<span class="n">C</span> <span class="o">=</span> <span class="n">A</span> <span class="o">+</span> <span class="n">B</span> <span class="c"># in Python, + appends two lists</span>
<span class="n">sort</span><span class="p">(</span><span class="n">C</span><span class="p">)</span> <span class="c"># Python's list sort is mergesort</span>
<span class="k">return</span> <span class="n">C</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">C</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">]</span>
</code></pre></div></div>
<p>This isn’t optimal. Intuitively, it <em>feels</em> like we’re doing a lot more work than we need to; we’re merging and sorting both arrays when we really just need to determine the middle element. Also, do we really need to <em>sort</em> the big array again when both A and B are sorted?</p>
<p>I didn’t mention mergesort up there for nothing: if you’re sharp, then you read mergesort and immedialy thought “<em>Why don’t we just merge A and B instead of appending and sorting?</em>”</p>
<p>We merge A and B <em>a la</em> mergesort by starting a pointer at the beginning of both arrays, comparing the element under the pointer in A with the element under the pointer in B, and advancing the pointer of whichever element is <strong>smaller</strong>. When we get to the n/2-th element, where n is the sum of the lengths of A and B, we return that one. If we do this, then we actually cut down our work to O(n). However, interestingly, our span becomes O(n). Here, we see the trade-off between work and span in action: algorithms can often become more parallel in exchange for doing more, sometimes repeated, work.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">double_median_merge</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">):</span>
<span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">+</span> <span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span>
<span class="k">if</span> <span class="p">(</span><span class="n">n</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="k">return</span> <span class="bp">None</span>
<span class="n">count_a</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">count_b</span> <span class="o">=</span> <span class="mi">0</span>
<span class="c"># upon loop termination, count_a and count_b will be on n/2-nd and n/2+1-st</span>
<span class="k">while</span> <span class="p">(</span><span class="n">count_a</span> <span class="o">+</span> <span class="n">count_b</span> <span class="o"><</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span> <span class="o">-</span> <span class="mi">1</span><span class="p">):</span>
<span class="k">if</span> <span class="p">(</span><span class="n">count_a</span> <span class="o">==</span> <span class="nb">len</span><span class="p">(</span><span class="n">A</span><span class="p">)):</span> <span class="c"># if at end of array A</span>
<span class="n">count_b</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">continue</span>
<span class="k">elif</span> <span class="p">(</span><span class="n">count_b</span> <span class="o">==</span> <span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)):</span> <span class="c"># if at end of array B</span>
<span class="n">count_a</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">continue</span>
<span class="k">if</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="n">count_a</span><span class="p">]</span> <span class="o"><</span> <span class="n">B</span><span class="p">[</span><span class="n">count_b</span><span class="p">]):</span>
<span class="n">count_a</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">count_b</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="p">(</span><span class="n">count_a</span> <span class="o">==</span> <span class="nb">len</span><span class="p">(</span><span class="n">A</span><span class="p">)):</span>
<span class="k">if</span> <span class="p">(</span><span class="n">n</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">):</span> <span class="c"># if even number of elements</span>
<span class="k">return</span> <span class="p">(</span><span class="n">B</span><span class="p">[</span><span class="n">count_b</span><span class="p">]</span> <span class="o">+</span> <span class="n">B</span><span class="p">[</span><span class="n">count_b</span> <span class="o">+</span> <span class="mi">1</span><span class="p">])</span> <span class="o">/</span> <span class="mi">2</span>
<span class="k">return</span> <span class="n">B</span><span class="p">[</span><span class="n">count_b</span><span class="p">]</span>
<span class="k">elif</span> <span class="p">(</span><span class="n">count_b</span> <span class="o">==</span> <span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)):</span>
<span class="k">if</span> <span class="p">(</span><span class="n">n</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="n">count_a</span><span class="p">]</span> <span class="o">+</span> <span class="n">A</span><span class="p">[</span><span class="n">count_a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">])</span> <span class="o">/</span> <span class="mi">2</span>
<span class="k">return</span> <span class="n">A</span><span class="p">[</span><span class="n">count_a</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">if</span> <span class="p">(</span><span class="n">n</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="n">count_a</span><span class="p">]</span> <span class="o">+</span> <span class="n">B</span><span class="p">[</span><span class="n">count_b</span><span class="p">])</span> <span class="o">/</span> <span class="mi">2</span>
<span class="k">return</span> <span class="n">A</span><span class="p">[</span><span class="n">count_a</span><span class="p">]</span> <span class="k">if</span> <span class="n">A</span><span class="p">[</span><span class="n">count_a</span><span class="p">]</span> <span class="o"><</span> <span class="n">B</span><span class="p">[</span><span class="n">count_b</span><span class="p">]</span> <span class="k">else</span> <span class="n">B</span><span class="p">[</span><span class="n">count_b</span><span class="p">]</span>
</code></pre></div></div>
<p>Couple things to note here: we need to check for a couple edge cases, namely, what happens if one or both arrays are empty. Note that if one array is empty but not both, the <code class="highlighter-rouge">count_a == len(A)</code> or <code class="highlighter-rouge">count_b == len(B)</code> cover it but we need to return <code class="highlighter-rouge">None</code> if both arrays are empty. We also have a slightly-awkward loop counter with <code class="highlighter-rouge">count_a + count_b < n/2 - 1</code>, which is just to make sure that the final iteration makes the termination condition, which is that <code class="highlighter-rouge">count_a</code> and <code class="highlighter-rouge">count_b</code> will be on elements <code class="highlighter-rouge">n/2</code> and <code class="highlighter-rouge">n/2 + 1</code>, not necessarily in that order. Also, depending on how you in particular code it, you might have to worry about when the arrays are both 1 element.</p>
<p>This works both for when the arrays are equal size and when the arrays are unequal size.</p>
<h2 id="foray-2-divide-and-conquer">Foray #2: Divide and conquer</h2>
<p>Now the next step takes a bit of a mental leap. If we think about what we know about the problem, we want to find a specific element out of <strong>sorted</strong> arrays, except we’re not looking for the element by <em>id</em> but by <em>cardinality</em>, or <em>rank</em>. A good option here to explore, after hearing the words <em>sorted</em> and <em>find</em>, would be some sort of <strong>binary search</strong>, even just talking about it can show the interviewer that you’re on the right track and can prompt them to give you a hint to set you in the right direction. You could also maybe arrive by the divide-and-conquer paradigm by going through the common algorithmic paradigms. For example, when I’m looking for some <em>smarter</em> algorithm, I first think to see if a greedy algorithm would work, then a divide-and-conquer one, then dynamic programming, then backtracking, and finally graph algorithms.</p>
<p>Anyways, if we think about how we can use binary search to find the median of two sorted arrays, let’s think about what binary search does. Binary search looks at the median of a single sorted array or subarray, compares it to the target element, and drops the lower half of the array if the target element is larger than the median, because the target will not occur in the lower half where all elements are less than the median, which is less than the target, and symmetrically for if the target is lower than the median.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">binary_search</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">target</span><span class="p">):</span>
<span class="k">if</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">):</span> <span class="k">return</span> <span class="bp">False</span>
<span class="n">mid</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span>
<span class="k">if</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="n">mid</span><span class="p">]</span> <span class="o">==</span> <span class="n">target</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">elif</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="n">mid</span><span class="p">]</span> <span class="o"><</span> <span class="n">target</span><span class="p">):</span> <span class="c"># if median is less than target</span>
<span class="k">return</span> <span class="n">binary_search</span><span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="n">mid</span><span class="o">+</span><span class="mi">1</span><span class="p">:])</span> <span class="c"># syntax for all elements before mid</span>
<span class="k">else</span><span class="p">:</span> <span class="c"># if median is greater than target</span>
<span class="k">return</span> <span class="n">binary_search</span><span class="p">(</span><span class="n">A</span><span class="p">[:</span><span class="n">mid</span><span class="p">])</span>
</code></pre></div></div>
<p>We’re comparing the median of the sorted array to something, and then dropping half of the array based on that…this is the part where you either have the flash of inspiration or your interviewer prods you to the flash of inspiration. <strong>What if we compare the medians of the two arrays?</strong></p>
<h3 id="equal-length">Equal length</h3>
<p>Let’s explore this, first if we assume the arrays are equal size. Simplifying assumptions are a great way to get a start on a problem and build intuition. If the arrays are equal size and we compare the medians, we have three cases:</p>
<ol>
<li>
<p>If the median of A is <strong>less</strong> than the median of B, then we know that the true median has to be in the second half of A, A<sub>R</sub> or the first half of B, B<sub>L</sub> inclusive of the sub-medians.</p>
</li>
<li>
<p>If the median of A is <strong>greater</strong> than the median of B, then we know that the true median has to be in the first half of A, A<sub>L</sub> or the second half of B, B<sub>R</sub>, inclusive of the sub-medians.</p>
</li>
<li>
<p>If the median of A is <strong>equal</strong> to the median of B, then our job just got a lot easier! The median is either one of those medians.</p>
</li>
</ol>
<p>The picture below should help illustrate the intuition behind these three cases.</p>
<p><img src="/assets/prog-1/equal-len.jpg" alt="Equal length" title="Equal length" /></p>
<p>Again, if it intuitively seems like we can immediately find the median of two equal length sorted arrays, take a moment to convince yourself why that isn’t true. Writing out a couple of examples might help.</p>
<h3 id="solution">Solution</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">double_binary_search_eq_len</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">):</span>
<span class="k">if</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="ow">and</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="k">if</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="ow">and</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">B</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">/</span> <span class="mi">2</span>
<span class="k">elif</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span><span class="p">)</span> <span class="ow">and</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="nb">max</span><span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">B</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">+</span> <span class="nb">min</span><span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">B</span><span class="p">[</span><span class="mi">1</span><span class="p">]))</span> <span class="o">/</span> <span class="mi">2</span>
<span class="n">mid_a</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span>
<span class="n">mid_b</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span>
<span class="k">if</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="n">mid_a</span><span class="p">]</span> <span class="o"><</span> <span class="n">B</span><span class="p">[</span><span class="n">mid_b</span><span class="p">]):</span>
<span class="k">return</span> <span class="n">double_binary_search_eq_len</span><span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="n">mid_a</span><span class="p">:],</span> <span class="n">B</span><span class="p">[:</span><span class="n">mid_b</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="k">elif</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="n">mid_a</span><span class="p">]</span> <span class="o">></span> <span class="n">B</span><span class="p">[</span><span class="n">mid_b</span><span class="p">]):</span>
<span class="k">return</span> <span class="n">double_binary_search_eq_len</span><span class="p">(</span><span class="n">A</span><span class="p">[:</span><span class="n">mid_a</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">B</span><span class="p">[</span><span class="n">mid_b</span><span class="p">:])</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">A</span><span class="p">[</span><span class="n">mid_a</span><span class="p">]</span>
</code></pre></div></div>
<p>The first <code class="highlighter-rouge">if/elif</code> statement is the base case - if both arrays are length 2, then the “median” of each is the first element always, and we could get stuck in a loop where the arrays aren’t actually shortened at each step.</p>
<p>This takes O(log n) work and span because we are chopping off roughly half of our total input size at each iteration, and because we only have one recursive call, there is no parallizibility.</p>
<h3 id="unequal-length">Unequal length</h3>
<p>Now let’s take this one step further. What if our two arrays A and B are of unequal length? There’s not actually that much different about our algorithm. We still compare the medians of both arrays, but we have to make some different assumptions about how we can “chop” off parts of our arrays. However, we also have the information about the lengths of the arrays to help us out. Let’s also assume, for simplicity, that |A| < |B|. If A is larger than B, we can just swap the arrays - the logic is symmetric.</p>
<p>We again have a couple cases:</p>
<ol>
<li>
<p>If the median of A is greater than the median of B, then we can drop all of the second half of A, A<sub>R</sub>. Additionally, we can drop that many elements from the first half of B, B<sub>L</sub>, but we <strong>cannot always drop all of B<sub>L</sub></strong>.</p>
</li>
<li>
<p>Symmetrically, if the median of A is less than the median of B, then we can drop all the first half of A, A<sub>L</sub>. Additionally, we can drop that many elements from the second half of B, from B<sub>R</sub>.</p>
</li>
<li>
<p>If the median of A is equal to the median of B, then we can do either of the above two cases. Let’s just use the second one here. Note that if you would like to make this a separate base case where you compare the medians and perhaps also the neighboring elements to the medians if the arrays are both even or both odd - for example, the median of both [0, 2, 4, 6, 8, 10] and [1, 2, 4, 6, 7, 9] are both 4 but the median of the merged arrays is 5.</p>
</li>
</ol>
<p><img src="/assets/prog-1/unequal-len.jpg" alt="Unequal length" title="Unequal length" /></p>
<p>Another reason that interviewers like this problem is that there are a <em>lot</em> of base cases to account for, especially with arrays of unequal length. We can reduce them by forcing A to be shorter than B, of course, but there are still a couple we have to account for.</p>
<h3 id="solution-1">Solution</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">double_binary_search</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">):</span>
<span class="k">if</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">):</span>
<span class="k">if</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="k">return</span> <span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">]</span> <span class="c"># if A is empty, return median of B</span>
<span class="k">if</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="ow">and</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">B</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">/</span> <span class="mi">2</span> <span class="c"># if one element in both arrays</span>
<span class="k">if</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">):</span>
<span class="k">if</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">):</span>
<span class="k">if</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o"><</span> <span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">]):</span>
<span class="k">return</span> <span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">]</span>
<span class="k">elif</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">></span> <span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">])</span> <span class="ow">and</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">></span> <span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]):</span>
<span class="k">return</span> <span class="n">A</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">elif</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">></span> <span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]):</span>
<span class="k">return</span> <span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">if</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o"><</span> <span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]):</span>
<span class="k">return</span> <span class="p">(</span><span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">])</span> <span class="o">/</span> <span class="mi">2</span>
<span class="k">elif</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">></span> <span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span> <span class="o">-</span> <span class="mi">1</span><span class="p">])</span> <span class="ow">and</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o"><</span> <span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">]):</span>
<span class="k">return</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">])</span>
<span class="k">elif</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">></span> <span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">])</span> <span class="ow">and</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o"><</span> <span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]):</span>
<span class="k">return</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span> <span class="o">+</span> <span class="mi">1</span><span class="p">])</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="p">(</span><span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">]</span> <span class="o">+</span> <span class="n">B</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span> <span class="o">+</span> <span class="mi">1</span><span class="p">])</span> <span class="o">/</span> <span class="mi">2</span>
<span class="k">elif</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span><span class="p">):</span>
<span class="k">if</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="nb">max</span><span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">B</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">+</span> <span class="nb">min</span><span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">B</span><span class="p">[</span><span class="mi">1</span><span class="p">]))</span> <span class="o">/</span> <span class="mi">2</span>
<span class="k">elif</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">):</span>
<span class="c"># ...</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">mid_a</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span>
<span class="n">mid_b</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span>
<span class="k">if</span> <span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="n">mid_a</span><span class="p">]</span> <span class="o">></span> <span class="n">B</span><span class="p">[</span><span class="n">mid_b</span><span class="p">]):</span>
<span class="k">return</span> <span class="n">double_binary_search</span><span class="p">(</span><span class="n">A</span><span class="p">[:</span><span class="n">mid_a</span><span class="o">+</span><span class="mi">1</span><span class="p">],</span> <span class="n">B</span><span class="p">[</span><span class="n">mid_a</span><span class="p">:])</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">double_binary_search</span><span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="n">mid_a</span><span class="p">:],</span> <span class="n">B</span><span class="p">[:</span><span class="nb">len</span><span class="p">(</span><span class="n">B</span><span class="p">)</span><span class="o">-</span><span class="n">mid_a</span><span class="p">])</span>
</code></pre></div></div>
<p>There’s a lot of base cases that don’t do much but get in the way of the core idea. The rest of the |A| = 2 cases are fairly similar to the first couple. They mostly just boil down to examining cases and then finding the median of more than two elements.</p>
<h2 id="conclusion">Conclusion</h2>
<p>That’s the end of the very first algorithms post, and boy was it hefty, with over 2600 words. I hope this has been helpful - it was certainly helpful for me to get all my thoughts on this particular problem, which have been jangling around in my head for weeks now, down and clear. And it’s definitely a work in progress - I intend to finish writing the code and thoroughly test it and post a link to it on my GitHub. Send any comments or corrections to <a href="mailto:josh@jzhanson.com">josh@jzhanson.com</a>. Cheers!</p>Welcome to the first of a series where I post a programming interview question and work through it, posting code and explanations of my approaches, pitfalls, and clever tricks! I may use different languages and compare the results if there are interesting or noteworthy differences, but I will generally use Python due to its brevity and ease of understanding. The focus here is on the algorithm, approaches, and clarity of code rather than any particular code finesse. Send comments or corrections to josh@jzhanson.com.Deep Learning Part 1 - Bayes’ Rule and Maximum Likelihood2017-12-30T19:45:00+00:002017-12-30T19:45:00+00:00http://blog.jzhanson.com/blog/dl/tutorial/2017/12/30/dl-1<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<p>This is the first in a several-part series on the basics of deep learning, presented in an easy-to-read, lightweight format. Previous experience with basic probability and matrix algebra will be helpful, but not required. Send any comments or corrections to <a href="mailto:josh@jzhanson.com">josh@jzhanson.com</a>.</p>
<h2 id="bayes-rule">Bayes’ Rule</h2>
<p>We begin our discussion with <strong>Bayes’ rule</strong>, an important result that captures the intuitive relationship between an event and prior knowledge we have of factors that might affect the probability of the event. Simply put, it formulates how event <em>B</em> affects the probability of event <em>A</em>. It forms the basis of Bayesian inference and Naive Bayes. Because it is a little difficult to grasp intuitively at first, let’s go over its derivation from the definition of <em>conditional probability</em>, which is easier to understand at first.</p>
<h3 id="conditional-probability">Conditional probability</h3>
<p>Conditional probability simply formulates the probability of event <em>A</em> happening <strong>given that</strong> event <em>B</em> happened.</p>
<script type="math/tex; mode=display">P(A \vert B) = \frac{P(A \cap B)}{P(B)} \text{ or, equivalently, } P(B \vert A) = \frac{P(B \cap A)}{P(A)}</script>
<p>The <em>P</em>s basically mean “probability of,” the vertical bar | on the left side simply means “given,” and the little upside-down u on the numerator of the right side means “and,” as in event <em>A</em> happening <em>and</em> event <em>B</em> happening.</p>
<p>What conditional probability is saying is that the probability of event <em>A</em> given event <em>B</em> is equal to the probability of event <em>A</em> and event <em>B</em> happening divided by the probability of event <em>B</em>. It’s a bit easier to see with a Venn diagram of probabilities.</p>
<p><img src="/assets/dl-part-1/conditional-2.png" alt="Conditional probability illustrated" title="Conditional probability illustrated" /></p>
<p>It is fairly clear that if we assume that event <em>B</em> happens and we wish to consider the probability of event <em>A</em> happening, then we only need to consider the probability space where <em>B</em> happens, that is, the right, darker circle <em>P(B)</em>. Within that circle, there’s the middle section, <em>P(A and B)</em>, which is how <em>A</em> can happen if we assume that <em>B</em> happens. So we can see that the probability of <em>A</em> given <em>B</em> is equal to the probability of <em>A</em> and <em>B</em> (how <em>A</em> can still happen given that <em>B</em> happens) divided by the total probability space under consideration, <em>P(B)</em>, because, again, we’re assuming that <em>B</em> happens.</p>
<script type="math/tex; mode=display">\implies P(A \vert B) P(B) = P(A \cap B) \text{ and } P(B \vert A) P(A) = P(B \cap A)</script>
<script type="math/tex; mode=display">\implies P(A \vert B) P(B) = P(B \vert A) P(A)</script>
<script type="math/tex; mode=display">\implies P(A \vert B) = \frac{P(B \vert A) P(A)}{P(B)}</script>
<p>We first multiply the denominators on both formulas, set the two formulas equal, because “and” is communative - <em>A</em> and <em>B</em> happening is the same as <em>B</em> and <em>A</em> happening - and finally divide <em>P(B)</em> over, assuming that that probability is not zero, we can easily derive Bayes’ Rule.</p>
<h3 id="generalizing-bayes-rule">Generalizing Bayes’ Rule</h3>
<script type="math/tex; mode=display">P(A \vert B) = \frac{P(B \vert A) P(A)}{P(B)}</script>
<p>We use the Law of Total Probability, which states the probability of any event <em>A</em> is equal to the probability of that event <em>A</em> happening given some event <em>B</em> happening times the probability that <em>B</em> happens, plus the probability of that event <em>A</em> happening given some event <em>B</em> happening times the probability <em>B</em> doesn’t happen. To refer to the diagram above, we’re basically saying that the probability of <em>A</em> is equal to the dark middle portion, <em>A</em> happening given <em>B</em> happening, plus the lightest shaded portion, <em>A</em> happening but <em>B</em> not happening. Notationally, the bar above the letter of an event just means the complement of that event - i.e. the event of that event not happening.</p>
<script type="math/tex; mode=display">P(A) = P(A \vert B) P(B) + P(A \vert \overline{B}) P(\overline{B})</script>
<p>Let’s use the example of flipping two coins and want to find the probability that the second one is heads. Then, we have</p>
<script type="math/tex; mode=display">P(\text{second coin is heads}) = P(\text{second coin is heads } \vert \text{ first coin is heads}) P(\text{first coin is heads})</script>
<script type="math/tex; mode=display">+ P(\text{second coin is heads } \vert \text{ first coin is not heads}) P(\text{first coin is not heads})</script>
<p>We rewrite Bayes’ rule as follows using the Law of Total Probability, replacing the denominator:</p>
<script type="math/tex; mode=display">P(A | B) = \frac{P(B | A) P(A)}{P(B | A) P(A) + P(B | \overline{A}) P(\overline{A})}</script>
<p>This is for the two variable case, but it is not difficult to see that it generalizes to any finite number of variables, say, if several outcomes <em>partition</em> the <em>sample space</em>, which means that exactly one of these events <em>must</em> happen. So, instead of just having two outcomes, <em>B</em> or <em>not B</em>, we have several. For example, the event of getting a one, a two, a three, a four, a five, or a six when rolling a dice are events that partition the sample space, because exactly one must happen when you roll the dice! The takeaway is that we can write in the general case, with multiple events <em>B<sub>1</sub></em>, <em>B<sub>2</sub></em>, …, <em>B<sub>n</sub></em>, that</p>
<script type="math/tex; mode=display">P(B_i | A) = \frac{P(A | B_i) P(B_i)}{P(A | B_1)P(B_1) + \ldots + P(A | B_n) P(B_n)}</script>
<script type="math/tex; mode=display">= \frac{ P(A | B_i) P(B_i)}{\sum^n_{j = 1} P(A | B_j)P(B_j)}</script>
<p>Now if we leave behind <em>discrete</em> probability and move to <em>continuous</em> probability, not too much changes besides we switch the summation to an integral and swap around some function notation, which we will introduce here. Note that the lowercase <em>p</em>s and <em>f</em>s mean more or less the same thing as the uppercase <em>P</em>s - they stand for the probability mass or probability density functions for discrete and continuous random variables, respectively. We usually use Greek letters, like <em>theta</em>, to stand for <em>hypotheses</em>, or unknown parameters. We will usually use little English letters, like <em>x</em>, to represent observations, or data values. Don’t worry too much about whey there’s a <em>p</em> here or an <em>f</em> there, it’s just to make a distinction between <em>marginal</em> and <em>conditional</em> or <em>joint</em> distributions. Elsewhere, the notation may vary.</p>
<script type="math/tex; mode=display">p(\theta \:| \: x) = \frac{f(x \: | \: \theta) \, p(\theta)}{\int f(x \: | \: \theta) \, p(\theta) \, d\theta}</script>
<p>In the context of machine learning, <em>x</em> is the <em>observation</em> - what we sample from some unknown distribution that we want to <em>model</em>. Theta is the unknown parameter that our distribution depends upon, representing our hypothesis on a random variable under observation. Once we know theta, we can easily generate new observations to form a prediction on our random variable under observation. This is why we want to guess at what theta can be as best as we can so we can get a good prediction from the distribution. In fact, each term in the above equation has a name.</p>
<p>The numerator of the left side has <em>f(x | theta)</em>, which we refer to as the <em>likelihood</em>, because it’s the likelihood that we observe <em>x</em> if we fix some parameter value <em>theta</em>. We also have a <em>p(theta)</em>, which we call the <em>prior</em>, because it usually represents our prior knowledge of <em>theta</em> and how it’s distributed - we have some prior knowledge of how theta behaves and which values it’s likely to take. On the denominator of the right side, we have an integral over all values of <em>theta</em> of the likelihood times the prior, which we can see is just generalizing the Law of Total Probability to the continuous case. We refer to this as the <em>evidence</em>, because it’s what we know about the conditional distribution, <em>f(x | theta)</em>, and the prior, <em>p(theta)</em>. We can also call the denominator the <em>marginal</em>, because when we integrate across all values of <em>theta</em>, the denominator becomes a function of <em>x</em> only, <em>p(x)</em>, which is the <em>total probability</em>. Finally, we call the <em>p(theta | x)</em> on the left side of the equation the <em>posterior</em> distribution, because it’s the distribution we can infer after we combine the information we have from <em>likelihood</em> and the <em>prior</em> and apply Bayes’ Rule. We can rewrite this, with words, as</p>
<script type="math/tex; mode=display">\textbf{posterior} = \frac{\textbf{likelihood} \times \textbf{prior}}{\textbf{evidence}}</script>
<p>Note that we can easily replace the single value <em>x</em> with a bolded <strong>x</strong>, representing a vector of multiple values.</p>
<script type="math/tex; mode=display">f(x_1, x_2, \ldots, x_n, \theta) = f(\textbf{x}, \theta)</script>
<h3 id="chain-rule-for-conditional-probability">Chain rule for conditional probability</h3>
<p>In a nutshell, the chain rule for conditional probability states that the probability of a bunch of things all happening is the probability of one of the things happening <em>given</em> the other things <em>happen</em> times the probability of all the other things happening.</p>
<script type="math/tex; mode=display">P(A_1 \cap A_2 \cap \ldots \cap A_n)</script>
<script type="math/tex; mode=display">= p(A_1, A_2, \ldots, A_n) = p(A_1 | A_2, \ldots, A_n) \times p(A_2, \ldots, A_n)</script>
<p>The first line of the above is just to illustrate the change in notation, from the “cap” notation earlier to using commas to denote events all happening. We can repeatedly apply the chain rule, giving us</p>
<script type="math/tex; mode=display">= p(A_1 | A_2, \ldots, A_n) \times p(A_2 | A_3, \ldots, A_n) \times p(A_3, \ldots, A_n)</script>
<script type="math/tex; mode=display">= ...</script>
<script type="math/tex; mode=display">= p(A_1 | A_2, \ldots, A_n) \times p(A_2 | A_3, \ldots, A_n) \times \ldots \times p(A_{n-1} | A_n) \times p(A_n)</script>
<h3 id="likelihood-functions">Likelihood functions</h3>
<p>Now if we sample, say, <em>n</em> samples from our unknown distribution, and the assumption here is that the samples are independent, then what we can do is if we know the likelihood function <em>f(x<sub>i</sub> | theta)</em> and we want to find the probability that <em>theta</em> is a particular value given all our sampled data, we can repeatedly apply the chain rule of probabilities, replacing <em>p</em> with <em>f</em> since we are often dealing with continuous rather than discrete data:</p>
<script type="math/tex; mode=display">f(x_1, x_2, \ldots, x_n, \theta) = f(x_1 | x_2, \ldots, x_n, \theta) \times f(x_2, \ldots, x_n, \theta)</script>
<script type="math/tex; mode=display">= \ldots</script>
<script type="math/tex; mode=display">= f(x_1 | x_2, \ldots, x_n, \theta) \times f(x_2 | x_3, \ldots, x_n) \times \ldots \times f(x_n | \theta)</script>
<p>Note that we don’t have a <em>f(theta)</em> at the end despite the chain rule expansion, because <em>theta</em> is not jointly distributed with the <em>x</em>s.</p>
<p>And finally, because we assume each <em>x<sub>i</sub></em> is independent, we can drop all the other <em>x<sub>j</sub></em> terms from each conditional probability distribution. This is because they’re independent - i.e. the probability of <em>x<sub>i</sub></em> being what it is does not at all depend on what value any other <em>x<sub>j</sub></em> takes. This means that we have</p>
<script type="math/tex; mode=display">= f(x_1 | \theta) \times f(x_2 | \theta) \times \ldots \times f(x_n | \theta)</script>
<script type="math/tex; mode=display">= \prod^n_{i = 1} f(x_i | \theta) = L(\theta | x_1, \ldots, x_n)</script>
<p>which we call the <em>likelihood</em> function. Note that because the <em>training data</em>, or <em>features</em>, we observed, <em>x<sub>1</sub>, …, x<sub>n</sub></em>, are fixed with respect to theta, the likelihood function is only a function of <em>theta</em>, the unknown parameter upon which our mystery distribution depends. In fact, it is exactly the probability that we observe what we observed, <em>x<sub>1</sub>, …, x<sub>n</sub></em>, given that value of <em>theta</em>. In other words, you can give me a value for <em>theta</em>, and I can use this likelihood function to tell you how likely that we get out the training data <em>x<sub>1</sub>, …, x<sub>n</sub></em>.</p>
<p>Note also that I don’t require a concrete value for <em>theta</em> to construct the likelihood function - I only need some training data <em>x<sub>1</sub>, …, x<sub>n</sub></em>. So, if I wanted to model a particular, unknown, <em>black-box</em> distribution, I sample <em>n</em> samples from it, which I call my training data. I use this training data and the chain rule of conditional probability to construct my likelihood function. I then try to <em>maximize</em> that likelihood function with respect to <em>theta</em>. That is, I try to find the value of theta that gives me the highest likelihood for my observation.</p>
<script type="math/tex; mode=display">\text{argmax}_{\theta \in \Theta} L(\theta | x_1, \ldots, x_n) = \hat{\theta}_{MLE}</script>
<p>We call the theta that gives us the highest probability from our likelihood function <em>theta-hat</em> - there exist more formal terms for it, but the carat that is used to signify a best-guess estimate looks like a hat. This is known as <em>maximum likelihood estimator</em>, hence the subscript MLE.</p>
<p>We make the distinction that the <em>estimator</em> is the function itself, and the <em>estimate</em> is the <em>estimator</em> evaluated with some observation.</p>
<p>Now that I have an estimated distribution, I can ask the real mystery distribution for some more data samples, known as the <em>test data</em>. If, for each test data point, my estimated distribution says there’s a high probability that I would get this particular point, then I say that my model <em>generalizes well</em>. If my estimated distribution has difficulty distinguishing this test data from, say, garbage data, then I say it <em>generalizes poorly</em>, perhaps suffering from <em>overfitting</em>. Maybe we picked an insufficient functional form, one that isn’t capable of modeling what’s really going on. More on that in future posts.</p>
<h2 id="conclusion">Conclusion</h2>
<p>To summarize, we began by explaining how conditional probability is the basis of Bayes’ Rule, how the chain rule of conditional probability makes a likelihood function, and how to use the likelihood function to find the parameter of a mystery distribution.</p>Welcome!2017-12-21T18:53:00+00:002017-12-21T18:53:00+00:00http://blog.jzhanson.com/blog/update/thoughts/2017/12/21/welcome<p>Hi there!</p>
<p>Welcome to Junior Varisty Computer Science, a blog where I write code and talk about it! There are some things about the blog that I would like to iterate on, but as of right now, I think it is fully functional as a blog.</p>
<p>I imagine this blog will have three main categories of posts:</p>
<ol>
<li>
<p>Code - this might be coding interview questions I find interesting and want to share, in which case I’ll write down the question, insert a big white space or cut, and then walk through the process I went through to arrive at the optimal solution, outlining my lines of thought and any gotchas I ran into. It could also be an interesting algorithm that I translate into code, possibly in several different languages to compare and do a little language critique. It could even be me posting a snippet of code from a personal project and talking about why it sucks or why I’m proud of it. I’m considering making interview question posts weekly, maybe “Technical Interview Thursdays” or something.</p>
</li>
<li>
<p>Reading list - this blog also will be where I post interesting articles or papers I read and where I write anything from a couple paragraphs to an entire essay on what I think of them. It could even be on broader topics where I tie several articles/papers together.</p>
</li>
<li>
<p>Thoughts - the least frequent of the post types, where I post things I’ve been thinking about that I consider important enough to write about. Blog update posts, which I will only use for major changes, and life updates, which I will only use for extremely major changes like graduation or death, fall under this category.</p>
</li>
</ol>
<p>To lots of posts!
Josh</p>Hi there!