Josh Zhanson

Project diary - Blaster rifle and defibrilator for Galaxy Awaits Star Wars LARP

2024-07-07T18:00:00+00:00

This is a quick blog post about two props I made for Galaxy Awaits LARP which I traveled to in February of this year. It was a lot of fun, but I had to leave early on Saturday afternoon. I’d love to attend again as the same character or possibly even a lightsaber-wielder.

Blaster rifle

I used the Nerf Elite 2.0 Echo blaster, which had a chunky, military look that I liked. Unfortunately, it kind of sucks. The magazine action is not very smooth, and unintuitively, you have to pull the priming slider back to be able to release the magazine. But the vents on the side of the barrel, as well as the central barrel itself and magazine, looked fantastic.

I started by splitting the blaster up into its four parts and covering the orange barrel tips (the one shown, as well as the one that remains after detaching the front barrel part) with masking tape. I sprayed Vallejo Black airbrush primer in a few coats all over the blaster pieces, and it went on smooth without needing to dilute it and was reasonably resistant to overspray. Since this black primer looked pretty good, and I was going to lighten it with a metallic grey, I decided not to spray a real black paint over it. You could totally do it if you wanted the black to be darker.

I used Army Painter “Gun Metal”, a metallic medium-dark grey miniature paint. To add a weathered appearance to the parts, I put a few drops of paint on my (gloved) fingers, then smeared it on around all the raised parts of the blaster. Kind of like drybrushing, but with my fingers. The first few times, I used too much paint, so it was more like painting with my fingers instead of drybrushing, and I had to work a bit to spread the paint around and stop it from pooling. Use less paint than you think, and go over it more times! In the end, the effect was really cool and added a lot of depth to the blaster.

To paint the gunmetal parts on the magazine, the barrel, the slide, and some accents on the stock, I applied masking tape (sometimes very intricately by tearing it into tiny pieces like I had to do for the complex geometry of the barrel), then airbrushed a few layers of the Army Painter Gun Metal. I had a tough time stopping the paint and primer from rubbing off where the slide scraped against the body of the blaster and other moving parts, but I wanted to paint at least a coat of primer under the slide. I tried not to move them as much as possible, but the paint and primer did get scratched off in the end.

I used U.S. Art Supply Matte Airbrush Sealer to apply a protective topcoat. As I mentioned in the Pittsburgh topographic map post, it dries and gunks up the airbrush very easily. Forget about using a double action - even a single action is really hard to get unclogged, and you only get a minute or two of spraying time before the flow starts to reduce. It does seem like this matte sealer is pretty resistant to overspray. I definitely oversprayed some parts and it looked gluey while drying but was unnoticable after it dried.

I attached a shoulder strap from a camera bag or a sling bag I bought at the thrift store to the attachment points on the NERF blaster using a carabiner and a keyring. The front attachment point was a little too small to get a carabiner around, so I had to get creative with a keyring. Sadly, after a weekend of walking around, the sharp keyring had rubbed off the paint in that attachment point, and a few other high-friction points had lost their paint.

Still, it was a fun project that taught me how to use the airbrush, and the blaster rifle is still a really cool looking prop!

Defibrilator

I made a defibrilator from Nibb31’s frog box 3D printed in yellow filament with black clasps and a random peripheral I found in the junk pile at work for the white wires. The paddles are Riv3’s paddles. I trimmed off the USB ends of the peripheral, stuck them into the paddles, and hot glued them in. I think superglue or something else would be more solid. The paddles are also a little small - I’d print them at 120% or even 150% for a nice comfy hold.

The paddles don’t fit inside the box, but that can be easily remedied with a larger box. I didn’t think of a good idea of how to get the wires to come out of the box so the box could be closed, or how to attach the white peripheral to the inside of the box. Probably the best thing to do is to drill holes in the side of the box and pass the wires through, then screw the white peripheral into the bottom of the box or something.

Other props I made

I also 3D printed 20 or so of WaveSkyLord’s stocking test tubes to be prop vials for medicines and such, in white PLA with PVA water dissolvable supports for the lids.

For the rank badges, I printed SkyRider007’s Imperial cadet rank badge since that was what aligned with the “Ensign” rank for Galaxy Awaits Larp, and used Matterhackers Silky white filament (shiny-ish, to mimic metal) for the bottom and used the second extruder to print Matterhackers Ruby Red (a deep, slightly translucent red) for the top. Hot-gluing safety pins was way too fragile, so I ended up using a command strip to affix a magnetic name tag holder to the back of it. The magnetic name tag holder might feel strong, but it loses its strength really quickly through more than 1 layer of fabric, or 1 layer of thicker fabric. I almost couldn’t find a place to attach it to my tactical vest.

Project diary - Paul Chao’s Pittsburgh Topographic Map

2024-07-07T17:00:00+00:00

It’s been a little over six years since I’ve updated this blog. Since then, I’ve graduated from college, started and finished a Masters program, and started a full-time job.

My job has a makerspace, and I’ve been learning to 3D print and use the laser cutter. I’ve wanted to make a 3D topographic map for a while, and it just so happens that Paul Chao had already done the hard work of getting the contour data and was kind enough to post his DXF files and Rhino .3dm files for anyone to use.

Here are my Adobe Illustrator files.

Opening the files

At first, the makerspace recommended using Microsoft Visio to open and send files to the Universal laser cutter. Microsoft Visio is a flowchart and diagramming software with a very impoverished web version and no easy way to install the software locally. While it was technically capable of opening .dxf files, it was very difficult to scale them and nigh-impossible to manipulate the vector contours.

Because of the difficulty of using Microsoft Visio, I gave up on the project for a few months, until the makerspace discovered that a copy Adobe Illustrator was installed on the computer connected to the laser cutter. I had some experience using Adobe Illustrator from a graphic design class in high school, so this was a fantastic development that enabled me to actually do this project instead of thinking about it.

I set the canvas size in Illustrator to 24” x 18”, the size of the laser cutter bed, and imported the .dxf files. I scaled the files by 50%, because that was a nice round number to work with as I was doing test cuts and experimenting with the files. I was planning to increase the size of the final product, probably to around a 60% or 70% scale instead so the city would stretch from end to end of the 24” x 18” laser cutter bed, but I forgot. My final map is around 18 inches wide.

Paul’s files were separated roughly into vector cut and raster engrave layers set up for the laser cutter he was using. For my makerspace’s laser cutter, I had to change the vectors to be cut into pure red (#FF0000) and to be 0.001 pt wide. The raster engrave layers could be any color, so I changed them into pure black (#000000) and left them as is, after scaling.

I made a few changes to the files. Layer 0 had the rivers cut out, so I removed those vectors so Layer 0 is just a solid cutout of the outline of the city limits. Layer 2 also only had part of the city, so I had to use a trial version of Rhino3D to open the .3dm file Paul provided and extract the contours, import them into Illustrator, manually line them up with the existing Layer 2, and look back and forth between Layer 3 and Layer 2 to see which contours were actually part of Layer 2. I made some small additions to the later layers of the Westwood/Oakwood/East Carnegie penninsula, and I also added a small Layer 7 with some tiny pieces and inconsistencies with Paul’s Layer 6 that didn’t make sense to me. I stared hard at the City of Pittsburgh’s interactive topographic map to see if Paul got the contours around the edges right. I think there is a little hook-like piece at the south border of the city that I think should be elevation instead of depression – it can be hard to tell whether the area between the border and the first contour line is a depression or an elevation, i.e. does it slope down or up.

For the later layers, which have a lot of small pieces, I found it really helpful during assembly to add an additional layer to the files with the outline of the city, and roughly place each little piece where it would go on the map. Then, I separated the cut and engrave layers into smaller batches of 3-5 pieces each. Because the Universal Laser Systems (ULS) software was smart enough to only cut the areas that were marked with black or red lines, and not the whole canvas, I was able to position each batch separately on the wood and save some wood that way.

One interesting thing to note is that there is a gap in the City of Pittsburgh, a little south of the South Side Flats, between Knoxville on the west and Mt. Oliver on the east. There’s no road data here, so it showed up as a blank spot on the map, but it didn’t like the contours were affected, at least not in a major way. Luckily, it’s small enough that it’s not noticable unless you’re looking for it.

Wood materials

I used 3.30mm plywood for this, and for each map, I went through around 3-5 sheets of 24” x 18” plywood, even trying to be efficient with the placement of the smaller pieces. I liked the wood texture of the plywood more than the medium density fiberboard that the makerspace had.

Painting the rivers

The first thing I did was cut out the base layer (Layer 0) and the layer after that (Layer 1) and position them on top of each other. I used a pencil to trace an outline around Layer 1 onto Layer 0, and when I went to paint, I made sure I painted over the pencil markings and then some.

Paul used a really vibrant blue to paint the rivers. I had a tough time finding such a vibrant blue in the student-grade acrylic paints that my makerspace had, and it’s really hard (impossible? without a deep understanding of color theory and even then using some trickery) to mix paints to get a paint that is more saturated than the paints you started with. I ended up using a miniatures paint, which was certainly newer and less chunky and dried than the old student-grade acrylics in the makerspace, and also had a higher pigment density. I believe I used Army Painters Crystal Blue – it might have been Voidshield Blue, but I think that one was too light for the water. I know the river water is not at all that color, but I really liked how that high-saturation high-luminosity medium tone blue stood out with the yellow-brown of the wood and laser cut edges.

Since the miniatures paint was pretty thick with pigment, I only needed one thick layer, with a little more paint to touch up some spots afterwards. I did a quick and dirty job, focusing on getting solid color coverage without leaving pools of paint, and the paint interacted well with the wood.

Cutting & pasting

Laser cutting and raster engraving in particular TAKES A LONG TIME. I spent 20 hours or so on the first map, and I was running the laser cutter for 12-16 of those hours.

Laser cutting vectors (and etching vectors - some people call this “scoring” instead of “engraving”) is reasonably fast, but it still took 10-20 minutes for the laser to just cut out the outline of the city of Pittsburgh.

When cutting and raster engraving Layer 1, it took 50 minutes to a little over an hour. It takes a long time to raster engrave because the laser head has to move all the way across the width of the cut area, no matter how much there actually is to engrave. But, you can change the size and thickness of the lines when raster engraving, so it scales better than scoring / except at very small scales, where the design can become patchy.

I used the “General Medium Woods” setting on the laser cutter, with the red cut and black raster engrave powers cranked up all the way to +50%.

Even smaller batches of 3-5 pieces took 10-15 minutes to cut out. And there are a lot of pieces, especially in the later layers. Luckily, I had a lot of gluing to do in between waiting for the laser to finish, and I had laser cutting to do while waiting for the glue to dry.

I used Gorilla Glue wood glue with the aid of some clamps to glue the pieces together. I tried a few different ways of spreading this glue, including using a toothbrush and a small paint spreader. Both were ok. I liked how the spreader could move a lot more glue, but the toothbrush did a better job of making the glue “soak into” the wood.

I found this glue to have a very strong cohesion, to the point where I’d spread it onto the wood, go to spread some other part of a larger piece, and it would “shrink back” and pull itself into blobs instead of evenly coating the wood. This was an advantage for the edges, because I’d often spread a little bit of glue over the edges and it would “pull back” like honey out of a spout, but that didn’t always happen. It also advertises a working time of up to 15 minutes, but I found I had to hustle on the bigger pieces, because the glue seemed less visible and certainly effective after 5 minutes or so of sitting out in the open.

When applying the wood glue, I was careful not to leave drops of it on the edges of the piece. I used either a finger or a paper clip to “grab” the excess glue drops off the irregular edges as I was applying it. The glue says you can sand it after it dries, but it would have been awkward with the geometry of the laser cut edges, and I wanted them to have a nice uniform burnt edge. When the glue dries in a drop, it dries a dark, barely translucent brown. If a little bit of glue was squeezed out between two layers and dried, I was able to use a paper clip to scratch off the excess, which would sometimes leave a little bright mark where the glue broke off.

The key part of working with wood glue is to clamp your pieces together at as many edges as you can. Even if you only get a little bit on, the glue will melt the wood and re-form a very strong bond if you can press the pieces together. Since I was working on a large table, I set the map on a corner of the table, and I could only clamp two sides at a time and maybe the center of the piece, if it was small enough and I maneuvered it onto the corner of the table. I tried to use heavy things like crescent wrenches and monkey wrenches to weigh down the edges that I couldn’t clamp, and I’m sure it was better than nothing, but it also wasn’t very effective. Sometimes, if one edge wasn’t properly clamped, it would stick up, and I would put some glue on the scraper, work it under that edge, then clamp that edge and wait for it to dry again.

The wood glue told me to let it dry while clamped for 20-30 minutes. While I was waiting for the wood to dry, I was laser cutting other pieces and labeling them with their layer number and batch number so I wouldn’t lose track of them. I usually found that 20 minutes was sufficient, and even possibly as few as 15 or even 10 minutes while clamped.

For the smaller pieces in the later layers, I often found it was sufficient to just press down hard for 5-10 seconds and leave it, especially for the pieces smaller than a fingernail. Clamping small pieces is tough, because the force of the clamp can often cause the piece to shift a little bit, misaligning the roads in an annoying way.

Cutting and gluing the wood was the vast majority of the project time. Probably 12-16 hours of the 20 hours was spent doing at least one of the two, and often both at the same time. I started making my second version of the topographic map while I was waiting for the glue to dry on some of the last layers of the first one.

Spraying & sealing

Since I had some U.S. Art Supply matte airbrush sealer from my Star Wars blaster rifle, I decided to seal the first topographic map using it. However, I had the same problems spraying it out of the single-action airbrush as I had in the other project. U.S. Art Supply matte airbrush sealer is really gunky, so within a minute or two of spraying, it would either clog up the airbrush nozzle, the intake straw and well, or even the little air hole at the top of the little paint bottle. De-clogging it was also a pain. I used the U.S. Art Supply airbrush cleaner, then isopropyl alcohol (yes I was wearing a respirator, yes I know breathing IPA is bad) with the aid of a paper clip to clean out the clogs, since it seemed like there was a good amount of gunk that required physical force to remove.

Other than the clogging problem, the U.S. Art Supply matte airbrush sealer worked ok. Since it’s a matte sealer, it dries into a bumpy surface rather than a smooth surface. You’re not supposed to let it pool, but if you do, then it starts to provide a little more of an even, glossy look, and it didn’t have too much of a white cast. Even though the Amazon product page says that it’s good to use on wood, I’m not sure if it’s really the best choice.

Left is unsealed, right is sealed.

For the second topographic map, I bought spray-on Shellac. It worked well – in truth, about the same as U.S. Art Supply matte airbrush sealer, but the Shellac rattle can had no clogging problems. There’s less Shellac in the spray can than you think. Both sealers darkened the wood and gave it a nice burnt look.

When spraying, I started with even coverage. In the later layers, I paid special attention to the valleys, which didn’t get as much spray as the peaks or the rivers (where the runoff would collect). I’m not sure if I got a solid layer of sealer on the valleys before the peaks built up a solid glaze of sealer and I had to stop before they got too “candied”.

Mounting

I didn’t have any good ideas on how to mount it, except by using the white “velcro” Command strips used for picture hanging. Two or three of the “large” ones did the trick. The map is about as heavy as you’d expect, or maybe a little lighter.

Takeaways

The differences in the slope of the topographic map are noticably sharper on the Mount Washington Incline.

This was a really fun project that taught me a lot about laser cutting wood and assembling multiple layers of wood into a 3D sculpture.

If I did it again, I would:

strongly consider using scoring / etching (blue #00FF00) the roads instead of raster engraving them to save time.
I’m now happy with the files I have and the edits I made, so hopefully, I wouldn’t make any errors. I made a small error on the first one I made, where the Waterworks Mall is one layer short, and the second one I made, the Waterworks Mall is one layer too tall because a clamping made the roads misalign with the layer below it, and I added a layer to correct it. The first one I made also has a “flat top” Westwood/Oakwood/East Carnegie, and the second one has more detail but I still think I missed a layer.
spend more time spraying sealer into the valleys, because the peaks will naturally collect overspray while the valleys won’t.

From 0 to 200 - lessons learned from solving Atari Breakout with Reinforcement Learning

2018-05-28T17:00:00+00:00

Note: this post has a lot of hefty GIFs. Be patient for the website to load! It’ll be worth it :)

The GitHub repository with my code.

I spent the last two months on what my Deep Reinforcement Learning and Control professor called the “MNIST for deep RL” — solving the classic Atari game Breakout. I originally thought it would be a two-week project, especially since I already had the code for a double deep Q-network, but, along with coursework and exams and model training challenges, took closer to two months to complete.

First stab: Double Deep Q-Network

The original deep RL methods that were used to play Atari games came from Mnih et al., Playing Atari with Deep Reinforcement Learning (and the more cited Nature paper), where Mnih and colleagues used the model-free reinforcement learning algorithm Q-learning, paired with a deep neural network to approximate the action-value Q-function, to play Atari.

Q-learning is a relatively simple algorithm that takes an action in the environment and uses the following update rule to update its estimate of the Q-function with the tuple of sampled experience state, action, reward, and next state \((s_t, a_t, r_t, s_{t+1})\):

\[Q_{t+1}(s_t, a_t) \overset{\cdot}{=} Q_t(s_t, a_t) + \alpha (r_t + \gamma \cdot \max_a Q_t(s_{t+1}, a) - Q_t(s_t, a_t))\]

where \(\alpha\) is the learning rate and \(\gamma\) is the discount factor — see the RL literature for more info.

In a nutshell, the algorithm is pushing its estimate of the reward of taking a particular action in a particular state a little bit towards the real reward obtained by the agent in that state by taking that action. Under some conditions regarding infinite sums and the learning rate as well as that all states and all actions are visited and taken infinitely often, it has been shown that this estimate of the Q-function converges to the true Q-function.

Challenges

The network used in the Nature paper made of three convolutional layers plus a fully-connected layer and an output logit for each action to estimate its corresponding Q-value, was simple to implement and fairly standard at the time. 2015 was just three years ago, but more recent methods have essentially made deep Q-networks, at least on their own as presented in the Nature paper, obsolete. Reinforcement learning as a field is moving very quickly.

The main challenge lay in the replay memory: the Nature paper used a replay buffer of 1M transitions, and because each state was made up of four grayscale 84x84 images stacked together and each transition has two states attached to it, this meant that the replay buffer should have taken about 56 billion bytes, or 56 gigabytes, which is really not that much. However, when training the model on AWS, I found that the memory usage was exploding. The model was not small, of course, with 3 convolutional layers of 32, 64, and 64 kernels each, plus a dense layer of 512 units and then another dense layer to connect to the output logits, but saved model checkpoints should not have been nearly the size of the replay buffer. With some quick-and-dirty calculations in the search bar of my web browser, it seemed like each transition was eating up 0.0003 gigabytes or 300,000 bytes, which was way way way more than the 56,000 bytes or so each transition should have taken up. This was most likely due to the way I structured my replay buffer — the interaction between the numpy arrays that were the images and the Python deque must have had a memory leak somewhere.

There was also a possibly related problem that I have yet to figure out — after some time, the AWS instance would stop responding and I would be unable to SSH in to it. It didn’t matter whether I ran the Python process in a tmux session or in the foreground or background, but whenever I would let it run for a while and then tried to reconnect, the SSH would hang for 10-15 minutes and then print a simple “Permission denied.” So far, my best guess as to what happened is that the replay buffer fills up and with Tensorflow using up every ounce of compute the system has left, there is no memory left to respond to the SSH request. It could also be the case that there is sufficient memory (towards my later trials, I was allocating 2000 gigabytes or 2 terabytes per instance) but because so much was held in the swap/RAM, the caching slowdown brought on by having to constantly sift through the slower SSD flash memory to sample transitions at random from the replay memory completely overwhelmed the system and made it take a huge amount of time to respond to the SSH request.

In any case, it proved very difficult to even keep an AWS p2.xlarge instance alive long enough for me to be able to SSH back into it that I eventually abandoned the double deep Q-network and moved on to the other less GPU- and memory-intensive methods.

Second try: Advantage Actor-Critic (A2C)

Asynchronous Advantage Actor-Critic (A3C) is a more recent algorithm from 2016 by the same authors as the original Nature paper which uses a deep network to learn the optimal policy using an estimate of the state-value V-function rather than the action-value Q-function. Both use multiple workers, each with their own copy of the enviornment, but A3C uses them asynchronously while A2C runs them synchronously. According to OpenAI there seems to be no noticible benefits provided by the asynchronicity.

This algorithm has two neat tricks here: first, we are calculating the actual value of a state from experience using a rollout of the rewards received over N time steps

\[A_t = R_t - V(s_t) = \sum_{i=0}^{N-1} \gamma^i r_{t+i} + \gamma^N V(s_{t+N}) - V(s_t)\]

as well as subtracting the value of the starting state, which gives a quantity known as the advantage, i.e. a measure of the relative amount of reward that can be expected from a state. A really good (high-value) state is likely to have a high reward, so the advantage is small, and a really bad (low-value) state is likely to have a low reward, so the advantage is also small. However, receiving a high reward in a bad state results in a large advantage, while a low reward in a good state results in very small (likely negative) advantage.

We use this quantity squared as the loss for the part of the network that estimates the value function, known as the critic, and we use that quantity times the negative log of the probability we take the action we took in that state under the policy given by our network to update the part of the network responsible for computing the policy, known as the actor, hence, advantage actor-critic. It is fairly common in practice, however, to use the actor and the critic loss combined with an entropy term as the loss function, which is what I did.

The N in the above expression is a hyperparameter for the number of steps to unroll when calculating the cumulative discounted reward — basically, how far into the future to look when determining an action’s impact on obtained reward. Using a value of \(N=1\) gives a one-step advantage actor-critic, while using a value of \(N = \infty\) gives an algorithm known as REINFORCE, which are both part of the broader category of N-step advantage actor-critic methods.

The second trick here is that we run multiple workers, each with its own environment, all using and updating the same network weights — hence, asynchronous. Exploration can either come from the workers updating their network weights separately and syncing them periodically, all the workers using the same weights and updating them immediately, or even by adding a little bit of noise to the action-probabilities outputted by the policy network.

It is worth noting, however, that since A2C and A3C are on-policy learning algorithms, we require that the updates to the network come from the policy that is outputted by the network. This is in contrast to off-policy methods like Q-learning outlined above, which do not require that we follow the policy given by our network because we are not learning a policy — we are learning the value of taking various actions in the different states of the Markov Decision Process, rather than directly learning what to do in a particular state. This means that a replay buffer, a key component of deep Q-networks, cannot be used for A2C, as all experience used to train the network must come from the policy currenty given by the network.

Challenges

The biggest setback I suffered, or rather, challenge I surmounted :), was my initial misunderstanding of the algorithm. I initially thought that the cumulative discounted reward included the state-value function for each state in the N steps, rather than just the last state. That is, I was calculating the cumulative discounted reward for each step within a batch of \(N\) steps as (assuming \(t=0\) is the first step in the batch rather than first step in the episode), as illustrated below. Note that the value of a terminal state is defined to be 0.

WRONG:

for t from 0 to N-1:
  cumulative_discounted = 0
  for i from t to N-1:
    cumulative_discounted += gamma^i * r_t
  R[t] = gamma^N V(s_t) + cumulative_discounted

RIGHT:

R[N] = V(s_N)
  for t from N-1 to 0:
    R[t] = r_t + gamma * R[t+1]

The primary difference being that only the last state value is included in the target, not the state value for every intermediate state in the N steps. The first rollout doesn’t work because the values outputted by the network itself, the estimate of the value function \(V(s_t)\), play too large a part in the optimization of the network — the target is primarily comprised of value estimates, rather than real rewards. The second rollout only includes the value function of the very last state after N steps, which results in a target made up more of real rewards than estimates, which really does make all the difference.

Below are two animated GIFs I made with my phone’s camera set to time-lapse visually explaining the difference.

This is also a good time to note that the key difficulty of deep reinforcement learning is that these two methods, as well as many other more recent methods like PPO and TRPO, all rely to a certain degree on using the network’s own estimates as part of the target to optimize towards. This is known as bootstrapping in the RL literature, coming from the 19th century expression “to pull yourself over a fence by your bootstraps,” meaning to do something impossible. Fitting, seeing as how these deep models are able to do just that — successfully learn how to play a game using real experience combined with its own estimates and pulling on itself to surmount a huge obstacle.

Contrast this with traditional supervised learning, where the target to train the network towards comes only from the labeled training data — MNIST or ImageNet would be a whole lot harder if networks were trained where half of the objective function is made up of the real label for an image, and half is made up of what the network thought the image was. It does seem quite impossible to bootstrap a model using its own output as a part of the target, but a really cool thing about reinforcement learning is that these methods actually work.

Some improvements to the OpenAI Gym Breakout environment I implemented included treating loss-of-life as the end of an episode, rather than the end of a game (5 lives) as an episode, and repeating the first frame of an episode in the frame-stacking rather than using frames of all zeros, as well as pressing the “fire” button at the beginning of an episode to launch the ball.

A minor training issue I encountered: since the outputs of the policy logits are at first very similar, putting them through a softmax distribution and then sampling from it meant that the agent was following a more or less random policy, which made it impossible to learn from experience — any tiny changes to the network weights would just be drowned out by the random sampling. A probability distribution of 0.25/0.25/0.25/0.25 is not a whole lot different than 0.245/0.247/0.253/0.255 when you’re sampling from it. I also discovered that adding noise to the outputs to encourage exploration simply meant that the agent had a harder time following its policy, and that the noise again drowned out the changes in the policy in the early episodes of learning, which are critical to bootstrapping. Taking the argmax of the outputted action probabilites was the way to go, since it offered the most consistency with the actor’s behavior and the network’s outputs — argmax is very sensitive to small changes when all the probabilities are very similar.

Note that 1.38, the value at the flat line in the graph, is the entropy for the probability distribution 0.25/0.25/0.25/0.25.

This also had to do with the fact that our total loss, used to optimize both the actor and the critic, combined the actor loss, the critic loss, and a negative entropy term, which actually had the effect of pushing the policy action probabilites closer to a random policy: minimizing negative entropy means maximizing entropy, leading the network to be more “uncertain” about which action to take. While this may sound like a bad idea, it is actually necessary to prevent the algorithm from falling into some very easy local minima right off the bat by taking the same action to the exclusion of all others, making it impossible to learn anything but that suboptimal behavior. For example, training without the entropy term or with the entropy term’s sign flipped made the agent in Breakout move the paddle all the way to the right and do nothing but try to move the paddle right.

Finally, after correcting that big misunderstanding, I found some sort of learning rate decay necessary in order to skirt the local minima of the objective function in the early stages of training. If we kept the learning rate constant, the network would learn to hit the ball once or twice, perhaps even getting up to 30 reward or so, and then unlearn all of it and just move the paddle right. However, learning rate decay allows the network to value later learning less than initial learning, which makes sense since games of Breakout all look about the same at the beginning and we want to quickly learn the behavior of hitting the ball, but as the games progress, they tend to look different and we want to learn just enough that the agent can hit the ball but not too much that it thinks that some configuration of the blocks means that it should arbitrarily move left or right. Decaying the learning rate allows us to initially take large steps to step over early local minima and smaller steps later on once the algorithm is close to the true minimum.

I used a simple linear learning rate decay policy where the initial learning rate was decayed linearly over several million training iterations, but I wonder if different decay strategies like quadratic or exponential might make a difference in avoiding the sharp overfitting dropoff that we can see towards the end of training.

Some comments on the generated graphs: because we are slightly pushing entropy to be high in the loss function to avoid the network prematurely preferring one action to the exclusion of all others, the entropy should remain fairly high and fairly constant, but it should certainly not flatline as 1.38, which is the value associated with a random policy. It is interesting to see how the losses are related to the episode length and the average reward, and episode length and rewards are very closely correlated, since games of Breakout lasting longer = a higher score. Also note that I am averaging rewards per episode over 100 episodes, which trades precision for a better look at the overall trend of learning – the reward gotten per episode usually has quite a high variance, so a higher average reward per 100 episodes really means that it is consistently getting better. A more precise graph would probably use average reward per 20 or 25 episodes.

Apologies for some of the graphs running over their axes — I have so far only run on my local machine but plan to run on cloud compute next.

N = 5

I have not yet run the N = 5 case extensively, but in the 3000 or so episodes I did run it, it did not seem to learn anything. More details (# iterations, etc.) to come as I train this for longer. For now, these graphs provide a good look at what an agent that doesn’t learn anything looks like.

N = 20

N = 20 was the first case to show promising results — it was able to get up to a max reward of 376 and a good average reward per 100 episodes, although it wasn’t quite able to get over 200 average reward per 100 episodes before the overfitting cliff hit, which was at around 9000 training episodes (4M iterations).

N = 50

N = 50 performed even better than N = 20, and since I began graphing the max reward obtained so far, N = 50 was able to get a somewhat but not significantly higher max reward of 397, though it took significantly longer to train (in terms of # of episodes, not sure yet about # iterations). N = 50 also had a policy that appeared more stable, likely because unrolling over more time steps trades training speed and immediate reward for a more long-term outlook both in the agent and in training.

N = 100

N = 100 was the slowest training agent that I had run so far, but it certainly did a good job of learning how to play Breakout, likely because 100 is a particuarly good number of steps, about the number of steps that it takes for the paddle to hit the ball and for the ball to hit the bricks and reward to be issued, which makes it particularly good for a rollout as each batch of 100 steps would include the paddle actually hitting the ball as well as the reward being issued. The max reward achieved is 428 and average reward per 100 episodes exceeded 200 towards the end of training.

At around 10400 episodes of training, the agent exhibits the advanced behavior of focusing hitting the ball towards one side of the wall, thus making a tunnel to hit the ball through and score a huge reward when the ball repeatedly bounces off the far wall and the higher-valued bricks in the back.

Here are two video captures from 11400 and 11900 episodes of training where it digs a tunnel through the center as well as a tunnel through the side and even catches the ball when it comes out one of the side tunnels, even though it was hit through the center tunnel.

Finally, here are two video captures from 15500 and 17800 training episodes where the agent has more or less solved the game, hitting almost every brick on the screen.

Unfortunately, after a week of training on my laptop, this model too hit the overfitting cliff. Here are the graphs from the end of training:

And here’s a video of the final policy. Note that it does seem to have retained something, but the policy logits are outputting action-probabilities that have the 1 on the move right action, which is usually what these learning algorithms resort to in this game when there’s a bug in the code or if they are not complex enough to learn how to play the game.

N = infinity

I found that N = infinity was not able to learn anything, most likely because the unrolling takes place over several hundred time steps and the rewards just become too diluted to train worth anything. Also, if the only estimated state-value wrapped into the rollout is that of the terminal state, then it removes the effect of even having a critic — the critic estimate is always disregarded in training and the estimated state-value is discarded. Even if it were run for a very long time, I doubt that it would be able to learn Breakout.

Reflection

There is also the very interesting steep dropoff towards the end of training when the agent seems to suddenly stop being able to play breakout. From video capture, it seems as if the agent can still move the paddle to more or less the right place but can’t keep it there to hit the ball, instead moving it aside at the last moment. This likely starts a positive feedback loop, resulting in the agent repeatedly achieving very little reward with the weights that it learned, leading to it unlearning how to play breakout in a cascade of poor episodes caused by just missing the ball.

And eventually, it performs more or less like a random agent.

Here is a video capture of a lucky random policy, for comparison:

In any case, my agent was able to achieve consistently 200+ reward, which is considered to have “solved” Breakout. Certainly it matches, if not surpasses, human-level performance, and besides the fact that a critical misunderstanding in the A2C algorithm took me two months to unravel, this was an extremely informative learning experience. Writing the code for the algorithm and the network was the easy part. The hard part was training and debugging. I was lucky in that respect — I found an implementation of the algorithm worked that I could look at to see which features it had that my code didn’t, and then implement them in my own code one by one.

Some very interesting questions that I would like to explore: why do smaller values of N even work, considering that the action that resulted in the paddle hitting the ball and the reward being issued may not even take place in the same N time steps? Particularly for N = 20 — how was it able to learn something when the reward definitely was not issued in the same batch as the action that led to the reward? Exactly how much of a role does entropy and the critic loss play — I used the canned coefficients of 0.5 for the critic loss and 0.1 for the entropy, but would the agent learn faster if the critic loss coefficient was increased, placing relatively more value on the quality of the network’s estimates, or if the entropy coefficient was increased (encouraging more evenly-distributed action probabilities) or decreased (encouraging more confident, distinct action probabilities).

And the biggest question of all: what exactly is the cliff at the end of training? I have observed that the cliff happens when the softmax action probabilities converge to all zeroes and one one. It must be some sort of overfitting, but is it in the same vein as overfitting in supervised learning, or is it something different? It is a very sharp drop rather than a slow decline, which means that the agent was very good at playing the game before somewhat suddenly becoming very bad. Breakout is deterministic, which means that the loss of uncertainty whould be a good sign — likely, the wrong kernels/units are being overly emphasized, which leads to worse decisions.

An interesting hint is that the actor loss goes to zero (again, because probability of choosing the action that it chose becomes 1 and the log of that becomes 0) but the critic loss explodes, becoming something around 10+ digits long, which hints us that the value estimate for each state is exploding while the obtained rewards stagnate or drop sharply, and since the critic loss is the difference of the two squared, it results in an extremely large loss, which is likely the reason for the agent’s quick decline in performance. This seems quite like a case of exploding gradients, where the network’s state-value estimate goes to infinity or negative infinity (likely the latter) and causes a positive feedback loop where the loss gets larger and larger and the gradients get larger and larger.

All in all, a very very good learning experience. Who knew that reinforcement learning was so hard? :P

Reinforcement Learning Part 1 - K-Armed Bandits

2018-01-21T17:00:00+00:00

This is the first in a series I’ll be doing on (deep) reinforcement learning where I’ll write about the topic and the interesting parts in a lightweight, easy-to-read format! A lot of this will be based off Sutton & Barto’s Reinforcement Learning book, and this particular post will be focusing on Chapter 2 from that book. Send any comments or corrections to josh@jzhanson.com.

The Bandit Problem

The first time I heard about the bandit problem, I had just entered Carnegie Mellon University’s School of Computer Science. I knew next to nothing about the broader field of computer science. After I emailed the dean, Andrew Moore, asking for a bit of advice on finding my life direction, he very kindly set aside a bit of time in his undoubtedly busy schedule to talk with me one-on-one. He spoke about the transition from high school to college, and how one’s vision should appropriately broaden. He spoke about finding your niche, where you fit in and who you fit in with. He spoke about taking what he called technological risks - when you don’t know if something is even possible, but, knowing that you’re surrounded by the best minds in the field, you have a good chance of making something that was previously impossible, possible.

On the topic of a life direction, he introduced to me the bandit problem, which goes as follows: say you have a slot machine in front of you which has two levers - in contrast to normal slot machines, which have one lever and are often called one-armed bandits on account of their one lever. Say the two different levers of this two-armed bandit in front of you both make the slot machine spin and output some reward, but they do so differently so that pulling one lever or the other result in different payouts. Of course, nothing is for certain, so maybe the first lever has a higher average payout than the second one, or maybe the second one has a higher chance to give you nothing but also a higher chance to make you rich beyond your wildest dreams.

Unfortuately, you don’t know the statistical distributions of the payouts for each lever. But you want to get rich quick, and you only have enough money for, say, 100 lever pulls, so what do you do? One easy strategy is to pick a lever, and keep pulling that one. Maybe you’ll get lucky and pick the “better” lever, or maybe you’ll pick the “worse” lever. If you wanted to be smarter about it, you would sacrifice some initial payout and give each lever a couple pulls, just to see which one seems better, and once you had a good enough guess about which lever was better, spend the rest of your time only pulling that one. Hence, you spend some time in the exploration phase figuring out which lever is the best, and you spend the rest of your time in the exploitation phase, pulling the same lever and getting as much money as you can.¹

It is important to note that the tasks of exploration and exploitation are conflicting - your goal is to get as much payout, or reward, as you can, and you get as much money as you can by exploitation. However, you might not know which strategy is best without exploration - exploring might make you try out unknown strategies to make sure that you’re not missing a potential goldmine. You can’t do just one and not the other - only exploring won’t pay off as much, and only exploiting might miss the best lever to pull. Finding the trade-off between the two is one of the most important parts of reinforcement learning.²

What exactly is reinforcement learning? Reinforcement learning is how an agent learns, by itself and by trying out different actions, which actions to take in various situations in order to maximize a reward. In fact, a reinforcement learning system has four main parts, a policy, which defines what actions the agent should take in a given situation, a reward signal, which gives a numerical representation of how well the agent is doing at the task or its goal, a value function, which specifies favorable states (where the potential for reward is high) and unfavorable states, and, optionally, a model of the environment, which can range from very simple to very complex and is quite often intractable.

Definitions

Note: in this section, notation is kept consistent with Sutton & Barto’s formulations in Chapter 2 of Reinforcement Learning, an Introduction.

A k-armed bandit problem is defined as a situation where, at each time step, the agent has a choice from k different actions where each action results in a reward chosen from some unchanging probability distribution for that action. The agent aims to maximize the total reward gained over some fixed number of time steps, say, 100 or 1000. The analogy is to a bandit slot machine because each action can be likened to pulling a particular one out of the k levers of the slot machine and receiving the reward chosen from the appropriate distribution.

Let’s write this more formally - just like in deep learning, it is easy to read a lot of high-level discussion about reinforcement learning without really understanding anything - it is fairly simple, and writing the base formulations helps make it simple.

If we call the value of an action the mean reward when that action is taken - recall that the reward is sampled from a distribution and is rarely just a constant - and the action selected on time step \(t\) as \(A_t\) and the reward of that particular action as \(R_t\), we can write the value of an action \(a\) as the expected reward if \(a\) is taken:

\[q_* (a) = E[R_t \vert A_t = a]\]

However, because we don’t always know the true value of every action, we denote our best estimate of the value of action \(a\) as \(Q_t(a)\).

There are a couple ways of estimating \(Q_t(a)\) - one of the most basic is using the sample-average method, which is simply summing up all the rewards received after performing action \(a\) and dividing by the number of times action \(a\) was taken prior to the current time step \(t\).

\[Q_t(a) = \frac{\sum_{i = 1}^{t - 1} R_i \cdot \textbf{1}_{A_i = a}}{\sum_{i = 1}^{t - 1} \textbf{1}_{A_i = a}}\]

Where the bold \(\textbf{1}\) is just a random indicator variable that equals 1 if action \(a\) was taken on time step \(i\) and 0 otherwise, which just serves to make sure that we’re only working with the rewards when we actually took action \(a\).

If we wish to do a greedy action selection (i.e. picking the immediate best action) we just take the max estimated reward over all our actions and pick that one and call it \(A_t\).³

\[A_t \leftarrow \text{argmax}_a Q_t (a)\]

We can begin, now, to formally mesh exploration and exploitation. We want to be exploiting most of the time, so let’s define a small probability \(\varepsilon\) that we explore and select a random action, and the rest of the time, we exploit (with probability \(1-\varepsilon\)) and select the action with the highest estimated reward. We call this type of exploration-exploitation balance \(\varepsilon\)-greedy methods.

Updating with previous estimate

Now that we’re keeping track of all our estimates for action values \(Q_n\) after we’ve selected a given action \(n - 1\) times, we can show that for any \(n\), we can calculate \(Q_{n+1}\) at that step given only the current estimate \(Q_n\) and the current reward \(R_n\), rather than with all the previous rewards:

\[Q_n \stackrel{.}{=} \frac{R_1 + R_2 + \ldots + R_n}{n}\]

\[Q_{n + 1} = \frac{1}{n} \sum_{i = 1}^n R_i\] \[= \frac{1}{n}(R_n + \sum_{i = 1}^{n - 1} R_i)\] \[= \frac{1}{n}(R_n + (n-1)\frac{1}{n-1}\sum_{i = 1}^{n - 1} R_i)\] \[= \frac{1}{n}(R_n + (n-1)Q_n)\] \[= \frac{1}{n}(R_n + nQ_n - Q_n)\] \[= Q_n + \frac{1}{n}(R_n - Q_n)\]

This means that to calculate our new estimate, we just need our current estimate and the current reward! It’s also worth noting that the last equation is of the form

\[\text{New estimate} = \text{Old estimate} + \text{Step size} (\text{Target} - \text{Old estimate})\]

which intuitively makes sense - we want to be updating our estimate based off what our previous estimate was and how much the reality differs from our previous estimate, weighted by some learning factor.

My implementation

I’m working on my own basic implementation of \(\varepsilon\)-greedy methods on a 10-armed testbed where the true reward \(q_*(a)\) for each action is sampled from a normal distribution with mean 0 and variance 1, and the reward per action is sampled from a normal distribution with mean \(q_*(a)\) and variance 1. Stay tuned for results and my own plots - but for the meantime, Sutton & Barto have a good discussion of their sample results.

1: Andrew Moore said that I was still in the exploration phase, where my goal was to figure out what I wanted to do with my life and what I liked doing - the exploitation phase came later, when I would work at it as hard as I could.

2: Things get a bit more complicated once we make the payoffs for each lever change over time - what you thought was the optimal arm to pull might not be, after a while. But we’ll get into that later.

3: I use the pseudocode arrow notation for assignment here while Sutton & Barto use the \(\stackrel{.}{=}\) notation to represent a definition

Algorithms - Selection

2018-01-17T03:00:00+00:00

Welcome to the second of a series where I write a bit about an interesting algorithm I learned. Send comments or corrections to josh@jzhanson.com.

This week, we’ll be going over a problem similar to last week’s median of two sorted arrays - finding the kth-smallest element in an unsorted array! This problem is taken from the first lecture of 15-451 Algorithm Design and Analysis at CMU this semester - which happened today. I thought the algorithms that were presented were cool and worth writing a post about.

Note: in this post, the algorithms will be all sequential - therefore, the work equals the span.

Second note: I’m considering whether or not to use LaTeX in some parts - it adds mathematical precision and rigor but it makes the tone of the post a little too formal.

The problem

Let’s define terms first. Say we have a sorted array of elements, not necessarily consecutive. We define an element’s rank to be its position in the sorted array, starting from 1. For example, if we have the array [1, 3, 6, 7, 14, 20, …], the element 1 has rank 1, the element 3 has rank 2, the element 6 has rank 3, and so on.

Our problem: given an unsorted array A of length n and an integer k, find the kth-smallest element. Note that we can find the median of this unsorted array by taking the element with rank n/2 if n is even and n/2 + 1 if n is odd. Also, if the array is sorted, then the kth smallest element is trivially the element with index k.

It is important to always precisely state the input and output of the problem - it helps understand what the problem is asking and prevent you from solving an adjacent but different problem.

Input: An array A of n unsorted data elements with a total order (which just means that the elements can always be compared against each other and “greater” “less” and “equal” are defined), and an integer k in the range 1 to n, inclusive.

Output: The element of A with rank k.

Algorithm #1: Quick select

If we look at the problem, we see that it bears some resemblence to quicksort - in fact, whenever the sorted-ness of an array is mentioned in a problem, a good starting point will be to think about different sorting algorithms - selection sort, insertion sort, mergesort, quicksort, and maybe heap sort or radix/bucket sort if you know extra information about the elements.

In particular, let’s think about quicksort, which is sequential - thinking about mergesort won’t go too far in this case, because after we split the array, we only care about the half that the median is in. In addition, we can’t make any assumptions about the elements in the subarrays after we split in mergesort, while in quicksort we know that the elements in each half of the array are less than the pivot element. We’ll be looking at randomized quicksort, which means that instead of always picking the “middle” index or the “first” index, we pick an element uniformly at random from the array to be the pivot.

Here’s the quicksort algorithm and pseudocode:

Pick a pivot element x from the array uniformly at random.
Put elements that are less than or equal to x before it and elements that are greater than x after it. Let L be the subarray of elements before x and R be the subarray of elements after x.
Recursively call quicksort on L and R.

Note that while quicksort (and the other algorithms presented in this post) work fine with duplicate elements, it simplifies our discussion a little to assume all elements in A are distinct.

def quicksort(A):
    if |A| is 1:
        return A
    x = uniformly random element of A
    L = all elements of A less than x
    R = all elements of A greater than x
    L' = quicksort(L)
    R' = quicksort(R)
    return L' + x + R'

The bars around an array stand for “length of” that array.

We can make an observation here that lets us adapt this algorithm for finding the kth element. We actually know the lengths of L and R. This means that we can recursively call the algorithm on the subarray that the kth element falls in, and if we are recurring into the left array then we leave k as is but if we are recurring into the right array then we subtract the length of the left array from k.

Specifically, if there are k elements or more in L, we know the element of rank k lies in L. If there are less than k-1 elements in L, then the element of rank k lies in R. We can additionally say that if there are exactly k-1 elements in L, then x is the element of rank k and we’re done!

  def quickselect(A, k):
      if |A| = 1:
          return A[1]
      x = uniformly random element of A
      L = all elements of A less than x
      R = all elements of A greater than x
      if |L| == k-1:
          return x
      else if |L| >= k:
          return quickselect(L, k)
      else:
          return quickselect(R, k-|L|)

Runtime analysis

Let’s do some runtime analysis! Runtime in this context is number of comparisons. We aim to show the entire algorithm has expected runtime O(n).

Informally:

It takes linear O(n) time to construct L and R, since we have to walk through the array and put each element into either L or R. The recursive call is either on the larger side or the smaller side, but we can simplify our worst-case analysis by forcing the recursive call to always be on the larger half.

Because there’s a 1/n chance that each of the n elements is chosen, and each element has a different rank that makes the larger half of size n-1, n-2, …, n/2, n/2, n/2 + 1, …, n-1, (note how the size of the larger half wraps around and gets larger after n/2) this means, along with our inductive hypothesis that it takes some constant d times n runtime which we assume to be true for all values less than n

Formally, we have the recurrence

\[T(n) = cn + E[T(\text{larger side})], \; T(1) = 1\] \[= cn + \frac{1}{n} T(n-1) + \ldots + \frac{1}{n} T(\frac{n}{2}) + \frac{1}{n} T(\frac{n}{2}) + \ldots + \frac{1}{n} T(n - 1)\] \[= cn + \frac{2}{n} \sum_{i = \frac{n}{2}}^{n - 1} T(i)\] \[= cn + \frac{2}{n} (d(n - 1) + d(n - 2) + \ldots + d(\frac{n}{2}))\] \[= cn + \frac{3}{4} dn \leq dn \quad \text{if} \quad d = 4c\]

Of course, it is unlikely that writing and solving a recurrence will be required in anything other than an academic setting. Note also that our O(n) runtime in expectation, which means that we could have worse runtime (namely, O(n²) runtime if we consistently pick bad or the worst element, just like in quicksort, but this is unlikely.

Algorithm #2: Median of medians

While the above quicksort-based method is most likely the one that will be expected in a programming interview, it is interesting to explore a rather elegant linear-time deterministic algorithm posed by Manuel Blum (Turing Award winner), Robert Floyd (Turing Award Winner), Vaughan Pratt (helped found Sun Microsystems), Ronald Rivest (Turing Award winner), and Robert Tarjan (Turing Award winner).

It goes like this:

Break the input into groups of 5 elements. For example, the array [4, 3, 7, 5, 8, 1, 0, 2, 9, 6, …] would be broken up into [4, 3, 7, 5, 8], [1, 0, 2, 9, 6], and so on in linear time.
Find the median of each group in linear time - because finding the median of exactly five elements takes constant time.
Find the median of these medians recursively - let’s call it x. If we assume that the algorithm is indeed O(n), then this takes T(n/5).
Construct L from all elements less than or equal to x and R from all elements greater than x, just like in quicksort or quickselect. 1/2 of the groups of 5 will have medians less than x, and 1/2 of the groups of 5 will have medians greater than x. Within each group where the median is less than x, the two smallest elements are less than the median and are therefore less than x. Likewise, for each group of 5 where the median is greater than x, the two largest elements are greater than the median and are therefore greater than x. Therefore, at least 1/2 (groups less than x) * 3/5 (elements less than x per group of 5 - the 3 comes from the two elements less than the median and the median itself) = 3/10 of the total elements are less than x, and likewise 3/10 of the total elements are greater than x - see the below picture for the intuition behind this. This means that the larger half of the array is at most 7/10 the size of the original array. Therefore, this step takes T(7n/10), if we simplify matters and always analyze the larger half of the array - it is worst-case analysis, after all.
Recursively call median of medians on the half of the array that k lies in - again, if |L| >= k, then recur on L, if |L| = k - 1, then pick x, and if |L| < k - 1, then recur on R.

Runtime Analysis

For the runtime analysis, it is a bit tricky to arrive at the desired O(n) bound without writing and solving the recurrence, but if we look at the fact that at each recursive step, it takes some O(n) work plus the recursive calls T(n/5) and T(7n/10), which, when “added”, equal 9n/10, which means that after each recursive call, the size of the input decreases geometrically, which means that our recurrence is big-O of the time of the first recursive call, O(n).

Formally, we can draw a brick diagram of runtimes and then show, with the aid of an infinite sum, that geometrically decreasing runtime per step is effectively a constant.

\[T(n) \leq cn (1 + \frac{9}{10} + (\frac{9}{10})^2 + (\frac{9}{10})^3 + \ldots)\] \[\text{Formula for geometric sum is} \quad \frac{1}{1 - a}, \quad \text{where} \quad a = \frac{9}{10}, \quad \text{so}\] \[T(n) \leq cn(10) \in O(n)\]

It is interesting to note that if we break the input into groups of three, we are unable to show the O(n) upper bound because the first recursive term in the recurrence becomes T(n/3) and the second becomes T(2n/3) - because we can guarantee that the median is greater than 2n/6 = n/3 elements, and therefore the larger half of the array is at most 2n/3 - which sum up to n and each recursive step is the same work as the last, and the recurrence is balanced rather than root dominated which gives us O(n log n) runtime.

Conclusion

In this post, we introduced the problem of selection and explained two algorithms that solve it: a randomized algorithm based on quicksort that finds the k-th element in O(n) expected work, and a deterministic algorithm that finds the k-th element in O(n) work always. We also did some runtime analysis with recurrences, a powerful tool to formally show tight runtime bounds for recursive algorithms that would be difficult or impossible to arrive at informally.

Deep Learning Part 2 - Restricted Boltzmann Machines and Feedforward Neural Networks

2018-01-12T05:45:00+00:00

This is the second in a several-part series on the basics of deep learning, presented in an easy-to-read, lightweight format. Here is a link to the first one. Previous experience with basic probability and matrix algebra will be helpful, but not required. Send any comments or corrections to josh@jzhanson.com.

Mathematically, Restricted Boltzmann Machines are derived from the Maxwell-Boltzmann Distribution plus matrix algebra, which we’ll go over in this post. We’ll also use that as a bridge to connect to the basics of neural networks.

The Boltzmann Distribution

Let us first define x to be a vector of n outcomes, where each x_i can either be 0 or 1. Of course, each x_i can have a different probability of being 1. The probabilities can even be conditional, a la Markov Chains. But more on that later. In the previous post, we have usually thought of x as being a single random variable. Here, however, it is a vector of individual random variables. We are assuming the discrete case here, where each element of a vector can either be 0 or 1.

\[\textbf{x} = \begin{bmatrix} x_1 & x_2 & \ldots & x_n \end{bmatrix}, \: x_i \in \{0, 1\}\]

With that definition out of the way, we can examine the Boltzmann distribution, invented by Ludwig Boltzmann, which models a bunch of things in physics, like how a hot object cools, or how energy dissipates into the environment. We have

\[p(x) = \frac{1}{Z} \exp (-E(\textbf{x})), \: E(\textbf{x}) = - \textbf{x}^T \textbf{U} \textbf{x} - \textbf{b}^T \textbf{x}\]

Here, Z is the partition function or normalizing constant which makes sure that the distribution sums to one. It has actually been proven that the partition function Z is intractable, which means that it cannot be efficiently solved or evaluated. This is not hard to see, because Z requires calculating all combinations of xs, and if x has n elements, then we have 2ⁿ possibilities.

The exp function is the same as raising the constant e to the function’s argument, which is the energy function. Within the energy function, we have a U, which is the matrix of weights that our variable x interacts with, and a b is the vector of biases for each x. For now, let’s force U to be symmetric.

If we expand the first matrix multiplication term,

\[\textbf{x}^T \textbf{U} \textbf{x} = \begin{bmatrix} x_1 & x_2 & \ldots & x_n \end{bmatrix} \Bigg[ \textbf{u}_1 \quad \textbf{u}_2 \quad \ldots \quad \textbf{u}_n \Bigg] \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} = \begin{bmatrix} \textbf{x}^T \textbf{u}_1 & \textbf{x}^T \textbf{u}_2 & \ldots & \textbf{x}^T \textbf{u}_n \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}\]

Which we observe is a scalar, since each x^Tu_i is a scalar.

RBMs

To formally define a Restricted Boltzmann Machine (referred to as a RBM), we need to make a couple things clear. So far, we’ve thought of the input to the energy function, the vector x, as our observations or samples from the distribution. RBMs switch that up a little - they assume that the state vector x is composed of two parts: some number of visible variables v, and some number of hidden variables h.

\[\textbf{x} = (\textbf{v}, \textbf{h})\]

Why do we explicity split x into the visible and hidden variables? It turns out that modeling the interaction between visible and hidden variables is very powerful - in fact, by modeling these interactions and stacking RBMs, we can do a lot of cool things.

We can then rewrite the energy function:

\[E(\textbf{v}, \textbf{h}) = - \begin{bmatrix} \textbf{v}^T & \textbf{h}^T \end{bmatrix} \begin{bmatrix} \textbf{R} & \frac{1}{2}\textbf{W} \\ \frac{1}{2}\textbf{W}^T & \textbf{S} \end{bmatrix} \begin{bmatrix} \textbf{v} \\ \textbf{h} \end{bmatrix} - \begin{bmatrix} \textbf{b}^T \\ \textbf{c}^T \end{bmatrix} \begin{bmatrix} \textbf{v} & \textbf{h} \end{bmatrix}\]

Note that we have decomposed U into four quarters, which are themselves matrices and which we compose out of matrices we name R, W, and S, and we have decomposed b^T into b^T and a^T, which are the respective parts of the bias matrix that are multiplied by v and h. Because U is symmetric, the upper-right and lower-left quarters must be each other’s transpose. We name them 1/2 W instead of just W for reasons that will become clear once we expand the first matrix multiplication:

\[\begin{bmatrix} \textbf{v}^T & \textbf{h}^T \end{bmatrix} \begin{bmatrix} \textbf{R} & \frac{1}{2}\textbf{W} \\ \frac{1}{2}\textbf{W}^T & \textbf{S} \end{bmatrix} \begin{bmatrix} \textbf{v} \\ \textbf{h} \end{bmatrix}\] \[= \begin{bmatrix} \textbf{v}^T \textbf{R} + \frac{1}{2} \textbf{h}^T \textbf{W}^T & \frac{1}{2} \textbf{v}^T \textbf{W} + \textbf{h}^T \textbf{S} \end{bmatrix} \begin{bmatrix} \textbf{v} \\ \textbf{h} \end{bmatrix}\] \[= \textbf{v}^T \textbf{R} \textbf{v} + \frac{1}{2} \textbf{h}^T \textbf{W}^T \textbf{v} + \frac{1}{2} \textbf{v}^T \textbf{W} \textbf{h} + \textbf{h}^T \textbf{S} \textbf{h}\]

and by applying the property of matrix multiplication that (AB)^T = B^TA^T on the second term, we have

\[\textbf{h}^T \textbf{W}^T \textbf{v} = (\textbf{W} \textbf{h})^T \textbf{v} = [\textbf{v}^T (\textbf{W} \textbf{h})]^T = \textbf{v}^T \textbf{W} \textbf{h}\]

The last equality is because the triple matrix multiplication results in a scalar value and the transpose of a scalar value is the scalar value. Therefore,

\[E(\textbf{v}, \textbf{h})= - (\textbf{v}^T \textbf{R} \textbf{v} + \textbf{v}^T \textbf{W} \textbf{h} + \textbf{h}^T \textbf{S} \textbf{h}) - (\textbf{b}^T \textbf{v} + \textbf{a}^T \textbf{h})\]

We can actually see that R models the interactions among visible variables and S models the interactions among hidden variables. If we ignore those two matrix multiplication terms and focus only on the interactions of visible variables with hidden variables, we have the modified energy function

\[E(\textbf{v}, \textbf{h})= - \textbf{v}^T \textbf{W} \textbf{h} - \textbf{b}^T \textbf{v} - \textbf{a}^T \textbf{h}\]

which is the basis of a Restricted Boltzmann Machine - the difference between an RBM and a normal Boltzmann Machine is we forget about the visible-visible and hidden-hidden interactions and only concern ourselves with the visible-hidden interactions.

Conditional Derivation

With our new energy function, we can write the joint distribution of v and h for a RBM. Here comes the really cool stuff.

\[P(\textbf{v}, \textbf{h}; \theta) = \frac{1}{Z(\theta)} \exp (-E(\textbf{v}, \textbf{h}; \theta)) \quad \text{where} \quad Z(\theta) = \sum_\textbf{v} \sum_\textbf{h} \exp(-E(\textbf{v}, \textbf{h}; \theta))\]

The following derivation of the conditional distribution of h is an expansion of the derivation found in the first couple pages of Ruslan Salakhutdinov’s PhD thesis, so I use the same notation here, where theta is W, b, and a, and the semicolon stands for “given” or “dependent upon” while the commas denote parameters of the joint distribution.

Because we’re working in the discrete case, we say that v and h are D and F dimensional vectors, all of elements that can be either 0 or 1.

\[\textbf{v} \in \{0, 1\}^D \quad \text{and} \quad \textbf{h} \in \{0, 1\}^F\]

We aim to find the conditional distribution of h given v, because that would allow us to model the distribution of the hidden variables given values of visible variables. We can start by applying Bayes’ Rule to rewrite the conditional in terms of the joint, which we have above, and the marginal on the denominator, which we will proceed to derive.

\[P(\textbf{h} \vert \textbf{v}; \theta) = \frac{P(\textbf{v}, \textbf{h}; \theta)}{P(\textbf{v}; \theta)}\]

To derive the marginal, we take the joint distribution on v and h and sum over all values of h and expand, replacing matrix multiplication terms with sigma notation.

\[P(\textbf{v}; \theta) = \sum_h P(\textbf{v}, \textbf{h}; \theta) = \frac{1}{Z(\theta)} \sum_h \exp (-E(\textbf{v}, \textbf{h}; \theta))\] \[= \frac{1}{Z(\theta)} \sum_h \exp (-(- \textbf{v}^T \textbf{W} \textbf{h} - \textbf{b}^T \textbf{v} - \textbf{a}^T \textbf{h}))\] \[= \frac{1}{Z(\theta)} \sum_h \exp (\sum_{i = 1}^D \sum_{j = 1}^F v_i W_{ij} h_j + \sum_{i = 1}^D b_i v_i + \sum_{j = 1}^F a_j h_j)\]

We can bring out the b_i v_i term out of the exp and the outer summation as a product, because e^{a + b} = e^a e^b.

\[= \frac{1}{Z(\theta)} \cdot \exp(\sum_{i = 1}^D b_i v_i) \cdot \sum_h \exp (\sum_{i = 1}^D \sum_{j = 1}^F v_i W_{ij} h_j + \sum_{j = 1}^F a_j h_j)\]

We can also swap the double summations in the latter exp as well as pull out the h_j, because it only depends on j and not i, and then pull out the j = 1 to F summation.

\[= \frac{1}{Z(\theta)} \cdot \exp(\sum_{i = 1}^D b_i v_i) \cdot \sum_h \exp (\sum_{j = 1}^F ( \sum_{i = 1}^D v_i W_{ij}) h_j + \sum_{j = 1}^F a_j h_j)\] \[= \frac{1}{Z(\theta)} \cdot \exp(\sum_{i = 1}^D b_i v_i) \cdot \sum_h \exp \Big[ \sum_{j = 1}^F ( ( \sum_{i = 1}^D v_i W_{ij}) h_j + a_j h_j) \Big]\]

Just like we did above, we can use the fact that e^{a + b} = e^a e^b to pull out the j = 1 to F summation out of the exp and turn it into a product.

\[= \frac{1}{Z(\theta)} \cdot \exp(\sum_{i = 1}^D b_i v_i) \cdot \sum_h \exp \Big[ \sum_{j = 1}^F (( \sum_{i = 1}^D v_i W_{ij}) h_j + a_j h_j) \Big]\] \[= \frac{1}{Z(\theta)} \cdot \exp(\sum_{i = 1}^D b_i v_i) \cdot \sum_h \prod_{j = 1}^F \exp ((\sum_{i = 1}^D v_i W_{ij}) h_j + a_j h_j)\]

Now it seems fairly intuitive that you can switch the product and the sum, especially if we remember that each h_j must be either 0 or 1. Indeed, if we simply take the two cases which h_j can be and plug in h_j = 0 (which cancels everything out and exp(0) = 1) and h_j = 1, we arrive at

\[= \frac{1}{Z(\theta)} \cdot \exp(\sum_{i = 1}^D b_i v_i) \cdot \prod_{j = 1}^F \sum_{h_j \in \{0, 1 \}} \exp ((\sum_{i = 1}^D v_i W_{ij}) h_j + a_j h_j)\] \[= \frac{1}{Z(\theta)} \cdot \exp(\sum_{i = 1}^D b_i v_i) \cdot \prod_{j = 1}^F (1 + \exp (\sum_{i = 1}^D v_i W_{ij} + a_j))\]

If you’re willing to take this on faith, skip the next subheading and go to Plugging in. If you would like a detailed explanation of why this is true, read on!

Expansion of the product-sum

To formally derive that

\[\sum_h \prod_{j = 1}^F \exp ((\sum_{i = 1}^D v_i W_{ij}) h_j + a_j h_j) = \prod_{j = 1}^F (1 + \exp (\sum_{i = 1}^D v_i W_{ij} + a_j))\]

Let’s define a function as follows:

\[f(j, h_j; \theta) = \exp ((\sum_{i = 1}^D W_{ij} v_i) h_j + a_j h_j)\]

for our hidden variable vector,

\[\textbf{h} = \begin{bmatrix} h_1 & h_2 & \ldots & h_F \end{bmatrix}, h_j \in \{ 0, 1 \}\]

Note that

\[f(j, 0; \theta) = 1 \quad \text{and} \quad f(j, 1; \theta) = \exp (\sum_{i = 1}^D W_{ij} v_i + a_j) \quad \forall j\]

Therefore, the whole product is equal to evaluating the product on a subset of the terms where \(h_j = 1\).

\[\prod_{j = 1}^F f(j, h_j; \theta) = \prod_{j \in \{i_1, \ldots, i_k \}} f(j, 1; \theta) \quad \text{where} \quad h_j = 1, \: j \in \{ i_1, i_2, \ldots, i_k \}\]

We want to make statements and write equations about all vectors of this type. For any vector of this type, it has \(k\) ones. Because they’re \(F\)-dimensional, that means that there \(F - k\) zeroes. The ones can be distributed in any fashion - evidently, summation notation is insufficient, and adding combinations into the mix won’t strengthen the concept…how about we use an uppercase kappa, standing for “k-combinations of products” in the same vein as the uppercase sigma for sum and pi for product? Another option: lowercase nu, which looks like a \(\nu\)?

Hereafter, we denote “sum across all vectors h with dimension F and from k = 0 to F ones” as

\[\underset{j \in \{i_1, \ldots, i_k \} }{K}\]

In any case, we can write that the latter portion of the equation up there with this new function f and our new notation as

\[\underset{j \in \{i_1, \ldots, i_k \} }{K} f(j, h_j; \theta)\]

which is summing over all vectors h with 0 to F ones and all other zeroes \(f(j, h_j; \theta\), where j is the vector element index and h_j is the element at that index, and multiplying them together - the product \(\prod_{j = 1}^F\) is included in the kappa notation.

To expand it and make it a little less abstract, we have

\[= \big[ f(1, 0; \theta) f(2, 0; \theta) \ldots f(F, 0; \theta) \big]\] \[+ \big[ f(1, 1; \theta) f(2, 0; \theta) \ldots f(F, 0; \theta) + f(1, 0; \theta) f(2, 1; \theta) \ldots f(F, 0; \theta) + \ldots + f(1, 0; \theta) f(2, 0; \theta) \ldots f(F, 1; \theta) \big]\] \[+ \ldots\] \[+ \big[ f(1, 1; \theta) f(2, 1; \theta) \ldots f(F, 1; \theta) \big]\]

where between each set of square brackets is all vectors h with k = 0, k = 1, and k = F ones. There is one vector each for k = 0 and k = F and there are F vectors for k = 1, and F choose two vectors for k = 2, and so on.

Now here’s our doozy: because all \(f(j, 0; \theta)\) turn into ones, we can actually factor the entire expression into

\[= \prod_{j = 1}^F (1 + \exp (\sum_{i = 1}^D W_{ij} v_i + a_j))\]

It might be a bit easier to see with an example. Let’s factor the two dimensional case, F = 2 with the four vectors \(\textbf{h} = \begin{bmatrix} 0 & 0 \end{bmatrix}, \begin{bmatrix} 0 & 1 \end{bmatrix} , \begin{bmatrix} 1 & 0 \end{bmatrix} , \begin{bmatrix} 1 & 1 \end{bmatrix}\)

We have

\[\underset{j \in \{i_1, i_2 \} }{K} f(j, h_j; \theta) = f(0, 0) f(1, 0) + \big[ f(0, 1) f(1, 0) + f(0, 0) f(1, 1) \big] + f(0, 1) f(1, 1)\] \[= 1 + 1 \cdot f(0, 1) + 1 \cdot f(1, 1) + f(0, 1) f(1, 1) = (1 + f(0, 1))(1 + f(1, 1)) = \prod_{j = 1}^2 (1 + f(j, 1))\] \[= \prod_{j = 1}^2 (1 + \exp (\sum_{i = 1}^D W_{ij} v_i + a_j))\]

which seems like a whole lot of ado for what could have been a simple expansion, but I found this to be a neat math trick :).

Plugging in

Now that we have expanded the marginal, we can actually note that because we don’t actually manipulate the summation over all h except the last part, we can similarily expand the joint distribution \(P(\textbf{v}, \textbf{h}; \theta)\) using the same steps.

\[P(\textbf{h} \vert \textbf{v}; \theta) = \frac{P(\textbf{v}, \textbf{h}; \theta)}{P(\textbf{v}; \theta)} = \frac{\frac{1}{Z(\theta)} \exp (-E(\textbf{v}, \textbf{h}; \theta))}{P(\textbf{v}, \theta)}\] \[= \frac{\frac{1}{Z(\theta)} \exp (\sum_{i = 1}^D \sum_{j = 1}^F v_i W_{ij} h_j + \sum_{i = 1}^D b_i v_i + \sum_{j = 1}^F a_j h_j)}{P(\textbf{v}, \theta)}\] \[= \frac{\frac{1}{Z(\theta)} \exp (\sum_{i = 1}^D b_i v_i) \cdot \exp (\sum_{j = 1}^F \sum_{i = 1}^D v_i W_{ij} h_j + \sum_{j = 1}^F a_j h_j)}{P(\textbf{v}, \theta)}\] \[= \frac{\frac{1}{Z(\theta)} \exp (\sum_{i = 1}^D b_i v_i) \cdot \prod_{j = 1}^F \exp ((\sum_{i = 1}^D v_i W_{ij}) h_j + a_j h_j)} {\frac{1}{Z(\theta)} \exp (\sum_{i = 1}^D b_i v_i) \cdot \prod_{j = 1}^F (1 + \exp(\sum_{i = 1}^D W_{ij} v_i + a_j))}\]

Cancelling terms and pulling out the product,

\[= \prod_{j = 1}^F \frac{\exp ((\sum_{i = 1}^D v_i W_{ij}) h_j + a_j h_j)}{1 + \exp(\sum_{i = 1}^D W_{ij} v_i + a_j)}\]

which we can write as the element-wise conditional

\[= \prod_{j = 1}^F P(h_j \vert \textbf{v}; \theta) \quad \text{where} \quad P(h_j \vert \textbf{v}; \theta) = \frac{\exp ((\sum_{i = 1}^D v_i W_{ij}) h_j + a_j h_j)}{1 + \exp(\sum_{i = 1}^D W_{ij} v_i + a_j)}\]

Now we make the step that takes the cake. We care about the conditional probability that h_j = 1, and when we set h_j = 1, we actually see that the distribution turns into the sigmoid function!

\[P(h_j = 1 \vert \textbf{v}; \theta) = \sigma (\sum_{i = 1}^D W_{ij} v_i + a_j) \quad \text{where} \quad \sigma(x) = \frac{\exp (x)}{1 + \exp (x)}\]

And now we have shown a mathematical theoretical basis for why the units in a neural network carry a nonlinearity - oftentimes, the sigmoid function, as the activation function. It corresponds exactly to the conditional probability that the hidden variable is 1. What’s the sigmoid function dependent on? The sum of every visible variable - which can be 0 or 1 depending on whether each visible unit “fired” or not - times its appropriate weight plus the bias for that hidden unit.

Moreover, we’ve actually derived the architecture of vanilla neural networks from the mathematical structure of Restricted Boltzmann Machines, where some number of visible units all feed into each hidden unit, where their connections are multiplied by weights and biases are added within each unit and the sigmoid function is applied to determine whether the output of that unit will be 1 or 0. That is, whether the “neuron” will “fire” or not.

Thanks to Evan Wallace’s Finite State Machine Designer.

Most of these distributions in statistics and machine learning are taught because they work - the Boltzmann Distribution, for example, is notable because it does a good job of modeling natural phenomena. Many many distributions and methods are lost because, while mathematically novel, they aren’t useful. The ones we do remember are the ones that work, the ones that fit phenomena or predict well.

The difference between RBMs and feedforward neural networks is that RBMs are a probabilistic model while feedforward neural networks are deterministic. We just take the mean of the first conditional distribution p(h_j | v) to get our deterministic neural networks. We can also go from discrete, where our inputs and outputs can only be 0 or 1, to continuous, where inputs and outputs can take any value from 0 to 1, but we have to add some restrictions and flip some signs around - the energy function has to have all its signs reversed and the weights matrix U has to be positive definite for the distribution to converge and integrate to 1.

Again, we have just shown that there’s a theoretical foundation for neural networks. It was actually this proof, combined with Hinton’s discovery that stacking RBMs - in much the same fashion as we now stack layers of hidden units to form deep neural networks - yielded promising results in feature extraction, discrimination/classification, object detection, and many other classes of tasks actually kicked off the boom in AI and deep learning that we’re seeing now. We’ve just shown the basis of all that.

Pretty cool.

Algorithms - Double Binary Search

2018-01-08T20:00:00+00:00

Welcome to the first of a series where I post a programming interview question and work through it, posting code and explanations of my approaches, pitfalls, and clever tricks! I may use different languages and compare the results if there are interesting or noteworthy differences, but I will generally use Python due to its brevity and ease of understanding. The focus here is on the algorithm, approaches, and clarity of code rather than any particular code finesse. Send comments or corrections to josh@jzhanson.com.

Note: time and runtime in the context of runtime analysis both mean work, which is how the algorithm takes to execute on a single processor, i.e. sequentially, as opposed to span, which is how long the algorithm takes if we assume infinite processors - span is the longest single branch of the recurrence tree, that is, the most work that has to be done by any single processor among our infinite processors. If the wording is ever ambiguious, I mean work.

Second note: the diagrams are pictures I took with my phone of the diagrams drawn on paper - once I figure out a good diagraming software, I’ll probably replace the pictures. But having pictures of hand-drawn diagrams actually adds a bit of character and humanity to these posts, which I like :).

Double Binary Search

or, Median of Two Sorted Arrays, or, kth-smallest

It’s trivial to find the median of a single sorted array A: just take the length of the array n and find A[n/2]. If you want to be fancy, you can find A[n/2] if the array is of odd length or the midpoint between or average of A[n/2 - 1] and A[n/2] if the array if of even length.

def median_simple(A):
    n = len(A)
    return A[n/2]

def median_fancy(A):
    n = len(A)
    if (n % 2 == 0):   # array length is even
        return (A[n/2 - 1] + A[n/2]) / 2
    else:              # array length is odd
        return A[n/2]

But what if you wanted to find the median of two sorted arrays? It might seem straightforward at first, especially if all the elements of one array are less than all the elements of another array, e.g. [1, 2, 5, 7] and [15, 21, 33], but what if the arrays overlap, or even share elements? How would we find the median of, say, [2, 5, 7, 8] and [0, 3, 4, 6, 7, 9]?

Word on the street is that this is an essential question to know for Google coding interviews, and by word on the street, I mean word straight from the mouth of Professor Guy Blelloch in the lecture of 15-210: Parallel and Sequential Data Stuctures and Algorithms at Carnegie Mellon University taught under the School of Computer Science undergraduate program…

The problem

We define the median element of two or more arrays to be the median of the array formed when all the arrays are combined and sorted, preserving duplicates. For example, the median element of [2, 5, 7, 8] and [0, 3, 4, 6, 7, 9] would be 5. We define the kth-smallest element of two or more arrays to be element k + 1 of the array formed when all the arrays are combined and sorted, preserving duplicates. Using the above example, the 1st-smallest element would be 0 and the 4th-smallest element would be 4.

Given two sorted arrays of equal size A and B, find the median element.

Input: two sorted arrays of equal size A and B whose elements are integers (but can be any other element for which there exists a total ordering).

Output: the median element of the array formed when both arrays are combined and sorted - if C is the sorted “union” preserving duplicates of A and B with length n, the median would be the element at index n/2 if n is even and n/2 + 1 if n is odd.

Given two sorted arrays of unequal size A and B, find the median element.

Input: two sorted arrays of unequalequal size A and B whose elements are integers (but can be any other element for which there exists a total ordering).

Given two sorted arrays of unequal size A and B and an integer k, where k <= |A| + |B|, find the kth-smallest element of the two arrays. We use the bars | to denote the size of an array or the length of a string, so |A| is the size of A.

Input: two sorted arrays of unequal size A and B whose elements are integers (but can be any other element for which there exists a total ordering).

Output: the kth-smallest element of the array formed when both arrays are combined and sorted - if C is the sorted “union” preserving duplicates of A and B, the kth-smallest element would be element k+1.

Foray #1: Brute force

We will tackle 1 and 2 together while 3 is mostly left as an exercise.

A good place to start, in programming interviews, is always to talk through and explore the simplest, often brute force solution. It is almost never the correct solution, but doing so 1) prevents you from sitting there silently for several minutes thinking like a maniac and trying to come up with the perfect solution, 2) fills up the time and helps show your thought process to the interviewer, and 3) helps build intuition on the problem.

The simplest solution here is to combine both arrays into one big array, sort it, and then trivially find the median of that big array. Let n = |A| + |B|. If we use an implementation of arrays that allows appending in O(n) work and O(1) span and a decent (read: asymptotically optimal) sorting algorithm which runs in O(n log n) work and, if we’re picky about the parallelism of our algorithms, has O(log³ n) span cough mergesort cough, then this gives us a total work of O(n log n) and a total span of O(log³ n) span.

def double_median_naive(A, B):
    C = A + B                  # in Python, + appends two lists
    sort(C)                    # Python's list sort is mergesort
    return C[len(C) / 2]

This isn’t optimal. Intuitively, it feels like we’re doing a lot more work than we need to; we’re merging and sorting both arrays when we really just need to determine the middle element. Also, do we really need to sort the big array again when both A and B are sorted?

I didn’t mention mergesort up there for nothing: if you’re sharp, then you read mergesort and immedialy thought “Why don’t we just merge A and B instead of appending and sorting?”

We merge A and B a la mergesort by starting a pointer at the beginning of both arrays, comparing the element under the pointer in A with the element under the pointer in B, and advancing the pointer of whichever element is smaller. When we get to the n/2-th element, where n is the sum of the lengths of A and B, we return that one. If we do this, then we actually cut down our work to O(n). However, interestingly, our span becomes O(n). Here, we see the trade-off between work and span in action: algorithms can often become more parallel in exchange for doing more, sometimes repeated, work.

def double_median_merge(A, B):
    n = len(A) + len(B)
    if (n == 0) return None
    count_a = 0
    count_b = 0
    # upon loop termination, count_a and count_b will be on n/2-nd and n/2+1-st
    while (count_a + count_b < n/2 - 1):
        if (count_a == len(A)):        # if at end of array A
            count_b += 1
            continue
        elif (count_b == len(B)):      # if at end of array B
            count_a += 1
            continue

        if (A[count_a] < B[count_b]):
            count_a += 1
        else:
            count_b += 1

    if (count_a == len(A)):
        if (n % 2 == 0):               # if even number of elements
            return (B[count_b] + B[count_b + 1]) / 2
        return B[count_b]
    elif (count_b == len(B)):
        if (n % 2 == 0):
            return (A[count_a] + A[count_a + 1]) / 2
        return A[count_a]
    else:
        if (n % 2 == 0):
            return (A[count_a] + B[count_b]) / 2
        return A[count_a] if A[count_a] < B[count_b] else B[count_b]

Couple things to note here: we need to check for a couple edge cases, namely, what happens if one or both arrays are empty. Note that if one array is empty but not both, the count_a == len(A) or count_b == len(B) cover it but we need to return None if both arrays are empty. We also have a slightly-awkward loop counter with count_a + count_b < n/2 - 1, which is just to make sure that the final iteration makes the termination condition, which is that count_a and count_b will be on elements n/2 and n/2 + 1, not necessarily in that order. Also, depending on how you in particular code it, you might have to worry about when the arrays are both 1 element.

This works both for when the arrays are equal size and when the arrays are unequal size.

Foray #2: Divide and conquer

Now the next step takes a bit of a mental leap. If we think about what we know about the problem, we want to find a specific element out of sorted arrays, except we’re not looking for the element by id but by cardinality, or rank. A good option here to explore, after hearing the words sorted and find, would be some sort of binary search, even just talking about it can show the interviewer that you’re on the right track and can prompt them to give you a hint to set you in the right direction. You could also maybe arrive by the divide-and-conquer paradigm by going through the common algorithmic paradigms. For example, when I’m looking for some smarter algorithm, I first think to see if a greedy algorithm would work, then a divide-and-conquer one, then dynamic programming, then backtracking, and finally graph algorithms.

Anyways, if we think about how we can use binary search to find the median of two sorted arrays, let’s think about what binary search does. Binary search looks at the median of a single sorted array or subarray, compares it to the target element, and drops the lower half of the array if the target element is larger than the median, because the target will not occur in the lower half where all elements are less than the median, which is less than the target, and symmetrically for if the target is lower than the median.

def binary_search(A, target):
    if (len(A) == 0): return False
    mid = len(A) / 2
    if (A[mid] == target):
        return True
    elif (A[mid] < target):                # if median is less than target
        return binary_search(A[mid+1:])    # syntax for all elements before mid
    else:                                  # if median is greater than target
        return binary_search(A[:mid])

We’re comparing the median of the sorted array to something, and then dropping half of the array based on that…this is the part where you either have the flash of inspiration or your interviewer prods you to the flash of inspiration. What if we compare the medians of the two arrays?

Equal length

Let’s explore this, first if we assume the arrays are equal size. Simplifying assumptions are a great way to get a start on a problem and build intuition. If the arrays are equal size and we compare the medians, we have three cases:

If the median of A is less than the median of B, then we know that the true median has to be in the second half of A, A_R or the first half of B, B_L inclusive of the sub-medians.
If the median of A is greater than the median of B, then we know that the true median has to be in the first half of A, A_L or the second half of B, B_R, inclusive of the sub-medians.
If the median of A is equal to the median of B, then our job just got a lot easier! The median is either one of those medians.

The picture below should help illustrate the intuition behind these three cases.

Again, if it intuitively seems like we can immediately find the median of two equal length sorted arrays, take a moment to convince yourself why that isn’t true. Writing out a couple of examples might help.

Solution

def double_binary_search_eq_len(A, B):
    if (len(A) == 0) and (len(B) == 0):
        return None
    if (len(A) == 1) and (len(B) == 1):
        return (A[0] + B[0]) / 2
    elif (len(A) == 2) and (len(B) == 2):
        return (max(A[0], B[0]) + min(A[1], B[1])) / 2
    mid_a = len(A) / 2
    mid_b = len(B) / 2

    if (A[mid_a] < B[mid_b]):
        return double_binary_search_eq_len(A[mid_a:], B[:mid_b-1])
    elif (A[mid_a] > B[mid_b]):
        return double_binary_search_eq_len(A[:mid_a-1], B[mid_b:])
    else:
        return A[mid_a]

The first if/elif statement is the base case - if both arrays are length 2, then the “median” of each is the first element always, and we could get stuck in a loop where the arrays aren’t actually shortened at each step.

This takes O(log n) work and span because we are chopping off roughly half of our total input size at each iteration, and because we only have one recursive call, there is no parallizibility.

Unequal length

Now let’s take this one step further. What if our two arrays A and B are of unequal length? There’s not actually that much different about our algorithm. We still compare the medians of both arrays, but we have to make some different assumptions about how we can “chop” off parts of our arrays. However, we also have the information about the lengths of the arrays to help us out. Let’s also assume, for simplicity, that |A| < |B|. If A is larger than B, we can just swap the arrays - the logic is symmetric.

We again have a couple cases:

If the median of A is greater than the median of B, then we can drop all of the second half of A, A_R. Additionally, we can drop that many elements from the first half of B, B_L, but we cannot always drop all of B_L.
Symmetrically, if the median of A is less than the median of B, then we can drop all the first half of A, A_L. Additionally, we can drop that many elements from the second half of B, from B_R.
If the median of A is equal to the median of B, then we can do either of the above two cases. Let’s just use the second one here. Note that if you would like to make this a separate base case where you compare the medians and perhaps also the neighboring elements to the medians if the arrays are both even or both odd - for example, the median of both [0, 2, 4, 6, 8, 10] and [1, 2, 4, 6, 7, 9] are both 4 but the median of the merged arrays is 5.

Another reason that interviewers like this problem is that there are a lot of base cases to account for, especially with arrays of unequal length. We can reduce them by forcing A to be shorter than B, of course, but there are still a couple we have to account for.

Solution

def double_binary_search(A, B):
    if (len(A) == 0):
        if (len(B) == 0):
            return None
        return B[len(B) / 2]           # if A is empty, return median of B
    if (len(A) == 1) and (len(B) == 1):
        return (A[0] + B[0]) / 2       # if one element in both arrays
    if (len(A) == 1):
        if (len(B) % 2 == 0):
            if (A[0] < B[len(B) / 2]):
                return B[len(B) / 2]
            elif (A[0] > B[len(B) / 2]) and (A[0] > B[len(B) / 2 + 1]):
                return A[0]
            elif (A[0] > B[len(B) / 2 + 1]):
                return B[len(B) / 2 + 1]
        else:
            if (A[0] < B[len(B) / 2 - 1]):
                return (B[len(B) / 2 - 1] + B[len(B) / 2]) / 2
            elif (A[0] > B[len(B) / 2 - 1]) and (A[0] < B[len(B) / 2]):
                return (A[0] + B[len(B) / 2])
            elif (A[0] > B[len(B) / 2]) and (A[0] < B[len(B) / 2 + 1]):
                return (A[0] + B[len(B) / 2 + 1])
            else:
                return (B[len(B) / 2] + B[len(B) / 2 + 1]) / 2

    elif (len(A) == 2):
        if (len(B) == 2):
            return (max(A[0], B[0]) + min(A[1], B[1])) / 2
        elif (len(B) % 2 == 0):
        # ...

    else:
        mid_a = len(A) / 2
        mid_b = len(B) / 2
        if (A[mid_a] > B[mid_b]):
            return double_binary_search(A[:mid_a+1], B[mid_a:])
        else:
            return double_binary_search(A[mid_a:], B[:len(B)-mid_a])

There’s a lot of base cases that don’t do much but get in the way of the core idea. The rest of the |A| = 2 cases are fairly similar to the first couple. They mostly just boil down to examining cases and then finding the median of more than two elements.

Conclusion

That’s the end of the very first algorithms post, and boy was it hefty, with over 2600 words. I hope this has been helpful - it was certainly helpful for me to get all my thoughts on this particular problem, which have been jangling around in my head for weeks now, down and clear. And it’s definitely a work in progress - I intend to finish writing the code and thoroughly test it and post a link to it on my GitHub. Send any comments or corrections to josh@jzhanson.com. Cheers!

2022-11-13 Bonus - thanks Aleksandar Bosnjak!

We reasoned about the time complexity of the algorithm with our discussion of work and span, but what’s the space complexity of the code we wrote? How much space will our recursive calls take?

Click to see the answer

Deep Learning Part 1 - Bayes’ Rule and Maximum Likelihood

2017-12-30T19:45:00+00:00

This is the first in a several-part series on the basics of deep learning, presented in an easy-to-read, lightweight format. Previous experience with basic probability and matrix algebra will be helpful, but not required. Send any comments or corrections to josh@jzhanson.com.

Bayes’ Rule

We begin our discussion with Bayes’ rule, an important result that captures the intuitive relationship between an event and prior knowledge we have of factors that might affect the probability of the event. Simply put, it formulates how event B affects the probability of event A. It forms the basis of Bayesian inference and Naive Bayes. Because it is a little difficult to grasp intuitively at first, let’s go over its derivation from the definition of conditional probability, which is easier to understand at first.

Conditional probability

Conditional probability simply formulates the probability of event A happening given that event B happened.

\[P(A \vert B) = \frac{P(A \cap B)}{P(B)} \text{ or, equivalently, } P(B \vert A) = \frac{P(B \cap A)}{P(A)}\]

The Ps basically mean “probability of,” the vertical bar | on the left side simply means “given,” and the little upside-down u on the numerator of the right side means “and,” as in event A happening and event B happening.

What conditional probability is saying is that the probability of event A given event B is equal to the probability of event A and event B happening divided by the probability of event B. It’s a bit easier to see with a Venn diagram of probabilities.

It is fairly clear that if we assume that event B happens and we wish to consider the probability of event A happening, then we only need to consider the probability space where B happens, that is, the right, darker circle P(B). Within that circle, there’s the middle section, P(A and B), which is how A can happen if we assume that B happens. So we can see that the probability of A given B is equal to the probability of A and B (how A can still happen given that B happens) divided by the total probability space under consideration, P(B), because, again, we’re assuming that B happens.

\[\implies P(A \vert B) P(B) = P(A \cap B) \text{ and } P(B \vert A) P(A) = P(B \cap A)\] \[\implies P(A \vert B) P(B) = P(B \vert A) P(A)\] \[\implies P(A \vert B) = \frac{P(B \vert A) P(A)}{P(B)}\]

We first multiply the denominators on both formulas, set the two formulas equal, because “and” is communative - A and B happening is the same as B and A happening - and finally divide P(B) over, assuming that that probability is not zero, we can easily derive Bayes’ Rule.

Generalizing Bayes’ Rule

\[P(A \vert B) = \frac{P(B \vert A) P(A)}{P(B)}\]

We use the Law of Total Probability, which states the probability of any event A is equal to the probability of that event A happening given some event B happening times the probability that B happens, plus the probability of that event A happening given some event B happening times the probability B doesn’t happen. To refer to the diagram above, we’re basically saying that the probability of A is equal to the dark middle portion, A happening given B happening, plus the lightest shaded portion, A happening but B not happening. Notationally, the bar above the letter of an event just means the complement of that event - i.e. the event of that event not happening.

\[P(A) = P(A \vert B) P(B) + P(A \vert \overline{B}) P(\overline{B})\]

Let’s use the example of flipping two coins and want to find the probability that the second one is heads. Then, we have

\[P(\text{second coin is heads}) = P(\text{second coin is heads } \vert \text{ first coin is heads}) P(\text{first coin is heads})\] \[+ P(\text{second coin is heads } \vert \text{ first coin is not heads}) P(\text{first coin is not heads})\]

We rewrite Bayes’ rule as follows using the Law of Total Probability, replacing the denominator:

\[P(A | B) = \frac{P(B | A) P(A)}{P(B | A) P(A) + P(B | \overline{A}) P(\overline{A})}\]

This is for the two variable case, but it is not difficult to see that it generalizes to any finite number of variables, say, if several outcomes partition the sample space, which means that exactly one of these events must happen. So, instead of just having two outcomes, B or not B, we have several. For example, the event of getting a one, a two, a three, a four, a five, or a six when rolling a dice are events that partition the sample space, because exactly one must happen when you roll the dice! The takeaway is that we can write in the general case, with multiple events B₁, B₂, …, B_n, that

\[P(B_i | A) = \frac{P(A | B_i) P(B_i)}{P(A | B_1)P(B_1) + \ldots + P(A | B_n) P(B_n)}\] \[= \frac{ P(A | B_i) P(B_i)}{\sum^n_{j = 1} P(A | B_j)P(B_j)}\]

Now if we leave behind discrete probability and move to continuous probability, not too much changes besides we switch the summation to an integral and swap around some function notation, which we will introduce here. Note that the lowercase ps and fs mean more or less the same thing as the uppercase Ps - they stand for the probability mass or probability density functions for discrete and continuous random variables, respectively. We usually use Greek letters, like theta, to stand for hypotheses, or unknown parameters. We will usually use little English letters, like x, to represent observations, or data values. Don’t worry too much about whey there’s a p here or an f there, it’s just to make a distinction between marginal and conditional or joint distributions. Elsewhere, the notation may vary.

\[p(\theta \:| \: x) = \frac{f(x \: | \: \theta) \, p(\theta)}{\int f(x \: | \: \theta) \, p(\theta) \, d\theta}\]

In the context of machine learning, x is the observation - what we sample from some unknown distribution that we want to model. Theta is the unknown parameter that our distribution depends upon, representing our hypothesis on a random variable under observation. Once we know theta, we can easily generate new observations to form a prediction on our random variable under observation. This is why we want to guess at what theta can be as best as we can so we can get a good prediction from the distribution. In fact, each term in the above equation has a name.

The numerator of the left side has f(x | theta), which we refer to as the likelihood, because it’s the likelihood that we observe x if we fix some parameter value theta. We also have a p(theta), which we call the prior, because it usually represents our prior knowledge of theta and how it’s distributed - we have some prior knowledge of how theta behaves and which values it’s likely to take. On the denominator of the right side, we have an integral over all values of theta of the likelihood times the prior, which we can see is just generalizing the Law of Total Probability to the continuous case. We refer to this as the evidence, because it’s what we know about the conditional distribution, f(x | theta), and the prior, p(theta). We can also call the denominator the marginal, because when we integrate across all values of theta, the denominator becomes a function of x only, p(x), which is the total probability. Finally, we call the p(theta | x) on the left side of the equation the posterior distribution, because it’s the distribution we can infer after we combine the information we have from likelihood and the prior and apply Bayes’ Rule. We can rewrite this, with words, as

\[\textbf{posterior} = \frac{\textbf{likelihood} \times \textbf{prior}}{\textbf{evidence}}\]

Note that we can easily replace the single value x with a bolded x, representing a vector of multiple values.

\[f(x_1, x_2, \ldots, x_n, \theta) = f(\textbf{x}, \theta)\]

Chain rule for conditional probability

In a nutshell, the chain rule for conditional probability states that the probability of a bunch of things all happening is the probability of one of the things happening given the other things happen times the probability of all the other things happening.

\[P(A_1 \cap A_2 \cap \ldots \cap A_n)\] \[= p(A_1, A_2, \ldots, A_n) = p(A_1 | A_2, \ldots, A_n) \times p(A_2, \ldots, A_n)\]

The first line of the above is just to illustrate the change in notation, from the “cap” notation earlier to using commas to denote events all happening. We can repeatedly apply the chain rule, giving us

\[= p(A_1 | A_2, \ldots, A_n) \times p(A_2 | A_3, \ldots, A_n) \times p(A_3, \ldots, A_n)\] \[= ...\] \[= p(A_1 | A_2, \ldots, A_n) \times p(A_2 | A_3, \ldots, A_n) \times \ldots \times p(A_{n-1} | A_n) \times p(A_n)\]

Likelihood functions

Now if we sample, say, n samples from our unknown distribution, and the assumption here is that the samples are independent, then what we can do is if we know the likelihood function f(x_i | theta) and we want to find the probability that theta is a particular value given all our sampled data, we can repeatedly apply the chain rule of probabilities, replacing p with f since we are often dealing with continuous rather than discrete data:

\[f(x_1, x_2, \ldots, x_n, \theta) = f(x_1 | x_2, \ldots, x_n, \theta) \times f(x_2, \ldots, x_n, \theta)\] \[= \ldots\] \[= f(x_1 | x_2, \ldots, x_n, \theta) \times f(x_2 | x_3, \ldots, x_n) \times \ldots \times f(x_n | \theta)\]

Note that we don’t have a f(theta) at the end despite the chain rule expansion, because theta is not jointly distributed with the xs.

And finally, because we assume each x_i is independent, we can drop all the other x_j terms from each conditional probability distribution. This is because they’re independent - i.e. the probability of x_i being what it is does not at all depend on what value any other x_j takes. This means that we have

\[= f(x_1 | \theta) \times f(x_2 | \theta) \times \ldots \times f(x_n | \theta)\] \[= \prod^n_{i = 1} f(x_i | \theta) = L(\theta | x_1, \ldots, x_n)\]

which we call the likelihood function. Note that because the training data, or features, we observed, x₁, …, x_n, are fixed with respect to theta, the likelihood function is only a function of theta, the unknown parameter upon which our mystery distribution depends. In fact, it is exactly the probability that we observe what we observed, x₁, …, x_n, given that value of theta. In other words, you can give me a value for theta, and I can use this likelihood function to tell you how likely that we get out the training data x₁, …, x_n.

Note also that I don’t require a concrete value for theta to construct the likelihood function - I only need some training data x₁, …, x_n. So, if I wanted to model a particular, unknown, black-box distribution, I sample n samples from it, which I call my training data. I use this training data and the chain rule of conditional probability to construct my likelihood function. I then try to maximize that likelihood function with respect to theta. That is, I try to find the value of theta that gives me the highest likelihood for my observation.

\[\text{argmax}_{\theta \in \Theta} L(\theta | x_1, \ldots, x_n) = \hat{\theta}_{MLE}\]

We call the theta that gives us the highest probability from our likelihood function theta-hat - there exist more formal terms for it, but the carat that is used to signify a best-guess estimate looks like a hat. This is known as maximum likelihood estimator, hence the subscript MLE.

We make the distinction that the estimator is the function itself, and the estimate is the estimator evaluated with some observation.

Now that I have an estimated distribution, I can ask the real mystery distribution for some more data samples, known as the test data. If, for each test data point, my estimated distribution says there’s a high probability that I would get this particular point, then I say that my model generalizes well. If my estimated distribution has difficulty distinguishing this test data from, say, garbage data, then I say it generalizes poorly, perhaps suffering from overfitting. Maybe we picked an insufficient functional form, one that isn’t capable of modeling what’s really going on. More on that in future posts.

Conclusion

To summarize, we began by explaining how conditional probability is the basis of Bayes’ Rule, how the chain rule of conditional probability makes a likelihood function, and how to use the likelihood function to find the parameter of a mystery distribution.

Welcome!

2017-12-21T18:53:00+00:00

Hi there!

Welcome to Junior Varisty Computer Science, a blog where I write code and talk about it! There are some things about the blog that I would like to iterate on, but as of right now, I think it is fully functional as a blog.

I imagine this blog will have three main categories of posts:

Code - this might be coding interview questions I find interesting and want to share, in which case I’ll write down the question, insert a big white space or cut, and then walk through the process I went through to arrive at the optimal solution, outlining my lines of thought and any gotchas I ran into. It could also be an interesting algorithm that I translate into code, possibly in several different languages to compare and do a little language critique. It could even be me posting a snippet of code from a personal project and talking about why it sucks or why I’m proud of it. I’m considering making interview question posts weekly, maybe “Technical Interview Thursdays” or something.
Reading list - this blog also will be where I post interesting articles or papers I read and where I write anything from a couple paragraphs to an entire essay on what I think of them. It could even be on broader topics where I tie several articles/papers together.
Thoughts - the least frequent of the post types, where I post things I’ve been thinking about that I consider important enough to write about. Blog update posts, which I will only use for major changes, and life updates, which I will only use for extremely major changes like graduation or death, fall under this category.

To lots of posts! Josh