CollegeHumor, like many websites that want to reduce database requests and speed up processing, uses memcached as a caching layer. This article will explain our software implementation and discuss a number of things we learned along the way. If you read my blog you will know I’m a PHP guy. CollegeHumor is also coded in PHP. None of the concepts covered in this post will be specific to any particular programming language. There will be a few simple PHP examples.

Let’s jump right into the class setup… Our cache class is a completely static class and can’t be instantiated on its own. When our application starts up a create_instance function is called which will create a connection to memcache in a static object stored in our class. Shortly after the instance is created, our config file is loaded and we add the servers that make up our memcache pool. Really nothing fancy happening thus far…

Besides the create_instance and the add_server functions our class only has three main public functions; get, set and delete. When I say main, those are really the only three functions we use throughout our application, everything else is for internal use by the cache class. There are also a few logging and stats functions which aren’t important for this article.

Set Function

public static function set($key, $value, $expire = -1) {
    if ($expire == -1)
        return true;

    // there was no key passed, to be safe we will create one
    if (!strlen($key))
        $key = md5($value);

    $value = self::falsify($value);

    self::instance_set($key, $value); // set the instance cache
    self::$memcache->set($key, self::dogpile_set($value, $expire), MEMCACHE_COMPRESSED, $expire + self::EXTENDED_LIFE); // set memcache
}

Let me explain what’s going on and why… The set function takes 3 parameters: $key, $value and $expire.

1) Check the $expire value. If the expiration is less than zero (we default to -1) we assume this wasn’t meant to be cached and we just return true.

2) Make sure there was a $key passed. If there isn’t a key, we play it safe and set $key equal to the MD5 of the $value. Otherwise, there is a chance of key collisions. We definitely don’t want that.

3) We run the $value through a function called “falsify”. We found that storing the actual value of “false” caused a number of issues. The first being that we interface very closely with our database layer. If our get function returns false, the database layer assumes it didn’t find a value in cache and does a query (more on the solution to this in the get section). Second, the get function will return false when it fails to find data. The solution was the falsify function. It looks at the value and if it’s false the value is turned into a special string we can later identify. Something like “<-F4LS3->”. Basically a value that would never normally be stored. Later on in the get function we will use this to identify the difference between a stored false and not finding results.

4) We created a local instance cache that stores any data we get/set in a static array. This helps prevent unnecessary subsequent requests for data we may have already retrieved. We do this by passing the key/value to an instance_set function which stores the data by its key.

5) Sites with high loads are often subject to the dogpile effect. Basically, if something falls out of cache and two or more users make the same request and receive a miss from memcache, they end up performing the same operation (usually a database query) to refresh the data. For smaller sites the extra database work is fairly minor. But larger sites under heavy load often can’t afford this. Think about something falling out of cache on the homepage of a large site. This could generate hundreds of the same query, locking up tables and eventually bring the database to a crawl. I have seen it, it’s not that awesome.

The way we deal with this problem is by passing the value and expiration time to a function called set_dogpile. This function is simple and you will understand its purpose later on… All this function does is create an array with the key “v” and stores the $value and a key called “t” where we store the $expire and then returns the array.

private static function dogpile_set($data, $expire) {
    return array('t' => time() + $expire, 'v' => $data);
}

6) Finally we can write our data to memcache! We set the key to the $key we were originally passed and the value to the array that was returned from dogpile_set. The expiration time is slightly more complicated. Our cache class has a constant variable called “extended_life”. We use this in conjunction with the dogpile effect prevention. Our extended life is set to 300 seconds. This will be better explained in the get section below, bare with me for now. Set the expiration to the $expire value passed in + the extended_life.

*Its important to note that the version of data stored in the instance cache (Step 4) is the original data, not the dogpile array.

Get Function

public static function get($key, &$found = null) {
    if(is_array($key)) {
        // Not used for the purpose of this article
        // $value = self::get_complex($key);
        } else {
            $value = self::get_simple($key);
        }

        if($value === false) {
            $found = false;
           
            return false;
        } else {
   
        $found = true;

        if(is_array($value))
            return $value; // complex response (not used in this article)
        else
            return self::defalsify($value); // simple response
    }
}

The get function takes two parameters: The first is the $key and the second is an optional referenced parameter called &$found (default to null).

Our get function can take two types of keys. Either a single key request as a string or multiple key requests as an array. The first thing we do in the get function is determine the key type. If it’s a string we send the key off to the “get_simple” function. If it’s an array it goes to the “get_complex” function. For simplicity reasons we are only going to focus on the get_basic function.

If you do explore get_complex on your own, keep in mind that memcache can take an array of keys and retrieve those results in a single request. You will also need to keep track of what is and isn’t found from the instance cache. That way we can do lookups from memcache on the missed keys. There is an example of get_complex at the very end of this article, it should be easy to understand after you grasp get_basic.

Get Simple

private static function get_simple($key) {
    $value = self::instance_get($key);

    // instance cache returned nothing, look it up
    if(!$value) {
        $value = self::dogpile_get($key, self::$memcache->get($key));

    if(!empty($value)) {
        // results were found in memcache, lets add it to the instance cache
        self::instance_set($key, $value);
    } else {
        // nothing was found, return
        return false;
        }
    }

    return $value;
}

1) First the key is passed to the “get_instance” function. This function checks our static instance array to see if we have already performed a get or set for that key. If data is found, return it, otherwise return false.

We are back in the get_basic function now. If the value returned from get_instance isn’t false, we return the value back to “get”.

2) If the value returned from get_instance is false we need to ask memcache for the data. You should do that now.

Whatever the response is from memcache we will pass the key and the data to the dogpile_get function.

private static function dogpile_get($key, $data) {
    if(!empty($data)) {
        $value = $data['v'];

        if($data['t'] > 0) {
            if(time() >= $data['t']) {
                $data['t'] = time() + self::DELAY; // Update the cache time

                // set the stale value back to memcache for a short 'delay' so no one else tries to write the same data
                self::$memcache->set($key, $data, MEMCACHE_COMPRESSED, self::DELAY);

                return false;
            }
        }

        return $value;
    }

    return false;
}

a. First we need to make sure the value isn’t an empty array. If it fails that check we assume the data is corrupt or snuck into memcache and we return false. If everything looks good, extract the value from the “v” index and the time from the “t” index.

b. Make sure the “t” value is greater then 0. If its not, return the “v” value.

c. Check if the current time is greater than or equal to the “t” value. This is where our dogpile magic happens. The extended_life we set earlier gives us an extra few minutes to let the application determine what’s stale and what’s valid. The “t” value is the actual time we want the data to expire even thought we told memcache to store it for an extra few minutes. Since we got this far, it means the data is meant to expire. To prevent everyone and their mother from reloading the cache we are going to temporarily store the old stale data back in memcache while this user has the opportunity to refresh it.

We need to update the “t” value of the data array that was passed in to be the current time + $delay. Delay is a constant integer; we have out set to 30 seconds (change as needed). This delay is our dogpile buildup prevention. You will see how in a moment… Now we write the stale data back to memcache. Set the key to the $key that was passed in, the value to our updated $data array and the expiration to $delay. Now for the next 30 seconds all subsequent requests will receive the stale data (shouldn’t be a big deal for most sites). The user who found the stale data now has 30 seconds to complete an update. If she fails to do so, the next user to make the request will have that opportunity. Return false, because we want to the request to think we found nothing and fetch fresh data.

3) Now that we have our response and have dealt with dogpiling we need to analyze the response. If the value from get_simple is false we need to set &$found to false and return false. The found variable does exactly as you probably guessed. It tells us if the get function really found something (even if the value was stored as false). It’s a referenced variable, so we can easily use it outside of the cache class.

If the response is not false we set &$found to true and pass the data off to the defalsify function, which compares the value to our constant “false identifier” (<-F4LS3->). If the value equals the identifier, return false. Now the “get” function can finally return the results.

Delete Function

The delete functionality is simple compared to the get/set functionality. The goal is to delete the data from the instance cache and from memcache. This is easily shown with an example.

public static function delete($keys) {
    if(!is_array($keys))
        $keys = array($keys);

    if($keys) {
        foreach($keys as $key) {
            self::$memcache->delete($key);
            self::instance_delete($key);
        }
    }
}

Conclusion

I hope this article gives you a better insight into some more advanced caching techniques. I, nor CollegeHumor claim to have invented any of these techniques or methods. This was a high level overview of how our caching layer functions and I hope someone learned something from it. I know it might be difficult to follow all the operations, unfortunately I cant post the entire class code. But, if you have any questions about implementation, caching or anything PHP, feel free to contact me or post a comment.

Get Complex

private static function get_complex($keys) {
    $results = array();
    $missing = array();
    $num_keys = count($keys);

    // lets see if we can find any of these in the instance cache first
    for($x = 0; $x < $num_keys; $x++) {
        $key = $keys[$x];
        $value = self::instance_get($key);
       
        if(!$value) {          
            // nothing was found in the instance cache, lets create a list to look them up in memcache         
            $missing[] = $key;     
        } else {           
            // we found what we were looking for in instance cache!            
            $results[$key] = self::defalsify($value);      
        }  
    }  
   
    if(empty($missing))        
        return $results; // we found everything we need in instance cache  
       
    // look up anything thats missing in memcache  
    $values = self::$memcache->get($missing);

    if(!empty($values)) {
        foreach($values as $key => $value) {
            $value = self::dogpile_get($key, $value);

            if($value) {
                $results[$key] = self::defalsify($value);

                self::instance_set($key, $value);
            }

        }
    } else {
        // we didnt find what we needed in memcache
        return false;
    }

    return $results;
}