Crossing the Drawbridge: The Life and Times of Dale “Intern” Jr.

At risk of this becoming a paean to Startup-dom, let me begin by first thanking the various ping-pong maestros that this company employs for elevating my game. I thought I was the best ping-pong player this side of the Mississippi until I met these guys, and they put me in my place.

Now, to shine a little light on the work I did while at Drawbridge, let me tell you that not once did I fetch coffee. When I was assigned to do “menial” work, it was to help Len, the front-end king, port all our docs into static html/css. This turned out to be a great exercise because Len got me to start using vi, a command line text editor (for those having trouble switching, macvim helps tremendously with the transition- mixes vi with normal mac gui controls).

I worked directly with Paul, who at the time was Drawbridge’s “entire global sales force”, to assemble target lists. Last week I spotted some of the prospects I’d dredged up on a slide deck presented to the board of directors- it’s not like this was my magnum opus being featured in NYC’s best gallery or anything, but my heart definitely swelled with pride a little.

Although these are but a small sampling of the varied projects I worked on, the common theme should be clear- I was no excel monkey. While many of my Dartmouth compatriots were busy formatting power points at Bank X and selling their soul to the Investment Bank Gods, I was dropped into the heart of a vibrant startup where I worked on meaningful projects and was able to observe some of the smartest people in Silicon Valley tackle problems and carry a vision to fruition, one painstaking model release or sales call at a time.

Xiang and Heedong- it may be true that nobody will remember the day we stood basking in the glory of victory on that podium after go-cart racing, but I, for one, will never forget the raw passion with which we raced and won. Fight on, gentlemen!

Thanks Drawbridge!

final haiku:

Fellow Drawbridgers

Thanks for putting up with me

It was the bees knees

actually here’s the last haiku:

Drawbridge poker nights

John will take all your money

but no one knows how

(especially him)

this has been Brian Joseff the Intern, over and out.

Be Sociable, Share!

    The Pairing Algorithm: A High-Level Overview of Our Special Sauce

    By Tin Kyaw

    The proliferation of smart phones, tablets and other mobile devices enables a person to access the Internet via multiple devices everywhere, everyday.  In theory, the ability to reach users on more devices offers marketers better and richer targeting opportunities.  However, in practice, the lack of data sharing between browsers and mobile applications creates a virtual boundary making the task of building a user model using data across platforms without a login id especially challenging.

    At Drawbridge, we have developed a statistical algorithm to match users’ cookies with their devices.  Using our algorithm, we are able to match, within levels of confidence, the device ids belonging to a person represented by a cookie in the browser and vice versa.  We run our algorithm daily to match device ids in our mobile advertising network with the desktop cookies we receive from our partners and build a database storing the device ids and the cookies associated with them.

    Currently, our algorithm works as follows:

    1. Match device ids with cookies using features common in both types of requests
    2. Find and penalize ‘noisy’ features
    3. Compute score


    Our matching algorithm was written as a series of hadoop map-reduce jobs and we use Apache Avro as storage for data size and efficiency.  Each run of the algorithm incorporates as input the mobile application and desktop cookie data from the previous 60 days.  We try to include as much data as possible with each run of the matching algorithm in order to maximize precision of the algorithm and the coverage of our device id to cookie database.

    Today we record over 2.5 billion user activities across both mobile and desktop platforms totaling over 1.5 terabytes in storage per day.  In order to fit 60-days worth of log data and run the algorithm in a reasonable amount of time without overrunning our 83-node hadoop cluster which is responsible for myriads of other reporting and optimization jobs, we aggregate our logs daily, reducing the dataset from 1.5 terabytes to about 40 gigabytes per day.

    As part of our matching algorithm, we have to perform data joins over massive data sets and we found that the performance of those joins degrade exponentially as the sizes of the datasets increase.  If one of the data sets is small enough, one could potentially load the small dataset entirely into memory in each mapper and perform in-memory join with records from the large dataset as they are loaded into the mappers.  However, when neither datasets fit entirely into memory, which is typically the case with our datasets, map-side in-memory join becomes unfeasible.

    We implemented our own reduce-side in-memory join algorithm in order to perform those massive joins.  Following are the steps in our reduce-side in-memory join algorithm:

    1. Split the smaller dataset based on join keys into partitions each of which can be loaded into memory in each reducer.  This can be achieved by using a custom partitioner while generating this smaller dataset.
    2. Stream the larger dataset through the regular map-reduce steps while using the same partitioner to ensure the join keys from the larger dataset will be mapped to the same reducers where the same keys from the smaller dataset will be loaded into memory if present.
    3. In each reducer, do a simple in-memory lookup to join the key from the smaller dataset with the key from the larger dataset.

    The performance improvement of our reduce-side in-memory join over regular map-reduce join is illustrated in the chart below:

    We have accumulated cookies that are matched to over 500 million devices in our current database and the number is growing rapidly.  Our matching algorithm can achieve over 60% precision and 60% recall in the cookie to device pairs we match according to the latest test with our data partner.  We are constantly striving to improve both the precision and runtime of our matching algorithm while working within the constraints of our cluster capacity.  We are currently looking to incorporate HBase in our matching algorithm to improve its runtime performance and we will be happy to provide another update on our findings in due time.  Please stay tuned!

    Be Sociable, Share!

      Jerry Ye wowing crowds at Hadoop Innovation Summit

      Our very own Jerry Ye, dashing data scientist by day, merciless signal extorter by night (actually he does this 24/7) has travelled down to beautiful San Diego for the Hadoop Innovation Summit to give a talk on how he does his machine learning magic. He’s scheduled for Thursday 2/21 at 2:10PM. If we get a video feed I will share it here.

      Here’s the abstract:

      Drawbridge is solving the problem of cross device ad targeting at scale.  After matching over 450 million devices, we built a large scale machine learning platform that trains on over 2 billion examples across 1 million dimensions to deliver more relevant ads to users regardless of the device they are on.  This talk will go over our parts of our Hadoop infrastructure, give a high level introduction to our bridging algorithm for matching devices, and go over lessons learned from scaling up our machine learning platform.

      And, here’s his mug:

      jerry-bio

       

      Be Sociable, Share!

        The Modeling Dance: Scale, Quality, and Speed

        By Xiang Li, Data Scientist

        Here at Drawbridge, scale is the key to our success. We build all kinds of models that predict click through rates, conversion rates, winning rates, winning prices, and other metrics that enable us make the optimal decision when trying to serve the most relevant ads to users at the cheapest possible cost to us and our clients.

        Many of the problems we are dealing with are non-stationary in nature making their statistics difficult to wrangle into a model- it’s like pinning spaghetti to the wall. This requires us to rapidly push model changes so we can catch the trend. Some of the non-stationary characteristics are captured by our online model updating procedure that dynamically adjusts the model weights, and we address the other characteristics in our daily model release.

        This so-called “daily model release” is a bit of a misnomer. Every day we will roll out at least three new models, and test them against our existing models. At a very high level, there are two kinds of tests that we do: one is to verify that the model does exactly what we want it to do, and the second is to verify that the new model actually performs better than the old one. If the new models pass the test, they will be pushed into our production system. To reiterate, we are charging through this swift release cycle and testing process so we can ride the latest trends in the behavior of our users and the vicissitudes in the market conditions. What’s more, these new model pushes are conducted in addition to the multiple other new models we push every day to our experimental test buckets of data.

        Now that I have discussed our speed and agility, let me bring scale into the equation. Our swift release cycle is conducted on a playing field populated by massive amounts of data. To better understand the truly gargantuan data we deal with, feast your eyes on these stats:

        - Our models on average each have over 200K free parameters that capture virtually every type of information we could extract from our users, the devices we track, and the market. Our biggest model has almost 900K free parameters.
        - Every day we use more than 1.2 billion records each to update our models, and this must be done quickly and efficiently so we can push out models to production in a timely manner.
        - Every month we consume more than 30 billion records to build our manifold models.

        Numerous factors compound the difficulty of this herculean task. For example, we don’t want to sacrifice the quality of our model in exchange for the scale. Additionally, sometimes we need to update our models on the go and can’t wait for the daily release cycle.

        Want to know more about how we do it? Why don’t you submit your resume and talk to us.

        Be Sociable, Share!

          Just Point Me in the Right Indirection

          by Len Frenkel, Ajax Ninja

          “All problems in computer science can be solved by another level of indirection.”
          –      David Wheeler

          If we want a high performance website, we need to minimize the number of requests and the payload of each request. Sounds obvious, but what does that mean for CSS and JavaScript? To minimize the number of requests, we want to combine multiple files into one. To minimize the payload, we want to minify our CSS and JavaScript. We probably don’t want to have all our code in one file at development time, and we certainly don’t want to maintain minified code. That means we need a build step when our code goes to production.  That also means that we probably don’t want to have <link> and <script> tags directly in our code. While we’re at it, we may as well think about cache busting and using a CDN. Both require some kind of change to production URLs. Can we solve all these problems in one fell swoop? Why not?

          To recap, we want modularity in our development environment but a small number of files in production. We want readability at development time but minification in production. We want URLs that point to our development machine magically to point to a CDN in production. And finally, we want URLs for static files to vary with each release so that our users don’t get stale cached files. Indirection to the rescue.

          At Drawbridge, our modules are organized into packages. A package is defined by a JSON file. Here is the package for our front page:

          {
          “files”: [
          "/.../front.css",
          "/.../Front.js"
          ],
          “dependencies”: [
          "/.../widget.json",
          "/.../counter.json",
          "/.../splash.json",
          "/.../page.json"
          ]
          }

          The definition has two sections: “files” and “dependencies.” “Files” contains the list of files that this package brings to the table. The “dependencies” list consists of other packages that this package relies on. Instead of having <link> and <script> tags, our pages refer to these packages. Here is how our front page includes CSS and JavaScript:

          Inclusion::includeCss(‘/…/common.json’);
          Inclusion::includeCss(‘/…/front.json’);
          Inclusion::includeJs(‘/…/common.json’);
          Inclusion::includeJs(‘/…/front.json’);

          “Inclusion” is our tool that handles all the details. In a development environment, the above code generates the following:

          <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/imports.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/ext-all.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/extcustom.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/util.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/common.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/common.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/counter.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/splash.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/popup.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/checkbox.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/slideoutmessage.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/loginpopup.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/dropdownbutton.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/list.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/dropdownmenu.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/header.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/footer.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/page.css” /> <link rel=”stylesheet” type=”text/css” href=”http://test.assets.drawbrid.ge/…/front.css” /> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/ext-all-debug.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/BigInt.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/Barrett.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/RSA.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/Util.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/I18n.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/Widget.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/Animator.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/Counter.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/Pairing.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/Splash.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/Popup.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/Checkbox.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/SlideoutMessage.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/LoginPopup.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/DropdownButton.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/List.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/DropdownMenu.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/Header.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/Footer.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/Page.js”></script> <script type=”text/javascript” src=”http://test.assets.drawbrid.ge/…/Front.js”></script>

          Phew! That’s a long list! Basically, it’s a de-duped post-order traversal of the dependency tree. Fortunately, the production list is quite a bit shorter:

          <link rel=”stylesheet” type=”text/css” href=”http://assets.drawbrid.ge/www/d0e6932c2cef03e2f81a4370e3b888bd/common.css” /> <link rel=”stylesheet” type=”text/css” href=”http://assets.drawbrid.ge/www/d0e6932c2cef03e2f81a4370e3b888bd/front.css” /> <script type=”text/javascript” src=”http://assets.drawbrid.ge/www/d0e6932c2cef03e2f81a4370e3b888bd/common.js”></script> <script type=”text/javascript” src=”http://assets.drawbrid.ge/www/d0e6932c2cef03e2f81a4370e3b888bd/front.js”></script>

          Our build process produces one minified CSS and JavaScript file for each top-level package and puts these files into a uniquely named directory. Note that the production server is different from development, which takes care of the CDN requirement. The production URLs point to the directory created by the build process, which achieves cache busting.  Our build process also automatically generates CSS sprites and filters our CSS to point to the generated sprites in production, but that’s a topic for another post.

          Be Sociable, Share!

            Building An Efficient Campaign Targeting Engine

            By Sanjay Agarwal, Engineering

            I am the engineering lead on building the Adserving platform at Drawbridge, and would like to share some insights on building an efficient campaign targeting engine for Adservers.

            Overview

            Our Adservers receive requests from a variety of supply partners and often have strict SLA requirements governing the response we return, which can be as aggressive as 75 milliseconds. Our Adserver is built using open source java frameworks, augmented by proprietary technologies. One of the main problems we needed to solve when processing Ad requests was to find the campaigns eligible to run on a given Ad request based on targeting criteria specified by Advertisers.

            At a high level, each request needs to process the following before returning a response:

            1. Extract a set of attributes related to the ad impression, such as device type, location, ad size, user attributes, etc
            2. Find the campaigns that qualify to serve on this impression. This is also known as applying the campaign targeting.
            3. Filter out campaigns that don’t have enough budget
            4. Filter out campaigns that have reached daily frequency cap set by Advertisers
            5. Predict revenue for these campaigns using prediction models
            6. Choose a campaign and return an ad response.

            We are going to focus on the in-house technology we built for applying campaign targeting for an Ad request.

            What is Campaign Targeting?

            Given Ad request level attributes (for example, device_type = iphone, and country = US), we need to find the campaigns that are eligible to run on this impression. In this example, campaigns that are targeting device type as iphone OR did not specify device_type targeting are eligible for device_type = iphone. The eligible campaigns must also target US OR not use country targeting at all. When a campaign doesn’t specify targeting for an attribute, we call it “Don’t Care” or DNTC for short.

            More formally,

            For a request with:

            • device_type = iphone
            • country = US

            Campaigns that satisfy each of the following criteria are eligible too on this request:

            • device_type = iphone OR device_type = DNTC
            • country = US OR country = DNTC

            We allow targeting on many attributes, some of which are:

            • device_type (iphone, ipod, android_phone, etc)
            • country, state, city
            • platform (app, mobile_web, pc_web)
            • carrier
            • make, model of device (such as Samsung, Galaxy)
            • User attributes, such as demo_age, demo_gender etc.
            • And many others ..

            Mapping Campaign Targeting to a Search Problem

            We mapped the targeting problem to a search problem, where each campaign is treated as a “Document” and each request is treated as a “Query”. Each request can be translated as a query, having a set of key-value pairs where key is attribute-name and value is its value.  Over time, new targeting attributes can be added to the system, which will just translate to a new key-value pair.

            We evaluated open source frameworks such as SOLR for executing this kind of search, but found it to be too CPU intensive for our needs. The reasons for this are:

            • SOLR is designed for full text search, and is not optimized for exact match queries
            • There is ranking function built into the index, which is not necessary for our needs. For Ad serving, we need to rank the campaigns after search using revenue or profit models.
            • The query is represented in String form, which needs to be parsed before execution, which adds to CPU consumption

            To solve this search problem in an efficient way, we wrote an in-memory reverse index based search framework. It works as follows:

            • On a server startup, build a reverse index of attribute values to docId (campaignId in our case)
            • Expose an API to pass in the query attributes in form of key-value pairs, creating a more efficient way to process the query parameters.
            • The index is maintained in-memory. If the campaign targeting is changed when the server is running, the server can incrementally load the changes and update the index online.
            • The index returns all documents that match the query. Documents are not ranked and any document that is returned is guaranteed to satisfy the search criteria.

            Search Framework Internals

            Let’s dive into the internals with an example.

            For campaign_id = 1

            Attribute Name Attribute Value
            device_type iphone
            country DNTC
            platform_code app
            carrier_name ATT
            demo_gender male


            For campaign_id = 2

            Attribute Name Attribute Value
            device_type iphone, ipod, ipad
            country US
            platform_code app, mobile_web
            carrier_name DNTC
            demo_gender DNTC


            For campaign_id = 3

            Attribute Name Attribute Value
            device_type DNTC
            country DNTC
            platform_code mobile_web
            carrier_name AT&T
            demo_gender DNTC


            The search index will be built like the following on initialization, where for each attribute, we build a mapping of distinct attribute values pointing to a list of campaign_ids.

            device_type
            iphone -> 1,2
            ipod -> 2
            ipad -> 2
            DNTC -> 3

            country
            US -> 2
            DNTC -> 1,3

            platform_code
            app -> 1,2
            mobile_web -> 2,3
            DNTC -> NULL

            carrier_name
            ATT -> 1,3
            DNTC -> 2

            demo_gender
            male -> 1
            DNTC -> 2,3

            Let’s say an Ad request maps to the following query

            device_type = iphone
            country = US
            platform_code = mobile_web
            carrier = ATT
            demo_gender = male

            We start with all active campaign_ids and the following algorithm is executed:

            1. Start with all active campaignIds in the system
            2. For each request.attribute_type, take the union of request.attribute value and DNTC. This finds the campaignIds that either match directly with request or have not specified targeting for this attribute_type
            3. Take the intersection of campaignIds from step 2 against step 1.
            4. Repeat Step 2,3 for all request attribute types.

            In this example, this is how we will find the eligible campaigns

            Start with eligible_campaign_ids = 1,2,3

            For device_type=iphone OR DNTC, matching campaignIds = 1,2. eligible_campaign_ids = 1,2,3
            For country = US OR DNTC, matching campaignIds = 1,2. eligible_campaign_ids = 1,2,3
            For platform_code = mobile_web OR DNTC, matching campaignIds = 2. eligible_campaign_ids = 2,3
            For carrier = ATT OR DNTC, matching campaignIds = 1. eligible_campaign_ids = 2,3
            For demo_gender = male OR DNTC, matching campaignIds = 1,2. eligible_campaign_ids = 2,3

            So, campaign_ids 2,3 are eligible to serve on this request. They will be further evaluated for budget and predicted revenue and a winner will be chosen based on a ranking function.

            Summary

            Using an efficient in-memory search index, we built a generic campaign targeting search index. The framework itself is generic and can be used in other applications also. The framework exposes java APIs for indexing new documents and performing search using key-value pairs. We already use the framework for multiple applications in our systems.

            Checkout our jobs page: http://drawbrid.ge/jobs

            Be Sociable, Share!

              Hello World!

              When I founded Drawbridge in 2011, it was in response to a problem that I had observed at AdMob and Google: targeting ads in mobile is really difficult.  All of the goodness that we could take advantage of on the desktop– 3rd party cookies, lossless attribution, 3rd party data– was broken.  The core of this problem, we felt, was user identity… but without knowing anything about the user, targeting and attribution were broken.

              This in turn created a number of silos that make conducting business in mobile difficult: for example, desktop and mobile operate in separate silos (so the same user cannot be reached across multiple devices).  In-app and mobile web are also siloed (so the same user on the same phone is treated differently when they are on an application vs. their web browser). There is even a wall between the mobile web and app marketplaces, so that proper attribution cannot be given to an advertiser who drove an app install. What could be a vibrantly diverse, functionally unified digital ecosystem has been reduced to a scattering of walled gardens, stunting user engagement, and limiting advertising efficacy.

              We created Drawbridge to break down these silos.

              Little did we know that so much technology would be involved in trying to realize this goal.

              To solve the problem of user identification, we built a large scale pairing system to identify users across devices (we even published our methodology here)!.  This week, we just passed 350M devices paired (you can track our progress in real-time here).

              To solve the problem of lossless conversion tracking, we developed a client and server side conversion tracking system called ConversionWorks, and opensourced it to help the industry.

              We built Drawbridge for App Marketing to help developers, and Drawbridge for Cross-Screen Marketing to help brands.  These product allow for desktop-to-mobile retargeting, and use of 3rd party audience segments in mobile campaigns.

              We have made huge investments in machine learning to efficiently bid on real time exchanges, and we have even extended a post-conversion event API to our clients so that we can find more of their valuable users.

              So, what is this blog about??
              It’s about all of the above.  We are passionate about mobile advertising, cross-device advertising, and machine learning/pairing algorithms.  We love audience targeting and retargeting, conversion tracking and attribution, and helping our clients make more money in a cross-device world.  

              We will post on these and other subjects of interest.

              We hope you enjoy our blog… please subscribe for twitter alerts, and contribute ideas, questions and comments.

              Thank you for joining us on our cross-device journey!

              Kamakshi Sivaramakrishnan
              Drawbridge Founder and CEO

              Be Sociable, Share!