Streaming at Scale: Netflix’s Real-Time Data Magic for Smarter Recommendations by Tulika Bhatt

Tulika Bhatt
Senior Software Engineer

Reviews

0
No votes yet
Automatic Summary

The Magic of Streaming at Scale: Netflix's Impression History Service

Welcome, everyone! Today, we're diving into the fascinating world of streaming at scale, focusing on Netflix's Impression History Service. This crucial component powers Netflix's recommendation engine, enhancing user experience by delivering tailored content. In this article, we'll explore the core concepts, challenges, and insights gained from implementing this service.

Understanding Impressions: What Are They?

To set the stage, let's clarify what an impression is. Simply put, an impression is any image asset presented to a user. Think of it as the box art or screensaver you see when browsing Netflix. However, the definition becomes complex when we consider:

  • The percentage of box art displayed: How much needs to be shown for it to count as an impression?
  • User behavior: Do scrolling actions generate new impressions, or are they just repetitions?
  • Cached vs. fresh impressions: How do we differentiate these in our tracking?

Answering these questions helps establish a robust logic for counting impressions, which is vital as it feeds directly into our recommendation algorithms.

Why Impression History Matters

At Netflix, we utilize a blend of offline and online recommendation models to optimize user experiences. To train our offline models, we require datasets built from impressions, which serve as crucial inputs. In addition to training datasets, real-time impression history is vital for providing relevant recommendations based on user behavior.

The Scale of Impression Data

With approximately 300 million members, each profile generates thousands of impressions annually, leading to a massive quantity of data. Storing this data efficiently is both a challenge and a priority for maintaining a responsive service.

Statistical Representation of Impression History

To represent impression history effectively, we leverage the concept of Exponential Moving Average (EMA). This technique prioritizes recent events over older ones, ensuring that our recommendations stay relevant. Here’s how we calculate EMA:

  • Formula: EMA = α * (Current Impression) + (1-α) * (Previous EMA)
  • Parameters:
    • α (Alpha): A smoothing factor between 0 and 1.
    • Window Size: Determines how far back in time the EMA considers.

This methodology allows us to maintain control over how impression history influences our recommendations, balancing recent activity with historical data.

Designing the Impression History Service

Our Impression History Service is designed to operate online with a response time of less than 100 milliseconds, supporting a seamless user experience. Key requirements include:

  • Real-time updates: A recently viewed impression must impact subsequent recommendations immediately.
  • Support for various decay windows: Different models may require varied look-back periods for EMA.
  • Handling trillions of distinct rows: Our architecture must support extensive data without compromising performance.

System Design Choices

In designing the system, we evaluated two approaches:

  1. On-the-fly Computation: Compute EMA in real-time from the impression data store. Advantages include flexibility but limited to shorter windows due to performance constraints.
  2. Pre-computed EMA: Store calculated EMA values in a key-value store for quicker access. This approach allows handling of larger windows but sacrifices some flexibility, necessitating adjustments for any formula changes.

Ultimately, our solution combines both methods. For short windows, we compute EMA dynamically, while longer windows utilize pre-computed values, balancing flexibility and performance.

Insights and Learnings

Throughout the project, several valuable lessons emerged:

  • The complexity of serializing EMA values into our storage systems presented challenges that required alternative solutions.
  • Shifting from a library model to a microservice architecture simplified operational management, as it centralized system control.
  • Precomputing larger windows reduced flexibility, and formula changes necessitate careful consideration before implementation.

These experiences not only enhanced our understanding of recommendation systems but also highlighted areas for potential innovation.

Conclusion

In summary, Netflix's Impression History Service exemplifies how crucial data management and real-time processing are in delivering a personalized streaming experience. By developing sophisticated methods of tracking and analyzing impressions, we continue to enhance


Video Transcription

Welcome, everyone. Today, we I'll be talking about streaming at scale, Netflix real time magic for smarter recommendation.I'll be talking about a service called as the impression history service, which is one of the major service that powers our recommendation engine. I'll be diving a little bit into it and explain, like, you know, what are the core concepts they had, you know, kind of behind that service, what are the prerequisites, and how do we design such a service to handle such huge loads. So moving on, let's look at the table of contents. So we'll be focusing on core concepts, like, you know, some important principles that you know before you design such a service. Then I'll look into, like, what are the requirements and scale, what were certain design considerations, and then focusing a bit into, like, my learnings and what I learned from this experience and what am I going to, like, you know, do, more innovation on in the, future.

So before we begin, the question of the hour is what is an impression? So to say it simply, impression is any image asset that is presented to a user. So, for example, when you're using Netflix, you log on to your, like, you know, your app or your home page and you actually, see a lot of, like, you know, small images. So those images are called impressions. And, now that we know what impressions are is, it is kind of essential to decide, like, categorizes this definition a bit further. For example, how much percentage of box art needs to be shown for it to be considered as an impression? So as you see in this image, for certain images, they're, like, cut off. You cannot see the whole thing. Do we consider them as, like, you know, as an impression or do we ignore them? So is there, like, a certain percentage threshold?

Furthermore, like, whenever a person is using Netflix, they will, like, you know, move up and down, and they will scroll to and fro. So it kind of becomes important, to know whether you consider all of these, like, images that you see while doing these movements as new legit impressions or, like, you know, or they are sort of counted as the, like, you know, older impressions. Then there is this thing where, like, you know, some impressions, they will be cached on your user devices, and you will see the same image. Whereas for certain impressions, they it'll be like a fresh server call. So you will make a request to the Netflix servers and they then you'll get, like, new images and metadata. So do we need to differentiate between these two impressions? Like, what's our strategy here?

So the answers to all of these question is really important because it help us to define what are the the duplication logic will be and, like, you know, how are we going to set filters depending on the use case. And, we really want to get this right because this is one of the main, like, you know, fundamental dataset that is powering our recommendation and how our pages are created and arranged. So it's, like, really important to get the definition of impression correct. Moving on, here here on this page, you can see several different examples of, impressions. So, like, you know, you have the, it can be, like, the big image that you see when you scroll on to, when you log on to Netflix, or it could even be the screensaver that is displayed on your device when your, screen is idle. And, obviously, the, like, the multiple small images that you see, we call them box art, they're also considered as impression. So once you have answers to all of the above question, your final impression object kind of depends, kind of looks like this. This is the shape of the object.

It'll be associated with the profile, so it has a profile ID. Then there is video ID or entity ID, which represents the ID of the entity that is being shown to you. Then there is the impression time, which is the time at which this impression happened. And then there'll be, like, you know, several metadata fields. So metadata field could be, like, you know, your app device, app or device version, or, like, you know, the location where this impression happens, so on which row, on which page, and, like, you know, so on. Moving on, why do we need impression history? So at Netflix, we use a blend of, offline and online, recommendation models, to give you the best, possible home pages so that users have really good, user experience.

Now in order to, you know, train, offline model, you need to create, datasets training datasets. For any supervised machine learning model, train dataset would consist of, like, two things. They'll have features, and then they'll have labels. Their features will be the input, and labels will be the output. Now for in case of impressions, they serve as one of the, input to our, model. And the place that we get from, those impressions are considered as output. So we create this training dataset, which is, like, you know, which is, like impressions with, players together, and we use that for training our offline, model. Then secondly, for certain, rows, for certain percentage of users, we have, like, you know, online recommendations. So, like, you know, certain, rows are generated, at compute time, like, during when the page has been constructed.

So for that, we definitely need the raw impression history to feed into our model so that it can do inference and provide you with the candidate set of videos that is, shown on the page. Then, obviously, impression is also used for a lot of different business use cases on the, page. It can be used for merchandising for, like, you know, for live promotion, etcetera. So all of these are really important use cases and, you know, they need a lot of impression history. So to give you some context about the numbers, so any profile will see around, like, you know, thousands of impressions in a year. And if you multiply with that with, like, 300,000,000 members, which is like the approximate number of members, current Netflix members, that becomes, like, you know, a lot of, like, a lot of, impression history for one year.

And it is, like, you know, practically very expensive, to store somewhere. Therefore, we need a way to somehow, you know, condense this, impression history, using, you know, some statistics. And, so that we can store this entire, like, you know, history of impressions but in a clever way. So the next question is how do we represent impressions, history statistically? Now for that, we leverage something called as, EMA, which is which stands for exponential moving average. Now EMA is not a new concept. It is sort of already used in the, stock market world where, like, you know, EMA is kind of used for those time series where you we, like, you know, where you weigh certain, like, recent events more than, you know, the older events. So we sort of want similar characteristics in our impression time series, then therefore, we leverage EMA. Now EMA is something which is calculated, per profile ID, per video ID.

And, if you look at the formula, you know, on the right, that's how, it is calculated where EMA now, which is, like, now is the request time, is equal to alpha now minus impression time divided by window size. Now impression time is, as the name suggests is the time when the impression happened. Now is the request time. Alpha is something called as the smoothing factor, and it's just like one of the variables that is, used to describe the curve. And alpha is usually, you know, ranges between zero and one. Similarly, window size is, like, you know, the size of the window, according like, for which you are calculating, you're reducing this, EMA time series. So, for example, there could be window size could be one year where, you know, you are just looking at the whole time series, holistically for one year.

It could be two hours, something like that. So if you look at the formula, if you have smaller alpha and larger decay window, then the older data points, they will continue to have influence on the EMA, and they will decay slowly. Whereas if you want a very aggressive EMA, then you would have something as, like, you know, a larger alpha and, you know, smaller decay window. So then your EMA is going to be a little aggressive, and it will be, like, you know, it will lean towards more recent data points compared to older data points. So moving on, I've done, like, you know, a little bit of calculations and, to show you the effect of, alpha. So suppose, you know, you're we are calculating for EMA now and we're calculating it for an impression that happened just right now, then, alpha becomes now minus now divided by one day, which is our window size, then it becomes alpha to the power zero, which is, like, one.

Whereas if you take an impression that happened yesterday and put it into the formula again, then it becomes alpha to the power one, which is, like, point five. So this kinda shows that, you know, for window size of one day, an impression from previous day holds almost half the weight of impression from today. Now similarly, like, you know, you can put, like, you know, any, value you may need to the power, like, you know, a hundred days that will lead you to alpha to the power 100 and which is, like, you know, which is a tiny number. So one thing, you can notice from this is, technically, when we are calculating, EMEs, we we usually look back to the entire history. But if you look at something, like, you know, for if you are supposed computing it for, window size of one day and if you look at impression that happened, hundred days ago, the value becomes very minuscule and it's, like, very insignificant. So we sort of, like, you know, to make things easier for us, we don't include such older values.

What we do is, like, we have something called as the look back window where we set it, like, you know, almost, like, you know, 10 times, the size of, window. So for example, if your window is one one day, then, you know, the look back window will be, like, you know, ten days. So what that means is, like, we'll only consider impressions that happened in the last ten days. As you know, the impression that happened last Sunday, it will be, like, you know, the value will be alpha to the power 10, which is going to be very small value. So we can sort of, like, you know, limit, the amount we have to look back, using this, particular, like, you know, assumption. Okay. Now that we have a little bit of idea about, you know, how we can reduce impression history, So let's just deep dive into our service.

So our service, needs to be an online service with response time of less than hundred millisecond. So, obviously, we are using the service for online, inference, while, you know, while somebody is watching Netflix. So we don't want it to be, like, you know, delayed, and hamper the experience. So it has to be super quick, and it has to be at every profile level. And further, it has to be a near real time service with most recent updates. So what I mean by this is if you watched an impression a minute ago and then you are computing EMA right now, then the impression that happened a minute ago should be reflected in the, calculation for EMA now.

So it has to be near real time. And, also, we are using this EMA in, like, you know, several different models for several different use cases. So we want EMA values for different decay windows. So for some models where, like, you know, you need to have you need to look back, for, like, you know, a longer period of time. So they would take, like, you know, EMA with, window size of one year. For certain models, we want, like, you know, we don't want such a long history. We just want, like, you know, maybe two months or, like, you know, even a week. So our service should be capable of providing that. Now to talk a little bit about scale, we would have we have, like, you know, more than a trillion, distinct rows representing impressions.

You know, you're and it is, like, you know, two perabyte in size and literally almost, 23% of the profiles will see updates daily, so you'll get new impressions on them. Plus, this is the right, RPS and the read RPS for our service. Now let's quickly look and look at the, system design. So what happens is, we know, like, you know, our service needs to be, compute EMA per profile. So definitely, this hints towards a a key value data store. So we have, like, you know, two option over here. The first option is the simplest option is, you know, when we want the image, you will just go to, impression table which contains all the impressions. We will just fetch the impression from this table based on profile and then compute things on the fly and return it back to the user.

Now the second option is we don't want to compute it on the fly. Rather, we precompute these EMAs, and we'll store them into, like, you know, some, data store, which is, like, key value. And then, when user requests for EMAs, we will just fetch EMAs from these key value, data store and provide it back to the user. So just let's let's look at the pros and cons of the two approach. So for the choice one where you are not doing any pre compute and computing on the fly, it provides you with flexibility. Like, you can, compute EMA for any window, any formula that you want. But, realistically, you can only really use it for, like, you know, shorter windows.

Because if suppose I say, like, you know, I want EME for a year, then the look back becomes ten years, and you're not gonna fetch, like, you know, ten years of impression history on the wire, add request time, and do all the computation. It's just going to be take a lot of time and, like, you know, you will get, like, maybe out of memory errors, etcetera. Whereas these other option is, like, you know, use pre computed VMA. So the positive side of this choice is, like, you know, you can support larger windows. So you already you're already sort of computing, the MAs and storing it somewhere. So you can literally compute it for any, large window size you want. But the con is, like, you know, you lose flexibility. Now if you want to be like, you know, hey. I wanna change the value of, alpha from point five to point six. It's not going to be, like, simple. You know?

You again have to go and recompute everything and then, kind of store it, populate it again into kv, store. So it's not an easy process. So in our use case, we have chosen the best of both worlds, which is, like, we provide both option to our client. So for a window size less than one day, we compute EME on the fly, whereas for window size greater than one day, we offer precomputed EMA. So even for, like, when we are offering precomputed EMA, because we need near real time, EMA value, so we do some, like, you know, some math magic and refresh the, EMAs so that we are always providing the, like, the latest EMA to our clients. And, we do the, precomputation and, population of, EMA to kvStore by using two Spark jobs. Now the let's quickly look into, like, you know, how we are, calculating these EMA value for larger windows.

So first of all, what we do is and suppose we have this profile 1,001 and, you know, we are computing EMA for all the impression that happened on, first April. So we see there are three impressions that happened, two for video ID eleven, one for video ID 12, Then we just, substitute things in the formula, and then, we like, you know, the three impression have reduced to two EMA values, for like, one for video ID 11, another for video ID 12.

Now what we do is, we fetch yesterday's EMA value and then we decay them, to end of day value today. So yesterday, like, you know, the we we already had, like, you know, two EMAs for eleven and thirteen. So we just take it and, we dig it, to the end of, day value today to update the yesterday's EMA value. And the third step is just, you know, do a full join and add on, add EMA value on profile video ID. So if you could see, like, today, we computed it for eleven and twelve. Yesterday, we had it for eleven and thirteen. So in the final table, we have, like, you know, three EMAs today. One one for eleven, one for twelve, one for thirteen.

And, you know, if you look at the date and it is showing, you know, when these impression happened. So two, like, latest impression happened. Two happened on, first April. The '13 one, happened on, like, you know, March 31. And, the update time is telling you the time when, we the time stamp till which we have, like, refreshed these EMA values. And you can see the the EMA values over here, which is, like, you know, which is just like a map of window size and EMA value. So once we have computed this EMA value, then we serialize this, you know, EMA values in protograph, and then we, have a spark job which kind of, like, you know, takes these EMA's and, which are, like, you know, serialized and then, sends it to our kv store. So this is what happens on the, right side. On the read side, things are comparatively easier.

For a short window, we just, like, you know, we do things on the fly. So just like, you know, whenever an impression happen, we just go to the impression data store, get the latest value, substitute into formula, and then send the results back. Whereas for a longer window, it's a two step process. So for the first step, we go to our kv store where we have, precomputed EMA's. We fetch the precomputed EMA's and then decay them to request time. Then because we need to provide near real time EMEs, we just go to impression data store and find all the impression that happened during the time when, EME was last precomputed and request time, and then use them, to compute, like, you know, the latest EMA is sort of, like, you know, again, add them together and provide the latest, latest near real time EMA to our clients.

So if you look at this, this is the actual, system design. This is obviously a simplified, version of how things are operating right now. So I'll go, like, you know, from, right to left and tell you, like, you know, what happened. So, basically, you have a Netflix member. They're watching, like, you know, it's, they're browsing Netflix. As as they're browsing, their impressions are getting logged, which are then redirected to impression stream. Now from impression stream, there are some jobs which, like, you know, then populates this data into an, offline iceberg table. Also, it uses the impression stream to, like, you know, create our time series data store. So we get, like, two data store out of from this, Kafka impression stream. Then finally, there is a precompute, EMA spark job that reads from this offline impression table, does the computation, and then populates our, kv store with EMA values.

So, you know, which is like, you know, there is an IHS service, which sort of, like, sits on top of these, two data store, the kv data store storing EMA, and the time series data stores storing, like, you know, raw impression history, and then you have the client. So whenever a client is requesting information, if it is for, like, you know, shorter window EME, it just goes to the time series data store, and compute CME on the fly. Whereas if you are requesting for, like, you know, longer EMA, it goes to k v's data store such as the precompute EMA, like, kind of, refreshes the the EMA using, like, you know, leftover impressions in, time series data store and kind of provides the, response back to the client.

Yeah. So that's all I have for, the system design today. Now I'll just quickly share a bit of learning. So some of my learning from this project was, you remember the step that I mentioned where where you are we are, like, serializing EMA value into protobuf and populating into the KiwiStore? Well, that process was not very easy, because, you know, we would have we would, there was this, you're writing Spark job, and we would, batch the, like, the EMA values, and then try to call our KV API and write it into the, into a KV data store. Now what would happen is that would kind of, like, the throughput would be a lot, and it would overwhelm our k v. So we had to keep on readjusting, tuning Spark jobs so that we are not overwhelming k v cluster, and it was very hard to do it. So, ultimately, we dropped that route and, to, like, you know, to the courtesy of their platform at Netflix, we made an alternative solution where we take these EMEs and we write SST files and then directly kind of, like, you know, load these files into Rocks TV and use it to serve our EMA.

So we are just doing doing the backdoor route for, for this, and this was a good learning from this, entire exercise. Another thing was, like, you know, whether we should choose library or microservice. Now, as you know, library would mean, like, you know, faster results. Microservice means one more hub. So earlier, we had gone with the library model, but it was very hard to maintain operationally. Any changes would mean, like, now now you're relying on consumers to make any property changes. So we scrapped that, and then we went to the microservice route. Another thing is, like, you know, precomputing longer Windows, is like, you know, right now we are precomputing it, but it does make it lose flexibility. So whenever if we have to do a formula change, it's a big thing. And we normally don't do it unless there is a very solid reason for it.

And, obviously, the entire system is kind of, dependent on whether the definition of our impression is solid. If, like, you know, if there's just changes to the definition of impression, that means, like, you know, updating the entire, system, built on it. Then finally, we are sort of, like, you know, providing the same value to impression that is happening on the entire home page. So, like, screensaver impression or, like, you know, the big billboard impression or, like, you know, impression have happening on, like, you know, other category pages. So right now, we are we just use the same formula, but maybe we should update the formula according to where impression happens, as, like, you know, no two impressions have the same weightage. This is something that we are thinking about and experimenting with. So that's all from my side today. Thank you all for listening, to my presentation.

And if you have any question, comments, or feedback, please, reach out to me now, or you can you can also connect on LinkedIn. Thank you so much, everyone. Ash.