Database Replication 101: A Beginner's Guide to Building High-Availability Systems

Ria Bhatia
Software Engineer
Tanya Sneh
Software Engineer
Automatic Summary

An Insightful Discussion on Data Application: Breaking It Down

Recently, we had the opportunity to delve deep into a topic that excites countless tech enthusiasts: data application. Thanks to the Women Tech Network, we received a platform to exchange our insights with a fantastic audience. In this article, we hope to distill the key points from our conversation for those who couldn't attend while also refreshing the memory of those who did.

Introduction: The Presenters

Being software engineers in the fintech industry, we, Ria Bhatia and Tanya Sne, have had an enriching experience working with multiple database systems. We've been equally focused on data application techniques, ensuring we understand the intricacies. During our careers, we have tinkered with different database systems, spanning across on-cloud and on-prem options, such as MySQL and NoSQL. We are enthusiastic learners and prefer to delve into the whys and hows, rather than just the whats.

Data Application: Digging Deeper

At the heart of our discussion was the concept of database application, a technological approach that deeply fascinates us. We explained what data application is, how it operates, and why it's crucial for businesses.

Defining Data Application

Basically, data application refers to the process of duplicating data from one database (primary) to another (replica). Think about it as having a backup dancer when the main performer cannot complete the performance. It serves as a safety mechanism when something goes awry with the primary database system.

However, duplicating data isn't just about keeping a backup. We discussed how data application helps maintain data geographically closer to users, which can reduce latency. Furthermore, having replicas allows the system to continue functioning even when one part fails. Specifically, for read-intensive applications, data application enables scaling the number of machines without straining a single database.

Tackling the Techniques of Data Application

We went through the various types and techniques of replication at length. Here are the main three:

  1. Single Leader Application: This technique involves a leader distributing instructions to followers. It can be done synchronously or asynchronously, and in some cases, uses a combination of both to balance speed and consistency.
  2. Multi Leader Application: In this scenario, there are multiple leaders that distribute instructions to their unique followers. This helps distribute the workload evenly among various databases but can lead to conflict scenarios.
  3. Leaderless Application: With this approach, every database can handle both read and write operations, which often leads to clients sending instructions to all databases and waiting for acknowledgment from a select few.

Live Demonstration of Database Application

To drive the point home, we showcased a live demonstration of database application. Using two servers—one acting as a leader and the other as a follower—we showed how changes in data were reflected across both servers.

About Database Replication

Despite the clear advantages, we also discussed some of the common issues with database replication, namely replication lag, which could often lead to stale or incorrect data being accessed, and conflicts that can arise in multi-leader scenarios.

Navigating the Challenges

With every system, there are challenges. We examined potential solutions such as “read-your-write” consistency and "monotonic read" consistency to address replication lag. For conflict issues common in multi-leader scenarios, we broached possible ways to tackle them, like last-write-wins and manual conflict resolution.

Best Practices for Data Application

As we wound up our discussion, we highlighted some best practices when implementing a data application strategy—monitoring and alerting setup, having a solid backup and recovery plan, and testing and validating your application strategy.

We want to thank everyone who joined our discussion. We hope you feel more enlightened about data application and feel motivated to explore the topic further. Feel free to reach out to us on LinkedIn and Twitter if you have any questions.

Here's to successful data applications!


Video Transcription

Hello. Hi, everyone. Thank you for coming and attending this talk. Uh Firstly, we would like to uh we would like to thank the women tech network team for giving us this wonderful opportunity.We are very excited to talk about data application, share our learnings with you all and hope this session is useful before we get to the main presentation, let us quickly introduce ourselves starting with me. Hi, everybody. I'm Ria Bhatia and I'm currently working as a software engineer in the fintech industry. Over the course of my career, I worked with multiple database systems or databases on Cloud, on Prem my sequel, no sequel and have had a wonderful, I'm learning new technologies and uh implementing in and trying that with different database systems. One thing which I really like about learning and which makes learning even more exciting for me is understanding the whys and hows behind all these things. Apart from this, I'm also very passionate about organizing mentorship and outreach initiatives for women in stem and giving back to the community in whatever way possible when I'm not coding or working.

I love to play musical instruments like the ukulele and harmonica and I also like reading books over to Tanya.

Hi, everyone. I'm Tanya Sne. I also work as a software engineer in a financial services firm and I worked on increasing the performance of database systems both on cloud and on prem. Apart from this, I keep on learning about diverse fields in technology and have previously dabbled in the fields of robotics and game theory. I am passionate about increasing diversity in tech and when I'm not working, you can find me either painting, reading or planning my lecture.

Oh, we do. Yeah. OK. Um Yeah. Yeah. OK. Um While we thought that out, I will quickly go over what we're going to cover in this session.

We dive deep in two days replication techniques and also witnessed best practices. Keep in mind when doing so, I hope by the end of this presentation. So without further uh without further delay, let's get started first things first. What is database application as I mentioned in my introduction?

The why and what? And that makes learning even more fun. So let's start with the what why? Firstly, we need to understand while you are scrolling through sol the time the problem with this, however, is that this data is constantly changing. Firstly, let's talk about why this data matters. This data matters because this is crucial to unlock the insights into customer behavior, dislikes your purchasing habits. What top you like on your pizza all this earlier is that this this data is constantly changing and hard to what really happens is that to all week. So no matter the overlooked is the ability and consistency which to focus on. Uh And um um you are constantly generating data in the form of custom coming in place and your team is sitting and making decisions based on that data. There is a hyper data from different resources and uh you're already left behind most of your competitors. So to avoid this data, switch up data application me, copy of data on D. It's like having a backup dancer in case your main dancer fails to perform. So it's like having somebody who is always there. That data thing goes wrong and it's all sync with the, the U that is the primary database and this is called the replica.

OK. Uh While Ria joins back, I shall smoke. Um G Yeah. Could you please uh move your P

Yes, I I apologize for the network issues uh Continuing if data application is just a backup. So no data application is not just a backup. It is important to keep data geographically closer to your users. Now, what that means is suppose um you are trying to connect, taking Netflix example, you're trying to connect to a server from the Americas to UK. You would obviously experience some buffering issues because that request is taking that much time to travel no matter how efficient your query or your database is that request the time it will take to leave your system, go fetch the data and come back. You can't reduce that.

You can only reduce that if that server is located closer to your location. And that is how we reduce the latency having different replicas in multiple regions. One can connect to the closest one and um can easily have a low buffering experience and a better uh experience with respect to video streaming. This is this was just Netflix example and many such businesses use that uh keeping data geographically closer to your users. Next is it allows your system to continue working, even if some parts have failed. Now you have an application, you're connected to a server and suppose something bad happens to the server, the server catches fire. What happens in this case? Of course, you might have a backup for that, but even that backup would take some time to get restored and it won't have the latest data. Why? Because like you, you are keeping that data separately and taking that back backup from time to time. But the latest changes are lost because they were just written to that server having a replica. Instead the issue. If you have a replica with the same set of changes which gets replicated as and when the leader is updated or when the primary database is updated, the request can directly be handled by that particular server while you solve the fire situation.

So your system continues to work even if some parts have failed. Thirdly, it allows us to scale the number of machines for especially for read intensive applications. Now, right now, you have one node and certain set of people are connected to it. As the number of people grow, as the number of users on your application grow, you would need multiple nodes to serve those read requests. Uh And putting that pressure on just one database could be disastrous for your system. So having multiple queries allows you uh read uh replicas allows you to send those requests and handle multiple requests without putting pressure or lower on just one because all of them have the same data and are are already in sync with this. We talked about the why and what of the data application talking about the techniques over to Kana?

Yeah. So having understood how application can help us, let us see what are the types and techniques of replication that are available to choose from. The first technique is called the single leader application. This is also called the master slave application had the leader can be thought of as the king who is delegating instructions to all of the followers in this architecture. While all databases can cater to lead from the clients only, the leader can serve a right request. When it receives an instruction from the application, it makes changes to its own copy of the database and then it forwards the same instruction to its followers who can apply these the these changes to their own respective copies. This raises an important question. When does the replication actually happen? So the leader can go about replicating in two ways. First, let us look at a internet application, we can understand this by looking at an example of what happens when we receive a transaction. So in this example, the leader received a transaction to write X is equal to five. It will respond back to the client by making changes to itself. And then at a later point in time, send the instruction to all of its followers. This makes the system eventually consistent. What this means is that eventually all of the databases will have the same copy of the data. But the client cannot say so with priority immediately after receiving back an acknowledgement. So what does this architecture mean for the client?

Well, um next day, well, one can expect the right to be very fast because we just wait for it to happen on the leader. But because there could be a lag in the data reflecting correctly in the followers, the client can end up making tail reads. Also, if the leader goes down before having replicated the transactions, it can lead to a potential data loss situation. So proper fail over mechanisms need to be in place to handle that situation. In order to avoid some of the pitfalls of a international application, the leader can also choose to replicate synchronously. In this case, um when the leader receives the transaction, it will first make sure that all of the followers have performed the same before sending back an acknowledgment to the client. Now, right off the bat, we can see that this will make the right very slow because now all of the databases need to perform the operation. Also, if the leader keeps on waiting for a follower database that has gone down to actually write the value, we can potentially end up stalling the system altogether. And the system becomes uh not available again, this can be handled by proper care.

So, um in reality, there's no one size fits all approach to choosing a synchronized a synchronization approach for single master database. So what happens is that we usually end up using a combination of these two approaches, some databases are kept in sync with the leader. Whereas the other databases are still replicating asynchronously, this will help us get the best of both the worlds. But does that solve all of our problems? Well, not really if your system is right heavy, then you are sending all of the rights to one particular reader uh leader and increasing the workload tremendously on the same. Also, if your system is geographically distributed and you just have one leader, then clients all over the world will be trying to contact this leader in order to perform a ride and for a client that is located geographically far away from that leader. The network induced latency will be huge and the ride become slow. The multi leader approach can solve these issues. So in this architecture, there can be multiple leaders that are potentially located geographically farther apart, then the client can connect to a leader that is closest to it.

In order to perform the right operation. This right will then be replicated to all of the followers of this leader and it will also be replicated to other leaders across the world. This helps us maintain a consistent state of the data across the globe. So with multi layer application, we distributed the right workload amongst various databases. But the situation is not all roses with multiple databases undergoing right, there could be a situation of conflict generation, we will look at it later and this and figure out what it means and how it can be handled. OK. Yet another approach of the application is the leaderless application in the leaderless application, there is no follower or leader. All of the databases are capable of performing both lead and ride operations. So what happens is that when a client has to perform a ride, it sends the instructions to all of the databases and waits for acknowledgment from some of them. For instance, in this example, the client sent the right operation to all the five nodes and is waiting for an acknowledgment back from three of the databases. A similar thing needs to be done when we are to perform read operation as well.

This is also called forum establishment and it is an advanced concept that is out of the scope of this presentation. So for people who are interested in leaderless applications, they should definitely check out forums with this. I hand over to Ria to uh demonstrate database application.

Hi, thank you so much Tanya. So now let's get to the fun part and see a database replication live in action. We will see this using a demo. So what we have done over here is that we have started two my sequel servers on different ports. So you see this one is the master one and this one is the slave. If you see that we have mentioned that sorry

to uh break you. Uh You've only shared the chrome tab. Can you share in that too? Yes.

Sure. OK. Uh Is it visible now? Yeah. Thank you Tanya. OK. So this, this is the lead uh This is our leader or the master and this are the, this is a slave or the follow up. So what we have done is that we have started two MYS QL servers on two different ports. My SQL does its replication in asynchronous wave leader uh to slave. So that is what we will be demonstrating today. Now, let's let me show you that we have started uh is the next uh screen visible? Great. So you can see that I've started the master on this port that is 3307 and the s of the slave on 3308. Let's see, the cons uh the current state of our um databases. So if I to show me select this, this is a table which I've already created in my uh particular database. And if I want to see what is in this table, let me fire a select a statement and I see, OK? It has ID one and first name Tania. Similarly, let's check this for our slave. OK? Right now everything is in sync. Now let me create another record uh for my uh table and let's add it. And let's see if things get replicated. What I'm basically doing is that I'm writing to the leader and I will be reading from the follower. So let's insert into test replication.

That's the name of my table values. And this time I would be sending it my name. OK? So let's check the state of our leader ones if everything got inserted. Great. Now let's check the state of our replicator model. See we have our record over here also. So if we make some changes to the leader, those are getting replicated to our slave, that is our follower in this case. So this is how application works. And you can take the advantage of uh setting backups or keeping your system geographically closed, all the benefits which we have discussed till now. One thing to note here is that till now we have discussed about all the wonderful things of uh data application. We have seen how that in action as well, how database application is happening for my SQL as I mentioned, it was asynchronous but is everything perfect. Of course, not when you design, meaning large scale applications, a lot can go wrong. There are so many issues when um uh dealing with database application, which need to be taken, which need to be uh addressed. And at the same time, there are some trade offs which you need to choose that what uh uh fits for your system in that place. And at that time, so with this, let's uh talk about the two most common issues which one can face with database replication. One is replication lag.

So till now we have seen that there's this leader and there's the follower, we do A R to the leader and we are reading from the follower that is a leader follower based on master slave based replication model. Now this helps us to keep many number of leaders and scale the number of readers. So that um in case your application is read heavy, the readers can handle all those massive read requests for you. Having a synchronous architecture here won't suit us because um that kills the sense of replication. We want that we are able to read from the uh followers without being dependent on the leader for first replicating the data and then sending it back to the client asynchronous makes sense here that we accept the rights, we replicate it to the followers and they are in sync as in when that happens.

The problem with this is that there could be a lag between the leader and the follower by lag. And mean there could be a delay by the leader side to replicate the followers and putting all the data in the followers. So let's understand this with a quick example. Suppose you're all set and ready to upload your new video to your favorite social media app and you send your video to the leader. The leader says, OK, I have accepted your video and your video is live now in the background. AYC replication is happening and it's updating its followers. But within a couple of seconds, you check again, you check your video again and this time your read request gets connected to the follower till now. The follower hasn't really caught up with the leader and it's still running behind.

It still doesn't have the latest changes. So the follower tells you that no, your video is not there. This is a potential reflex replication lag scenario where despite receiving the confirmation from the client, you don't see uh things uh which you have committed already because you are connected to a different instance which is not really uh really in sync in theory replication lag can take um from a few seconds.

In theory. Actually, there's no time limit to how much this delay would be because it could be because of any issue because of network issues in between or one node is working on full capacity and it's taking some time to replicate the changes. So from a few seconds to many minutes, there's no limit to how much replication lag. You could experience another thing which you can do here is what you are ready to trade off. Maybe as a client, I want to see the information um of my latest updates, but your friend could see that same information 30 seconds later, it won't matter, right? I mean, again, depending upon your business model, there is no right or wrong here. We are discussing a social media scenario over here. So for that one solution could be read your right consistency. What's happening over here is that if a user creates a post and then it tries to read that post after in a short period of time, then all the reads would be directed to the leader only now again, this skills, the essence of a replication. If always all the reads are being handled by the leader. If you're making multiple changes, and then all the leads are being handled by the leader, then what is the point of having read replicas?

What we can do instead that for like a short time span, like 10 seconds maybe all the readers, all the reads would be directed to the leader. Post you create uh um or post you upload something and other reads could be taken care by the follower because they are not linked to this post until the time ay replication catches up even. And that time has passed this follower would al al already be up to date with the leader for other users. Again, the rights of course would be handled by the leader because that's our architecture over here. And they would also continue to read from the follower in case they haven't created any post but your post, they might see it after some delay if that suits your architecture well and good. But there is no right or wrong over here. Another problem with replication lag over here is that suppose your leader, uh the current state of value X is current uh is changing continuously from X is equal to one X is equal to two X is equal to three. That is that data is in your leader and in the background, it is getting uh replicated to your followers. Now, a user tries to read the latest value of X right now.

For him, the value of X is two but like in a couple of seconds, he sends another read request. And this time value of X is one. Imagine that you're reading your blog and uh commence appear, then disappear, then appear, then disappear, something is happening. Something weird is happening with every refresh. So this is not a very good experience for any user and you are reading data from the past. The solution for this is is to make the reads stick to just one replica. Like for again, for a certain period of time till the replication catches up, the reads would always be directed to that particular followers. So that no uh uh they do not experience this traveling back in time where they are reading faulty or uh stale values or uh they are not sure what the actual value is again for user two. If they have read this, it will be sticky to this follower only and they would uh the read request would again be directed to this particular follow only despite refreshing. So this is called monotonic read consistency. Let's also discuss another issue, another common issue with uh database replication that is dealing with conflicts. Now, uh as Tanya mentioned in uh while discussing multi leader architecture, that this is a common issue which one can face. Let's understand what that exactly is.

So you have two leaders in diff different geographical locations and uh both these leaders are capable of taking in the rights at one location. Your colleague changes something in the database value for customer a suppose uh that customer is address is updated by this person and in region one and similarly your colleague in region two, it's at record for the same customer. But this time they might change the phone number now, they have done that they have committed. The leader has said, ok, well, and good when things get replicated across leaders, this is a potential conflict scenario because we don't know which is the correct value. The uh the there's possibility that your application is not um um like uh up to uh your application is not uh in a state to handle the merge conflicts or um it could be anything like how in that scenario, which is the correct record that is difficult to figure out. And this is a conflict scenario. So one possible solution for this is last, right? Wings. What that means is that one of the colleagues they both have, have a certain time difference in between their um commits to the database and the last, right, the one which did the latest one wins. The problem with this.

However, is, for example, uh this person did that at one second and this did that at one point a few milliseconds. The last right one would be this one and this person would feel that their rights despite being committed, despite getting the acknowledgment from the client have gotten lost. Another solution for this is manual conflict resolution. What that means is that we leave it up to our application code to decide which one is correct. What uh what our application could do is since it was performed on two different uh values of the same record. It could merge it maybe or it could show both the values to the client and the client can decide which one is the correct one or another scenario would be that it creates a marker object in our database. And later on somebody uh from the team decide which one is the correct one. This way we prevent data loss and this requires manual intervention. There are some of the practices which one needs to follow. Uh when uh discussing uh when implementing a successful application over to Tanya for that.

Yeah. So with all of the application techniques that are in our arsenal, it is very important that you carefully choose what needs to be done. So there are some factors that we need to consider. First and foremost comes the workload. How much data are you generating and how frequently is it changing? This will determine the replication frequency of your system and the method that you choose, you also need to consider consistency requirements. Does your system need to be strongly consistent across replicas at all times?

If that is the case, then uh you need to choose something that guarantees strong consistency. Also, system architecture becomes a is your system distributed? Are you using a cloud based or an on premise system? All these factors will impact your application approach as these will determine any sort of network latency that you might face. Now, once you have selected an application method, it is time to implement it properly in the best way possible. There are some best practices that you need to keep in mind. The first one is monitoring and alerting, monitoring. The application process is critically to uh detecting any faults in your replicas as soon as possible. Setting up alerts will notify you when a replica is down or when there are any replication errors or discrepancies and it will allow you to take action before something uh bigger happens. Right? Next comes backup and recovery. This should be top of your mind. Your application process is only as good as your ability to recover from a failure. So you should have a robust backup and recovery plan to ensure that your data is protected and recoverable in case of a failure, then comes testing and validation. Before you actually go ahead and roll out your application strategy, you should make sure to test and validate it thoroughly. This will identify any potential issues beforehand and ensure that your application process is running smoothly in production.

With that we come to an end of our presentation and thank all of you for attending the same. We hope that you gained some insight on data application and are excited about using this in your own architecture. If there are any questions, we're happy to answer them or we can also connect offline at the given Twitter and linkedin. I

thank you everyone for attending. Please feel free to ask any questions if you have again. Thanks uh Women Tech Network for giving us this wonderful opportunity. It was great to uh talk about one of the topics which we are really passionate about.

Yep. Ok. In that case, should we end this discussion?

Yes. Thank you so much. Everyone do connect if you have any questions.

Thank you. Bye bye.