Jayeeta Putatunda Transfer Learning With BERT: Building a Text Classification Model

Video Transcription

So the topic for today is transfer line with bird. Uh N LP 11 as I call it because this is like a subject that's really uh close to my heart.I really like researching about N LP working on it and it's great to be sharing some of the things that I've learned with you guys. So, uh thank you for joining me on this. So, a quick introduction, my name is JW Tunda. Uh I'm a data scientist with Indent Us Inc. Uh We are based out of New York and we do a lot of interesting data analytics projects. So it's been great researching and working with them. Uh You can connect with me, my linkedin and Twitter accounts. Um Follow me, ask me any questions if you have any doubt and I can address that as well. Uh Great. So just a quick format. So we, this is a 40 minute session. So we're gonna talk 30 35 minutes about the subject. And if you have any questions, we're gonna take that at the end uh like 5 to 10 minutes of question answering. I will be able to share the code in the in my github link. Uh I will update it after the session. Uh So get back to me if you have any questions on the code or if you need any clarifications, always there to help. Uh No worries.

So going forward, this is uh this is a thing that I feel when the data science is such a huge field and N LP is such a big area in that. So a lot of the time people feel that LP is hard and to start with it definitely is hard. Let's take a look at this. Uh What do you, what do you see when you see this image? It says I am a huge metal fan. Uh What are the two references you get from this picture? I get two, do you guys get 222 references from this picture? If, if uh algorithm hears this line, I'm a huge metal fan. Does it think about a metal fan or it thinks about a metal band and that you're a fan of that band? So there are a lot of complexity in English language? Yeah, I feel this is a great example. It lets me like clear out the method of what I think about uh where to start with N LP. And some of the major complexities of English language is basically ambiguity. There is a lot of synonym that you do not know a particular word can mean multiple things. So it is very easy for us humans to understand uh and reference between these two contexts. But it's very difficult for an algorithm to actually distinguish between this. And they need lots and lots of data to understand the very nuances differences and give us proper results.

So that's why they say MP is hard. But I guess we need to go a little bit step by step and we will conquer it. So let's go ahead. Uh So we are gonna talk a little bit about what is natural language processing. I know there's a lot of uh newbies out there who wants to jump in data science and get wants to explore the uh domain of natural language processing. So we're gonna talk a little bit about N LP where N LP is today uh where it is uh applied mostly in the industry. What are the basic preprocessing steps that you've got to do to make sure that your model works fine? And then we're gonna talk a lot about transfer learning bird and the basic and we look into uh code that talks about the classification model using bird, uh pre trained models. So let's jump onto it. So what is N LP? So N LP is basically uh a subfield of a lot of areas you would see in the chart. Can you, can you guys see clearly the chart the, the Venn diagram that shows here that it is a, that there are a lot of components, linguistics comes in because it talks about English language, English literature. What are the grammatical rules?

What are the syntax rules for English language? Then it is a amalgamation of a lot of natural language generation like natural language understanding and with a, a big chunk of deep learning and machine learning comes in to create a very niche subfield of natural language processing.

I guess the major idea is all of these fields come together to basically help the machines to understand and communicate back in a free flowing human speech so that we can generate or uh give them direct speech with a direct uh way and they can give us a very quick feedback without us giving them very manually tagged data.

So let's see how that happens. So before that just look at this chart and see that how far we have come. Like if you see the exponential growth that has happened from April 2018 to May 2020. Wow, just look at the jump in parameters. So when I say parameters, those are the parameters that big companies like Google, open A I uh Microsoft and Facebook Hugging face they are all working with and creating pre trained models. These models are very, very uh difficult to train. They take a lot of uh time resources, a lot of GP UCPU uh strengths. So there are like mm multiple T view that needs to run to train one particular model. So if you see as we speak today, end of May 2020 open A I uh developed a new model with 1 75 billion parameters. Those are huge numbers. And uh we need to understand the significance of what we can achieve with this exponential growth. So um so where do, where is the N LP actually getting implemented in all these industry use cases. The major first use case that comes to my mind is definitely a machine translation. Uh There are English text. If you want to translate any text to a different colloquial language, like I uh from India, we have multiple languages that people communicate in. So how do you build a one model and then use that to translate between multiple uh text videos?

Then you translate Spanish audio to say you want to translate it to a German text. The major example is like uh Google translate. Uh Another area is chatbots. I'm sure you guys have interacted with the chat bots a lot lately and sometimes you may find them frustrating because you keep asking them a lot of questions and it feels that it, they keep reiterating the same thing and you're not getting the exact information uh you want from them.

Uh But you got to know that when you're building a chatbot model, there is a lot of data that goes out in there and the, the, the intent classification, the knowledge tree building. And this whole question answering needs a lot of data set to train a chatbot to make them as good as uh uh a human chat uh interaction that you would want to have. So we look where we are in that domain. Uh at the end of this presentation, uh Another area is like definitely natural text generation. Like you see so many big articles. What if you want to understand uh abstract summary of it? Like you want to get uh you don't want to read through multiple pages of the news but just want an abstract summary summary. So creating that has become like a big leap. And Google is working, just released the paper yesterday that spoke about that and they did uh like they achieved a lot of accuracy in the new model that they developed. So it is uh ongoing field and the research is like heavy in this area. Topic modeling is another area. So topic clustering, entity extraction, uh uh sentence chunking. These are very specific cases of where N LP uh gets to be uh implemented. And another area is definitely text classification.

We will see an example of text classification today that you have if you have a news and you want to uh classify that if it's a political news, if it's a sports based news, uh we will go through that example as well. Uh Yeah. So moving ahead, how do you do it? So I wanted to make sure that before we go and jump into transfer learning. Or if you have a basic or a baseline understanding of some of the steps that definitely goes in and we cannot miss out on this because uh the concept of garbage in garbage out comes out. Uh If you feed the model with very heavy garbage text and it's not clean, not structured, then there is no way that we're gonna get uh a good, good model or a good accuracy state. So let's look at some of these areas. So extra spaces, these are like if you see the example text, it says God is great. I won a lottery that it had a lot of spaces. So how do you fix that? So there are very simple ways of fixing this in each of the uh methodologies that you can use. There are multiple libraries as well. You can uh explore NLTK, you can explore space. There are a lot of uh prebuilt uh areas that you can utilize that.

But in a simple way, you can just strip it and strip actually basically takes out all the uh spaces from the trailing as well as spaces anywhere between the sentence and then you can join it again to get a very clean sentence. The next would be tokenization. So tokenization is the basis of all natural language processing model. So uh when you say tokenization, the best idea would be to think about it as like splitting your sentence into particular words like you cannot feed a whole sentence to a, you can feed a whole sentence to a model. But the way that it connects and builds those vector representations that it does in N LP is via word embedding, we'll look into that as well as we go in. So for tokenization, uh uh there's a lot of libraries that I mentioned. NLDK does a great job. So here if you just do tokenization and you res sub, so res sub basically is a reject search and it's taking out any extra characters, any sentences that you do not want, you can create your uh specific reject searches as you want. Uh So the final output, if you see uh there we deleted the exclamation mark, the full stops and basically get a very clean one saying that God is great. I won a lottery. Great. So spell check. This is another very important area.

Spell check leads to a lot of error in the uh in the model because a model is not able to distinguish between two same words even if there is like one very slight or two character slide differences uh in it. So how do you, there is a uh Peter Norwick spell checker that he used to work with Google brain? And uh they created a very, I would say a, a very efficient baseline uh spell checker. So you guys should take a look at it if you see the implementation of it here. So correction, the first correction is like the uh function name. And when I'm calling it a spelling with a one L missing, it gives me back the correct spelling. So this ensures that we are not feeding the data uh at least clarifying or changing for the basic words that we know that are incorrect. So the here in the next uh in the next sentence, it's like called uh corrected. So corrected means uh it should be corrected with AC and RK. So it translated it into that. It changed poetry. And the last uh and the last uh word, quintessential is correct as it is. So it didn't change anything much in there. The next uh is contraction mapping. So this is a very uh very important thing to do in every N LP project that we do is that in a natural text, like say if you are talking with uh uh uh um a chatbot or you're posting it in your Facebook social, I'm sure all of us use very uh small words.

We don't want to, we're writing it uh with a full version of it. So how do you do that? So basically with contraction mapping, what we are saying is that have nots like have with an apostrophe, we are basically mapping it out into a whole word. So if you see the final result here, it says, hey, I am Jaya. So it clears it as I am Jayita and takes out the contraction that is I apostrophe M. So uh these are some of the use cases like for not for will for would have, I would have, those are some of the applications of it. The next is stem and uh stem limitation. So this is a really important area. Uh We will see that bird also does a similar feature to uh train on outer vocabulary words. So how do we do it? So like say game, gaming, games and games, they all come from the same root word, which is game. So we if we want to train a model and also keep it efficient and not reiterate a lot of our words the best way to do it as a limitation and a stemming uh version of it so that we know that that comes from the particular word and even if the model do not have say maybe gamer, but it knows that it is related to the word game and maps it into the particular vector space of that uh next to ST stop words.

Uh So stop words are very general words like articles. Uh And uh and very, and there can be use case stop words as well. So example, like say, if you're working with the legal data, uh the legal data has a lot of legal connotations that do not add any value to the uh particular model. So how do you uh fix that? So you take out you create a list of all the stuff words that you don't want to use. And you basically see here like say, hey, uh I am went away because I and M are like very redundant words and you, and it doesn't serve any purpose to the meaning of the sentence. Uh And then you see how and then uh there are words like this, so these kind of do not have any uh value that could add into your model. And at the end, there are a lot of case based uh preprocessing as well. Case based is basically to understand that uh say like if you want to understand you have a full sentence and you want to see uh if it's a pronoun or it's, it's a verb.

So you can create uh there are a lot of good tools under space that you can go and explore and then it will give you uh a, a great uh connotation of all your words. Great. So now that we have like a basic idea of what LP or the basic preprocessing structure is set up. Let's go and take a look at what transfer learning means. Uh So this is a great quote by Andy Ng. Uh He's like my favorite person all over the earth. Some of I encourage you guys to go and check out his tutorials. There's a lot of good tutorials by him in Coursera, there are a lot of open youtube videos, you should definitely check them out. So he mentioned in uh N IP S conference 2016 that transfer learning will be the next driver of ML success. And he mentioned in 2016, we are in 2020 we know that the major models that all these big companies build are based out of transfer learning. So uh the applications are huge, the areas that we can explore are huge. Uh We'll see how so in a very quick connotation, I can explain transfer learning as it's, it's just a technique where you train a model on a one particular task for which you have a lot of data set.

And then you use that model or you repurpose that model for a second related task. It doesn't have to be uh it doesn't have to be the exact same task. You can definitely find you which we are going to do today uh at the end of the session. So uh let's see how that works. So here is a very simple math of it. Say you have a data set one X and Y and X here is just a general image and Y is the classified object like you can have a classification like the the images of a tree or of a train or of a car or anything that you see in uh everyday uh scenery. So and this, this data set is coming from image net. It's a huge data set of like millions of image collected labeled over the years. So it's the perfect, perfect way to start uh your journey into N LP and any kind of uh computer vision uh applications. So now I'm gonna put out a big chart. Don't get afraid of it if you have not seen this before, but we'll go through it step by step and it's gonna be like really simple to understand. So say when you're feeding XY. so this is a tensor uh diagram where you see that you're putting input in multiple formats, there are multiple layers that you are uh that you are training the whole data set on. And then at the final layer, you get an output which is in Y hat.

So basically, it gives you a uh category implementation of which I image if you want to test it against. Uh it will give you a classification based on all the data set that you have fed it into the model before. So one particular thing that changes with uh uh transfer learning is here, say now you have a data set too X and Y uh tag and X data set is very small. So like uh a very general use case example, like if you want to classify between uh image, if it's from a rural or urban image. So how do you describe rural and urban image? You in uh rural, you'll have like nice big laws, you'll have nice big uh areas open grounds and in a urban space you'll have like tall buildings that's like a general non ledger. But how many do you think uh images that you can collect for this particular data set? It's gonna be very difficult to collect so many uh data set, label it, clean it. So how do you achieve that with a higher level of accuracy in the first round? You use the data set from the image net because the baseline parameters remain same, right?

Because the low level features like um edges curves, some of the features like if it's a uh a horse or if it's a ground, these are, these are some of the features or parameters that will remain same from the previous data set. And what we are doing here if you see number step five. In this case, we are basically fine tuning the last dense layer I call it a dense layer because it's a one D vector space array. And from there we are feeding that layer to our data set too and fine tuning it to make sure that the final output is now our two category definition which is U and R which is urban and rural uh for short. So the idea here is not necessarily all the time you will have the full data set or the full uh resources and time to train it. So what do you do? You use data set that is very general and very closer to what the task that you're trying to achieve. And uh you do a transfer learning and fine tune from there. So let's see, some of so why do we do ring? So I spoke a lot about this already. Now say, here are three examples where you use like say you want to do speech recognition in Bengali. Bengali is an Indian uh language.

It's a regional language and not necessarily, there are a lot of say uh documents or uh scriptures that we can actually talk about it. So how do we use it? Like say there are so many English and Hindi or o other languages that are already available. So why not scrape that data from all the audio sources like youtube and all those areas and then make, make it a fine tuned version to understand Bengali because the underlying nomenclature of language remains the same. And if we feed the pre trained model and update the last layer with some of the uh dataset, tagged clean data set of the Bengali version, it will give us a pretty good accurate model for the first, right. Another example that we just spoke about is the urban versus rural. Definitely we took image with which had like a huge amount of data sets and then we train from there and even for text recognition. Like if you like uh if you want to work with the legal data set and you want to do the data cla classification, but you do not have a lot of legal uh data to train your model with. So you can use and scrape all general resources like news blogs, crape from websites, Reddit Twitter, you have like abundant of general data set from there and that can help you build the whole uh data structure. Uh Let's see what we have next.

So when to use, yeah, we basically spoke about it when you have a scarcity of labeled data to start with. So creating and labeled data is expensive. Definitely it needs time and experts like you need, you need uh subject matter, experts to come in train the data. If it's a very specific use case involved, the training process that you would need. The other is like you need to make sure that the final goal of your uh output matches with some of the pre-existing models that exist. It cannot be like too far away, it cannot be like exact complete opposite the the the the input properties of both task one and task two has to be same. So the transfer of knowledge would only happen if you are transferring the knowledge from a general base to a very niche base, but they come from the same particular uh understanding. So, so let's see what's the difference between transfer learning versus traditional ML framework.

So if you see from this image, it's very clear that in traditional ML, what we do is like we have multiple different tasks. We have multiple different data sets for each of those tasks and we train them to create a learning system. Now, how do we, what are the things that we're doing different in uh transfer learning? So basically, you have uh uh say a source task, like we said, you have image net, which has all the images from all over your surroundings. And then you create a knowledge base out of that. Now that when you're creating now that you have that knowledge base, you have a target task. Like you have to classify between a rural and urban image and then you utilize that knowledge base, do a fine tuning which that final arrow shows from the target to learning and create a final learning system without creating uh uh uh a method or a model for each specific task.

So two things are happening here. We are reducing resources. We are also ensuring that we are utilizing already trained models that exist and we don't uh do overdo it or we or we don't double double our work. So we're utilizing what we already have. So now let's talk about work. Uh I find this garden very funny. And if you see some of the models that Google have come up with, they named them in a very funny way. They, there is Elmo, there is by, I guess they're a fan of Sesame Street. I am too. So they said, OK. I am by how may I help you? So, let's see what bird is uh when we want to learn about words, there is another very important concept that we need to ensure that we have a clear understanding about which is word embeddings. Uh word embeddings are the base of creating any natural language model. Be it be bird, be it be vertue, be it be any other model that you can think of? So word embedding is a the clear or very simple definition of word embedding is that it's a feature vector representation of a word. Let's see what that means. So when you see this uh diagram, what do you see, do you see a connection? Can you point out one connection if you see what it means to you like?

Uh I see that woman and queen are kindly connected and there's man and king and also some of the other areas like slow, slower, slowest, which is like the different part as the form of a simple word. There's also mother boy daughter. So there are connections here. So we need to create this kind of connections among multiple various words based on where they have occurred in a particular sentence or in a particular uh document set. So the whole idea of word embedding is to create that whole vector space, you can call it a vector space definitely where you categorize and you cluster basically similar going words. So let's look at it here. Great. So if you see some of these examples, sorry, if it's a little bit of uh stretched out image. But if you see the image, uh you see water CNR in the left top are water and sea are like kind of together, then you see climate wa wind ice, they are like kind of together. Then you can see fish, eggs and meat are kind of together. So this is like a two D vector space where the vector space can be of multiple uh dimensions based on how much parameter of the sh you have.

But in a very simple way to exp explain, uh the idea is to cluster similar feature occurring words into one cluster. So that the machine or the algorithm understands that they come from a similar uh uh background or have the similar kind of meaning like say heat, electricity, oil, energy, fuel, they kind of mean the same, not same like they kind of come from the same concept. So let's take a quick example in a 3d format. So here you see there are uh that's a 3d graph where you have sky engine and wings. So all of us have. So w where does helicopter lie? So helicopter comes in between sky and engine. That sounds correct, right. Like uh it has an engine and it also flies in the sky, there's also drone. So it falls a lot between en engine and sky. More of an engine less in the sky. And then you see engine and rocket uh also falls. Now, if you see the first zero of helicopters, 0240 means that they didn't find or couldn't find the model any con any co occurrences of helicopter and wings. Um, yeah, I guess helicopters do deserve wings, but I guess it's not that uh prominent enough, but it did find wings between in eagle and sky.

So if you see the left hand side of the green arrows and the you see eagle is termed as 303, which means that eagles falls between greens and sky and it, and it has zero connotation for the word engine. So that's how basically in the vector space words are mapped with different vector notations. And uh each of these becomes features into how we create a model at the end of the day. Uh great going forward. So this is a bird mountain by Chris mcmorrow. He has like great youtube uh tutorials. I would encourage all of you to go and check him out. So he explained this in a very great way. So B is like the outcome of a lot of underlying technology that has been developed or developing uh for a long time now. So RNNLSTM and CO by LSTM are uh some of the baseline things. But if you see these three at the top, if you uh if you know what I mean, attention transformer and bird are kind of the the baseline of where birth is based out of. So I know that that was a lot of things like this is like the best way I feel that we should calm down. There are a lot of heavy concepts when we try and learn about L LP.

But take a deep breath, don't get too nervous and always think of transformers like I do as like optimus prime that he's there to help us. He's not there to scare us. So let's learn what transformer is. It's a very simple concept. So that's the basic concept of a transformer. What does it do? So here you have a German uh language input. The idea of transformer is that, that it will take each of those words, it will encode it into a system and then it will come up with an outcome as a decoder. So there is an encoder decoder uh work that goes on in all transformers. It's that's why they are called transformer. They're transforming one language to another language by encoding each of those uh uh tokens or tokenization from your particular language to another outcome that you would want. So here we are translating a German to a English language.

Like say, please come here. So the application here seems very simple, but it has a lot of uh difficulty and uh a lot of internal neural networks that are actually working through it. So let's see how that how that, what that means. So b basically uh the full form, it has a lot of heavy words, it's called bidirectional encoded representation from transformers. So let's break it down. And what does that mean? So uh the basic idea or the base to give you a overall view of uh by uh the bidirectional encoder model is that it's trailed on Wikipedia with almost 200 2500 million words. It's a big amount of words like if you think about it and also from excuse me and also from book corpus, which had like almost 800 million words. Now, the concept of word is we are we are trying to learn from both directions like vertue when vertue came, uh it was transformative because it was to understand what context the word is being used in, but it was still being trained at one level. So it was going left to right from one state. Now be came in with saying that no, we won't train one particular sentence from only left to right.

We will also do it from right to left so that we know exactly which word comes left or right or positions in multiple areas and create vector space from there. So that's the whole concept of why it's called bidirectional. It creates the encoder represent it and then transforms it. That's the whole definition of the short, that's called B the architecture layer has almost 12 layers, transformation blocks it has 12 attention heads. We'll, we'll just talk about very quickly what attention heads mean?

And then it has 100 and 10 parameters. So 100 and 10 million parameters. I missed a million in there. Ok. So it's a huge, huge, huge number. So, uh B was the first bidirectional, unsupervised pre-trains on plain text, basically trained on everything that's available in Wikipedia, which I, which I guess it's huge because it's all the content available. And the final, I I wouldn't see the goal is a bird, but the whole concept of bird is that it does input representation and word piece embedding. It tries to find that concept and link it to other concepts of uh the whole structure so that it can find better meaning of what you're trying to achieve. Now take a look at this example and see how if you, if you get a hold of it. So like the first uh sentence says that we went to the river bank. And the second sentence is I need to go to the bank to make a deposit. So you see the same word that we are using here, which is bank, but it has definitely different context and which leads for it to mean different things in these two different sentences. So when we say we went to the river bank, it's obviously a river bank.

So how does the, how does the model understand that it's two different context with the word with the bidirectional word of finding the bank was once used next to river. And another way to find that a deposit was used next to a bank. So there are multiple other examples that would feed in to make that differentiation between river and a deposit uh which is like a in a monitor, your financial uh con in a construct and create that meaning of bank, which is not necessarily one particular vector but a multiple vertic particular vector in two different vector spaces.

When you remember when we saw the vector space, it won't, the bank word won't be represented as one feature. It will be represented as multiple feature across the whole uh the vector space. I guess it made a little bit clear like if you have any questions to uh let me know. Uh So going ahead uh bird has three embedding layers, word embeddings are like the same, we are creating layers. So tokens, position and segment. So what this means is if you take a look at this sentence, the input sentences, my dog is cute and he likes playing. If you see there are two additional things that bird does here, it creates the CLS and it creates another token called SCP which is like the separation. So basically by is trying to understand where a word or where the sentence is starting. And if there is a where it is separating or if there is a full stop or if there is any kind of connotation, comma exclamation marks, uh et cetera. So here the word piece tokenization comes into play basically because it tries to understand and see if you see the position embedding. It says zero E zero E one E two E 10 because there are like 10 tokens that it could find in the sentence.

Now, if you see the segment embeddings, it says that eae a and then there is eb so basically, it categorizes the whole uh input that you have given to the model into two sentences. So that's great. Now, if you see uh token embeddings, uh it means that I see a question from arena and I'll get back to you and we'll discuss that at the end of the uh discussion. Let's go uh and see that what token embeddings mean. Now you see that after segments segment admitting is done. It's because it's a sentence to sentence relation that this sentence was separate. And this is another separate sentence for the token embeddings. You see that it actually creates an embedding for each multiple uh token that I found.

So how many layers you found? There are definitely three and, and there are a lot of workings that goes on between all these three embeddings layers that finally creates the uh the, the bird layer. So you'll see that how, so bird is basically pre on two N LP tasks. Uh If you guys are uh if if I'm trying to like make it as simple as possible to like uh talk about the idea first and then we will go into uh the code. So like the mass language modeling is basically this example. Like you say, a man went to somewhere and he bought a milk. So how you, the way that bar is trying to mask these ideas is because if there are multiple other words and because it is bidirectional in a way, but in the model will be cheating and would already know in hindsight that what those words are if it is not masked. So that is the purpose of why we need to mask some of the words in the whole sentences and then test the model to see if it can find the uh labels currently like store or and for the second sentence, it's a gallon. Now the next uh sentence is like next sentence prediction basically. So if you're saying the man went to a store and the next sentence is he bought a gallon of milk, that makes sense, right? Like because he went to a store and he went to gallon of milk.

So the label would be his next sentence. These are the two separate operations of birds to understand what is a particular mask words that's missing from the whole concept. And also to understand if one particular sentence is supposed to be a sentence uh for the next of the sentence. So if in the second example, you see a man who went to the store but the Penguins are flightless. So it's not the next sentence, right? It does not mean the same thing. So uh I guess we're gonna look at some of the code first and come back to 10. If that makes sense to you guys. Can you guys see my screen and record? Is it big enough? Ok. So if you great. So if you see here, uh So I'm using Google Collab. Google Collab is great. You guys should check it out. Uh The you can utilize GP UCP us because bird is a big model and your uh and your screen or your laptop will freeze if you're using AC pu in your laptop. So I prefer going and working out from uh Collab which gives you so you can change here if you see uh I guess under edit. Oh I can under the runtime, you can change runtime type so you can make it a GP U or a GP U.

So based on if it's a heavy task, uh I get run GP U for this simple classification. So basically we are uh downloading all the pre trainin model from Kas. Kas is a great uh application package built by Facebook. And uh we're driving, we are mounting all the content in our Google Drive. Uh We are downloading the data set that we want to use to fine tune some of the uh models that we are using to understand if it makes sense to use it with the uh pre trainin model here. We are downloading and using tens tensorflow version one version two has already been released. So you guys can work with that as well. Uh The basic idea if you see here. So we basically downloaded all the pre trainin models here. Uh We trained the model on that preloaded checkpoint. And if you see the model summary here, basically you see each layer of the bird model. So there is an embedding model embedding layer. There is an encoder model where it's doing a multi attention. Uh And then there is uh definitely a lot of encoder layers. So de depending on how deep you want to make the network, like how many networks you're creating, you'll have that many level of encoder and decoders.

So the next step for us, which is like just us using a very small data set to test uh if that makes sense. Uh uh If, if the pre trainin model makes sense for our data, like we are creating or and importing another news group data from any news source, you can download it from anywhere uh any new set that you want, we are creating and labeling. So here we see that here are the text files, text files is basically the tags for which for each of those uh uh for each of those uh sentences that we have in a new uh in the new data file. And then when we are loading it, we are basically loading and creating uh in a very similar ML way. We're creating a train and train and test text files. Here, we see that the, the, the amount of data for each tag is kind of like an equilibrium. There is not too much discrepancy. Uh Now, here is what we are doing the one final step that we are doing for fine tuning our model. Here it goes. So you're basically calling the last dense layer, you're getting the layer which is called get layer NSP dense. And you are fitting that dense layer with the softmax function which is an activation function. I encourage you to gui you guys to go and look at a lot of tutorials. There are to talk that talks about activation and which functions to use depending on your use case.

And then you're fitting that whole model into uh another comp you're compiling that changes into your model to fit it into a new fine tuning section. Now you have another, you can just create model. Uh That's a new model. And here if you say I am training the model and I'm fitting it with an epoch size of 100. Sorry, there is too much talking. So uh yeah, so epoch size of 100 it runs 400 times with, with the sequence length and all of the bad size that I have provided in the code above. And at the end, if you see you, if you, when you calculate the prediction parameters, it gives you a 0.52. Now think about it, you have like a 50 or maybe say 100 100 data set column and you want to create a model, it's impossible to start with that kind of less data set. But with the pre trainin model in the first round of hydration, we received almost 52% accuracy, which, which for me is like pretty good or it's a pretty big deal.

So I guess when, if we have a little bit of more data to fine tuning this whole method or if we use some other parameters, fine tuning, do some hyper parameter tunings, we're gonna get great results. So if you see the final um like I use some additional text that is from outside of the uh data set that we tested on, I tested about some Democrats that were having uh hour long talkathon. So basically, it talks about and gives you uh a thing of uh classification of politics. And so uh yeah, it's not too difficult. It's just a step by step, understanding of what we want to achieve and uh going through it uh by understanding some of these concepts. Now, I guess we're running out of time, but I'll just quickly um go through it. So where we are with N LP now, uh if you see uh Google has released the conversation, this is a very funny conversation. I will like if it's possible. Uh When they shared the video, you can take a look at this conversation. They got, it's called Mina and they can basically talk about anything. And then they said at the end, they said that there was a multi turn joke in an open domain and they didn't find any of those words in the training data set. So basically, it came up with its new data set.

So it's a huge advance for N LP uh domain. The final uh the final new open A I GB D three that was released just like uh 15 days ago, had 100 and 75 billion parameters. That's a huge number. But there are a lot of things that we need to do to ensure that it's uh pretty uh efficient in where we are applying it into our everyday use cases. So uh I will end up in a very uh crucial load that we need to understand that creating a model is a very challenging, it takes the resources if you see this chart that mit released uh uh training a single A I model emits as much as like five cars, carbon from five cars in their whole lifetime.

So it's a big data. It's a big number and we need to be aware that we are not just creating models for the sake of it. But for CRE for solving very particular problems and utilizing or re utilizing already available methods so that we are not redoing it again. Uh So closing remarks, uh uh it's been a great discussion with you guys, but N LP is hard, but I guess it's hard because human intelligence has set like a huge benchmark and we need to make sure that uh we do enough to kind of reach as close as possible. A few guidelines like bad data cleaning means it's, it'll get harder, always know your data uh better than others. Because if you do not know your data and you just start on creating a word model, it's not gonna work out and you would be like frustrated saying that why is it not working? Uh A lot of people say LP is magic but it's not, there is a lot of logic, a lot of uh concept of neural networks, a lot of understanding of the word embeddings which uh takes take a step by step learning at each of it.

And uh uh the, these are some of the additional resources that I put it up. What if they are uh going to put the video out? You can all have the resources. Um Great. Thank you so much. All of you guys. It was great. If you have any quick question, I can take it the uh end now and uh thanks and keep learning and take one step at a time. Don't be afraid of learning and thinking that it's not your cup of tea. It's everybody's cup of tea. And uh we just need to take one step at a time approach. And if you believe you can do it, everybody will believe that you can do it. So just do it and don't think about it much. Uh Is there, is there any question that I can take or quickly answer at the end? Uh uh Because I guess we're running out of time? Kind of great. Thank you. Thank you for such great uh reviews. Thank you. Thank you. Uh I guess uh the thank you. Is there any particular question or I guess I'm going to see that I upload the code at the end of the session in my github account and you can follow me there and utilize the call. Let me know if you have any questions, reach out to me. Yeah, definitely. I can share the slides on the Google Collab notebook. Uh Definitely uh for corrections for menu uh corrections function. It is actually a function that I mentioned in the earlier, it was developed by Google.

So if you look at the code, they have, I have mentioned the name of the uh authors that have built that whole uh code base. So you can, that's the definition function that you can call from their already open source code. So it's, it's already out there. Are there any other questions that I can answer. Thank you so much guys. Like you, you guys have been like so helpful and like supportive and I guess I uh learning anything new needs a lot of uh making it light and funny. So I get these transformers and cute videos that makes me think that, hey, it's not that uh it's not that uh difficult. So great. I guess one last question. Do you think transfer learning is something we could use to replicate something like user personalization and voice assistance?

I guess. Uh Yeah, that makes a lot of sense. I guess if you have the context built from a very um yeah, very less recordings of voice. I guess the voice is the area that audio is the area that it's still developing and there is a lot of work that's going on. But I guess it, the work or the application that you mentioned is definitely doable, but it would mean that we can do a lot of voice modulation or we can do a lot of voice fine tuning uh at the end of understanding those models. But I we'll see where we go with that. It's, there's a lot of scope definitely. And um yeah. Uh yes, it's very much doable. Thank you, everybody for joining uh do share. Uh I'm gonna put my linkedin and Twitter uh for you guys to ask any questions. Join me, connect with me. I'm happy to learn. This is like a great session where we can uh learn from each other. Thank you and uh keep learning. I guess that's the end of the session. Thank you.

Jayeeta Putatunda Transfer Learning With BERT: Building a Text Classification Model

Jayeeta Putatunda Transfer Learning With BERT: Building a Text Classification Model

Video Transcription

Don't miss out on the latest Women in Tech events, updates and news!

Powered By

Women in Tech Network

Women in Tech Conference

Tech Women Impact Globally

Follow us

Jayeeta Putatunda Transfer Learning With BERT: Building a Text Classification Model

Video Transcription

Don't miss out on the latest Women in Tech events, updates and news!

Powered By​​​​​​​

Women in Tech Network

Women in Tech Conference

Tech Women Impact Globally

Follow us

Powered By