Hands-on NLP with Hugging Face

Automatic Summary

Transform Your Knowledge in NLP with Hugging Face

Hello everyone, today I'm excited to impart to all of you a hands-on workshop on "Natural Language Processing (NLP) with Hugging Face". I am currently serving as a Machine Learning Research Engineer with a focus on NLP, AI robustness, and explainability. I am keen to inspire learners and NLP enthusiasts with the right resource.

Demystifying Natural Language Processing (NLP)

In the current era, developing an application that assesses the quality and trustworthiness of machine learning models is imperative. Furthermore, I feel a need for an increase in NLP resources in various underrepresented languages besides English. In light of this, I took the initiative to establish a community of Spanish-speaking NLP professionals known as "NLP en Espana" which translates to NLP in Spanish.

Today, I'll be walking you through the process of training a Spanish language model called S-Beta using Hugging Face libraries. It's crucial to remember that irrespective of the complexity of the model you're working with, the quality of the data plays a major role in the model's effectiveness.

Getting Started: Hacking Face Libraries

  • Datasets: Our chosen data set for this model is the Spanish Billion Words Corpus. It is an annotated Spanish corpus that contains almost 1.5 billion words. The best part? You can access this along with a plethora of other datasets with just two lines of code using the Hugging Face library.
  • Tokenizer: The next step is tokenizing the text, which involves breaking the text into words or sub-words and converting them into ids. Hugging Face's tokenizers library comes with multiple types of tokenizers, including 'Byte Pair Encoding (BPE)', 'Byte-Level BPE', 'WordPiece', and 'SentencePiece'. For our model, we'll be focusing on 'Byte-Level BPE'.
  • Transformers: The transformer architecture, with its ability for paralyzing and general performance improvements, is a popular choice among NLP and CV researchers. Its key innovations include positional encoding and multi-headed attention.

Training Process

The training process involves initializing a model from the configuration, training it via the trainer class, and saving it post-training. Once the model is trained, it can be uploaded onto Hugging Face's model hub for transfer learning or fine-tuning.

Online Resources and Books in NLP

Online platforms are a treasure trove of resources to enhance your knowledge in NLP. Websites like GitHub house substantial content related to NLP, including model datasets and other significant resources. Participating in active discussion groups and following NLP-related content on various platforms can do wonders in your NLP journey.


In conclusion, learning NLP and understanding its various nuances is a journey of its own. Aspiring NLP professionals can benefit from enrolling in relevant online courses, joining communities, working on real-life projects, and regularly interacting with experts in the field. Remember, there's nothing like learning by doing. So dive into the world of NLP and watch your knowledge and skills flourish!


Video Transcription

Uh Thank you everyone for joining me today. I'm very excited to be part of women take uh global conference and I'm going to give a workshop called Hands On N LP with hiding face.So um a little bit about me, I'm a machine learning research engineer uh currently focused on N LP and also A I robustness and explainability. My background is in mathematics and physics. And right now, I'm working at Neuro A uh developing an application that is going to assess the quality and trustworthiness of machine learning models. Um Besides that, I'm well, I'm from Spain and since I mentioned already, I am very interested in N LP. And also I think that there should be much more resources in N LP and in Spanish and also other under underrepresented languages that are not English. So I recently founded a community of Spanish speaking NOP professionals called N LP NS, which means like N LP in Spanish. And yeah, I know that some of you here are from Spain. So let me know if you want more influences uh about the talk. Uh Today we're going to train a language model uh in Spanish that is going to be called S beta uh because it's going to be a Roberta like uh language model. So for that, we are going to use three hacking phase libraries.

So the data sets, tokenizer and transformers. Uh First of all, we need a data set, right? Like uh never forget that when you're training a machine learning model, the data is very important. It doesn't matter if you have a very fancy model. If the data is not uh is not good, then not, not gonna be um a nice model of word. So I chose the Spanish billion words corpus. Uh This is an annotated Spanish corpus that has um almost 1.5 billion words and it's in the data sets library um because I added uh last December. So uh when you go to the hacking phase data set libraries, you can see a whole list of data sets and all of them have a data set card, like the one we see here that tells you which task it can be uh used for the language where that it is and much more useful information. So the nice thing about this library is that you can access a lot of data sets with just these two lines of code. So first you import the data set and then you just called a load data set with the name of the data set and done you have it. So uh let's see how many um examples we have. So yeah, a lot of examples in this data set.

And for example, if we print one of them, we see that it's annotated. Uh um it's an annotated like I said before. So the only um entry we have is a text. So they're going to be all sentences very well. So um once we have the data set chosen, we need to tokenize the text, right? So what is tokenizing when you tokenize the text, you have to split it into words or sub words depends on the type of tokenizing you want to do and then convert these uh words or sub words into ideas because of course, when we are going to train them all, we need numbers. OK? Um So of course, we're going to use the H and PS Tors library for this and they have different types of tokenizer. So byte pair encoding uh or B pe S uh byte level B pe S word piece or sentence piece. Uh Here, I've listed some of the models that are trained with these types of tokenizer. But we're going to focus in the byte level B pe because Roberta is uh using one this kind of organizer and it's the model that we're going to train. So let's see how we do this. Uh As I said, we're going to use the hacking Face Organizers library and it's the all everything is going to be automatic, automated. So the word organizer.

Also the preprocessing that it's going to truncate it, had it and add all the special tos with it. Uh We need uh automatically. And the also nice thing about this library is that it's very fast, things to rest. OK. So let's dive into the code first. We need to import the organizer as it said, it's a byte level D pe organizer. Uh And we have to initialize it just calling um the class since we are using a data set from the hacking phase library and not a data set with, let's say CS D files. Uh When we train the organizer, we're not, we're not going to use the method uh train, but we're going to train it from an iterator. And this way we're going to have batches even if we don't have files. OK? So this is for a performance. Um This is better from a performance point of view. So to be able to train it from an iterator, we need an iterator. So we call it uh we created with this function here that just iterates over the whole data set and creates uh patches. So then we can call the method train per meter specifying the iterator of the size of the vocabulary and also the special tokens we need.

So the end of sentence, the beginning of sentence, end of sentence, departing the unknown uh token and the mask token, this will take a while. It depends on the size of the data set. Of course, uh For me, it was like um one hour or something like that. And then when it's done, you save uh the organizer. Um Of course, uh as I said before, this, uh our model is going to call ES beta. So that's why Esber is there. So, um let's see how the tokenizer works. So, um if we encode a sentence as, let's say common as Buenos Dias, Miam Maria, that would mean good morning. My name is Maria. Then we see that the tokens are just the whole world's uh words because they are pretty common. But what happens if we pick more complicated words as these three here, then we see that the tokenizer uh that these words weren't complete. Uh Yeah, they weren't like um very frequent or maybe even didn't exist in the data set. And so the twin eyes are just like separated into small tokens like uh yeah. So uh this is the, the difference uh that we see here uh very well. So uh we have a data set, we have our organizer and la let's focus on the model. So um we have all heard about the Transformers architecture, its ability for paralyzing and general performance improvement makes it a very popular option right now among N LP researchers and lately also uh among CV uh computer vision researchers. Uh OK. So let's see the architecture here.

We can see the architecture from the original paper uh attention is all you need. As we can see there is an encoder and then the decoder and basically the pieces we can find here are the positional encoding, the multiheaded attention, the masks, multiheaded attention and then some uh more uh more simple pieces like the other norm people were linear and soft mass.

So the two key innovations here in this architecture are the positional encoding and the multi head attention. So um uh the multi head attention is uh comparing it's making all to all comparisons. So this means that we can fully paralyze the training in GP us. Uh This is much more computational efficient. And then yeah other previous architectures like the record neural networks or LST MS because those mo uh those architectures had to compute um the training token by token. But now we can just do all to all comparisons. Um Moreover, uh this is using the regular activation function that is much better than the S signal or the NH because uh well, first of all, it's really easy to compute. Uh But then the gradient doesn't saturate uh on the high end and it's also less sensitive to random initialization. Um Yeah. And last but not least uh the transformer architecture made very easy to use transfer learning. So this means that we can find in huge models preat by big companies. And yeah, so most of these models are are already available in the hacking phase transformers library and we can use them with only a couple of lines of code. So I guess, yeah, no wonder they have 47 stars. OK. Let's see how we can use this library. Um As I said before, we're going to uh pre trainin not train sorry, a Roberta like model.

So uh this uh this model was uh first introduced in the paper that I linked here, Roberta robustly optimized for pre training approach. And as the title says, it's based on Google's bird model that was um um introduced in, in 2018. And what they do is what they did was to modify the key hyper parameters, uh remove the next sentence, pre training objective and also train it with much larger mini batches and learning rates um very good. So for each of the models in the hacking phase transformers library, they have a configuration, a organizer and uh other uh models. So here we're going to first configure the model calling the Roberta config flat. And we specify the size of the vocabulary like before the maximum position of the embeddings, the number of attention heads, the number of hidden layers and the type of the size uh after we need to create the organizer. So we already train it. So now we are going to use the from protein method and just call our S beta organizer. And uh finally, we have to initialize the model from the configuration because we're not uh retraining it. Uh We're not fine tuning it, we're just like training it from scratch. So we initialize the model directly from the configuration they, well, um so once we have initialized the model, we have to train it right.

So for this, we are going to use the trainer class and of transformers too, uh we need to specify a data set, a data collator, the training arguments and the model. So let's see how we do this one by one. For the first thing, we need to train data set. So uh let's uh so here we see that we map all the, all the examples uh using, we encode all the examples, using our tokenizer. OK. And of course, we truncate it and we add some padding. Second, we have to use the data co uh we have to create the data, data collator. And for this, we use uh the data collator for language modeling, passing the tokenizer uh and saying that is a masked language model. Um Then we need to specify the training arguments. Um here. So we see that we again specify the directory that's called Eberle before um the number of training E pods, the bad side, uh the steps and some other arguments. Uh And then we just call the train method and after a couple of hours, then we can simply uh call the safe model method and save it wherever we want. Um Very well. So yeah, I see that you guys are, um, talking chad very well. Um, so once the model is trained, we of course, can upload it to the hacking phase, uh, model how and, uh, that way we would be a, everybody would be able to like, um, um, yeah, to use it again for fine, fine tuning or, um, yeah, transfer learning.

So if we want to perform transfer learning with one of these models we can use again the trainer class calling exactly the same um uh with the same options here. Uh But the model instead of using, you know, our model trained from scratch, we would just like add one from the train. And uh yeah, that would be it. That's how you train a model from scratch. So I went a little bit fast. Uh But this leaves you with a lot of time to ask me whatever questions I can go through like more deep in some details of the code or some technical details of the tokenizer or the transformer just uh feel free to, to ask in the chat. OK. So yes, uh this example is on github. I'm going to send you the link. Uh you can of course. Um OK, so you can play with the notebook in Google Club. Uh And of course just you can also reach out to me on my email linkedin or Twitter. And if you have any questions or if you have tried to do something and it didn't work or it failed or it wasn't as accurate as you thought it would be. So, no problem with that. Um In the repo I, I sent the link to you can also find the slides and in the slide, there's a lot of links to papers, resources and other useful links. So go ahead and take a look.

Um Sylvana also wants to know if I can recommend some online resources or books that can help me get some knowledge in M LP. Uh Yes, of course. So let's see what I can share. Um The NOP pane is a github report that I really like and share with you here. It has a collection of um I have a lot of uh models, data sets and all the resources like to perform N LP, a lot of different N LP tasks. So yeah, definitely very recommendable and yeah, well, of course, I also uh share a lot of N LP related content on, on linkedin. So feel free to, hello, just going to show you that um let's see other questions. OK. In the Transformer slide, you talked a lot about the multiheaded attention. Could you explain a little bit more of the positional inco True? Very, very nice uh nice question. So sure, let's get back to here. So here we see that the input is going to be uh yeah, embedded and then it goes to the positional coding piece. So um OK. We know that when we speak the order of the words matter, right? A lot. Um So in the previous architectures before the Transformer, uh the le let's say like the sequential models like record neural networks and LSTM, this order was inherited because they parse the text token by token.

But when transformers introduced the multi head attention mechanism and this all to comparison, uh they kind of lost this inheritance in the order of the, of the tokens. So what the positional embedding does is that it adds um casino and sinus uh functions. So, and that they add it to the embedding and when you perform the cosine similarity and the multiheaded attention or the product, uh then the results are going to be different depending on where the token was in the text. Um Yeah. So that's a high level explanation of what the positional encoding does. And right now, uh there's actually a lot of conversations uh regarding if this is actually important or not even necessary uh or not. Um But I think that's the most difficult conversation for now. Um Anyway, we, for example, in the N LP in Spanish community, we have uh transformers um study group and we debate and we actually like last week, we were debating about uh whether the positional encoding is actually uh relevant or just optional. So, yeah. Nice, nice, nice question.

Let's see more. Uh So somebody from Spain tells me, yeah, I would like to join the community. Can you tell us a little bit more about it? Uh Thank you. So sure, of course, I can tell you a little bit more about it. Um This is part of the hacking languages at Hugging Page initiative. So it doesn't matter if you're not from Spain. This is actually interesting for everybody because there are communities in uh I can't say in all the languages, but in a lot of languages, there are already some communities. So our intention is to democratize N LP but not only in English because um well, if you have looked for resources in N LP, I'm pretty sure you have seen that uh mostly everything is in English. And if you want to perform an N LP task or you're looking for a model or even to fine tune it, not even like a model specific for the task you're performing, it's very complicated to find it in another language that not, that's not Spanish. So yeah, at the community, what we do is uh we share resources within each other and we also try to create uh to create them.

So like to create data sets and upload them to the hacking face data set library or to create to train models or finding models and upload them to the hacking face model hub. And, and as I said, yes, we also have like a discussion group um for, for transformers. Uh But I think there are really nice communities out there. So if you're going, if you're looking for this, you can actually check the thread um languages at hugging face. Yeah. Um The languages are hugging place and see if you find our community uh for your language. And then of course, if you are actually interested in the Spanish community, as you said, then you can just, and as on Twitter or on linkedin. There you go. Um Thank you very much, everybody. I'm I'm happy you like the hands on. Um I can take some more questions if anybody has any. So I have two minutes. No. OK. Very well. Um Then I really hope that. Oh yeah, we have another question. Uh How would you recommend to start a journey with P did you do self, did you do self study or took some courses? No worry. Um Perfect. Thank you Carlina. So for myself, as I said before, at the beginning, I studied mathematics and physics. So I didn't study computer science or I didn't do a master's degree on machine learning or anything like this. So all my, my um everything I know I've learned it online.

So there are really a lot of courses I can really recommend if you're looking for something uh more generic in the deep learning uh domain. I would really recommend the Coursera course by Andrew ng uh by deep learning dot A I sorry. Um And then for N LP in particular, they also have an introduction to N LP first. Um That is also really nice and then what I really, really recommend is to do projects. So it's, it's cool. Of course, you need to take some courses at the beginning. But um yeah, when you face real problems and then you learn by fixing them and figuring out how to do something, it's when you do real projects. So look for something you're interested in a topic and then just think about how you could solve it or how you could apply N LP for that. So um yeah, really. Um that's also going to help you because it will help you uh build a portfolio. And if then you're going to apply for a job in, in related to this field, then of course, it's nice to show your projects. Um And also join communities. I really recommend that, for example. Um Yeah, for you and there's a lot of uh communities of, for example, women in A I or women in tech.

And they also share a lot of um uh of resources and content, then attend webinars, attend workshops, attend conferences like the women, take a global conference and reach out to people. So if you attend the talk and that you're interested in it, then you know why not reach out on linkedin and tell them like, hey, I really like this and I would like to keep in contact and keep learning from you or of course, always send a text with it, but I, I really recommend doing that because actually, then you're going to realize that people is actually um nicer than you think are like more approachable.

I, I've met a lot of incredible people just reaching out on linkedin. So, um yeah, I hope that's very uh broad, but um answer answered your question. Uh Carolina. Thank you. Thank you very much. I appreciate it. Very happy you like the presentation. Ok. And what I was going to say before is that I really looking forward to seeing your models in the hacking face model hub. So if you train any or fine tune a model, then feel free to send me a link and like, share it or if you have any problems while training it, uh, also reach out. Uh, no problem. I'm happy to help with, with anything, um, any issue you might say. Thank you very much for everybody for attending.