The Power of Collaboration Within the Data Science Pipeline by Catalina Herrera

Automatic Summary

Unlock the Power of Collaboration in the Data Science Pipeline

In our fast-paced world of technology, there's one attribute that stands out - passion. Passion fuels the drive to achieve optimum outputs and increases the ability to enjoy every step of the journey. In the next few minutes, we’ll delve into a majorly unexplored territory that could greatly enhance your results – collaboration in the data science pipeline.

From behind the scenes of various industry roles, I've identified key areas of improvement for better collaboration between different sectors in the data science pipeline. Based on my experience and insights, my goal is to help you understand why fostering collaboration is crucial to success and how we can achieve it effectively.

The Data Science Silos: Breaking them Down

When dealing with data, it’s common to encounter data silos, team silos, skill silos, and many more. Unfortunately, these silos often limit the capacity for collaboration and sight. Lack of a full perspective on issues often hinders our ability to make the most of our potential. One way to gain this full perspective is through an integrated, collaborative approach.

In the data analytics field, these collaborations happen on different layers, each one having its own complexities and challenges. There are folks dealing with raw data from semiconductors, others handling consulting aspects from the software side, and others grappling with the best means to extract maximum value from this data.

The Data Science Pipeline: A Closer Look

Usually, a data science pipeline includes processing data from multiple sources to answer specific questions or support business operations. The journey begins with raw data acquisition which is further processed to descriptive analytics. This presents a tangible ‘story’ of what the data says. The story then shifts to predictive analytics where algorithms and computing powers are used to predict future outcomes.

However, to ensure smooth execution and minimize negative outcomes, effective and constant communication is necessary, especially when each sector works in their 'silos' with specific objectives and goals.

The Role of Humans in Data Science

Even in a world of advanced technology and algorithms, the human factor remains a key tool in cohesive analysis and implementation. The need to keep humanity in the loop is crucial as algorithms and machine learning outcomes can have serious consequences if not properly vetted and validated by SMEs (Subject Matter Experts). This human factor ensures fairness, transparency, and bias mitigation, strengthening the framework of responsible AI.

Fostering a Culture of Collaboration

In an ever-evolving digital landscape, fostering a culture of collaboration in your organization can lead to significant performance improvements. By breaking down silos and bringing together the wide variety of backgrounds and skills, you embrace the collective intellect and drive towards a common goal.

Key Takeaways

  • Embrace the human factor in your data science pipeline.
  • Be aware of possible harmful outcomes from poor communication or siloed working conditions.
  • Include SMEs in validating models and algorithms to prevent bias and ensure transparency.
  • Understand the importance of communication in avoiding potential issues.
  • Collaboration is a powerful tool to maximize potential and achieve common goals.

My hope is that after reading this, you are better equipped to foster effective collaboration in your organization—be it small or large— for more streamlined and productive data science operations. It's a language we all need to speak fluently for future progress.


Video Transcription

Welcome Colombiana. So my quotes today, right? Whatever you do in life, be passionate about it because that's, that's where you can actually optimize results, right? Enjoy the ride. And there is a lot to do over there that we can ensure we do it, right?And for that collaboration is a key component. Uh So today I'm gonna kind of summarize, let's say 15 years of uh professional life, right? I come from a uh educational background. I started as a professor back in Colombia. I was teaching in the engineer school. I am an electronic engineer. Uh Then I moved to the States and I started working at Texas Instruments as a yield engineer, meaning I was collecting all of these millions of millions of row of data coming from all of these semiconductors steps. And literally, I have been there, I I know we're the behind the scenes of this, I have been doing the hands on of connecting all of these pieces for us to actually make sense out of it, right? So how we are gonna ensure that we can collaborate, why it is important and why are we gonna be talking about it today? Right? So we're gonna connect points. I'm gonna share insights that I have been learning. Uh I have been supporting the industry from the analytics perspective for, for many, many, many years, right? So after Texas instruments, I have been uh in the consulting side from the software perspective. So I am facing all the time problems that we are all going through, right?

Trying to extract the most out of this data. So in this slide, I'm actually using this analogy and what I am calling the analogy here is I'm gonna invite you to think about silos, but silos from different perspective, right? So when we are working with data, we are used to say yes, the data silo or the team silo, the skilled silo. But if you think about it, when you put all of those together, there is a lot of silos out there that are limiting the capacity in of collaboration, right? And at the end, we don't have the full picture and the full perspective and realistically, the more we know about a problem, uh the more we're gonna be able to do about it. So at the end of this few minutes, we hope are gonna feel like the happy donkey on the right? With a vision, right? And, and an idea of a road map of uh why it's important and what kind of things we're gonna highlight here. So I'm gonna start with a little bit of background, right? Uh realistically, we have been talking about machine learning and A I for centuries, right? It's not even decades and centuries. So I'm gonna make a highlight of a couple of things here. So 17 hundreds, right?

We had a very good foundation and all that uh math that is pretty much supporting all of these algorithms, right? And we had uh the buyers theory bias, probability, non oide and geometries. We have our lovely A, a love la that was actually considered the first person to write an algorithm for a computer, right? And, and this is centuries ago, then we have Alan Turing, right, using the machine, the machine turing machine. And he was already saving millions of lives uh back in 1945. Uh right. And if you haven't seen the movie, The Imitation Game, it's a phenomenal uh movie, by the way. So then you keep uh growing right into this concept and then you have the actual A I term that comes into place after the mccarthy conference conference writing in the fifties. And then we push it right as humans. OK. How else are we gonna represent what we do in our daily basis? How we're gonna think about the way that we process vision, the way that we process speech, the way that we uh predict something like what kind of patterns do we have to actually visualize and have together to be one step ahead, right?

So we have a long journey that these guys here are now using to go deeper into these techniques that we have into deep learning and everything else that's going on there. But at the end, what you have realistically, right is what I call the perfect storm. So what is going to be the perfect storm? Well, now we keep collecting data from all over the place, right? Data coming structure on a structure. We have teams of people asking questions to these data in different ways, right? And that's think about the analogy of the beginning, we're gonna refer to that many times those silos, right? The blinders, uh they are thinking about their own perspective, right? Data, data data. But at the end, somebody is asking questions to these data and we have a beautiful support in better resources from the computer perspective, right? That generates what I call the perfect start, which is what we continue to have data from all over the place on a structure structure.

Now we have all that math foundation right that we can now use on better hardware resources. So we have more compute power, right? And that has been growing exponentially in the last two decades. And then we have better and better algorithms and techniques that we can apply to all the data that is coming from all over the place. We are missing these key components, right as the humans. And that is where we can actually glue everything together in a way that makes sense in a positive way, right? Because realistically the data, when it actually just data without that human component asking questions and questioning is this actually right can be seriously harmful.

So I'm going to invite you to think about what is included in a data science pipeline, right? So you think about what needs to happen, you are trying to answer a question, perhaps you have a business question or you have a question that is based on what you're doing in your daily basis. And that usually is linked to a data source and sometimes many data sources, right? And then you need to process that and create that data sample. That is right for you to start with basic analytics, descriptive analytics like show me what this is telling me right? Then you are gonna push that into a predictive site that is gonna use all these algorithms and all of these computes and everything that we have available where we can actually outcome a machine learning algorithm or a prediction or a category in general an outcome, right? That somebody is gonna consume some way well, if we don't communicate and if we don't talk to each other and if we don't validate what that data or that model is doing, we may have very negative outcomes, right? This is this is a real situation there. And the thing there is that what's happening is that when we are working on silos like back to the analogy, right?

So we have the horse blinders, we are working in silos and I am focusing on answering my question and my question is OK, I am the data scientist and I'm gonna be creating this uh objective function, for example. And I'm gonna tell this boat, hey, your goal is to maximize your score and to maximize your score, you are gonna hit every single green target that you have around This algorithm is doing exactly what I told it to do and it's killing it, right? It's going after every single green target that it has around it. What else is going on? Well, it's also killing everything else that is in between, right? So it's going for everything because achieving common sense with the eye is hard, is hard and we need to be aware of that and we need to be aware of the potential bad outcomes that this can generate, right? So when we think about uh what we have behind the scenes, this situation technolo from the technology perspective, there is a lot to, to consider, right? And you may have a completely different background. You are thinking about your business, use case your vertical, right? So different people have different questions to this data and there are four different backgrounds and different experiences and different skills. And for some it's easier to write an R code or a Python code to transform that data and extract what they need.

But for some others that is not really where they focus that day, but they still have questions to that data, right? So what's going on is that from the technology perspective, all of these different colors are at the end are generating these silos right back to the analogy at the beginning. So we are kind of blind a little bit because there is not a fully integrated system and realistically it will never be I mean we continue, we are all targeting different things. Some of us are thinking about cloud and cloud deployment. Some of us are still into on prem schools and oracles. Some of us are in between, right are different situations for everybody here. But at the end, these different colors are these different isolations. So when you think about the people that is behind this map and all of these colors, do we think about your pre your question? Right? OK. I need to answer this question. So OK, first things first, we need to connect to this data and trying to understand what this is telling us. Then we're gonna try to enrich that data and move it from a descriptive stage.

When you are pushing dashboards alos we clicks over a bis whatever else you are using to generate those visualizations. And then you are actually pushing that into a predictive analytics where perhaps all of these algorithms and all of these compute power comes into place where we can have better outcomes and become predictive, right? Then you have a machine learning algorithm that at the end is doing what it was told to do. So you still have a disconnection there and these people is not talking to each other enough, right? Because it's not easy to communicate when you have social siloed ecosystem, right?

Because it's not integrated. So what are the consequences that we are seeing? Well, there are some outcomes from the machine learning perspective that can be harmful as, as we highlight a couple of slides ago, right? So this com communication is not happening in a smooth way.

I am not ensuring like a lot of these machine learning algorithms are going to production without even being drawn by an sme hey, is this thing doing what it's supposed to be doing, right? Are we considering what we should be considering here? How we are gonna ensure that we can talk to each other? So this thing is actually growing in a natural way and in a way that we can actually ensure that smooth communication because today you may be thinking about this use case and that use case. But tomorrow is gonna be that and that and that and the day in the future you're gonna have a lot of more demand into these use cases because we are transitioning into digitizing our businesses, right? Everything is gonna be data, data data and it's gonna be even more data, data data, right? So at the end, how are you going to ensure that you are keeping that human in the loop that these outcomes from this machine learning is actually within what you think it is and how you are gonna keep all of these guys being able to communicate to each other and to validate what they are doing, right?

So you are preparing all that data, you are deploying that into a place where somebody has visibility into what is going on. Hey, is that model shifting? Is that model still accurate? Do we need to retrain that thing? Do we need to tell somebody that that thing is shifting, right? Because outcomes are actually affecting people in a daily base. So we have to be responsible about it. So what happens when we actually talk to each other? What happens when we maximize those inputs from our experts in the industry, right? So we can actually get things working, right? So this is a beautiful data science pipeline example that we all gonna get because we have seen the ocean and all the pollution that is going on in the ocean right now. And when we think together about a problem and we put together the ABC, collaborate and even better are able to talk the same language from the technology perspective, we can maximize the things that we can do together with this data. So in this example, uh very simple, you have drones flying around and collecting video and classifying plastic. Is this plastic? Yes. No. Yes, no. Yes, no. OK. Then you have other group of people experts on wind and solar generation of power. OK.

Then you have another group of people that actually is building a mechanical arm that is moving as a response of those sensor reeds from the wind and the solar. Now put all of these people together with passionate individuals that actually want to clean the ocean and let them collaborate and let them talk the same language and put a technology technology layer on top of that, that is gonna be resilient of change. So it doesn't matter if you are thinking about changing that ecosystem or changing into cloud or whatever those data sources are, it doesn't matter my layer of collaboration and my le my layer of of uh collaboration and communication with third parties and with the experts keeping that human in the loop is gonna maximize my my outcomes, right.

So we are working together, we are collaborating, we are able to think about the the consequences about what is an outcome that this machine learning can generate, right? So when you are thinking about your data strategies, when you are thinking about how I'm gonna put my business into an analytics, that is not only descriptive but also predictive how I'm gonna make the most out of my data but in a responsible way. So I'm not generating an algorithm that is gonna be harmful for anybody, right, where we can actually have control on that when we can have visibility into what is changing, what is not changing? Is that right? Right. Is that right? Is that what I expected it to be? And that is gonna bring me to a list of things that I want to ensure I share with you before uh the five minutes are gone because back to the picture at the beginning. Right. So how are we gonna remove those blinders and be aware of the impact that uh a project can have when we are able to collaborate, when we are able to keep that human in the loop. So machine learning is phenomenal A I is phenomenal. Big data is awesome technology phenomenal but embrace your humanity, right? Let's be the humans that we are. And what that means is that we need to be aware that we work with things that perhaps bring a completely different set of backgrounds and skills. So remove those silos, right?

And let's work together and that that actually may require a cultural change, but you know, resonances causes change and synergy allows optimization. So be the human uh be responsible. So responsible A I itself is it has to be a framework, it has to be a framework where you are allowing all of these people with different backgrounds and with different skills to work together, right? Because at the end that is gonna mitigate issues for you related with bias with fairness with transparency. All of that has to be part of your thinking when data strategies, right? So data is more than numbers and things can go wrong and data is gonna keep changing and you train the model with a subset of data and that model is not gonna be the same with the next month's data or the next year's data and so on, you need to keep updating all of that and you need to be responsible about it, right?

So you need to ensure that you need, you know how, what if that thing is doing what you think is doing right now, be very aware of the quality of the data because at the end, we are pretty much digitizing as, as humans and as society. So bias in the way that you do things or uh outcomes that can be harmful data quality is is a very serious uh concept, right? And then having the full picture remove your those blinders ensure that you have the full picture. Remember the happy donkey and include that sme to validate that the model is doing what you think it is doing. That is actually behaving in a way that the sme is validating, right? Keeping that human in the loop, notifying whoever needs to be notified. If you see a data drift, if you see an outcome that was not expected, right? Be responsible. So thinking about the consequences uh incorporating responsible a I and transparent A I are fundamental concepts for your digital transformation strategy, right? And I I have seen it over and over and over and over for sure. Collectively we are more intelligent, like no doubt, let's embrace that, that humanity, right? Let's share but in a way that is supported from the technology perspective as well because this is gonna keep growing exponentially because data continues to come from all over it right now. Small wings go a long way.

So pick the one thing you want to start with, but think about the outcomes, right? So one day at a time, one thing at a time and if you think about it, people, your teams, your people are the best assets we have in our organization. So the inclusive collaboration, the planning data projects together, the drafting data collection, together, practice the analyzing the data together. The pretty much let's ensure that we are in the same page and that we all understand the power of collaboration in the data science pipeline. Thank you.

I appreciate uh that everybody was able to share this time with me. I hope uh you leave the room, virtual room with at least one or two things that you didn't thought before. And now you're thinking or I was able to share a couple of insights that are valuable for you. I appreciate your time today. Good luck and feel free to send me a message. Uh If you wanna discuss uh a little further. Thank you everybody.