by Prasad Bashapakala
•
10 Jun, 2020
Companies all over the world across a wide variety of industries have been going through what people are calling a digital transformation. That is, businesses are taking traditional business processes such as hiring, marketing, pricing, and strategy, and using digital technologies to make them 10 times better. Data Science has become an integral part of those transformations. With Data Science, organizations no longer have to make their important decisions based on hunches, best-guesses, or small surveys. Instead, they’re analyzing large amounts of real data to base their decisions on real, data-driven facts. That’s really what Data Science is all about — creating value through data. This trend of integrating data into the core business processes has grown significantly, with an increase in interest by over four times in the past 5 years according to Google Search Trends. Data is giving companies a sharp advantage over their competitors. With more data and better Data Scientists to use it, companies can acquire information about the market that their competitors might not even know existed. It’s become a game of Data or perish. Google search popularity of “Data Science” over the past 5 years. Generated by Google Trends. In today’s ever-evolving digital world, staying ahead of the competition requires constant innovation. Patents have gone out of style while Agile methodology and catching new trends quickly is very much in. Organizations can no longer rely on their rock-solid methods of old. If a new trend like Data Science, Artificial Intelligence, or Blockchain comes along, it needs to be anticipated beforehand and adapted quickly. The following are the 4 hottest Data Science trends for the year 2020. These are trends which have gathered increasing interest this year and will continue to grow in 2020. (1) Automated Data Science Even in today’s digital age, Data Science still requires a lot of manual work. Storing data, cleaning data, visualizing and exploring data, and finally, modeling data to get some actual results. That manual work is just begging for automation, and thus has been the rise of automated Data Science and Machine Learning. Nearly every step of the Data Science pipeline has been or is in the process of becoming automated. Auto-Data Cleaning has been heavily researched over the past few years. Cleaning big data often takes up most of a Data Scientist’s expensive time. Both startups and large companies such as IBM offer automation and tooling for data cleaning. Another large part of Data Science known as feature engineering has undergone significant disruption. Featuretools offers a solution for automatic feature engineering. On top of that, modern Deep Learning techniques such as Convolutional and Recurrent Neural Networks learn their own features without the need for manual feature design. Perhaps the most significant automation is occurring in the Machine Learning space. Both Data Robot and H2O have established themselves in the industry by offering end-to-end Machine Learning platforms, giving Data Scientists a very easy handle on data management and model building. AutoML, a method for automatic model design and training, has also boomed over 2019 as these automated models surpass the state-of-the-art. Google, in particular, is investing heavily in Cloud AutoML. In general, companies are investing heavily in building and buying tools and services for automated Data Science. Anything to make the process cheaper and easier. At the same time, this automation also caters to smaller and less technical organizations who can leverage these tools and services to have access to Data Science without building out their own team. (2) Data Privacy and Security Privacy and security are always sensitive topics in technology. All companies want to move fast and innovate, but losing the trust of their customers over privacy or security issues can be fatal. So, they’re forced to make it a priority, at least to a bare minimum of not leaking private data. Data privacy and security has become an incredibly hot topic over the past year as the issues are magnified by enormous public hacks. Just recently on November 22, 2019, an exposed server with no security was discovered on Google Cloud. The server contained the personal information of 1.2 Billion unique people including names, email addresses, phone numbers, and LinkedIn and Facebook profile information. Even the FBI came in to investigate. It’s one of the largest data exposures of all time. How did the data get there? Who does it belong to? Who is responsible for the security of that data? It was on a Google Cloud server, which really anyone could have created. Now we can rest assured that the whole world won’t be taking down their LinkedIn and Facebook accounts after reading the news, but it does raise some eyebrows. Consumers are becoming more and more careful of who they give their email address and phone number out to. A company that can guarantee the privacy and security of their customer's data will find that they have a far easier time convincing customers to give them more data (by continuing to use their products and services). It also ensures that, should their government enact any laws requiring security protocols for customer data, they are already well-prepared. Many companies are opting for SOC 2 Compliance to have some proof of the strength of their security. The entire Data Science process is fueled by data, but most of it isn’t anonymous. In the wrong hands, that data could be used to fuel global catastrophes and upset everyday people’s privacy and livelihood. Data isn’t just raw numbers, it represents and describes real people and real things. As we see Data Science evolve, we’ll also see the transformation of the privacy and security protocols surrounding data. That includes processes, laws, and different methods of establishing and maintaining the safety, security, and integrity of data. It won’t be a surprise if cybersecurity becomes the new buzzword of the year. (3) Super-sized Data Science in the Cloud Over the years that Data Science has grown from a niche to its own full-on field, the data available for analysis has also exploded in size. Organizations are collecting and storing more data than ever before. The volume of data that a typical Fortune 500 company might need to analyze has gone far past what a personal computer can handle. A decent PC might have something like 64GB of RAM with an 8 core CPU and 4TB of storage. That works just fine for personal projects, but not so well when you work for a global company such as a bank or retailer who have data covering millions of customers. That’s where cloud computing enters the field. Cloud computing offers the ability for anyone anywhere to access practically limitless processing power. Cloud vendors such as Amazon Web Services (AWS) offer servers with up 96 virtual CPU cores and up to 768 GB of RAM. These servers can be set up in an autoscaling group where hundreds of them can be launched or stopped without much delay — computing power on demand. A Google Cloud data center Beyond just compute, cloud computing companies are also offering full-fledged platforms for Data Analytics. Google Cloud offers a platform called BigQuery, a serverless and scalable data warehouse giving Data Scientists the ability to store and analyze petabytes of data, all in a single platform. BigQuery can also be connected to other GCP services for Data Science. Using Cloud Dataflow to create data streaming pipelines, Cloud DataProc to run Hadoop or Apache Spark on the data, or using BigQuery ML to build Machine Learning models on the huge datasets. Everything from data to processing power is growing. As Data Science matures, we might eventually Data Science being done purely on the cloud due to the sheer volume of the data. (4) Natural Language Processing Natural Language Processing (NLP) has made its way firmly into Data Science after huge breakthroughs in Deep Learning research. Data Science first began as an analysis of purely raw numbers since this was the easiest way to handle it and collect it in spreadsheets. If you needed to process any kind of text, it would usually need to be categorized or somehow converted into numbers. Yet it’s quite challenging to compress a paragraph of text into a single number. Natural language and text contain so much rich data and information — we used to be missing out on it since we lacked the ability to represent that information as numbers. Huge advancements in NLP through Deep Learning are fueling the full-on integration of NLP into our regular Data Analysis. Neural Networks can now extract information from large bodies of text incredibly quickly. They’re able to classify text into different categories, determine sentiment about text, and perform analysis on the similarity of text data. In the end, all of that information can be stored in a single feature vector of numbers. As a result, NLP becomes a powerful tool in Data Science. Huge datastores of text, not just one-word answers but full-on paragraphs, can be transformed into numerical data for standard analysis. We’re now able to explore datasets that are far more complex. For example, imagine a news website that wants to see which topics are gaining more views. Without advanced NLP, all one could go off of would be the keywords, or maybe just a hunch as to why a particular title worked well versus another. With today’s NLP, we’d be able to quantify the text on the website, comparing entire paragraphs of text or even webpages to gain much more comprehensive insights. For a technical overview of the most important advancements in NLP over the past few years, you can check out the guide by Victor Sanh. Data Science as a whole is growing. As its capabilities grow, it’s embedding itself into every industry, both technical and non-technical, and every business, both small and large. As the field evolves over the long term, it wouldn’t be a surprise to see it democratized at a large scale, becoming available to many more people as a tool in our software toolbox.