Advanced analytics with spark pdf free download second edition






















If you have an entry-level understanding of machine learning and statistics, and you program in Java,. Advanced Analytics with Pyspark. The amount of data being generated today is staggering--and growing. Apache Spark has emerged as the de facto tool to analyze big data and is now a critical part of the data science toolbox. A small number of the lines appear to be corrupted. These lines cause a NumberFormatException, and ideally, they would not map to anything at all.

Option represents a value that might only optionally exist. It is like a simple collection of 1 or 0 values, corresponding to its Some and None subclasses. It contains two IDs per line, separated by a tab. This file is relatively small, containing about , entries.

The aliases data set should be applied to convert all artist IDs to a canonical ID, if a different canonical ID exists. A helper function is defined to do this, for later reuse. The artistAlias mapping created earlier could be referenced directly in a map function even though it is a local Map on the driver. However, it is not tiny, consuming about 15 megabytes in memory and at least several megabytes in serialized form. Because many tasks execute in one JVM, it is wasteful to send and store so many copies of the data.

Instead, we create a broadcast variable called bArtistAlias for artistAlias. This makes Spark send and hold in memory just one copy for each executor in the cluster. When there are thousands of tasks and many execute in parallel on each executor, this can save significant network traffic and memory. Broadcast Variables When Spark runs a stage, it creates a binary representation of all the information needed to run tasks in that stage; this is called the closure of the function that needs to be executed.

This closure includes all the data structures on the driver referenced in the function. Spark distributes it with every task that is sent to an executor on the cluster.

Broadcast variables are useful when many tasks need access to the same immutable data structure. Just broadcasting the small table is advantageous sometimes. This is called a broadcast hash join.

The call to cache suggests to Spark that this DataFrame should be temporarily stored after being computed, and furthermore, kept in memory in the cluster. Without this, the DataFrame could be repeatedly recomputed from the original data each time it is accessed!

This one consumes about MB across the cluster. Actually, MB is surprisingly small. Given that there are about 24 million plays stored here, a quick back-of-the-envelope calculation suggests that this would mean that each user-artist-count entry consumes only 5 bytes on average. However, the three bit integers alone ought to consume 12 bytes.

This is one of the advantages of a DataFrame. Because the types of data stored are primitive bit integers, their representation can be optimized in memory internally. Finally, we can build a model: import org. The operation will likely take minutes or more depending on your cluster. Compared to some machine learning models, whose final form may consist of just a few parameters or coefficients, this type of model is huge.

It contains a feature vector of 10 values for each user and product in the model, and in this case there are more than 1. The model contains these large user-feature and product-feature matrices as DataFrames of their own. The values in your results may be somewhat different. The final model depends on a randomly chosen initial set of feature vectors. The default behavior of this and other components in MLlib, however, is to use the same set of random choices every time by defaulting to a fixed seed.

This is unlike other libraries, where behavior of random elements is typically not fixed by default. So, here and elsewhere, a random seed is set with setSeed Random. To see some feature vectors, try the following, which displays just one row and does not truncate the wide display of the feature vector: model. The other methods invoked on ALS, like setAlpha, set hyperparameters whose value can affect the quality of the recommendations that the model makes.

These will be explained later. The more important first question is, is the model any good? Does it produce good recommendations? Take, for example, user Extract the IDs of artists that this user has listened to and print their names. Collect data set of Int artist ID. The artists look like a mix of mainstream pop and hip-hop. A Jurassic 5 fan? The bad news is that, surprisingly, ALSModel does not have a method that directly computes top recommendations for a user.

Score all artists, return top by score. Note that this method does not bother to filter out the IDs of artists the user has already listened to. This is a problematic assumption but about the best that can be done without any other data. For example, presumably user likes many more artists than the 5 listed previously, and among the 1. What if a recommender were evaluated on its ability to rank good artists high in a list of recommendations?

This is one of several generic metrics that can be applied to a system that ranks things, like a recommender. To make this meaningful, some of the artist play data can be set aside and hidden from the ALS model-building process. Then, this held-out data can be interpreted as a collection of good recommendations for each user but one that the recommender has not already been given.

The recommender is asked to rank all items in the model, and the ranks of the held-out artists are examined. Ideally, the recommender places all of them at or near the top of the list. In practice, we compute this by examining only a sample of all such pairs, because a potentially huge number of such pairs may exist.

The fraction of pairs where the held-out artist is ranked higher is its score. A score of 1. This metric is directly related to an information retrieval concept called the receiver operating characteristic ROC curve.

The AUC metric is also used in the evaluation of classifiers. These include metrics like precision, recall, and mean average precision MAP. MAP is also frequently used and focuses more narrowly on the quality of the top recommendations. Typically, data is divided into three subsets: training, cross-validation CV , and test sets.

This will be sufficient to choose a model. In Chapter 4, this idea will be extended to include the test set. In order to use it, we must split the input data into a training and CV set.

The ALS model will be trained on the training data set only, and the CV set will be used to evaluate the model. Remove duplicates, and collect to driver. Note that areaUnderCurve accepts a function as its third argument. Here, the trans form method from ALSModel is passed in, but it will shortly be swapped out for an alternative.

The result is about 0. Is this good? It is certainly higher than the 0. Generally, an AUC over 0. But is it an accurate evaluation? In fact, one common practice is to divide the data into k subsets of similar size, use k — 1 subsets together for training, and evaluate on the remaining subset. We can repeat this k times, using a different set of subsets each time.

This is called k-fold cross-validation. The validation API will be revisited in Chapter 4. This is not personalized, but it is simple and may be effective. Calling the function and supplying the first argument creates a partially applied function, which itself takes an argument all Data in order to return predictions. The result of predictMostListened train Data is a function.

This suggests that nonpersonalized recommendations are already fairly effective according to this metric. Clearly, the model needs some tuning. Can it be made better? Hyperparameter Selection So far, the hyperparameter values used to build the ALSModel were simply given without comment. They are not learned by the algorithm and must be chosen by the caller. In nontrivial cases, this is also their rank.

More iterations take more time but may produce a better factorization. These hyperparameters are instead parameters to the process of building itself. The values used in the preceding list are not necessarily optimal. Choosing good hyperparameter values is a common problem in machine learning. The most basic way to choose values is to simply try combinations of values and evaluate a metric for each of them, and choose the combination that produces the best value of the metric.

These values are still something of a guess, but are chosen to cover a broad range of parameter values.

Free up model resources immediately. Sort by first value AUC , descending, and print. The for syntax here is a way to write nested loops in Scala. It is like a loop over alpha, inside a loop over regParam, inside a loop over rank. The differences are small in absolute terms, but are still somewhat significant for AUC values. Interestingly, the parameter alpha seems consistently better at 40 than 1.

For the curious, 40 was a value proposed as a default in one of the original ALS papers mentioned earlier. A higher regParam looks better too. This suggests the model is somewhat susceptible to overfitting, and so needs a higher regParam to resist trying to fit the sparse input given from each user too exactly.

As expected, 5 features is pretty low for a model of this size, and underperforms the model that uses 30 features to explain tastes. Of course, this process can be repeated for different ranges of values or more values.

It is a brute-force means of choosing hyperparameters. However, in a world where clusters with terabytes of memory and hundreds of cores are not uncommon, and with frameworks like Spark that can exploit parallelism and memory for speed, it becomes quite feasible.

It is not strictly required to understand what the hyperparameters mean, although it is helpful to know what normal ranges of values are like in order to start the search over a parameter space that is neither too large nor too tiny.

Making Recommendations Proceeding for the moment with the best set of hyperparameters, what does the new model recommend for user ? Querying the original data set reveals that it occurs , times, putting it nearly in the top !

It is an example of how the practice of data science is often iterative, with discoveries about the data occurring at every stage. This model can be used to make recommendations for all users.

This could be useful in a batch process that recomputes a model and recommendations for users every hour or even less, depending on the size of the data and speed of the cluster. It is possible to recommend to one user at a time, as shown above, although each will launch a short-lived distributed job that takes a few seconds.

This may be suitable for rapidly recomputing recommendations for small groups of users. Copy distinct users to the driver. Here, the recommendations are just printed. They could just as easily be written to an external store like HBase, which provides fast lookup at runtime.

Interestingly, this entire process could also be used to recommend users to artists. For example, a quick analysis of play counts reveals that user played artist an astonishing , times! Artist is the implausibly successful alterna-metal band System of a Down, who turned up earlier in recommendations. It must be spam or a data error, and another example of the types of real-world data problems that a production system would have to address. ALS is not the only possible recommender algorithm, but at this time, it is the only one supported by Spark MLlib.

This is appropriate when data is rating-like, rather than count-like. For example, it is appropriate when the data set is user ratings of artists on a 1—5 scale. The resulting prediction column returned from ALSModel. In this case, the simple RMSE root mean squared error metric is appropriate for evaluating the recommender. Later, other recommender algorithms may be available in Spark MLlib or other libraries.

In production, recommender engines often need to make recommendations in real time, because they are used in contexts like ecommerce sites where recommendations are requested frequently as customers browse product pages. It would be nicer to compute recommendations on the fly, as needed. This is not true of other models, which afford much faster scoring. Projects like Oryx 2 attempt to implement real-time on-demand recommendations with libraries like MLlib by efficiently accessing the model data in memory.

He found that large peas and people had larger- than-average offspring. However, the offspring were, on average, smaller than their parents. In terms of people: the child of a seven-foot-tall basketball player is likely to be taller than the global average but still more likely than not to be less than seven feet tall.

As almost a side effect of his study, Galton plotted child versus parent size and noticed there was a roughly linear relationship between the two. Although maybe not perceived this way at the time, this line was, to me, an early example of a predictive model. The line links the two values, and implies that the value of one suggests a lot about the value of the other. To do so, both require a body of inputs and outputs to learn from. They need to be fed both questions and known answers.

For this reason, they are known as types of supervised learning. Classification and regression are the oldest and most well-studied types of predictive analytics.

Recommenders, the topic of Chapter 3, were comparatively more intuitive to introduce, but are also just a relatively recent and separate subtopic within machine learning. The exciting thing about these algorithms is that, with all due respect to Mr. Vectors and Features To explain the choice of the data set and algorithm featured in this chapter, and to begin to explain how regression and classification operate, it is necessary to briefly define the terms that describe their input and output.

Each of these features can be quantified. The number of forecasters is, of course, an integer count. These features are not all of the same type.

The first two features are measured in degrees Celsius, but the third is unitless, a fraction. The fourth is not a number at all, and the fifth is a number that is always a nonnegative integer.

For purposes of discussion, this book will talk about features in two broad groups only: categorical features and numeric features. In this context, numeric features are those that can be quantified by a number and have a meaningful ordering. All of the features mentioned previously are numeric, except the weather type.

Terms like clear are not numbers, and have no ordering. It is meaningless to say that cloudy is larger than clear. This is a categorical feature, which instead takes on one of several discrete values.

Training Examples A learning algorithm needs to train on data in order to make predictions. It requires a large number of inputs, and known correct outputs, from historical data. Feature vectors provide an organized way to describe input to a learning algorithm here: The output, or target, of the prediction can also be thought of as a feature. Here, it is a numeric feature: The entire training example might be thought of as The collection of all of these examples is known as the training set.

Note that regression problems are just those where the target is a numeric feature, and classification problems are those where the target is categorical. Decision Trees and Forests It turns out that the family of algorithms known as decision trees can naturally handle both categorical and numeric features. Building a single tree can be done in parallel, and many trees can be built in parallel at once. Decision tree—based algorithms have the further advantage of being comparatively intuitive to understand and reason about.

For example, I sit down to have morning coffee with milk. Before I commit to that milk and add it to my brew, I want to predict: is the milk spoiled? I might check if the use-by date has passed. Otherwise, I sniff the milk. If it smells funny, I predict yes, and otherwise no. Each decision leads to one of two results, which is either a prediction or another decision, as shown in Figure In this sense, it is natural to think of the process as a tree of decisions, where each internal node in the tree is a decision, and each leaf node is a final answer.

Decision tree: is it spoiled? The preceding rules were ones I learned to apply intuitively over years of bachelor life —they seemed like rules that were both simple and also usefully differentiated cases of spoiled and nonspoiled milk. These are also properties of a good decision tree. That is a simplistic decision tree, and was not built with any rigor. A robot has taken a job in an exotic pet store. It wants to learn, before the shop opens, which animals in the shop would make a good pet for a child.

The robot compiles the information found in Table from examining the animals. Fido Slither 3. This rule predicts the correct value in five of nine cases. A quick glance suggests that we could improve the rule by lowering the weight threshold to kg.

This gets six of nine examples correct. The heavy animals are now predicted correctly; the lighter animals are only partly correct. So, a second decision can be constructed to further refine the prediction for examples with weights less than kg.

It would be good to pick a feature that changes some of the incorrect Yes predictions to No. For example, there is one small green animal, sounding suspiciously like a snake, that the robot could predict correctly by deciding on color, as shown in Figure Of course, decision rules could be added until all nine were correctly predicted.

Some balance is needed to avoid this phenomenon, known as overfitting. This is enough of an introduction to decision trees for us to begin using them with Spark. The remainder of the chapter will explore how to pick decision rules, how to know when to stop, and how to gain accuracy by creating a forest of trees. Covtype Data Set The data set used in this chapter is the well-known Covtype data set, available online as a compressed CSV-format data file, covtype. The data set records the types of forest-covering parcels of land in Colorado, USA.

The forest cover type is to be predicted from the rest of the features, of which there are 54 in total. This data set has been used in research and even a Kaggle competition. Preparing the Data Thankfully, the data is already in a simple CSV format and does not require much cleansing or other preparation to be used with Spark MLlib. Later, it will be of interest to explore some transformations of the data, but it can be used as is to start. The covtype.

If you have the memory, specify --driver- memory 8g or similar. The column names are given in the companion file, covtype. It also requests that the type of each column be inferred by examining the data.

It correctly infers that all of the columns are numbers, and more specifically, integers. Exactly one of the N values has value 1, and the others are 0. For example, a categorical feature for weather that can be cloudy, rainy, or clear would become three numeric features, where cloudy is represented by 1,0,0; rainy by 0,1,0; and so on. Another possible encoding simply assigns a distinct numeric value to each possible value of the categorical feature.

For example, cloudy may become 1. The original categorical values have no ordering, but when encoded as a number, they appear to. Treating the encoded feature as numeric leads to meaningless results because the algorithm is effectively pretending that rainy is somehow greater than, and two times larger than, cloudy.

So we see both types of encodings of categorical features. For performance reasons or to match the format expected by libraries of the day, which were built more for regression problems, data sets often contain data encoded in these ways. This will become apparent later. You can call data. Here, that is not possible. We would have no idea how to make up a new feature description of a new parcel of land in Colorado or what kind of forest cover to expect from such a parcel.

Instead, we must jump straight to holding out some data for purposes of evaluating the resulting model. Before, the AUC metric was used to assess the agreement between held-out listening data and predictions from recommendations. The input DataFrame contains many columns, each holding one feature that could be used to predict the target column. This class is an abstraction for vectors in the linear algebra sense, and contains only numbers.

For most intents and purposes, they work like a simple array of double values floating-point numbers. Fortunately, the VectorAssembler class can do this work: import org.

Its key parameters are the columns to combine into the feature vector, and the name of the new column containing the feature vector. Because most of the 54 values are 0, it only stores nonzero values and their indices. Here, the transformation is just invoked directly, which is sufficient to build a first decision tree classifier model.

DecisionTreeClassifier import scala. Because the model will later be used to predict new values of the target, it is given the name of a column to store predictions. It consists of a series of nested decisions about features, comparing feature values to thresholds. Here, for historical reasons, the features are only referred to by number, not name, unfortunately. Decision trees are able to assess the importance of input features as part of their building process.

That is, they can estimate how much each input feature contributes to making correct predictions. This information is simple to access from the model. This pairs importance values higher is better with column names and prints them in order from most to least important. Elevation seems to dominate as the most important feature; most features are estimated to have virtually no importance when predicting the cover type! For example, it might be interesting to see what the model predicts on the training data, and compare its prediction with the known correct cover type.

Eagle-eyed readers might note that the probability vectors actually have eight values even though there are only seven possible outcomes. However, there is also a value at index 0, which always shows as probability 0. Based on this snippet, it looks like the model could use some work. Its predictions look like they are often wrong.

This is the accuracy of this classifier. It can compute other related measures, like the F1 score. For purposes here, accuracy will be used to evaluate classifiers. Sometimes, however, it can be useful to look at the confusion matrix. So, the correct predictions are the counts along the diagonal and the predictions are everything else.

Fortunately, Spark provides support code to compute the confusion matrix. Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Who This Book Is ForIf you are a developer, engineer, or an architect and want to learn how to use Apache Spark in a web-scale project, then this is the book for you.

It is assumed that you have prior knowledge of SQL querying. A basic programming knowledge with Scala, Java, R, or Python is all you need to get started with this book. Hence, understanding the design and implementation best practices before you start your project will help you avoid these problems.

This book gives an insight into the engineering practices used to design and build real-world, Spark-based applications. The book's hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL. It starts by familiarizing you with data exploration and data munging tasks using Spark SQL and Scala.

Extensive code examples will help you understand the methods used to implement typical use-cases for various types of applications. You will get a walkthrough of the key concepts and terms that are common to streaming, machine learning, and graph applications. You will also learn key performance-tuning details including Cost Based Optimization Spark 2. Finally, you will move on to learning how such systems are architected and deployed for a successful delivery of your project.

Style and approachThis book is a hands-on guide to designing, building, and deploying Spark SQL-centric production applications at scale. Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the creators of the open-source cluster-computing framework. With an emphasis on improvements and new features in Spark 2. Learn about the fastest-growing open source project in the world, and find out how it revolutionizes big data analytics About This Book Exclusive guide that covers how to get up and running with fast data processing using Apache Spark Explore and exploit various possibilities with Apache Spark using real-world use cases in this book Want to perform efficient data processing at real time?

This book will be your one-stop solution. Who This Book Is For This guide appeals to big data engineers, analysts, architects, software engineers, even technical managers who need to perform efficient data processing on Hadoop at real time. Basic familiarity with Java or Scala will be helpful. What You Will Learn Get an overview of big data analytics and its importance for organizations and data professionals Delve into Spark to see how it is different from existing processing platforms Understand the intricacies of various file formats, and how to process them with Apache Spark.

Introduce yourself to the deployment and usage of SparkR. Walk through the importance of Graph computation and the graph processing systems available in the market Check the real world example of Spark by building a recommendation engine with Spark using ALS. Use a Telco data set, to predict customer churn using Random Forests. In Detail Spark juggernaut keeps on rolling and getting more and more momentum each day. Deploying the key capabilities is crucial whether it is on a Standalone framework or as a part of existing Hadoop installation and configuring with Yarn and Mesos.

The next part of the journey after installation is using key components, APIs, Clustering, machine learning APIs, data pipelines, parallel programming. It is important to understand why each framework component is key, how widely it is being used, its stability and pertinent use cases. Once we understand the individual components, we will take a couple of real life advanced analytics examples such as 'Building a Recommendation system', 'Predicting customer churn' and so on.

The objective of these real life examples is to give the reader confidence of using Spark for real-world problems. Style and approach With the help of practical examples and real-world use cases, this guide will take you from scratch to building efficient data applications using Apache Spark.

You will learn all about this excellent data processing engine in a step-by-step manner, taking one aspect of it at a time. This highly practical guide will include how to work with data pipelines, dataframes, clustering, SparkSQL, parallel programming, and such insightful topics with the help of real-world use cases.

Download eBook. View and download other Databricks eBooks. It includes the latest updates on new features from the Apache Spark 3.



0コメント

  • 1000 / 1000