作者:empty 页数:487 出版社:empty |
This is a shared repository for Learning Apache SparkNotes.The PDF version can be downloaded fromHERE, The first version was posted on Git hub in Chen Feng([Feng2017D.This shared repository mainlycontains the self-learning and self-teaching notes from Wen qiang during his IMA Data Science Fellowship.Thereaderisreferredtotherepositoryhttps://github.com/runawayhorse001/LeamningApacheSparkformoredetails about the dataset and the.ipy nb files.In this repository, I try to use the detailed demo code and examples to show howto use each main functions.If you find your work was n't cited in this note, please feel free to let me know.Although I am by no means an datamining programming and Big Dataexpert I decided that it would beuseful for me to share what Ile a med about Py Spark programming in the form of easy tutorials with detailedexample.I hope those tutorials will be a valuable tool for your studies,The tutorials assume that the reader has a preliminary knowledge of programming and Linux.And thisdocument is generated automatically by using sphinx.
About the authors·Wen qiang Feng·BiographyLearning Apache Spark wih Python-Sr.Data Scientist and PhD in Mathematics-University of Tennessee at Knoxville-Email:von198@gmail.comWen qiang Feng is aSr.Data Scientist at Machine Learning Lab, H&R Block.Before joining Block,Dr.Feng is a Data Scientist at Applied Analytics Group, DST(now SS&C) .Dr.Feng's responsibil-tics include providing clients with access to cutting-edge skills and technologies, including Big Dataanalytic solutions, advanced analytic and data enhancement techniques and modelingDr.Feng has deep analytic expert ie in datamining, analytic systems, machine leam ming algor thms.business intelligence, and applying Big Data tools to strategically solve industry problems in across-funcional business.Before joining DST, Dr.Feng was anIMA Data Science Fellow at The Institutefor Mathematics and its Applications(IMA) at the University of Minnesota, While there, he helpedstartup companies make marketing decisions based on deep predictive analytics.Dr.Feng graduated from University of Tennessee, Knoxville, with Ph.D.in Computational Ma the-matics and Master's degree in Statistics.He also holds Master's degree in Computational Mathematicsfrom Missouri University of Science and Technology(MST) and Master's degree in Applied Ma the-matics from the University of Science and Technology of China(USTC) .
The work of Wen qiang Feng was supported by the IMA, while working atIMA.However, any opin-ion, finding, and conclusions or recommendations expressed in this material are those of the authorand do not necessarily reflect the views of the IMA, UTK, DST and HR&Block.1.2 Motivation for this tutorialI was motivated by the IMA Data Science Fellowship project to learn Py Spark.After that I was impressedand attracted by the Py Spark, And If oud that:I.It is no exaggeration to say that Spark is the most powerful Big data tool.2.However.I still found that learning Spark was a diffcult process.I have to Google it and identify3which one is true, And it was hard to find detailed examples which I can easily learned the fullprocess in on ch le3.Good sources are expensive for a graduate student.1.3 Copyright notice and license infoThis Leaming Apache Spark with Python PDF file is supposed to be a free and living document, whichiswhyitssourceisavailableonlineathttps:/runawayhorse001.github.io/LearningApacheSpark/pyspark.pdf.But this document is licensed according to both MIT License and CreativeCommons Attribution-NonCommercial 2.0 Generic(CC BY-NC 2.0) License.When you plan to use, copy, modify, merge, publish, distribute or sublicense, Please see the terms ofthose licenses for more details and give the corresponding credits to the author.1.4 AcknowledgementA there, I would like to thank Ming Chen, Jian Sun and Zhong boLi at the University of Tennessee atKnoxville for the valuable disscussion and thank the generous anonymous authors for providing the detailedsolutions and sourcecode on the internet.Without those help.this repository would not have been possibleto be made.Wen qiang also would like to thank the Institute for Mathematics and Its Applications(IMA) at4Chapter 1.Preface
1 Preface
Mot vation for this tu tonal.
Copy ri ht not ie and license info
Acknowledgement.
Feedback and suggestions
Why Spark with Python?
2.1Why Spark?.
2.2Why Spak wth Python(Py Spark) ?.
Configure Running Platform
3.1Run on Data bricks Community Cloud.
3.2ConfigureS park on Mac and Ubuntu
3.3ConfigureS park on Windows.
3.4Py Spark Wth TextEditor or IDE
3.5PySparklingWater:Spark+H20
3.6Setup Spark on Cloud
3.7Py Spark on Co laboratory
3.8Demo Code in this Section.
An Introduction to Apache Spark
4.1Core Concepts.
4.2Spark Components,
4.3Architecture.
4.4How Spark Works?
Programming with RD Ds
5.1Create RDD.
5.2Spark Operations.
5.3rdd.Data Frame vspd.Data rra me
6Statisties and Linear Algebra Preliminaries
6.1Notations, .
6.2Lin car Algebra Preliminaries
6.3Measurement Formula.
6.4Confusion Matrix.
6.5Statistical Tests.
7 Data Exploration
7.1Univariate Analysis.
7.2Multivariate Analysis.
8 Data Manipulation:Features
8.1Feature Extraction
8.2Feature Transform.
8.3FeatureS election.
8.4Unbalanced da a:Under sampling
9 Regression
9.1Linear Regression.
9.2Generalized linear regression
9.3Decision tree Regression.
9.4Random Forest Regression.
9.5Gradient-boosted tree regression.
10 Regularization
10.1 Ordinary least squares regression
10.2 Ridge regression.
10.4 Elastic net
11 Classification
11.1 Binomial logistic regression.
11.2 Multinomial logistic regression
11.3 Decision tree Classification.
11.4 Random forest Classification.
115Gradient-boosted tree Classification
11.6XG Boost:Gradient boosted tree Class if ca
11.7NaiveBayes Classification
12 Clustering
12.1K-Means Model.
13RFM Analysis
13.1RFM Analysis Methodology.
13.2Dcmo.
13.3 Extension.
14 Text Mining
14.1 Text Collection.
14.2 Text Preprocessing.
14.3 Text Classification.
14.4 Sentiment analysis.
14.5N-grams and Cot relations
14.6 Topic Model:Latent Dirichlet Allocation.
15 Social Network Analysis
15.1 Introduction.
15.2Co-occurrence Network
15.3Appendix:matrix multiplication in Py Spark
15.4 Correlation Network
16ALS:Stock Portfolio Recommendations
16.1 Recommender systems.
16.2Alternating Least Squares
16.3Demo.
17 MonteCarlo Simulation
17.1Simulating Casino Win, ,
17.2 Simulating a Random Walk
18 Markov Chain MonteCarlo
18.1 Metropolis algorithm.
18.2A Toy Example of Mero polis
18.3Dcmos, .
19 Neural Network
19.1 Feedforward Neural Network.
20 Automation for Cloud era Distribution Had oop
20.1 Automation Pipe lne.
202DataCleanandManipuatn Automation.
20.3ML Pipeline Automation.
20.4 Save and Load Pipeline Model.
205Ingest Results Back into Had oop
21WrapPyS park Package
21.1 Package Wrapper.
21.2Pacakge Publishing on PyPI
22PyS park Data Audit Library
22.1 Install with pip::
22.2 Install from Repo,
22.3 Uninstall.
22.4 Test.
22.5AudiingonBigDaaset
23Zeppelintojupyter notebook
23.1How to Install
23.2 Converting Demos
24My CheatSheet
25PySparkAPI
25.1Stat API
252 Regression API
25.3 Classification API.
25.4 Clustering API
25.5 Recommendation API