Programming MapReduce with Scalding
()
About this ebook
This book is for developers who are willing to discover how to effectively develop MapReduce applications. Prior knowledge of Hadoop or Scala is not required; however, investing some time on those topics would certainly be beneficial.
Related to Programming MapReduce with Scalding
Related ebooks
Scala and Spark for Big Data Analytics: Explore the concepts of functional programming, data streaming, and machine learning Rating: 0 out of 5 stars0 ratingsApache Spark 2.x for Java Developers Rating: 0 out of 5 stars0 ratingsHands-On Data Analysis with Scala: Perform data collection, processing, manipulation, and visualization with Scala Rating: 0 out of 5 stars0 ratingsApache Spark 2 for Beginners Rating: 0 out of 5 stars0 ratingsLearning Cascading Rating: 0 out of 5 stars0 ratingsOpenStack Sahara Essentials Rating: 0 out of 5 stars0 ratingsScientific Computing with Scala Rating: 0 out of 5 stars0 ratingsSplunk Developer's Guide Rating: 0 out of 5 stars0 ratingsOptimizing Hadoop for MapReduce Rating: 0 out of 5 stars0 ratingsCouchbase Essentials Rating: 0 out of 5 stars0 ratingsApache Spark Quick Start Guide: Quickly learn the art of writing efficient big data applications with Apache Spark Rating: 0 out of 5 stars0 ratingsAdvanced Express Web Application Development Rating: 0 out of 5 stars0 ratingsRESS Essentials Rating: 0 out of 5 stars0 ratingsScala Machine Learning Projects: Build real-world machine learning and deep learning projects with Scala Rating: 0 out of 5 stars0 ratingsHadoop Cluster Deployment Rating: 0 out of 5 stars0 ratingsLearning Spark SQL Rating: 0 out of 5 stars0 ratingsMachine Learning with Spark - Second Edition Rating: 0 out of 5 stars0 ratingsMastering Apache Spark 2.x - Second Edition Rating: 0 out of 5 stars0 ratingsImplementing Cloud Design Patterns for AWS Rating: 0 out of 5 stars0 ratingsMastering Java for Data Science: Analytics and more for production-ready applications Rating: 0 out of 5 stars0 ratingsLearn Scala Programming: A comprehensive guide covering functional and reactive programming with Scala 2.13, Akka, and Lagom Rating: 0 out of 5 stars0 ratingsApache Hive Essentials: Essential techniques to help you process, and get unique insights from, big data, 2nd Edition Rating: 0 out of 5 stars0 ratingsDeveloping with Docker Rating: 5 out of 5 stars5/5Building Web Applications with Python and Neo4j Rating: 0 out of 5 stars0 ratingsCloud Development and Deployment with CloudBees Rating: 0 out of 5 stars0 ratingsArchitecting the Industrial Internet: The architect's guide to designing Industrial Internet solutions Rating: 0 out of 5 stars0 ratingsGetting Started with Hazelcast Rating: 0 out of 5 stars0 ratingsProfessional Scala: Combine object-oriented and functional programming to build high-performance applications Rating: 0 out of 5 stars0 ratings
Internet & Web For You
How to Be Invisible: Protect Your Home, Your Children, Your Assets, and Your Life Rating: 4 out of 5 stars4/5Coding All-in-One For Dummies Rating: 4 out of 5 stars4/5Coding For Dummies Rating: 5 out of 5 stars5/5The Gothic Novel Collection Rating: 5 out of 5 stars5/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5Social Engineering: The Science of Human Hacking Rating: 3 out of 5 stars3/5The Beginner's Affiliate Marketing Blueprint Rating: 4 out of 5 stars4/5The $1,000,000 Web Designer Guide: A Practical Guide for Wealth and Freedom as an Online Freelancer Rating: 4 out of 5 stars4/5An Ultimate Guide to Kali Linux for Beginners Rating: 3 out of 5 stars3/5No Place to Hide: Edward Snowden, the NSA, and the U.S. Surveillance State Rating: 4 out of 5 stars4/5Tor and the Dark Art of Anonymity Rating: 5 out of 5 stars5/5The Digital Marketing Handbook: A Step-By-Step Guide to Creating Websites That Sell Rating: 5 out of 5 stars5/5Six Figure Blogging Blueprint Rating: 5 out of 5 stars5/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5The Cyber Attack Survival Manual: Tools for Surviving Everything from Identity Theft to the Digital Apocalypse Rating: 0 out of 5 stars0 ratingsHow to Disappear and Live Off the Grid: A CIA Insider's Guide Rating: 0 out of 5 stars0 ratingsThe Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5Wireless Hacking 101 Rating: 5 out of 5 stars5/5Ultimate guide for being anonymous: Avoiding prison time for fun and profit Rating: 5 out of 5 stars5/5Surveillance and Surveillance Detection: A CIA Insider's Guide Rating: 3 out of 5 stars3/5Cybersecurity All-in-One For Dummies Rating: 0 out of 5 stars0 ratingsCybersecurity For Dummies Rating: 5 out of 5 stars5/5Python: Learn Python in 24 Hours Rating: 4 out of 5 stars4/5How To Start A Profitable Authority Blog In Under One Hour Rating: 5 out of 5 stars5/5Beginner's Guide To Starting An Etsy Print-On-Demand Shop Rating: 0 out of 5 stars0 ratingsJavaScript All-in-One For Dummies Rating: 5 out of 5 stars5/5
Reviews for Programming MapReduce with Scalding
0 ratings0 reviews
Book preview
Programming MapReduce with Scalding - Antonios Chalkiopoulos
Table of Contents
Programming MapReduce with Scalding
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Introduction to MapReduce
The Hadoop platform
MapReduce
A MapReduce example
MapReduce abstractions
Introducing Cascading
What happens inside a pipe
Pipe assemblies
Cascading extensions
Summary
2. Get Ready for Scalding
Why Scala?
Scala basics
Scala build tools
Hello World in Scala
Development editors
Installing Hadoop in five minutes
Running our first Scalding job
Submitting a Scalding job in Hadoop
Summary
3. Scalding by Example
Reading and writing files
Best practices to read and write files
TextLine parsing
Executing in the local and Hadoop modes
Understanding the core capabilities of Scalding
Map-like operations
Join operations
Pipe operations
Grouping/reducing functions
Operations on groups
Composite operations
A simple example
Typed API
Summary
4. Intermediate Examples
Logfile analysis
Completing the implementation
Exploring ad targeting
Calculating daily points
Calculating historic points
Generating targeted ads
Summary
5. Scalding Design Patterns
The external operations pattern
The dependency injection pattern
The late bound dependency pattern
Summary
6. Testing and TDD
Introduction to testing
MapReduce testing challenges
Development lifecycle with testing strategy
TDD for Scalding developers
Implementing the TDD methodology
Decomposing the algorithm
Defining acceptance tests
Implementing integration tests
Implementing unit tests
Implementing the MapReduce logic
Defining and performing system tests
Black box testing
Summary
7. Running Scalding in Production
Executing Scalding in a Hadoop cluster
Scheduling execution
Coordinating job execution
Configuring using a property file
Configuring using Hadoop parameters
Monitoring Scalding jobs
Using slim JAR files
Scalding execution throttling
Summary
8. Using External Data Stores
Interacting with external systems
SQL databases
NoSQL databases
Understanding HBase
Reading from HBase
Writing in HBase
Using advanced HBase features
Search platforms
Elastic search
Summary
9. Matrix Calculations and Machine Learning
Text similarity using TF-IDF
Setting a similarity using the Jaccard index
K-Means using Mahout
Other libraries
Summary
Index
Programming MapReduce with Scalding
Programming MapReduce with Scalding
Copyright © 2014 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: June 2014
Production reference: 1190614
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78328-701-7
www.packtpub.com
Credits
Author
Antonios Chalkiopoulos
Reviewers
Ahmad Alkilani
Włodzimierz Bzyl
Tanin Na Nakorn
Sen Xu
Commissioning Editor
Owen Roberts
Acquisition Editor
Llewellyn Rozario
Content Development Editor
Sriram Neelakantan
Technical Editor
Kunal Anil Gaikwad
Copy Editors
Sayanee Mukherjee
Alfida Paiva
Project Coordinator
Aboli Ambardekar
Proofreaders
Mario Cecere
Maria Gould
Indexers
Mehreen Deshmukh
Rekha Nair
Tejal Soni
Graphics
Sheetal Aute
Ronak Dhruv
Valentina Dsilva
Disha Haria
Production Coordinator
Conidon Miranda
Cover Work
Conidon Miranda
Cover Image
Sheetal Aute
About the Author
Antonios Chalkiopoulos is a developer living in London and a professional working with Hadoop and Big Data technologies. He completed a number of complex MapReduce applications in Scalding into 40-plus production nodes HDFS Cluster. He is a contributor to Scalding and other open source projects, and he is interested in cloud technologies, NoSQL databases, distributed real-time computation systems, and machine learning.
He was involved in a number of Big Data projects before discovering Scala and Scalding. Most of the content of this book comes from his experience and knowledge accumulated while working with a great team of engineers.
I would like to thank Rajah Chandan for introducing Scalding to the team and being the author of SpyGlass and Stefano Galarraga for co-authoring chapters 5 and 6 and being the author of ScaldingUnit. Both these libraries are presented in this book.
Saad, Gracia, Deepak, and Tamas, I've learned a lot working next to you all, and this book wouldn't be possible without all your discoveries. Finally, I would like to thank Christina for bearing with my writing sessions and supporting all my endeavors.
About the Reviewers
Ahmad Alkilani is a data architect specializing in the implementation of high-performance distributed systems, data warehouses, and BI systems. His career has been split between building enterprise applications and products using a variety of web and database technologies, including .NET, SQL Server, Hadoop, Hive, Scala, and Scalding. His recent interests include building real-time web and predictive analytics and streaming and sketching algorithms.
Currently, Ahmad works at Move.com (http://www.realtor.com) and enjoys speaking at various user groups and national conferences, and he is an author on Pluralsight with courses focused on Hadoop and Big Data, SQL Server 2014, and more, targeting the Big Data and streaming spaces.
You can find more information on Ahmad on his LinkedIn profile (http://www.linkedin.com/in/ahmadalkilani) or his Pluralsight author page (http://pluralsight.com/training/Authors/Details/ahmad-alkilani).
I would like to thank my family, especially my wonderful wife, Farah, and my beautiful son Maher for putting up with my long working hours and always being there for me.
Włodzimierz Bzyl works at the University of Gdańsk. His current interests include web-related technologies and NoSQL databases.
He has a passion for new technologies and introducing his students to them.
He enjoys contributing to open source software and spending time trekking in the Tatra mountains.
Tanin Na Nakorn is a software engineer who is enthusiastic about building consumer products and open source projects that make people's lives easier. He cofounded Thaiware, a software portal in Thailand and GiveAsia, a donation platform in Singapore; he currently builds products at Twitter. You may find him expressing himself on his Twitter handle @tanin and helping on various open source projects at http://www.github.com/tanin47.
Sen Xu is a software engineer in Twitter; he was previously a data scientist in Inome Inc.
He worked on designing and building data pipelines on top of traditional RDBMS (MySQL, PostgreSQL, and so on) and key-value store solutions (Hadoop). His interests include Big Data analytics, text mining, record linkage, machine learning, and spatial data handling.
www.PacktPub.com
Support files, eBooks, discount offers, and more
You might want to visit www.PacktPub.com for support files and downloads related to your book.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
http://PacktLib.PacktPub.com
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print and bookmark content
On demand and accessible via web browser
Free access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.
Preface
Scalding is a relatively new Scala DSL that builds on top of the Cascading pipeline framework, offering a powerful and expressive architecture for MapReduce applications. Scalding provides a highly abstracted layer for design and implementation in a componentized fashion, allowing code reuse and development with the Test Driven Methodology.
Similar to other popular MapReduce technologies such as Pig and Hive, Cascading uses a tuple-based data model, and it is a mature and proven framework that many dynamic languages have built technologies upon. Instead of forcing developers to write raw map and reduce functions while mentally keeping track of key-value pairs throughout the data transformation pipeline, Scalding provides a more natural way to express code.
In simpler terms, programming raw MapReduce is like developing in a low-level programming language such as assembly. On the other hand, Scalding provides an easier way to build complex MapReduce applications and integrates with other distributed applications of the Hadoop ecosystem.
This book aims to present MapReduce, Hadoop, and Scalding, it suggests design patterns and idioms, and it provides ample examples of real implementations for common use cases.
What this book covers
Chapter 1, Introduction to MapReduce, serves as an introduction to the Hadoop platform, MapReduce and to the concept of the pipeline abstraction that many Big Data technologies use. The first chapter outlines Cascading, which is a sophisticated framework that empowers developers to write efficient MapReduce applications.
Chapter 2, Get Ready for Scalding, lays the foundation for working with Scala, using build tools and an IDE, and setting up a local-development Hadoop system. It