Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Test-Driven Machine Learning
Test-Driven Machine Learning
Test-Driven Machine Learning
Ebook326 pages3 hours

Test-Driven Machine Learning

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book is intended for data technologists (scientists, analysts, or developers) with previous machine learning experience who are also comfortable reading code in Python. This book is ideal for those looking for a way to deliver results quickly to enable rapid iteration and improvement.
LanguageEnglish
Release dateNov 27, 2015
ISBN9781784396367
Test-Driven Machine Learning

Related to Test-Driven Machine Learning

Related ebooks

Computers For You

View More

Related articles

Reviews for Test-Driven Machine Learning

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Test-Driven Machine Learning - Bozonier Justin

    Table of Contents

    Test-Driven Machine Learning

    Credits

    About the Author

    About the Reviewers

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    Why subscribe?

    Free access for Packt account holders

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Downloading the example code

    Downloading the color images of this book

    Errata

    Piracy

    Questions

    1. Introducing Test-Driven Machine Learning

    Test-driven development

    The TDD cycle

    Red

    Green

    Refactor

    Behavior-driven development

    Our first test

    The anatomy of a test

    Given

    When

    Then

    TDD applied to machine learning

    Dealing with randomness

    Different approaches to validating the improved models

    Classification overview

    Regression

    Clustering

    Quantifying the classification models

    Summary

    2. Perceptively Testing a Perceptron

    Getting started

    Summary

    3. Exploring the Unknown with Multi-armed Bandits

    Understanding a bandit

    Testing with simulation

    Starting from scratch

    Simulating real world situations

    A randomized probability matching algorithm

    A bootstrapping bandit

    The problem with straight bootstrapping

    Multi-armed armed bandit throw down

    Summary

    4. Predicting Values with Regression

    Refresher on advanced regression

    Regression assumptions

    Quantifying model quality

    Generating our own data

    Building the foundations of our model

    Cross-validating our model

    Generating data

    Summary

    5. Making Decisions Black and White with Logistic Regression

    Generating logistic data

    Measuring model accuracy

    Generating a more complex example

    Test driving our model

    Summary

    6. You're So Naïve, Bayes

    Gaussian classification by hand

    Beginning the development

    Summary

    7. Optimizing by Choosing a New Algorithm

    Upgrading the classifier

    Applying our classifier

    Upgrading to Random Forest

    Summary

    8. Exploring scikit-learn Test First

    Test-driven design

    Planning our journey

    Creating a classifier chooser (it needs to run tests to evaluate classifier performance)

    Getting choosey

    Developing testable documentation

    Decision trees

    Summary

    9. Bringing It All Together

    Starting at the highest level

    The real world

    What we've accomplished

    Summary

    Index

    Test-Driven Machine Learning


    Test-Driven Machine Learning

    Copyright © 2015 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: November 2015

    Production reference: 1231115

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78439-908-5

    www.packtpub.com

    Credits

    Author

    Justin Bozonier

    Reviewers

    Lars Marius Garshol

    Alexey Grigorev

    Commissioning Editor

    Dipika Gaonkar

    Acquisition Editors

    Divya Poojari

    Llewellyn Rozario

    Content Development Editor

    Nikhil Potdukhe

    Technical Editors

    Rupali R. Shrawane

    Copy Editor

    Yesha Gangani

    Project Coordinator

    Paushali Desai

    Proofreader

    Safis Editing

    Indexer

    Tejal Daruwale Soni

    Graphics

    Jason Monteiro

    Production Coordinator

    Melwyn Dsa

    Cover Work

    Melwyn Dsa

    About the Author

    Justin Bozonier is a data scientist living in Chicago. He is currently a Senior Data Scientist at GrubHub. He has led the development of their custom analytics platform and also led the development of their first real time split test analysis platform which utilized Bayesian Statistics. In addition he has developed machine learning models for data mining as well as for prototyping product enhancements. Justin's software development expertise has earned him acknowledgements in the books Parallel Programming with Microsoft® .NET as well as Flow-Based Programming, Second Edition. He has also taught a workshop at PyData titled Simplified Statistics through Simulation.

    His previous work experience includes being an Actuarial Systems Developer at Milliman, Inc., contracting as a Software Development Engineer II at Microsoft, and working as a Sr. Data Analyst and Lead Developer at Cheezburger Network amongst other experience.

    Savannah Bozonier—the best partner I've ever had in life. Time and again she has made room in her life so I can push myself to do things that take an immense amount of time. Things like writing this book.

    My friends and colleagues for their support and help which culminated in this book: Tom Hayden, Drew Fustin, and Andrew Slotnick.

    My mentors across the years—Chad Boyer, Kelly Leahy, Robert Ream, James Thigpen, and Loren Bast.

    My parents—I don't know what it's like to be told I can't do something. My life reflects that in every way.

    About the Reviewers

    Lars Marius Garshol has worked as a consultant, product developer, and open source developer for two decades. He added Unicode support to the Opera web browser, edited a number of ISO standards, and developed the query language tolog. Later, he worked as an enterprise architect and an R&D developer. He is the developer of Duke, an open source tool for identifying near-duplicate database records. He wrote Definitive XML Application Development, published in 2002. Currently he is a software engineer at Schibsted Products & Technology in Oslo, Norway. He's working on a book on Norwegian farmhouse ale.

    Alexey Grigorev is an experienced software developer and data scientist with five years of professional experience. In his day-to-day job, he actively uses R and Python for data cleaning, data analysis, and modeling. He believes that testing is not only an integral part of software development, but it is also very useful for building machine learning models.

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    For support files and downloads related to your book, please visit www.PacktPub.com.

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    https://www2.packtpub.com/books/subscription/packtlib

    Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

    Why subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print, and bookmark content

    On demand and accessible via a web browser

    Free access for Packt account holders

    If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

    Preface

    Consistent, steady improvement is the name of the game in Machine Learning. Sometimes you find yourself implementing an algorithm from scratch; sometimes you're pulling in libraries. You always need the option to try new algorithms and improve performance. Simultaneously, you need to know that performance doesn't degrade.

    You could just ask an expert about every change because testing stochastic algorithms seems impossible. That's just as terribly slow as it sounds. What if you could automate checking that your updated algorithms outperform your previous ones? What if you could design your code so that you could swap in an algorithm from another library or pit one that you wrote yourself against what you have? These are all reasons for this book.

    We'll be covering what test-driven development is and what value it brings to machine learning. We'll be using nosetests in Python 2.7 to develop our tests. For machine-learning algorithms, we will be using Statsmodels and sci-kit learn. Statsmodels has some great implementations of regression. sci-kit learn is useful for its plethora of supported classification algorithms.

    What this book covers

    Chapter 1, Introducing to Test-Driven Machine Learning, explains what Test-Driven Development is, what it looks like, and how it is done in practice.

    Chapter 2, Perceptively Testing a Perceptron, develops a perceptron from scratch and defines its behavior even though it behaves non-deterministically.

    Chapter 3, Exploring the Unknown with Multi-armed Bandits, introduces multi-armed bandit problems, testing different algorithms, and iterating on their performance.

    Chapter 4, Predicting Values with Regression, uses statsmodels to implement regression and report on key performance metrics. We will also explore tuning the model.

    Chapter 5, Making Decisions Black and White with Logistic Regression, continues exploring regression as well as quantifying quality of this different type of it. We will use statsmodels again to create our regression models.

    Chapter 6, You're So Naïve, Bayes, helps us develop a Gaussian Naïve Bayes algorithm from scratch using test-driven development.

    Chapter 7, Optimizing by Choosing a New Algorithm, continues the work from Chapter 6, You're So Naïve, Bayes, and attempts to improve upon it using a new algorithm: Random Forests.

    Chapter 8, Exploring scikit-learn Test First, teaches how to teach oneself. You probably already have a lot of experience of this. This chapter will build upon this by teaching you to use the test framework to document sci-kit learn.

    Chapter 9, Bringing it all Together, takes a business problem that requires a couple of different algorithms. Again, we will develop everything we need from scratch and mix our code with third party libraries, completely test-driven.

    What you need for this book

    We will be using Python 2.7 in this book along with nosetests to unit test our software. In addition, we will be using statsmodels as well as scikit-learn.

    Who this book is for

    This book is for machine learning professionals who want to be able to test the improvements to their algorithms in isolation and in an automated fashion. This book is for any data scientist who wants to get started in Test-Driven Development with minimal religion and maximum value. This book is not for someone who wants to learn state of the art Test-Driven Development. It is written with the idea that the majority of what can be learned from Test-Driven Development is remarkably simple. We will provide a relatively simple approach to it which the reader can choose to augment as they see fit.

    Conventions

    In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

    Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Notice that in my test, I instantiate a NumberGuesser object.

    A block of code is set as follows:

    def given_no_information_when_asked_to_guess_test():

      number_guesser = NumberGuesser()

      result = number_guesser.guess()

      assert result is None, Then it should provide no result.

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

          for the_class, trained_observations in self._classifications.items():         if len(trained_observations) <= 1:

              return None

     

            probability_of_observation_given_class[the_class] = self._probability_given_class(trained_observations, observation)

    [default]

    Note

    Warnings or important notes appear in a box like this.

    Tip

    Tips and tricks appear like this.

    Reader feedback

    Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

    To send us general feedback, simply e-mail <feedback@packtpub.com>, and mention the book's title in the subject of your message.

    If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

    Customer support

    Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

    Downloading the example code

    You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

    Downloading the color images of this book

    We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub.com/sites/default/files/downloads/TestDrivenMachineLearning_ColorImages.pdf.

    Errata

    Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

    To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

    Piracy

    Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

    Please contact us

    Enjoying the preview?
    Page 1 of 1