Make Your Own Neural Network PDF

Contents
Prologue
TheSearchforIntelligentMachines
ANatureInspiredNewGoldenAge

Introduction
Whoisthisbookfor?
Whatwillwedo?
Howwillwedoit?
AuthorsNote

Part1HowTheyWork
EasyforMe,HardforYou
ASimplePredictingMachine
ClassifyingisNotVeryDifferentfromPredicting
TrainingASimpleClassifier
SometimesOneClassifierIsNotEnough
Neurons,NaturesComputingMachines
FollowingSignalsThroughANeuralNetwork
MatrixMultiplicationisUseful..Honest!
AThreeLayerExamplewithMatrixMultiplication
LearningWeightsFromMoreThanOneNode
BackpropagatingErrorsFromMoreOutputNodes
BackpropagatingErrorsToMoreLayers
BackpropagatingErrorswithMatrixMultiplication
HowDoWeActuallyUpdateWeights?
WeightUpdateWorkedExample
PreparingData

Part2DIYwithPython
Python
InteractivePython=IPython
AVeryGentleStartwithPython
NeuralNetworkwithPython
TheMNISTDatasetofHandwrittenNumbers

1
Part3EvenMoreFun
YourOwnHandwriting
InsidetheMindofaNeuralNetwork
CreatingNewTrainingData:Rotations

Epilogue

AppendixA:
AGentleIntroductiontoCalculus
AFlatLine
ASlopedStraightLine
ACurvedLine
CalculusByHand
CalculusNotByHand
CalculuswithoutPlottingGraphs
Patterns
FunctionsofFunctions
YoucandoCalculus!

AppendixB:
DoItwithaRaspberryPi
InstallingIPython
MakingSureThingsWork
TrainingAndTestingANeuralNetwork
RaspberryPiSuccess!

2
Prologue
TheSearchforIntelligentMachines
Forthousandsofyears,wehumanshavetriedtounderstandhowourownintelligenceworks
andreplicateitinsomekindofmachinethinkingmachines.
Wevenotbeensatisfiedbymechanicalorelectronicmachineshelpinguswithsimpletasks
flintsparkingfires,pulleysliftingheavyrocks,andcalculatorsdoingarithmetic.
Instead,wewanttoautomatemorechallengingandcomplextaskslikegroupingsimilarphotos,
recognisingdiseasedcellsfromhealthyones,andevenputtingupadecentgameofchess.
Thesetasksseemtorequirehumanintelligence,oratleastamoremysteriousdeepercapability
ofthehumanmindnotfoundinsimplemachineslikecalculators.
Machineswiththishumanlikeintelligenceissuchaseductiveandpowerfulideathatourculture
isfulloffantasies,andfears,aboutittheimmenselycapablebutultimatelymenacingHAL
9000inStanleyKubricks 2001:ASpaceOdyssey, Terminator
thecrazedaction robotsandthe
talkingKITTcarwithacoolpersonalityfromtheclassic KnightRider
TVseries.
WhenGaryKasparov,thereigningworldchesschampionandgrandmaster,wasbeatenbythe
IBMDeepBluecomputerin1997wefearedthepotentialofmachineintelligencejustasmuch
aswecelebratedthathistoricachievement.
Sostrongisourdesireforintelligentmachinesthatsomehavefallenforthetemptationtocheat.
TheinfamousmechanicalTurkchessmachinewasmerelyahiddenpersoninsideacabinet!
ANatureInspiredNewGoldenAge
Optimismandambitionforartificialintelligencewereflyinghighwhenthesubjectwasformalised
inthe1950s.Initialsuccessessawcomputersplayingsimplegamesandprovingtheorems.
Somewereconvincedmachineswithhumanlevelintelligencewouldappearwithinadecadeor
so.
Butartificialintelligenceprovedhard,andprogressstalled.The1970ssawadevastating
academicchallengetotheambitionsforartificialintelligence,followedbyfundingcutsanda
lossofinterest.
Itseemedmachinesofcoldhardlogic,ofabsolute1sand0s,wouldneverbeabletoachieve
thenuancedorganic,sometimesfuzzy,thoughtprocessesofbiologicalbrains.
Afteraperiodofnotmuchprogressanincrediblypowerfulideaemergedtoliftthesearchfor
machineintelligenceoutofitsrut.Whynottrytobuildartificialbrainsbycopyinghowreal
biologicalbrainsworked?Realbrainswithneuronsinsteadoflogicgates,softermoreorganic
reasoninginsteadofthecoldhard,blackandwhite,absolutisttraditionalalgorithms.
Scientistwereinspiredbytheapparentsimplicityofabeeorpigeon'sbraincomparedtothe
complextaskstheycoulddo.Brainsafractionofagramseemedabletodothingslikesteer
flightandadapttowind,identifyfoodandpredators,andquicklydecidewhethertofightor

4
escape.Surelycomputers,nowwithmassivecheapresources,couldmimicandimproveon
thesebrains?Abeehasaround950,000neuronscouldtodayscomputerswithgigabytesand
terabytesofresourcesoutperformbees?
Butwithtraditionalapproachestosolvingproblemsthesecomputerswithmassivestorageand
superfastprocessorscouldntachievewhattherelativelyminisculebrainsinbirdsandbees
coulddo.
Neuralnetworks emergedfromthisdriveforbiologicallyinspiredintelligentcomputingand
wentontobecomeoneofthemostpowerfulandusefulmethodsinthefieldofartificial
intelligence.Today,GooglesDeepmind,whichachievesfantasticthingslikelearningtoplay
videogamesbyitself,andforthefirsttimebeatingaworldmasterattheincrediblyrichgameof
Go,haveneuralnetworksattheirfoundation.Neuralnetworksarealreadyattheheartof
everydaytechnologylikeautomaticcarnumberplaterecognitionanddecodinghandwritten
postcodesonyourhandwrittenletters.
Thisguideisaboutneuralnetworks,understandinghowtheywork,andmakingyourownneural
networkthatcanbetrainedtorecognisehumanhandwrittencharacters,ataskthatisvery
difficultwithtraditionalapproachestocomputing.

5
Introduction
Whoisthisbookfor?
Thisbookisforanyonewhowantstounderstandwhatneuralnetworkare.Itsforanyonewho
wantstomakeandusetheirown.Anditsforanyonewhowantstoappreciatethefairlyeasy
butexcitingmathematicalideasthatareatthecoreofhowtheywork.

Thisguideisnotaimedatexpertsinmathematicsorcomputerscience.Youwontneedany
specialknowledgeormathematicalabilitybeyondschoolmaths.
Ifyoucanadd,multiply,subtractanddividethenyoucanmakeyourownneuralnetwork.The
mostdifficultthingwelluseisgradientcalculusbuteventhatconceptwillbeexplainedsothat
asmanyreadersaspossiblecanunderstandit.

Interestedreadersorstudentsmaywishtousethisguidetogoonfurtherexcitingexcursions
intoartificialintelligence.Onceyouvegraspedthebasicsofneuralnetworks,youcanapplythe
coreideastomanyvariedproblems.

Teacherscanusethisguideasaparticularlygentleexplanationofneuralnetworksandtheir
implementationtoenthuseandexcitestudentsmakingtheirveryownlearningartificial
intelligencewithonlyafewlinesofprogramminglanguagecode.Thecodehasbeentestedto
workwithaRaspberryPi,asmallinexpensivecomputerverypopularinschoolsandwithyoung
students.

IwishaguidelikethishadexistedwhenIwasateenagerstrugglingtoworkouthowthese
powerfulyetmysteriousneuralnetworksworked.I'dseentheminbooks,filmsandmagazines,
butatthattimeIcouldonlyfinddifficultacademictextsaimedatpeoplealreadyexpertin
mathematicsanditsjargon.
AllIwantedwasforsomeonetoexplainittomeinawaythatamoderatelycuriousschool
studentcouldunderstand.Thatswhatthisguidewantstodo.
Whatwillwedo?
Inthisbookwelltakeajourneytomakinganeuralnetworkthatcanrecognisehuman
handwrittennumbers.
Wellstartwithverysimplepredictingneurons,andgraduallyimproveonthemaswehittheir
limits.Alongtheway,welltakeshortstopstolearnaboutthefewmathematicalconceptsthat
areneededtounderstandhowneuralnetworkslearnandpredictsolutionstoproblems.

6

Welljourneythroughmathematicalideaslikefunctions,simplelinearclassifiers,iterative
refinement,matrixmultiplication,gradientcalculus,optimisationthroughgradientdescentand
evengeometricrotations.Butallofthesewillbeexplainedinareallygentleclearway,andwill
assumeabsolutelynopreviousknowledgeorexpertisebeyondsimpleschoolmathematics.
Oncewevesuccessfullymadeourfirstneuralnetwork,welltakeideaandrunwithitindifferent
directions.Forexample,welluseimageprocessingtoimproveourmachinelearningwithout
resortingtoadditionaltrainingdata.Wellevenpeekinsidethemindofaneuralnetworktoseeif
itrevealsanythinginsightfulsomethingnotmanyguidesshowyouhowtodo!
WellalsolearnPython,aneasy,usefulandpopularprogramminglanguage,aswemakeour
ownneuralnetworkingradualsteps.Again,nopreviousprogrammingexperiencewillbe
assumedorneeded.
Howwillwedoit?
Theprimaryaimofthisguideistoopenuptheconceptsbehindneuralnetworkstoasmany
peopleaspossible.Thismeanswellalwaysstartanideasomewherereallycomfortableand
familiar.Wellthentakesmalleasysteps,buildingupfromthatsafeplacetogettowherewe
havejustenoughunderstandingtoappreciatesomethingreallycoolorexcitingabouttheneural
networks.
Tokeepthingsasaccessibleaspossiblewellresistthetemptationtodiscussanythingthatis
morethanstrictlyrequiredtomakeyourownneuralnetwork.Therewillbeinterestingcontext
andtangentsthatsomereaderswillappreciate,andifthisisyou,youreencouragedto
researchthemmorewidely.
Thisguidewontlookatallthepossibleoptimisationsandrefinementstoneuralnetworks.There
aremany,buttheywouldbeadistractionfromthecorepurposeheretointroducethe
essentialideasinaseasyandunclutteredwasaspossible.
Thisguideisintentionallysplitintothreesections:
In part1
wellgentlyworkthroughthemathematicalideasatworkinsidesimpleneural
networks.Welldeliberatelynotintroduceanycomputerprogrammingtoavoidbeing
distractedfromthecoreideas.
In part2
welllearnjustenoughPythontoimplementourownneutralnetwork.Welltrain
ittorecognisehumanhandwrittennumbers,andwelltestitsperformance.

7
In
part3,wellgofurtherthanisnecessarytounderstandsimpleneuralnetworks,justto
havesomefun.Welltryideastofurtherimproveourneuralnetworksperformance,and
wellalsohavealookinsideatrainednetworktoseeifwecanunderstandwhatithas
learned,andhowitdecidesonitsanswers.
Anddontworry,allthesoftwaretoolswellusewillbe
free
and
opensource
soyouwonthave
topaytousethem.Andyoudontneedanexpensivecomputertomakeyourownneural
network.Allthecodeinthisguidehasbeentestedtoworkonaveryinexpensive5/$4
RaspberryPiZero,andtheresasectionattheendexplaininghowtogetyourRaspberryPi
ready.
AuthorsNote
IwillhavefailedifIhaventgivenyouasenseofthetrueexcitementandsurprisesin
mathematicsandcomputerscience.
IwillhavefailedifIhaventshownyouhowschoollevelmathematicsandsimplecomputer
recipescanbeincrediblypowerfulbymakingourownartificialintelligencemimickingthe
learningabilityofhumanbrains.
IwillhavefailedifIhaventgivenyoutheconfidenceanddesiretoexplorefurthertheincredibly
richfieldofartificialintelligence.
Iwelcomefeedbacktoimprovethisguide.Pleasegetintouchatmakeyourownneuralnetworkat
gmaildotcom,orontwitter @myoneuralnet.

Youwillalsofinddiscussionsaboutthetopicscoveredhereat
http://makeyourownneuralnetwork.blogspot.co.uk .Therewillbeanerrataofcorrectionsthere
too.

8
Part1HowTheyWork
Takeinspirationfromallthesmallthingsaroundyou.

9
EasyforMe,HardforYou
Computersarenothingmorethancalculatorsatheart.Theyareveryveryfastatdoing
arithmetic.
Thisisgreatfordoingtasksthatmatchwhatacalculatordoessummingnumberstoworkout
sales,applyingpercentagestoworkouttax,plottinggraphsofexistingdata.
EvenwatchingcatchupTVorstreamingmusicthroughyourcomputerdoesntinvolvemuch
morethanthecomputerexecutingsimplearithmeticinstructionsagainandagain.Itmay
surpriseyoubutreconstructingavideoframefromtheonesandzerosthatarepipedacrossthe
internettoyourcomputerisdoneusingarithmeticnotmuchmorecomplexthanthesumswe
didatschool.
Addingupnumbersreallyquicklythousands,orevenmillions,asecondmaybeimpressive
butitisntartificialintelligence.Ahumanmayfindithardtodolargesumsveryquicklybutthe
processofdoingitdoesntrequiremuchintelligenceatall.Itsimplyrequiresanabilitytofollow
verybasicinstructions,andthisiswhattheelectronicsinsideacomputerdoes.
Nowletsflipsthingsandturnthetablesoncomputers!
Lookatthefollowingimagesandseeifyoucanrecognisewhattheycontain:
YouandIcanlookatapicturewithhumanfaces,acat,oratree,andrecogniseit.Infactwe
candoitratherquickly,andtoaveryhighdegreeofaccuracy.Wedontoftengetitwrong.
Wecanprocessthequitelargeamountofinformationthattheimagescontain,andvery
successfullyprocessittorecognisewhatsintheimage.Thiskindoftaskisnteasyfor
computersinfactitsincrediblydifficult.
Problem Computer Human

10
Multiplythousandsoflarge Easy Hard
numbersquickly
Findfacesinaphotoofacrowd Hard Easy

ofpeople
Wesuspectimagerecognitionneedshumanintelligencesomethingmachineslack,however
complexandpowerfulwebuildthem,becausetheyrenothuman.
Butitisexactlythesekindsofproblemsthatwewantcomputerstogetbetteratsolving
becausetheyrefastanddontgettired.Anditthesekindsofhardproblemsthatartificial
intelligenceisallabout.
Ofcoursecomputerswillalwaysbemadeofelectronicsandsothetaskofartificialintelligence
istofindnewkindsofrecipes,or
algorithms,whichworkinnewwaystotrytosolvethese
kindsofharderproblem.Evenifnotperfectlywell,butwellenoughtogiveanimpressionofa
humanlikeintelligenceatwork.

KeyPoints:

Sometasksareeasyfortraditionalcomputers,buthardforhumans.Forexample,
multiplyingmillionsofpairsofnumbers.

Ontheotherhand,sometasksarehardfortraditionalcomputers,buteasyfor
humans.Forexample,recognisingfacesinaphotoofacrowd.

11
ASimplePredictingMachine
Letsstartsupersimpleandbuildupfromthere.
Imagineabasicmachinethattakesaquestion,doessomethinkingandpushesoutan
answer.Justliketheexampleabovewithourselvestakinginputthroughoureyes,usingour
brainstoanalysethescene,andcomingtotheconclusionaboutwhatobjectsareinthatscene.
Hereswhatthislookslike:
Computersdontreallythink,theyrejustglorifiedcalculatorsremember,soletsusemore
appropriatewordstodescribewhatsgoingon:
Acomputertakessomeinput,doessomecalculationandpopsoutanoutput.Thefollowing
illustratesthis.Aninputof3x4isprocessed,perhapsbyturningmultiplicationintoaneasier
setofadditions,andtheoutputanswer12popsout.

12

Thatsnotsoimpressive!youmightbethinking.Thatsok.Wereusingsimpleandfamiliar
examplesheretosetoutconceptswhichwillapplytothemoreinterestingneuralnetworkswe
lookatlater.
Letsrampupthecomplexityjustatinynotch.
Imagineamachinethatconvertskilometrestomiles,likethefollowing:
Nowimaginewedontknowtheformulaforconvertingbetweenkilometresandmiles.Allwe
knowisthetherelationshipbetweenthetwois linear
.Thatmeansifwedoublethenumberin
miles,thesamedistanceinkilometresisalsodoubled.Thatmakesintuitivesense.Theuniverse
wouldbeastrangeplaceifthatwasnttrue!
Thislinearrelationshipbetweenkilometresandmilesgivesusaclueaboutthatmysterious
c
calculationitneedstobeoftheformmiles=kilometresx c
,whereisaconstant.Wedont
knowwhatthisconstant c
isyet.
Theonlyotherclueswehavearesomeexamplespairingkilometreswiththecorrectvaluefor
miles.Thesearelikerealworldobservationsusedtotestscientifictheoriestheyreexamples
ofrealworldtruth.

13

TruthExample Kilometres Miles
1 0 0
2 100 62.137
Whatshouldwedotoworkoutthatmissingconstantc?Letsjustpluckavalueat
random
and
giveitago!Letstryc=0.5andseewhathappens.
Herewehavemiles=kilometresxc,wherekilometresis100andcisourcurrentguessat0.5.
Thatgives50miles.
Okay.Thatsnotbadatallgivenwechosec=0.5atrandom!Butweknowitsnotexactlyright
becauseourtruthexamplenumber2tellsustheanswershouldbe62.137.
Werewrongby12.137.Thatsthe error
,thedifferencebetweenourcalculatedanswerandthe
actualtruthfromourlistofexamples.Thatis,
error=truthcalculated
=62.13750
=12.137

14

Sowhatnext?Weknowwerewrong,andbyhowmuch.Insteadofbeingareasontodespair,
weusethiserrortoguideasecond,better,guessat c.
Lookatthaterroragain.Wewereshortby12.137.Becausetheformulaforconverting
c
kilometrestomilesislinear,miles=kilometresx,weknowthatincreasing c
willincreasethe
output.
Letsnudgec
upfrom0.5to0.6andseewhathappens.
Withc
nowsetto0.6,wegetmiles=kilometresxc=100x0.6=60.Thatsbetterthanthe
previousanswerof50.Wereclearlymakingprogress!
Nowtheerrorisamuchsmaller2.137.Itmightevenbeanerrorwerehappytolivewith.

15

Theimportantpointhereisthatweusedtheerrortoguidehowwenudgedthevalueofc.We
wantedtoincreasetheoutputfrom50soweincreased calittlebit.
Ratherthantrytousealgebratoworkouttheexactamount cneedstochange,letscontinue
withthisapproachofrefiningc.Ifyourenotconvinced,andthinkitseasyenoughtoworkout
theexactanswer,rememberthatmanymoreinterestingproblemswonthavesimple
mathematicalformulaerelatingtheoutputandinput.Thatswhyweneedmoresophisticated
methodslikeneuralnetworks.
Letsdothisagain.Theoutputof60isstilltoosmall.Letsnudgethevalueofc
upagainfrom
0.6to0.7.

16

Ohno!Wevegonetoofarand overshot theknowncorrectanswer.Ourpreviouserrorwas

2.137butnowits7.863.Theminussignsimplysaysweovershotratherthanundershot,
remembertheerroris(correctvaluecalculatedvalue).
Oksoc
=0.6waswaybetterthanc=0.7.Wecouldbehappywiththesmallerrorfromc=0.6
c
andendthisexercisenow.Butletsgoonforjustabitlonger.Whydontwenudgeupbyjust
atinyamount,from0.6to0.61.

17
Thatsmuchmuchbetterthanbefore.Wehaveanoutputvalueof61whichisonlywrongby
1.137fromthecorrect62.137.
Sothatlastefforttaughtusthatweshouldmoderatehowmuchwenudgethevalueofc.Ifthe
outputsaregettingclosetothecorrectanswerthatis,theerrorisgettingsmallerthendont
nudgethechangeablebitsomuch.Thatwayweavoidovershootingtherightvalue,likewedid
earlier.
Againwithoutgettingtoodistractedbyexactwaysofworkingout c,andtoremainfocussedon
thisideaofsuccessivelyrefiningit,wecouldsuggestthatthecorrectionisafractionoftheerror.
Thatsintuitivelyrightabigerrormeansabiggercorrectionisneeded,andatinyerrormeans
weneedtheteeniestofnudgesto c
.
Whatwevejustdone,believeitornot,iswalkedthroughtheverycoreprocessoflearningina
neuralnetworkwevetrainedthemachinetogetbetterandbetteratgivingtherightanswer.
Itisworthpausingtoreflectonthatwevenotsolvedaproblemexactlyinonestep,likewe
oftendoinschoolmathsorscienceproblems.Instead,wevetakenaverydifferentapproachby
tryingananswerandimprovingitrepeatedly.Someusetheterm iterativeanditmeans
repeatedlyimprovingananswerbitbybit.

KeyPoints:

Allusefulcomputersystemshaveaninput,andanoutput,withsomekindof
calculationinbetween.Neuralnetworksarenodifferent.

Whenwedontknowexactlyhowsomethingworkswecantrytoestimateitwitha
modelwhichincludesparameterswhichwecanadjust.Ifwedidntknowhowto
convertkilometrestomiles,wemightusealinearfunctionasamodel,withan
adjustablegradient.

Agoodwayofrefiningthesemodelsistoadjusttheparametersbasedonhowwrong
themodeliscomparedtoknowntrueexamples.

18
ClassifyingisNotVeryDifferentfromPredicting
Wecalledtheabovesimplemachinea predictor
,becauseittakesaninputandmakesa
predictionofwhattheoutputshouldbe.Werefinedthatpredictionbyadjustinganinternal
parameter,informedbytheerrorwesawwhencomparingwithaknowntrueexample.
Nowlookatthefollowinggraphshowingthemeasuredwidthsandlengthsofgardenbugs.
Youcanclearlyseetwogroups.Thecaterpillarsarethinandlong,andtheladybirdsarewide
andshort.
Rememberthepredictorthattriedtoworkoutthecorrectnumberofmilesgivenkilometres?
Thatpredictorhadanadjustablelinearfunctionatitsheart.Remember,linearfunctionsgive
straightlineswhenyouplottheiroutputagainstinput.Theadjustableparameterc
changedthe
slopeofthatstraightline.

19

Whathappensifweplaceastraightlineoverthatplot?
Wecantusethelineinthesamewaywedidbeforetoconvertonenumber(kilometres)into
another(miles),butperhapswecanusethelinetoseparatedifferentkindsofthings.
Intheplotabove,ifthelinewasdividingthecaterpillarsfromtheladybirds,thenitcouldbeused
to
classifyanunknownbugbasedonitsmeasurements.Thelineabovedoesntdothisyet
becausehalfthecaterpillarsareonthesamesideofthedividinglineastheladybirds.
Letstryadifferentline,byadjustingtheslopeagain,andseewhathappens.

20

Thistimethelineisevenlessuseful!Itdoesntseparatethetwokindsofbugsatall.
Letshaveanothergo:

21

Thatsmuchbetter!Thislineneatlyseparatescaterpillarsfromladybirds.Wecannowusethis
lineasa
classifierofbugs.
Weareassumingthattherearenootherkindsofbugsthatwehaventseenbutthatsokfor
now,weresimplytryingtoillustratetheideaofasimpleclassifier.
Imaginenexttimeourcomputerusedarobotarmtopickupanewbugandmeasureditswidth
andheight,itcouldthenusetheabovelinetoclassifyitcorrectlyasacaterpillaroraladybird.
Lookatthefollowingplot,youcanseetheunknownbugisacaterpillarbecauseitliesabove
theline.Thisclassificationissimplebutprettypowerfulalready!

22

Weveseenhowalinearfunctioninsideoursimplepredictorscanbeusedtoclassifypreviously
unseendata.
Butweveskippedoveracrucialelement.Howdowegettherightslope?Howdoweimprovea
lineweknowisntagooddividerbetweenthetwokindsofbugs?
Theanswertothatisagainattheveryheartofhowneuralnetworkslearn,andwelllookatthis
next.
TrainingASimpleClassifier
Wewantto
trainourlinearclassifiertocorrectlyclassifybugsasladybirdsorcaterpillars.We
sawabovethisissimplyaboutrefiningtheslopeofthedividinglinethatseparatesthetwo
groupsofpointsonaplotofbigwidthandheight.

23

Howdowedothis?
Ratherthandevelopsomemathematicaltheoryupfront,letstrytofeelourwayforwardbytrying
todoit.Wellunderstandthemathematicsbetterthatway.
Wedoneedsomeexamplestolearnfrom.Thefollowingtableshowstwoexamples,justto
keepthisexercisesimple.
Example Width Length Bug
1 3.0 1.0 ladybird
2 1.0 3.0 caterpillar
Wehaveanexampleofabugwhichhaswidth3.0andlength1.0,whichweknowisaladybird.
Wealsohaveanexampleofabugwhichislongerat3.0andthinnerat1.0,whichisa
caterpillar.
Thisisasetofexampleswhichweknowtobethetruth.Itistheseexampleswhichwillhelp
refinetheslopeoftheclassifierfunction.Examplesoftruthusedtoteachapredictorora
classifierarecalledthe
trainingdata .
Letsplotthesetwotrainingdataexamples.Visualisingdataisoftenveryhelpfultogetabetter
understandofit,afeelforit,whichisnteasytogetjustbylookingatalistortableofnumbers.

24

Letsstartwitharandomdividingline,justtogetstartedsomewhere.Lookingbackatourmiles
tokilometrepredictor,wehadalinearfunctionwhoseparameterweadjusted.Wecandothe
samehere,becausethedividinglineisastraightline:
y=Ax
Wevedeliberatelyusedthenames yandx
insteadoflengthandwidth,becausestrictly
speaking,thelineisnotapredictorhere.Itdoesntconvertwidthtolength,likewepreviously
convertedmilestokilometres.Instead,itisadividingline,aclassifier.
Youmayalsonoticethatthisy=Ax issimplerthanthefullerformforastraightline
y=Ax+B
.
Wevedeliberatelykeptthisgardenbugscenarioassimpleaspossible.Havinganonzero B

25
simplemeansthelinedoesntgothroughtheoriginofthegraph,whichdoesntaddanything
usefultoourscenario.
Wesawbeforethattheparameter Acontrolstheslopeoftheline.ThelargerAisthelargerthe
slope.
LetsgoforA
is0.25togetstarted.Thedividinglineis
y=0.25x
.Letsplotthislineonthesame
plotoftrainingdatatoseewhatitlookslike:
Well,wecanseethattheline y=0.25x
isntagoodclassifieralreadywithouttheneedtodo
anycalculations.Thelinedoesntdividethetwotypesofbug.Wecantsayifthebugisabove
thelinethenitisacaterpillarbecausetheladybirdisabovethelinetoo.

26
Sointuitivelyweneedtomovethelineupabit.Wellresistthetemptationtodothisbylooking
attheplotanddrawingasuitableline.Wewanttoseeifwecanfindarepeatablerecipetodo
this,aseriesofcomputerinstructions,whichcomputerscientistscallan algorithm .
Letslookatthefirsttrainingexample:thewidthis3.0andlengthis1.0foraladybird.Ifwe
testedthey=Axfunctionwiththisexamplewhere x
is3.0,wedget
y=(0.25)*(3.0)=0.75
Thefunction,withtheparameter Asettotheinitialrandomlychosenvalueof0.25,is
suggestingthatforabugofwidth3.0,thelengthshouldbe0.75.Weknowthatstoosmall
becausethetrainingdataexampletellsusitmustbealengthof1.0.
Sowehaveadifference,an error
.Justasbefore,withthemilestokilometrespredictor,wecan
usethiserrortoinformhowweadjusttheparameter A.
Butbeforewedo,letsthinkaboutwhatyshouldbeagain.Ifywas1.0thenthelinegoesright
throughthepointwheretheladybirdsitsat( x,y )=(3.0,1.0).Itsasubtlepointbutwedont
actuallywantthat.Wewantthelinetogoabovethatpoint.Why?Becausewewantallthe
ladybirdpointstobebelowtheline,notonit.Thelineneedstobeadividinglinebetween
ladybirdsandcaterpillars,notapredictorofabugslengthgivenitswidth.
Soletstrytoaimfory
=1.1whenx=3.0.Itsjustasmallnumberabove1.0,Wecouldhave
chosen1.2,oreven1.3,butwedontwantalargernumberlike10or100becausethatwould
makeitmorelikelythatthelinegoesabovebothladybirdsandcaterpillars,resultingina
separatorthatwasntusefulatall.
Sothedesiredtargetis1.1,andtheerror E
is
error=(desiredtargetactualoutput)
Whichis,
E=1.10.75=0.35
Letspauseandhavearemindourselveswhattheerror,thedesiredtargetandthecalculated
valuemeanvisually.

27

Now,whatdowedowiththis E
toguideustoabetterrefinedparameter A?Thatsthe
importantquestion.
Letstakeastepbackfromthistaskandthinkagain.Wewanttousetheerrorin y
,whichwe
E
call,toinformtherequiredchangeinparameter A.Todothisweneedtoknowhowthetwo
arerelated.Howis Arelatedto
E?Ifwecanknowthis,thenwecanunderstandhowchanging
oneaffectstheother.
Letsstartwiththelinearfunctionfortheclassifier:
y=Ax
WeknowthatforinitialguessesofAthisgivesthewronganswerfor y,whichshouldbethe
valuegivenbythetrainingdata.Letscallthecorrectdesiredvalue, t
fortargetvalue.Toget
thatvalue t
,weneedtoadjustAbyasmallamount.Mathematiciansusethedeltasymbolto
meanasmallchangein.Letswritethatout:
t=(A+A)x
Letspicturethistomakeiteasiertounderstand.Youcanseethenewslope( A+A
).

28

Remembertheerror E
wasthedifferencebetweenthedesiredcorrectvalueandtheonewe
calculatebasedonourcurrentguessfor A E
.Thatis,was
ty
.
Letswritethatouttomakeitclear:
ty=(A+A)xAx
Expandingoutthetermsandsimplifying:

E=ty=Ax+(A)xAx

E=(A)x
Thatsremarkable!Theerror EisrelatedtoA
isaverysimpleway.ItssosimplethatIthought
itmustbewrongbutitwasindeedcorrect.Anyway,thissimplerelationshipmakesourjob
mucheasier.
Itseasytogetlostordistractedbythatalgebra.Letsremindourselvesofwhatwewantedto
getoutofallthis,inplainEnglish.

29

Wewantedtoknowhowmuchtoadjust Abytoimprovetheslopeofthelinesoitisabetter
classifier,beinginformedbytheerror E.Todothiswesimplyrearrangethatlastequationto
put A
onitsown:
A=E/x
Thatsit!Thatsthemagicexpressionwevebeenlookingfor.Wecanusetheerror Etorefine
theslope Aoftheclassifyinglinebyanamount A
.
Letsdoitletsupdatethatinitialslope.
Theerrorwas0.35andthe x
was3.0.Thatgives A E
= x
/as0.35/3.0=0.1167.Thatmeans
weneedtochangethecurrent A
=0.25by0.1167.Thatmeansthenewimprovedvaluefor A
is
(A+A)whichis0.25+0.1167=0.3667.Asithappens,thecalculatedvalueof y
withthisnew
A is1.1asyoudexpectitsthedesiredtargetvalue.
Phew!Wedidit!Allthatwork,andwehaveamethodforrefiningthatparameter A
,informedby
thecurrenterror.
Letspresson.
Nowweredonewithonetrainingexample,letslearnfromthenextone.Herewehaveaknown
truepairingof x=1.0andy=3.0.
Letsseewhathappenswhenweput x
=1.0intothelinearfunctionwhichisnowusingthe
updated A=0.3667.Weget y=0.3667*1.0=0.3667.Thatsnotveryclosetothetraining
examplewith y=3.0atall.
Usingthesamereasoningasbeforethatwewantthelinetonotcrossthetrainingdatabut
insteadbejustaboveorbelowit,wecansetthedesiredtargetvalueat2.9.Thiswaythe
trainingexampleofacaterpillarisjustabovetheline,notonit.Theerror E
is(2.90.3667)=
2.5333.
Thatsabiggererrorthanbefore,butifyouthinkaboutit,allwevehadsofarforthelinear
functiontolearnfromisasingletrainingexample,whichclearlybiasesthelinetowardsthat
singleexample.
Letsupdatethe Aagain,justlikewedidbefore.The A E
is x
/whichis2.5333/1.0=2.5333.
Thatmeanstheevennewer Ais0.3667+2.5333=2.9.Thatmeansfor x
=1.0thefunction
gives2.9astheanswer,whichiswhatthedesiredvaluewas.

30
Thatsafairamountofworkingoutsoletspauseagainandvisualisewhatwevedone.The
followingplotshowstheinitialline,thelineupdatedafterlearningfromthefirsttrainingexample,
andthefinallineafterlearningfromthesecondtrainingexample.
Wait!Whatshappened!Lookingatthatplot,wedontseemtohaveimprovedtheslopeinthe
waywehadhoped.Ithasntdividedneatlytheregionbetweenladybirdsandcaterpillars.
Well,wegotwhatweaskedfor.Thelineupdatestogiveeachdesiredvaluefory.
Whatswrongwiththat?Well,ifwekeepdoingthis,updatingforeachtrainingdataexample,all
wegetisthatthefinalupdatesimplymatchesthelasttrainingexampleclosely.Wemightas
wellhavenotbotheredwithallprevioustrainingexamples.Ineffectwearethrowingawayany
learningthatprevioustrainingexamplesmightgivesusandjustlearningfromthelastone.
Howdowefixthis?

31

Easy!Andthisisanimportantideain machinelearning .Wemoderate theupdates.Thatis,
wecalmthemdownabit.Insteadofjumpingenthusiasticallytoeachnew A,wetakeafraction
ofthechange A,notallofit.Thiswaywemoveinthedirectionthatthetrainingexample
suggests,butdososlightlycautiously,keepingsomeofthepreviousvaluewhichwasarrivedat
throughpotentiallymanyprevioustrainingiterations.Wesawthisideaofmoderatingour
refinementsbeforewiththesimplermilestokilometrespredictor,wherewenudgedthe
parameter c
asafractionoftheactualerror.
Thismoderation,hasanotherverypowerfulandusefulsideeffect.Whenthetrainingdataitself
cantbetrustedtobeperfectlytrue,andcontainserrorsornoise,bothofwhicharenormalin
realworldmeasurements,themoderationcandampentheimpactofthoseerrorsornoise.It
smoothsthemout.
Okletsrerunthatagain,butthistimewelladdamoderationintotheupdateformula:
A=L(E/x)
Themoderatingfactorisoftencalleda learningrate,andwevecalledit L L
.Letspick =0.5as
areasonablefractionjusttogetstarted.Itsimplymeansweonlyupdatehalfasmuchaswould
havedonewithoutmoderation.
Runningthroughthatallagain,wehaveaninitial A=0.25.Thefirsttrainingexamplegivesus y
=0.25*3.0=0.75.Adesiredvalueof1.1givesusanerrorof0.35.The A L
= E
( x
/)=0.5*
0.35/3.0=0.0583.Theupdated Ais0.25+0.0583=0.3083.
TryingoutthisnewAonthetrainingexampleat x
=3.0gives y
=0.3083*3.0=0.9250.The
linenowfallsonthewrongsideofthetrainingexamplebecauseitisbelow1.1butitsnotabad
resultifyouconsideritafirstrefinementstepofmanytocome.Itdidmoveintherightdirection
awayfromtheinitialline.
Letspressontothesecondtrainingdataexampleat x
=1.0.UsingA
=0.3083wehavey=
0.3083*1.0=0.3083.Thedesiredvaluewas2.9sotheerroris(2.90.3083)=2.5917.The
A L
= E
( x
/)=0.5*2.5917/1.0=1.2958.Theevennewer Aisnow0.3083+1.2958=
1.6042.
Letsvisualiseagaintheinitial,improvedandfinallinetoseeifmoderatingupdatesleadstoa
betterdividinglinebetweenladybirdandcaterpillarregions.

32

Thisisreallygood!
Evenwiththesetwosimpletrainingexamples,andarelativelysimpleupdatemethodusinga
moderating learningrate y
,wehaveveryrapidlyarrivedatagooddividingline=
Ax A
whereis
1.6042.
Letsnotdiminishwhatweveachieved.Weveachievedanautomatedmethodoflearningto
classifyfromexamplesthatisremarkablyeffectivegiventhesimplicityoftheapproach.
Brilliant!

KeyPoints:

Wecanusesimplemathstounderstandtherelationshipbetweentheoutputerrorofa

33
linearclassifierandtheadjustableslopeparameter.Thatisthesameasknowinghow
muchtoadjusttheslopetoremovethatoutputerror.

Aproblemwithdoingtheseadjustmentsnaively,isthatthemodelisupdatedtobest
matchthelasttrainingexampleonly,discardingallprevioustrainingexamples.Agood
waytofixthisistomoderatetheupdateswithalearningratesonosingletraining
exampletotallydominatesthelearning.

Trainingexamplesfromtherealworldcanbenoisyorcontainerrors.Moderating
updatesinthiswayhelpfullylimitstheimpactofthesefalseexamples.

34
SometimesOneClassifierIsNotEnough
Thesimplepredictorsandclassifiersweveworkedwithsofartheonesthattakessomeinput,
dosomecalculation,andthrowoutanansweralthoughprettyeffectiveaswevejustseen,are
notenoughtosolvesomeofthemoreinterestingproblemswehopetoapplyneuralnetworks
to.
Herewellillustratethelimitofalinearclassifierwithasimplebutstarkexample.Whydowe
wanttodothis,andnotjumpstraighttodiscussingneuralnetworks?Thereasonisthatakey
designfeatureofneuralnetworkscomesfromunderstandingthislimitsoworthspendinga
littletimeon.
Wellbemovingawayfromgardenbugsandlookingat Booleanlogicfunctions.Ifthatsounds
likemumbojumbojargondontworry.GeorgeBoolewasamathematicianandphilosopher,
andhisnameisassociatedwithsimplefunctionslikeANDandOR.
Booleanlogicfunctionsarelikelanguageorthoughtfunctions.Ifwesayyoucanhaveyour
puddingonlyifyouveeatenyourvegetablesANDifyourestillhungrywereusingtheBoolean
ANDfunction.TheBooleanANDisonlytrueifbothconditionsaretrue.Itsnottrueifonlyone
ofthemistrue.SoifImhungry,buthaventeatenmyvegetables,thenIcanthavemypudding.
Similarly,ifwesayyoucanplayintheparkifitstheweekendORyoureonannualleavefrom
workwereusingtheBooleanORfunction.TheBooleanORistrueifany,orall,ofthe
conditionsaretrue.TheydontallhavetobetrueliketheBooleanANDfunction.Soifitsnot
theweekend,butIhavebookedannualleave,Icanindeedgoandplayinthepark.
Ifwethinkbacktoourfirstlookatfunctions,wesawthemasamachinethattooksomeinputs,
didsomework,andoutputananswer.Booleanlogicalfunctionstypicallytaketwoinputsand
outputoneanswer:

35
Computersoftenrepresenttrue
asthenumber1,and
false
asthenumber0.Thefollowing
tableshowsthelogicalANDandORfunctionsusingthismoreconcisenotationforall
combinationsofinputsAandB.
InputA InputB LogicalAND LogicalOR
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 1
Youcanseequiteclearly,thattheANDfunctionisonlytrueifbothAandBaretrue.Similarly
youcanseethatORistruewheneveranyoftheinputsAorBistrue.
Booleanlogicfunctionsarereallyimportantincomputerscience,andinfacttheearliest
electroniccomputerswerebuiltfromtinyelectricalcircuitsthatperformedtheselogical
functions.Evenarithmeticwasdoneusingcombinationsofcircuitswhichthemselveswere
simpleBooleanlogicfunctions.
Imagineusingasimplelinearclassifiertolearnfromtrainingdatawhetherthedatawas
governedbyaBooleanlogicfunction.Thatsanaturalandusefulthingtodoforscientists
wantingtofindcausallinksorcorrelationsbetweensomeobservationsandothers.For
example,istheremoremalariawhenitrainsANDitishotterthan35degrees?Istheremore
malariawheneither(BooleanOR)oftheseconditionsistrue?
Lookatthefollowingplot,showingthetwoinputsAandBtothelogicalfunctionascoordinates
onagraph.Theplotshowsthatonlywhenbotharetrue,withvalue1,istheoutputalsotrue,
shownasgreen.Falseoutputsareshownred.

36

Youcanalsoseeastraightlinethatdividestheredfromthegreenregions.Thatlineisalinear
functionthatalinearclassifiercouldlearn,justaswehavedoneearlier.
Wewontgothroughthenumericalworkingsoutaswedidbeforebecausetheyrenot
fundamentallydifferentinthisexample.
Infacttherearemanyvariationsonthisdividinglinethatwouldworkjustaswell,butthemain
y
pointisthatitisindeedpossibleforasimplelinearclassifieroftheform= +
axb
tolearnthe
BooleanANDfunction.
NowlookattheBooleanORfunctionplottedinasimilarway:

37

Thistimeonlythe(0,0)pointisredbecauseitcorrespondstobothinputsAandBbeingfalse.
AllothercombinationshaveatleastoneAorBastrue,andsotheoutputistrue.Thebeautyof
thediagramisthatitmakesclearthatitispossibleforalinearclassifiertolearntheBooleanOR
function,too.
ThereisanotherBooleanfunctioncalledXOR,shortforeXclusiveOR,whichonlyhasatrue
outputifeitheroneoftheinputsAorBistrue,butnotboth.Thatis,whentheinputsareboth
false,orbothtrue,theoutputisfalse.Thefollowingtablesummarisesthis:
InputA InputB LogicalXOR
0 0 0
0 1 1
1 0 1
1 1 0
Nowlookataplotoftheofthisfunctiononagridwiththeoutputscoloured:

38

Thisisachallenge!Wecantseemtoseparatetheredfromtheblueregionswithonlyasingle
straightdividingline.
Itis,infact,impossibletohaveasinglestraightlinethatsuccessfullydividestheredfromthe
greenregionsfortheBooleanXOR.Thatis,asimplelinearclassifiercantlearntheBoolean
XORifpresentedwithtrainingdatathatwasgovernedbytheXORfunction.
Wevejustillustratedamajorlimitationofthesimplelinearclassifier.Asimplelinearclassifieris
notusefuliftheunderlyingproblemisnotseparablebyastraightline.
Wewantneuralnetworkstobeusefulforthemanymanytaskswheretheunderlyingproblemis
notlinearlyseparablewhereasinglestraightlinedoesnthelp.
Soweneedafix.
Luckilythefixiseasy.Infactthediagrambelowwhichhastwostraightlinestoseparateoutthe
differentregionssuggeststhefixweusemultipleclassifiersworkingtogether.Thatsanidea
centraltoneuralnetworks.Youcanimaginealreadythatmanylinearlinescanstarttoseparate
offevenunusuallyshapedregionsforclassification.

39

Beforewediveintobuildingneuralnetworksmadeofmanyclassifiersworkingtogether,letsgo
backtonatureandlookattheanimalbrainsthatinspiredtheneuralnetworkapproach.

KeyPoints:

Asimplelinearclassifiercantseparatedatawherethatdataitselfisntgovernedbya
singlelinearprocess.Forexample,datagovernedbythelogicalXORoperator
illustratesthis.

Howeverthesolutioniseasy,youjustusemultiplelinearclassifierstodivideupdata
thatcantbeseparatedbyasinglestraightdividingline.

Neurons,NaturesComputingMachines

40
Wesaidearlierthatanimalbrainspuzzledscientists,becauseevensmalloneslikeapigeon
brainwerevastlymorecapablethandigitalcomputerswithhugenumbersofelectronic
computingelements,hugestoragespace,andallrunningatfrequenciesmuchfasterthan
fleshysquishynaturalbrains.
Attentionturnedtothearchitecturaldifferences.Traditionalcomputersprocesseddatavery
muchsequentially,andinprettyexactconcreteterms.Thereisnofuzzinessorambiguityabout
theircoldhardcalculations.Animalbrains,ontheotherhand,althoughapparentlyrunningat
muchslowerrhythms,seemedtoprocesssignalsinparallel,andfuzzinesswasafeatureof
theircomputation.
Letslookatthebasicunitofabiologicalbrainthe
neuron
:
Neurons,althoughtherearevariousformsofthem,alltransmitanelectricalsignalfromoneend
totheother,fromthedendritesalongtheaxonstotheterminals.Thesesignalsarethenpassed
fromoneneurontoanother.Thisishowyourbodysenseslight,sound,touchpressure,heat
andsoon.Signalsfromspecialisedsensoryneuronsaretransmittedalongyournervous
systemtoyourbrain,whichitselfismostlymadeofneuronstoo.
Thefollowingisasketchofneuronsinapigeonsbrain,madebyaSpanishneuroscientistin
1889.Youcanseethekeypartsthedendritesandtheterminals.

41

Howmanyneuronsdoweneedtoperforminteresting,morecomplex,tasks?
Well,theverycapablehumanbrainhasabout100billionneurons!Afruitflyhasabout100,000
neuronsandiscapableofflying,feeding,evadingdanger,findingfood,andmanymorefairly
complextasks.Thisnumber,100,000neurons,iswellwithintherealmofmoderncomputersto
trytoreplicate.Anematodewormhasjust302neurons,whichispositivelyminisculecompared
totodaysdigitalcomputerresources!Butthatwormisabletodosomefairlyusefultasksthat
traditionalcomputerprogramsofmuchlargersizewouldstruggletodo.
Sowhatsthesecret?Whyarebiologicalbrainssocapablegiventhattheyaremuchslowerand
consistofrelativelyfewcomputingelementswhencomparedtomoderncomputers?The
completefunctioningofbrains,consciousnessforexample,isstillamystery,butenoughis
knownaboutneuronstosuggestdifferentwaysofdoingcomputation,thatis,differentwaysto
solveproblems.
Soletslookathowaneuronworks.Ittakesanelectricinput,andpopsoutanotherelectrical
signal.Thislooksexactlyliketheclassifyingorpredictingmachineswelookedatearlier,which
tookaninput,didsomeprocessing,andpoppedoutanoutput.

42
Socouldwerepresentneuronsaslinearfunctions,justlikewedidbefore?Goodidea,butno.A
biologicalneurondoesntproduceanoutputthatissimplyasimplelinearfunctionoftheinput.
Thatis,itsoutputdoesnottaketheformoutput=(constant*input)+(maybeanotherconstant).
Observationssuggestthatneuronsdontreactreadily,butinsteadsuppresstheinputuntilithas
grownsolargethatittriggersanoutput.Youcanthinkofthisasathresholdthatmustbe
reachedbeforeanyoutputisproduced.Itslikewaterinacupthewaterdoesntspilloveruntil
ithasfirstfilledthecup.Intuitivelythismakessensetheneuronsdontwanttobepassingon
tinynoisesignals,onlyemphaticallystrongintentionalsignals.Thefollowingillustratesthisidea
ofonlyproducinganoutputsignaliftheinputissufficientlydialeduptopassathreshold.
Afunctionthattakestheinputsignalandgeneratesanoutputsignal,buttakesintoaccount
somekindofthresholdiscalledanactivationfunction .Mathematically,therearemanysuch
activationfunctionsthatcouldachievethiseffect.Asimple
stepfunctioncoulddothis:

43

Youcanseeforlowinputvalues,theoutputiszero.Howeveroncethethresholdinputis
reached,outputjumpsup.Anartificialneuronbehavinglikethiswouldbelikearealbiological
neuron.Thetermusedbyscientistsactuallydescribesthiswell,theysaythatneurons fire
when
theinputreachesthethreshold.
Wecanimproveonthestepfunction.TheSshapedfunctionshownbelowiscalledthe
sigmoidfunction .Itissmootherthanthecoldhardstepfunction,andthismakesitmore
naturalandrealistic.Naturerarelyhascoldhardedges!

44

ThissmoothSshapedsigmoidfunctioniswhatwellbecontinuetouseformakingourown
neuralnetwork.Artificialintelligenceresearcherswillalsouseother,similarlookingfunctions,
butthesigmoidissimpleandactuallyverycommontoo,sowereingoodcompany.
Thesigmoidfunction,sometimesalsocalledthe logisticfunction ,is
e
Thatexpressionisntasscaryasitfirstlooks.Theletter isamathematicalconstant2.71828
Itsaveryinterestingnumberthatpopsupinallsortsofareasofmathematicsandphysics,and
thereasonIveusedthedotsisbecausethedecimaldigitskeepgoingonforever.Numbers
likethathaveafancyname,transcendentalnumbers.Thatsallverywellandfun,butforour
purposesyoucanjustthinkofitas2.71828.Theinput x e
isnegatedand israisedtothepower
x
ofthat
x
.Theresultisaddedto1,sowehave 1+e.Finallytheinverseistakenofthewhole
x
thing,thatis1isdividedby
1+e.Thatiswhatthemildlyscarylookingfunctionabovedoesto
theinputx y
togiveustheoutputvalue.Sonotsoscaryafterall!

45

Justoutofinterest,when x ex
iszero,is1becauseanythingraisedtoapowerofzerois1.So
y
becomes1/(1+1)orsimply1/2,ahalf.Sothebasicsigmoidcutstheyaxisat y
=1/2.
Thereisanother,verypowerfulreasonforchoosingthissigmoidfunctionoverthemanymany
otherSshapedfunctionswecouldhaveusedforaneuronsoutput.Thereasonisthatthis
sigmoidfunctionismucheasiertodocalculationswiththanotherSshapedfunctions,andwell
seewhyinpracticelater.
Letsgetbacktoneurons,andconsiderhowwemightmodelanartificialneuron.
Thefirstthingtorealiseisthatrealbiologicalneuronstakemanyinputs,notjustone.Wesaw
thiswhenwehadtwoinputstotheBooleanlogicmachine,sotheideaofhavingmorethanone
inputisnotneworunusual.
Whatdowedowithalltheseinputs?Wesimplycombinethembyaddingthemup,andthe
resultantsumistheinputtothesigmoidfunctionwhichcontrolstheoutput.Thisreflectshow
realneuronswork.Thefollowingdiagramillustratesthisideaofcombininginputsandthen
applyingthethresholdtothecombinedsum:
Ifthecombinedsignalisnotlargeenoughthentheeffectofthesigmoidthresholdfunctionisto
suppresstheoutputsignal.Ifthesum x
iflargeenoughtheeffectofthesigmoidistofirethe
neuron.Interestingly,ifonlyoneoftheseveralinputsislargeandtherestsmall,thismaybe
enoughtofiretheneuron.Whatsmore,theneuroncanfireifsomeoftheinputsareindividually
almost,butnotquite,largeenoughbecausewhencombinedthesignalislargeenoughto
overcomethethreshold.Inanintuitiveway,thisgivesyouasenseofthemoresophisticated,
andinasensefuzzy,calculationsthatsuchneuronscando.
Theelectricalsignalsarecollectedbythedendritesandthesecombinetoformastronger
electricalsignal.Ifthesignalisstrongenoughtopassthethreshold,theneuronfiresasignal

46
downtheaxontowardstheterminalstopassontothenextneuronsdendrites.Thefollowing
diagramshowsseveralneuronsconnectedinthisway:
Thethingtonoticeisthateachneurontakesinputfrommanybeforeit,andalsoprovides
signalstomanymore,ifithappenstobefiring.
Onewaytoreplicatethisfromnaturetoanartificialmodelistohavelayersofneurons,with
eachconnectedtoeveryotheroneintheprecedingandsubsequentlayer.Thefollowing
diagramillustratesthisidea:

47

Youcanseethethreelayers,eachwiththreeartificialneurons,or nodes
.Youcanalsosee
eachnodeconnectedtoeveryothernodeintheprecedingandnextlayers.
Thatsgreat!Butwhatpartofthiscoollookingarchitecturedoesthelearning?Whatdowe
adjustinresponsetotrainingexamples?Isthereaparameterthatwecanrefineliketheslope
ofthelinearclassifierswelookedatearlier?
Themostobviousthingistoadjustthestrengthoftheconnectionsbetweennodes.Withina
node,wecouldhaveadjustedthesummationoftheinputs,orwecouldhaveadjustedthe
shapeofthesigmoidthresholdfunction,butthatsmorecomplicatedthansimplyadjustingthe
strengthoftheconnectionsbetweenthenodes.
Ifthesimplerapproachworks,letsstickwithit!Thefollowingdiagramagainshowsthe
connectednodes,butthistimea weight isshownassociatedwitheachconnection.Alow
weightwilldeemphasiseasignal,andahighweightwillamplifyit.

48

Itsworthexplainingthefunnylittlenumbersnexttotheweightsymbols.Theweight is
w2,3
simplytheweightassociatedwiththesignalthatpassedbetweennode2inalayertonode3in
thenextlayer.Sow istheweightthatdiminishesoramplifiesthesignalbetweennode1and
1,2
node2inthenextlayer.Toillustratetheidea,thefollowingdiagramshowsthesetwo
connectionsbetweenthefirstandsecondlayerhighlighted.

49

Youmightreasonablychallengethisdesignandaskyourselfwhyeachnodeshouldconnectto
everyothernodeinthepreviousandnextlayer.Theydonthavetoandyoucouldconnectthem
inallsortsofcreativeways.Wedontbecausetheuniformityofthisfullconnectivityisactually
easiertoencodeascomputerinstructions,andbecausethereshouldntbeanybigharmin
havingafewmoreconnectionsthantheabsoluteminimumthatmightbeneededforsolvinga
specifictask.Thelearningprocesswilldeemphasisethosefewextraconnectionsiftheyarent
actuallyneeded.
Whatdowemeanbythis?Itmeansthatasthenetworklearnstoimproveitsoutputsbyrefining
thelinkweightsinsidethenetwork,someweightsbecomezeroorclosetozero.Zero,oralmost
zero,weightsmeansthoselinksdontcontributetothenetworkbecausesignalsdontpass.A
zeroweightmeansthesignalsaremultipliedbyzero,whichresultsinzero,sothelinkis
effectivelybroken.

KeyPoints:

Biologicalbrainsseemtoperformsophisticatedtaskslikeflight,findingfood,learning
language,andevadingpredators,despiteappearingtohavemuchlessstorage,and
runningmuchslower,thanmoderncomputers.

50
Biologicalbrainsarealsoincrediblyresilienttodamageandimperfectsignals
comparedtotraditionalcomputersystems.

Biologicalbrains,madeofconnectedneurons,aretheinspirationforartificialneural
networks.

FollowingSignalsThroughANeuralNetwork
Thatpictureof3layersofneurons,witheachneuronconnecttoeveryotherinthenextand
previouslayers,looksprettyamazing.
Buttheideaofcalculatinghowsignalsprogressfromtheinputsthroughthelayerstobecome
theoutputsseemsalittledauntingand,well,toomuchlikehardwork!
Illagreethatitishardwork,butitisalsoimportanttoillustrateitworkingsowealwaysknow
whatisreallyhappeninginsideaneuralnetwork,evenifwelateruseacomputertodoallthe
workforus.Sowelltrydoingtheworkingsoutwithasmallerneuralnetworkwithonly2layers,
eachwith2neurons,asshownbelow:
Letsimaginethetwoinputsare1.0and0.5.Thefollowingshowstheseinputsenteringthis
smallerneuralnetwork.

51

Justasbefore,eachnodeturnsthesumoftheinputsintoanoutputusinganactivationfunction.
1
Wellalsousethesigmoidfunction y = (1+ex thatwesawbefore,wherexisthesumof
)
incomingsignalstoaneuron,and y
istheoutputofthatneuron.
Whatabouttheweights?Thatsaverygoodquestionwhatvalueshouldtheystartwith?Lets
gowithsomerandomweights:
w =0.9
1,1
w =0.2
1,2
w =0.3
2,1
w =0.8
2,2
Randomstartingvaluesarentsuchabadidea,anditiswhatwedidwhenwechoseaninitial
slopevalueforthesimplelinearclassifiersearlieron.Therandomvaluegotimprovedwitheach
examplethattheclassifierlearnedfrom.Thesameshouldbetrueforneuralnetworkslink
weights.
Thereareonlyfourweightsinthissmallneuralnetwork,asthatsallthecombinationsfor
connectingthe2nodesineachlayer.Thefollowingdiagramshowsallthesenumbersnow
marked.

52

Letsstartcalculating!
Thefirstlayerofnodesistheinputlayer,anditdoesntdoanythingotherthanrepresentthe
inputsignals.Thatis,theinputnodesdontapplyanactivationfunctiontotheinput.Thereisnot
reallyafantasticreasonforthisotherthanthatshowhistorytookitscourse.Thefirstlayerof
neuralnetworksistheinputlayerandallthatlayerdoesisrepresenttheinputsthatsit.
Thefirstinputlayer1waseasynocalculationstobedonethere.
Nextisthesecondlayerwherewedoneedtodosomecalculations.Foreachnodeinthislayer
weneedtoworkoutthecombinedinput.Rememberthatsigmoidfunction, y = (1+e 1
x ?Well,
)
thexinthatfunctionisthecombinedinputintoanode.Thatcombinationwastherawoutputs
fromtheconnectednodesinthepreviouslayer,butmoderatedbythelinkweights.The
followingdiagramisliketheonewesawpreviouslybutnowincludestheneedtomoderatethe
incomingsignalswiththelinkweights.

53

Soletsfirstfocusonnode1inthelayer2.Bothnodesinthefirstinputlayerareconnectedtoit.
Thoseinputnodeshaverawvaluesof1.0and0.5.Thelinkfromthefirstnodehasaweightof
0.9associatedwithit.Thelinkfromthesecondhasaweightof0.3.Sothecombined
moderatedinputis:
x=(outputfromfirstnode*linkweight)+(outputfromsecondnode
*linkweight)

x=(1.0*0.9)+(0.5*0.3)

x=0.9+0.15

x=1.05
Ifwedidntmoderatethesignal,wedhaveaverysimpleadditionofthesignals1.0+0.5,but
wedontwantthat.Itistheweightsthatdothelearninginaneuralnetworksastheyare
iterativelyrefinedtogivebetterandbetterresults.
So,wevenowgot x
=1.05forthecombinedmoderatedinputintothefirstnodeofthesecond
1
layer.Wecannow,finally,calculatethatnodesoutputusingtheactivationfunction y = (1+ex
)
y
.Feelfreetouseacalculatortodothis.Theansweris y
=1/(1+0.3499)=1/1.3499.So=
0.7408.
Thatsgreatwork!Wenowhaveanactualoutputfromoneofthenetworkstwooutputnodes.
Letsdothecalculationagainwiththeremainingnodewhichisnode2inthesecondlayer.The
combinedmoderatedinputxis:

54

x=(outputfromfirstnode*linkweight)+(outputfromsecondnode
*linkweight)

x=(1.0*0.2)+(0.5*0.8)

x=0.2+0.4

x=0.6
Sonowwehave x y
,wecancalculatethenodesoutputusingthesigmoidactivationfunction=
y
1/(1+0.5488)=1/(1.5488).So =0.6457.
Thefollowingdiagramshowsthenetworksoutputswevejustcalculated:
Thatwasafairbitofworkjusttogettwooutputsfromaverysimplifiednetwork.Iwouldntwant
todothecalculationsforalargernetworkbyhandatall!Luckilycomputersareperfectfordoing
lotsofcalculationsquicklyandwithoutgettingbored.
Evenso,Idontwanttowriteoutcomputerinstructionsfordoingthecalculationforanetwork
withmorethan2layers,andwithmaybe4,8oreven100sofnodesineachlayer.Evenwriting
outtheinstructionswouldgetboringandIdverylikelymakemistakeswritingallthose
instructionsforallthosenodesandallthoselayers,nevermindactuallydoingthecalculations.
Luckilymathematicshelpsusbeextremelyconciseinwritingdownthecalculationsneededto
workouttheoutputfromaneuralnetwork,evenonewithmanymorelayersandnodes.This
concisenessisnotjustgoodforushumanreaders,butalsogreatforcomputerstoobecause
theinstructionsaremuchshorterandtheexecutionismuchmoreefficienttoo.

55
Thisconciseapproachuses
matrices
,whichwelllookatnext.
MatrixMultiplicationisUseful..Honest!
Matriceshaveaterriblereputation.Theyevokememoriesofteethgrindinglyboringlaborious
andapparentlypointlesshoursspentatschooldoingmatrixmultiplication.
Previouslywemanuallydidthecalculationsfora2layernetworkwithjust2nodesineach
layer.Thatwasenoughwork,butimaginedoingthesameforanetworkwith5layersand100
nodesineach?Justwritingoutallthenecessarycalculationswouldbeahugetaskallthose
combinationsofcombiningsignals,multipliedbytherightweights,applyingthesigmoid
activationfunction,foreachnode,foreachlayerargh!
Sohowcanmatriceshelp?Well,theyhelpusintwoways.First,theyallowustocompress
writingallthosecalculationsintoaverysimpleshortform.Thatsgreatforushumans,because
wedontlikedoingalotofworkbecauseitsboring,andwerepronetoerrorsanyway.The
secondbenefitisthatmanycomputerprogramminglanguagesunderstandworkingwith
matrices,andbecausetherealworkisrepetitive,theycanrecognisethatanddoitveryquickly
andefficiently.
Inshort,matricesallowustoexpresstheworkweneedtodoconciselyandeasily,and
computerscangetthecalculationsdonequicklyandefficiently.
Nowweknowwhyweregoingtolookatmatrices,despiteperhapsapainfulexperiencewith
thematschool,letsmakeastartanddemystifythem.
Amatrixisjustatable,arectangulargrid,ofnumbers.Thatsit.Theresnothingmuchmore
complexaboutamatrixthanthat.
Ifyouveusedspreadsheets,yourealreadycomfortablewithworkingwithnumbersarrangedin
agrid.Somewouldcallitatable.Wecancallitamatrixtoo.Thefollowingshowsa
spreadsheetwithatableofnumbers.

56

Thatsallamatrixisatableoragridofnumbersjustlikethefollowingexampleofamatrixof
size2by3.
Itisconventiontouserowsfirstthencolumns,sothisisnta3by2matrix,itisa2by3
matrix.
Also,somepeopleusesquarebracketsaroundmatrices,andothersuseroundbracketslikewe
have.
Actually,theydonthavetobenumbers,theycouldbequantitieswhichwegiveanameto,but
maynothaveassignedanactualnumericalvalueto.Sothefollowingisamatrix,whereeach
elementisavariablewhichhasameaningandcouldhaveanumericalvalue,butwejust
haventsaidwhatitisyet.

57
Now,matricesbecomeusefultouswhenwelookathowtheyaremultiplied.Youmay
rememberhowtodothisfromschool,butifnotwelllookatitagain.
Heresanexampleoftwosimplematricesmultipliedtogether.
Youcanseethatwedontsimplymultiplythecorrespondingelements.Thetopleftofthe
answerisnt1*5,andthebottomrightisnt4*8.
Instead,matricesaremultipliedusingdifferentrules.Youmaybeabletoworkthemoutby
lookingattheaboveexample.Ifnot,havealookatthefollowingwhichhighlightshowthetop
leftelementoftheanswerisworkedout.
Youcanseethetopleftelementisworkedoutbyfollowingthetoprowofthefirstmatrix,and
theleftcolumnofthesecondmatrix.Asyoufollowtheserowsandcolumns,youmultiplythe
elementsyouencounterandkeeparunningtotal.Sotoworkoutthetopleftelementofthe

58
answerwestarttomovealongthetoprowofthefirstarraywherewefindthenumber1,andas
westarttomovedowntheleftcolumnofthesecondmatrixwefindthenumber5.Wemultiply1
and5andkeeptheanswer,whichis5,withus.Wecontinuemovingalongtherowanddown
thecolumntofindthenumbers2and7.Multiplying2and7givesus14,whichwekeeptoo.
Wevereachedtheendoftherowsandcolumnssoweaddupallthenumberswekeptaside,
whichis5+14,togiveus19.Thatsthetopleftelementoftheresultmatrix.
Thatsalotofwords,butitiseasiertoseeitinaction.Haveagoyourself.Thefollowing
explainshowthebottomrightelementisworkedout.
Youcanseeagain,thatfollowingthecorrespondingrowandcolumnoftheelementweretrying
tocalculate(thesecondrowandsecondcolumn,inthisexample)wehave(3*6)and(4*8)
whichis18+32=50.
Forcompleteness,thebottomleftiscalculatedas(3*5)+(4*7)=15+28=43.Similarly,thetop
rightiscalculatedas(1*6)+(2*8)=6+16=22.
Thefollowingillustratestheruleusingvariables,ratherthannumbers.

59

Thisisjustanotherwayofexplainingtheapproachwetaketomultiplyingmatrices.Byusing
letters,whichcouldbeanynumber,wevemadeevenclearerthegenericapproachto
multiplyingmatrices.Itsagenericapproachbecauseitcouldbeappliedtomatricesofdifferent
sizestoo.
Whenwesaiditworksformatricesofdifferentsizes,thereisanimportantlimit.Youcantjust
multiplyanytwomatrices,theyneedtobecompatible.Youmayhaveseenthisalreadyasyou
followedtherowsacrossthefirstmatrix,andthecolumnsdownthesecond.Ifthenumberof
elementsintherowsdontmatchthenumberofelementsinthecolumnsthenthemethod
doesntwork.Soyoucantmultiplya2by2matrixbya5by5matrix.Tryityoullseewhyit
doesntwork.Tomultiplymatricesthenumberofcolumnsinthefirstmustbeequaltothe
numberofrowsinthesecond.
Insomeguides,youllseethiskindofmatrixmultiplicationcalleda dotproduct oran

inner
product .Thereareactuallydifferentkindsofmultiplicationpossibleformatrices,suchasa
crossproduct,butthedotproductistheonewewanthere.
Whyhavewegonedownwhatlookslikearabbitholeofdreadedmatrixmultiplicationand
distastefulalgebra?Thereisaverygoodreason..hanginthere!
Lookwhathappensifwereplacetheletterswithwordsthataremoremeaningfultoourneural
networks.Thesecondmatrixisatwobyonematrix,butthemultiplicationapproachisthesame.

60

Magic!
Thefirstmatrixcontainstheweightsbetweennodesoftwolayers.Thesecondmatrixcontains
thesignalsofthefirstinputlayer.Theanswerwegetbymultiplyingthesetwomatricesisthe
combinedmoderatedsignalintothenodesofthesecondlayer.Lookcarefully,andyoullsee
this.Thefirstnodehasthefirstinput_1moderatedbytheweight w addedtothesecond
1,1
input_2moderatedbytheweight w x
.Thesearethevaluesof
2,1 beforethesigmoidactivation
functionisapplied.
Thefollowingdiagramshowsthisevenmoreclearly.
Thisisreallyveryuseful!

61
Why?Becausewecanexpressallthecalculationsthatgointoworkingoutthecombined
moderatedsignal,x,intoeachnodeofthesecondlayerusingmatrixmultiplication.Andthiscan
beexpressedasconciselyas:
X W
=
I
Thatis,W
isthematrixofweights, I
isthematrixofinputs,and Xistheresultantmatrixof
combinedmoderatedsignalsintolayer2.Matricesareoftenwrittenin bold
toshowthatthey
areinfactmatricesanddontjustrepresentsinglenumbers.
Wenowdontneedtocaresomuchabouthowmanynodesthereareineachlayer.Ifwehave
morenodes,thematriceswilljustbebigger.Butwedontneedtowriteanythinglongeror
larger.Wecansimplywrite W.IevenifIhas2elementsor200elements!
Now,ifacomputerprogramminglanguagecanunderstandmatrixnotation,itcandoallthehard
workofmanycalculationstoworkoutthe X=W.I ,withoutushavingtogiveitalltheindividual
instructionsforeachnodeineachlayer.
Thisisfantastic!Alittlebitofefforttounderstandmatrixmultiplicationhasgivenusapowerful
toolforimplementingneuralnetworkswithoutlotsofeffortfromus.
Whatabouttheactivationfunction?Thatseasyanddoesntneedmatrixmultiplication.Allwe
needtodoisapplythesigmoidfunction y = (1+e 1 X
x toeachindividualelementofthematrix .
)
Thissoundstoosimple,butitiscorrectbecausewerenotcombiningsignalsfromdifferent
nodeshere,wevealreadydonethatandtheanswersarein X.Aswesawearlier,theactivation
functionsimplyappliesathresholdandsquishestheresponsetobemorelikethatseenin
biologicalneurons.Sothefinaloutputfromthesecondlayeris:
O X
=sigmoid()
O
That writteninboldisamatrix,whichcontainsalltheoutputsfromthefinallayeroftheneural
network.
Theexpression X W
=
I
appliestothecalculationsbetweenonelayerandthenext.Ifwehave3
layers,forexample,wesimplydothematrixmultiplicationagain,usingtheoutputsofthe
secondlayerasinputstothethirdlayerbutofcoursecombinedandmoderatedusingmore
weights.
Enoughtheoryletsseehowitworkswitharealexample,butthistimewelluseaslightly
largerneuralnetworkof3layers,eachwith3nodes.

62

KeyPoints:

Themanycalculationsneededtofeedasignalforwardthroughaneuralnetworkcan
beexpressedas matrixmultiplication.

Expressingitasmatrixmultiplicationmakesitmuchmore concise forustowrite,no
matterthesizeofneuralnetwork.

Moreimportantly,somecomputerprogramminglanguagesunderstandmatrix
calculations,andrecognisethattheunderlyingcalculationsareverysimilar.This
allowsthemtodothesecalculationsmoreefficiently
andquickly.

AThreeLayerExamplewithMatrixMultiplication
Wehaventworkedthroughfeedingsignalsthroughaneuralnetworkusingmatricestodothe
calculations.Wealsohaventworkedthroughanexamplewithmorethan2layers,whichis
interestingbecauseweneedtoseehowwetreattheoutputsofthemiddlelayerasinputstothe
finalthirdlayer.
Thefollowingdiagramshowsanexampleneuralnetworkwith3layers,eachwith3nodes.To
keepthediagramclear,notalltheweightsaremarked.

63

Wellintroducesomeofthecommonlyusedterminologyheretoo.Thefirstlayeristhe input
layer ,asweknow.Thefinallayeristhe
outputlayer
,aswealsoknow.Themiddlelayeris
calledthehiddenlayer .Thatsoundsmysteriousanddark,butsadlythereisntamysterious
darkreasonforit.Thenamejuststuckbecausetheoutputsofthemiddlelayerarenot
necessarilymadeapparentasoutputs,soarehidden.Yes,thatsabitlame,buttherereally
isntabetterreasonforthename.
Letsworkthroughthatexamplenetwork,illustratedinthatdiagram.Wecanseethethree
I
inputsare0.9,0.1and0.8.Sotheinputmatrixis:
Thatwaseasy.Thatsthefirstinputlayerdone,becausethatsalltheinputlayerdoesit
merelyrepresentstheinput.

64
Nextisthemiddlehiddenlayer.Hereweneedtoworkoutthecombined(andmoderated)
signalstoeachnodeinthismiddlelayer.Remembereachnodeinthismiddlehiddenlayeris
connectedtobyeverynodeintheinputlayer,soitgetssomeportionofeachinputsignal.We
dontwanttogothroughthemanycalculationslikewedidearlier,wewanttotrythismatrix
method.
Aswejustsaw,thecombinedandmoderatedinputsintothismiddlelayerare X=
W.I
whereIis
W
thematrixofinputsignals,and I
isthematrixofweights.Wehave W
butwhatis ?Wellthe
diagramshowssomeofthe(madeup)weightsforthisexamplebutnotallofthem.The
followingshowsallofthem,againmadeuprandomly.Theresnothingparticularlyspecialabout
theminthisexample.
Youcanseetheweightbetweenthefirstinputnodeandthefirstnodeofthemiddlehidden
layerisw =0.9,justasinthenetworkdiagramabove.Similarlyyoucanseetheweightforthe
1,1
linkbetweenthesecondnodeoftheinputandthesecondnodeofthehiddenlayeris =0.8,
w2,2
asshowninthediagram.Thediagramdoesntshowthelinkbetweenthethirdinputnodeand
thefirsthiddenlayernode,whichwemadeupas w =0.1.
3,1
Butwaitwhyhavewewritteninput_hiddennexttothat W W
?Itsbecause arethe
input_hidden
weightsbetweentheinputandhiddenlayers.Weneedanothermatrixofweightsforthelinks
betweenthehiddenandoutputlayers,andwecancallit W .
hidden_output
Thefollowingshowsthissecondmatrix
Whidden_output
withtheweightsfilledinasbefore.Again
youshouldbeabletosee,forexample,thelinkbetweenthethirdhiddennodeandthethird
outputnodeas w =0.9.
3,3

65
Great,wevegottheweightmatricessorted.
Letsgetonwithworkingoutthecombinedmoderatedinputintothehiddenlayer.Weshould
alsogiveitadescriptivenamesoweknowitreferstothecombinedinputtothemiddlelayer
andnotthefinallayer.Letscallit .
Xhidden
Xhidden=
W I
input_hidden
Werenotgoingtodotheentirematrixmultiplicationhere,becausethatwasthewholepointof
usingmatrices.Wewantcomputerstodothelaboriousnumbercrunching.Theansweris
workedoutasshownbelow.
Iusedacomputertoworkthisout,andwelllearntodothistogetherusingthePython
programminglanguageinpart2ofthisguide.Wewontdoitnowaswedontwanttoget
distractedbycomputersoftwarejustyet.
Sowehavethecombinedmoderatedinputsintothemiddlehiddenlayer,andtheyare1.16,
0.42and0.62.Andweusedmatricestodothehardworkforus.Thatsanachievementtobe
proudof!
Letsvisualisethesecombinedmoderatedinputsintothesecondhiddenlayer.

66

Sofarsogood,buttheresmoretodo.Youllrememberthosenodesapplyasigmoidactivation
functiontomaketheresponsetothesignalmorelikethosefoundinnature.Soletsdothat:
O
hidden X
=sigmoid( )
hidden
ThesigmoidfunctionisappliedtoeachelementinX
hiddentoproducethematrixwhichhasthe
outputofthemiddlehiddenlayer.

67
1
Letsjustcheckthefirstelementtobesure.Thesigmoidfunctionis y = (1+ex ,sowhenx=
)
1.16
1.16,e y
is0.3135.Thatmeans=1/(1+0.3135)=0.761.
Youcanalsoseethatallthevaluesarebetween0and1,becausethissigmoiddoesntproduce
valuesoutsidethatrange.Lookbackatthegraphofthelogisticfunctiontoseethisvisually.
Phew!Letspauseagainandseewhatwevedone.Weveworkedoutthesignalasitpasses
throughthemiddlelayer.Thatis,theoutputsfromthemiddlelayer.Which,justtobesuper
clear,arethecombinedinputsintothemiddlelayerwhichthenhavetheactivationfunction
applied.Letsupdatethediagramwiththisnewinformation.
Ifthiswasatwolayerneuralnetwork,wedstopnowasthesearetheoutputsfromthesecond
layer.Wewontstopbecausewehaveanotherthirdlayer.
Howdoweworkoutthesignalthroughthethirdlayer?Itsthesameapproachasthesecond
layer,thereisntanyrealdifference.Westillhaveincomingsignalsintothethirdlayer,justas
wedidcomingintothesecondlayer.Westillhavelinkswithweightstomoderatethosesignals.
Andwestillhaveanactivationfunctiontomaketheresponsebehavelikethoseweseein
nature.Sothethingtorememberis,nomatterhowmanylayerswehave,wecantreateach
layerlikeanyotherwithincomingsignalswhichwecombine,linkweightstomoderatethose

68
incomingsignals,andanactivationfunctiontoproducetheoutputfromthatlayer.Wedontcare
whetherwereworkingonthe3rdor53rdoreventhe103rdlayertheapproachisthesame.
Soletscrackonandcalculatethecombinedmoderatedinputintothisfinallayer X =
W.I justas
wedidbefore.
Theinputsintothislayeraretheoutputsfromthesecondlayerwejustworkedout O .And
hidden
theweightsarethoseforthelinksbetweenthesecondandthirdlayers
Whidden_output
,notthose
wejustusedbetweenthefirstandsecond.Sowehave:
X
output=
W O
hidden_output hidden
Soworkingthisoutjustinthesamewaygivesthefollowingresultforthecombinedmoderated
inputsintothefinaloutputlayer.
Theupdateddiagramnowshowsourprogressfeedingforwardthesignalfromtheinitialinput
rightthroughtothecombinedinputstothefinallayer.

69

Allthatremainsistoapplythesigmoidactivationfunction,whichiseasy.
Thatsit!Wehavethefinaloutputsfromtheneuralnetwork.Letsshowthisonthediagramtoo.

70

Sothefinaloutputoftheexampleneuralnetworkwiththreelayersis0.726,0.708and0.778.
Wevesuccessfullyfollowedthesignalfromitsinitialentryintotheneuralnetwork,throughthe
layers,andoutofthefinaloutputlayer.
Whatnow?
Thenextstepistousetheoutputfromtheneuralnetworkandcompareitwiththetraining
exampletoworkoutanerror.Weneedtousethaterrortorefinetheneuralnetworkitselfso
thatitimprovesitsoutputs.
Thisisprobablythemostdifficultthingtounderstand,sowelltakeitgentlyandillustratethe
ideasaswego.
LearningWeightsFromMoreThanOneNode
Previouslywerefinedasimplelinearclassifierbyadjustingtheslopeparameterofthenodes
linearfunction.Weusedtheerror,thedifferencebetweenwhatthenodeproducedasan
answerandwhatweknowtheanswershouldbe,toguidethatrefinement.Thatturnedouttobe
quiteeasybecausetherelationshipbetweentheerrorandthenecessaryslopeadjustmentwas
verysimpletoworkout.

71
Howdoweupdatelinkweightswhenmorethanonenodecontributestoanoutputanditserror?
Thefollowingillustratesthisproblem.
Thingsweremuchsimplerwhenwejusthadonenodefeedingintoanoutputnode.Ifwehave
twonodes,howdoweusethatoutputerror?
Itdoesntmakesensetousealltheerrortoupdateonlyoneweight,becausethatignoresthe
otherlinkanditsweight.Thaterrorwastherebecausemorethanonelinkcontributedtoit.
Thereisatinychancethatonlyonelinkofmanywasresponsiblefortheerror,butthatchance
isextremelyminiscule.Ifwedidchangeaweightthatwasalreadycorrect,makingitworse,it
wouldgetimprovedduringthenextiterationssoallisnotlost.
Oneideaistosplittheerrorequallyamongstallcontributingnodes,asshownnext.

72
Anotherideaistosplittheerrorbutnottodoitequally.Insteadwegivemoreoftheerrorto
thosecontributingconnectionswhichhadgreaterlinkweights.Why?Becausetheycontributed
moretotheerror.Thefollowingdiagramillustratesthisidea.
Heretherearetwonodescontributingasignaltotheoutputnode.Thelinkweightsare3.0and
1.0.Ifwesplittheerrorinawaythatisproportionatetotheseweights,wecanseethatofthe
outputerrorshouldbeusedtoupdatethefirstlargerweight,andthatoftheerrorforthe
secondsmallerweight.
Wecanextendthissameideatomanymorenodes.Ifwehad100nodesconnectedtoan
outputnode,wedsplittheerroracrossthe100connectionstothatoutputnodein proportion
toeachlinkscontributiontotheerror,indicatedbythesizeofthelinksweight.
Youcanseethatwereusingtheweightsintwoways.Firstlyweusetheweightstopropagate
signalsforwardfromtheinputtotheoutputlayersinaneuralnetwork.Weworkedonthis
extensivelybefore.Secondlyweusetheweightstopropagatetheerrorbackwardsfromthe
outputbackintothenetwork.Youwontbesurprisedwhythemethodiscalled
backpropagation .
Iftheoutputlayerhad2nodes,weddothesameforthesecondoutputnode.Thatsecond
outputnodewillhaveitsownerror,whichissimilarlysplitacrosstheconnectinglinks.Letslook
atthisnext.
BackpropagatingErrorsFromMoreOutputNodes
Thefollowingdiagramshowsasimplenetworkwith2inputnodes,butnowwith2outputnodes.

73

Bothoutputnodescanhaveanerrorinfactthatsextremelylikelywhenwehaventtrainedthe
network.Youcanseethatbothoftheseerrorsneedstoinformtherefinementoftheinternallink
weightsinthenetwork.Wecanusethesameapproachasbefore,wherewesplitanoutput
nodeserroramongstthecontributinglinks,inawaythatsproportionatetotheirweights.
Thefactthatwehavemorethanoneoutputnodedoesntreallychangeanything.Wesimply
repeatforthesecondoutputnodewhatwealreadydidforthefirstone.Whyisthissosimple?It
issimplebecausethelinksintoanoutputnodedontdependonthelinksintoanotheroutput
node.Thereisnodependencebetweenthesetwosetsoflinks.
Lookingatthatdiagramagain,wevelabelledtheerroratthefirstoutputnodeas e .Remember
1
thisisthedifferencebetweenthedesiredoutputprovidedbythetrainingdata
t1
andtheactual
output o.Thatis,
1 e=(
1 to
1 1 ).Theerroratthesecondoutputnodeislabelled e.
2
Youcanseefromthediagramthattheerror eissplitinproportiontotheconnectedlinks,which
1
haveweights and
w11 .Similarly,
w21
e2wouldbesplitinproportionatetoweights and
w21 .
w22
Letswriteoutwhatthesesplitsare,sowerenotinanydoubt.Theerror e isusedtoinformthe
1
refinementofbothweights w11 w
and .Itissplitsothatthefractionof
21
e1
usedtoupdate w is
11

74

Similarlythefractionof
e1
usedtorefineis
w21
Thesefractionsmightlookabitpuzzling,soletsillustratewhattheydo.Behindallthese
symbolsistheeverysimpleideathetheerror e issplittogivemoretotheweightthatislarger,
1
andlesstotheweightthatissmaller.
w
If 11 w
istwiceaslargeas21 w
,say w
=6and
11 =3,thenthefractionof
21 e usedtoupdate
1 w is
11
6/(6+3)=6/9=2/3.Thatshouldleave1/3of e
1fortheothersmallerweight w whichwecan
21
confirmusingtheexpression3/(6+3)=3/9whichisindeed1/3.
Iftheweightswereequal,thefractionswillbothbehalf,asyoudexpect.Letsseethisjusttobe
sure.Letssay w w
=4and
11 =4,thenthefractionis4/(4+4)=4/8=1/2forbothcases.
21
Beforewegofurther,letspauseandtakeastepbackandseewhatwevedonefroma
distance.Weknewweneededtousetheerrortoguidetherefinementofsomeparameter
insidethenetwork,inthiscasethelinkweights.Wevejustseenhowtodothatforthelink
weightswhichmoderatesignalsintothefinaloutputlayerofaneuralnetwork.Wevealsoseen
thatthereisntacomplicationwhenthereismorethanoneoutputnode,wejustdothesame
thingforeachoutputnode.Great!
Thenextquestiontoaskiswhathappenswhenwehavemorethan2layers?Howdowe
updatethelinkweightsinthelayersfurtherbackfromthefinaloutputlayer?
BackpropagatingErrorsToMoreLayers
Thefollowingdiagramshowsasimpleneuralnetworkwith3layers,aninputlayer,ahidden
layerandthefinaloutputlayer.

75

Workingbackfromthefinaloutputlayerattherighthandside,wecanseethatweusethe
errorsinthatoutputlayertoguidetherefinementofthelinkweightsfeedingintothefinallayer.
Wevelabelledtheoutputerrorsmoregenericallyas e andtheweightsofthelinksbetween
output
thehiddenandoutputlayeras w .Weworkedoutthespecificerrorsassociatedwitheachlink
ho
bysplittingtheweightsinproportiontothesizeoftheweightsthemselves.
Byshowingthisvisually,wecanseewhatweneedtodoforthenewadditionallayer.Wesimply
takethoseerrorsassociatedwiththeoutputofthehiddenlayernodes e ,andsplitthose
hidden
againproportionatelyacrosstheprecedinglinksbetweentheinputandhiddenlayers .The
wih
nextdiagramshowsthislogic.

76
Ifwehadevenmorelayers,wedrepeatedlyapplythissameideatoeachlayerworking
backwardsfromthefinaloutputlayer.Theflowoferrorinformationmakesintuitivesense.You
canseeagainwhythisiscallederror backpropagation .
Ifwefirstusedtheerrorintheoutputoftheoutputlayernodese ,whaterrordoweusefor
output
thehiddenlayernodes e ?Thisisagoodquestiontoaskbecauseanodeinthemiddle
hidden
hiddenlayerdoesnthaveanobviouserror.Weknowfromfeedingforwardtheinputsignals
that,yes,eachnodeinthehiddenlayerdoesindeedhaveasingleoutput.Youllrememberthat
wastheactivationfunctionappliedtotheweightedsumontheinputstothatnode.Buthowdo
weworkouttheerror?
Wedonthavethetargetordesiredoutputsforthehiddennodes.Weonlyhavethetarget
valuesforthefinaloutputlayernodes,andthesecomefromthetrainingexamples.Letslookat
thatdiagramaboveagainforinspiration!Thatfirstnodeinthehiddenlayerhastwolinks
emergingfromittoconnectittothetwooutputlayernodes.Weknowwecansplittheoutput
erroralongeachoftheselinks,justaswedidbefore.Thatmeanswehavesomekindoferror
foreachofthetwolinksthatemergefromthismiddlelayernode.Wecouldrecombinethese
twolinkerrorstoformtheerrorforthisnodeasasecondbestapproachbecausewedont
actuallyhaveatargetvalueforthemiddlelayernode.Thefollowingshowsthisideavisually.
Youcanseemoreclearlywhatshappening,butletsgothroughitagainjusttobesure.We
needanerrorforthehiddenlayernodessowecanuseittoupdatetheweightsinthepreceding
layer.Wecallthesee .Butwedonthaveanobviousanswertowhattheyactuallyare.We
hidden
cantsaytheerroristhedifferencebetweenthedesiredtargetoutputfromthosenodesandthe
actualoutputs,becauseourtrainingdataexamplesonlygiveustargetsfortheveryfinaloutput
nodes.

77
Thetrainingdataexamplesonlytelluswhattheoutputsfromtheveryfinalnodesshouldbe.
Theydonttelluswhattheoutputsfromnodesinanyotherlayershouldbe.Thisisthecoreof
thepuzzle.
Wecouldrecombinethespliterrorsforthelinksusingtheerrorbackpropagationwejustsaw
earlier.Sotheerrorinthefirsthiddennodeisthesumofthespliterrorsinallthelinks
connectingforwardfromsamenode.Inthediagramabove,wehaveafractionoftheoutput
error onthelinkwithweight
eoutput,1 andalsoafractionoftheoutputerror
w11 fromthe
eoutput,2
secondoutputnodeonthelinkwithweight w .
12
Soletswritethisdown.
Ithelpstoseeallthistheoryinaction,sothefollowingillustratesthepropagationoferrorsback
intoasimple3layernetworkwithactualnumbers.
Letsfollowoneerrorback.Youcanseetheerror0.5atthesecondoutputlayernodebeing
splitproportionatelyinto0.1and0.4acrossthetwoconnectedlinkswhichhaveweights1.0and

78
4.0.Youcanalsoseethattherecombinederroratthesecondhiddenlayernodeisthesumof
theconnectedspliterrors,whichhereare0.9and0.4,togive1.3.
Thenextdiagramshowsthesameideaappliedtotheprecedinglayer,workingfurtherback.

KeyPoints:

Neuralnetworkslearnbyrefiningtheirlinkweights.Thisisguidedbythe error
the
differencebetweentherightanswergivenbythetrainingdataandtheiractualoutput.

Theerrorattheoutputnodesissimplythedifferencebetweenthedesiredandactual
output.

Howevertheerrorassociatedwithinternalnodesisnotobvious.Oneapproachisto
splittheoutputlayererrorsin
proportion
tothesizeoftheconnectedlinkweights,
andthenrecombinethesebitsateachinternalnode.

BackpropagatingErrorswithMatrixMultiplication
Canweusematrixmultiplicationtosimplifyallthatlaboriouscalculation?Ithelpedearlierwhen
weweredoingloadsofcalculationstofeedforwardtheinputsignals.
Toseeiferrorbackpropagationcanbemademoreconciseusingmatrixmultiplication,lets
writeoutthestepsusingsymbols.Bytheway,thisiscalledtryingto
vectorise theprocess.

79
Beingabletoexpressalotofcalculationsinmatrixformmakesitmoreconciseforustowrite
down,andalsoallowscomputerstodoallthatworkmuchmoreefficientlybecausetheytake
advantageoftherepetitivesimilaritiesinthecalculationsthatneedtobedone.
Thestartingpointistheerrorsthatemergefromtheneuralnetworkatthefinaloutputlayer.
Hereweonlyhavetwonodesintheoutputlayer,sotheseare e e
and
1 .
2
Nextwewanttoconstructthematrixforthehiddenlayererrors.Thatmightsoundhardsolets
doitbitbybit.Thefirstbitisthefirstnodeinthehiddenlayer.Ifyoulookatthediagramsabove
againyoucanseethatthefirsthiddennodeserrorhastwopathscontributingtoitfromthe
outputlayer.Alongthesepathscometheerrorsignals e
1 w
*11 w
/( +w
11 e
)and
21 w
*
2 w
/(
12 +
12
w ).Nowlookatthesecondhiddenlayernodeandwecanagainseetwopathscontributingto
22
itserror,
e1
* /(
w21 +w
w21 )and
11
e2
*/(
w22 +w
w22 ).Wevealreadyseenhowthese
12
expressionsareworkedoutearlier.
Sowehavethefollowingmatrixforthehiddenlayer.ItsabitmorecomplexthanIdlike.
Itwouldbegreatifthiscouldberewrittenasaverysimplemultiplicationofmatriceswealready
haveavailabletous.Thesearetheweight,forwardsignalandoutputerrormatrices.Remember
thebenefitsarehugeifwecandothis.
Sadlywecanteasilyturnthisintoasupersimplematrixmultiplicationlikewecouldwiththe
feedingforwardofsignalswedidearlier.Thosefractionsinthatbigbusymatrixabovearehard
tountangle!Itwouldhavebeenhelpfulifwecouldneatlysplitthatbusymatrixintoasimple
combinationoftheavailablematrices.

80

Whatcanwedo?Westillreallyreallywantthosebenefitsofusingmatrixmultiplicationstodo
thecalculationsefficiently.
Timetogetabitnaughty!
Havealookagainatthatexpressionabove.Youcanseethatthemostimportantthingisthe
multiplicationoftheoutputerrors
en
withthelinkedweights
wij
.Thelargertheweight,themore
oftheoutputerroriscarriedbacktothehiddenlayer.Thatstheimportantbit.Thebottomof
thosefractionsareakindofnormalisingfactor.Ifweignoredthatfactor,wedonlylosethe
scalingoftheerrorsbeingfedback.Thatis, e
1 w
*11 w
/( +w
11 )wouldbecomethemuch
21
simpler
e1
* .
w11
Ifwedidthat,thenthematrixmultiplicationiseasytospot.Hereitis:
Thatweightmatrixisliketheoneweconstructedbeforebuthasbeenflippedalongadiagonal
linesothatthetoprightisnowatthebottomleft,andthebottomleftisatthetopright.Thisis
called
transposing wT
amatrix,andiswrittenas .
Herearetwoexamplesofatransposingmatrixofnumbers,sowecanseeclearlywhat
happens.Youcanseethisworksevenwhenthematrixhasadifferentnumberofrowsfrom
columns.

81

Sowehavewhatwewanted,amatrixapproachtopropagatingtheerrorsback:
Thisisgreatbutdidwedotherightthingcuttingoutthatnormalisingfactor?Itturnsoutthatthis
simplerfeedbackoftheerrorsignalsworksjustaswellasthemoresophisticatedonewe
workedoutearlier.Thisbooksbloghasa post
showingtheresultsofseveraldifferentwaysof
backpropagatingthetheerror.Ifoursimplerwayworksreallywell,wellkeepit!
Ifwewanttothinkaboutthismore,wecanseethatevenifoverlylargeorsmallerrorsarefed
back,thenetworkwillcorrectitselfduringthenextiterationsoflearning.Theimportantthingis
thattheerrorsbeingfedbackrespectthestrengthofthelinkweights,becausethatisthebest
indicationwehaveofsharingtheblamefortheerror.
Wevedonealotofwork,ahugeamount!

KeyPoints:

Backpropagatingtheerrorcanbeexpressedasamatrixmultiplication.

Thisallowsustoexpressitconcisely,irrespectiveofnetworksize,andalsoallows
computerlanguagesthatunderstandmatrixcalculationstodotheworkmore
efficientlyandquickly.

Thismeans both
feedingsignalsforwardanderrorbackpropagationcanbemade
efficientusingmatrixcalculations.

Takeawelldeservedbreak,becausethenextandfinaltheorysectionisreallyverycoolbut
doesrequireafreshbrain.
HowDoWeActuallyUpdateWeights?

82
Wevenotyetattackedtheverycentralquestionofupdatingthelinkweightsinaneural
network.Wevebeenworkingtogettothispoint,andwerealmostthere.Wehavejustone
morekeyideatounderstandbeforewecanunlockthissecret.
Sofar,wevegottheerrorspropagatedbacktoeachlayerofthenetwork.Whydidwedothis?
Becausetheerrorisusedtoguidehowweadjustthelinkweightstoimprovetheoverallanswer
givenbytheneuralnetwork.Thisisbasicallywhatweweredoingwaybackwiththelinear
classifieratthestartofthisguide.
Butthesenodesarentsimplelinearclassifiers.Theseslightlymoresophisticatednodessum
theweightedsignalsintothenodeandapplythesigmoidthresholdfunction.Sohowdowe
actuallyupdatetheweightsforlinksthatconnectthesemoresophisticatednodes?Whycant
weusesomefancyalgebratodirectlyworkoutwhattheweightsshouldbe?
Wecantdofancyalgebratoworkouttheweightsdirectlybecausethemathsistoohard.There
arejusttoomanycombinationsofweights,andtoomanyfunctionsoffunctionsoffunctions...
beingcombinedwhenwefeedforwardthesignalthroughthenetwork.Thinkaboutevenasmall
neuralnetworkwith3layersand3neuronsineachlayer,likewehadabove.Howwouldyou
tweakaweightforalinkbetweenthefirstinputnodeandthesecondhiddennodesothatthe
thirdoutputnodeincreaseditsoutputby,say,0.5?Evenifwedidgetlucky,theeffectcouldbe
ruinedbytweakinganotherweighttoimproveadifferentoutputnode.Youcanseethisisnt
trivialatall.
Toseehowuntrivial,justlookatthefollowinghorribleexpressionshowinganoutputnodes
outputasafunctionoftheinputsandthelinkweightsforasimple3layerneuralnetworkwith3
nodesineachlayer.Theinputatnode i x
is ,theweightsforlinksconnectinginputnodeito
i
hiddennode j w
is ,similarlytheoutputofhiddennode
i,j j x
is ,andtheweightsforlinks
j
b
connectinghiddennode j
tooutputnodek
is .Thatfunnysymbol
wj,k ameanssumthe
subsequentexpressionforallvaluesbetween a b
and .

83
Yikes! Letsnotuntanglethat.
Insteadoftryingtobetooclever,wecouldjustsimplytryrandomcombinationsofweightsuntil
wefindagoodone?
Thatsnotalwayssuchacrazyideawhenwerestuckwithahardproblem.Theapproachis
calledabruteforce method.Somepeopleusebruteforcemethodstotrytocrackpasswords,
anditcanworkifyourpasswordisanEnglishwordandnottoolong,becausetherearenttoo
manyforafasthomecomputertoworkthrough.Now,imaginethateachweightcouldhave
1000possibilitiesbetween1and+1,like0.501,0.203and0.999forexample.Thenfora3
layerneuralnetworkwith3nodesineachlayer,thereare18weights,sowehave18,000
possibilitiestotest.Ifwehaveamoretypicalneuralnetworkwith500nodesineachlayer,we
have500millionweightpossibilitiestotest.Ifeachsetofcombinationstook1secondto
calculate,thiswouldtakeus16yearstoupdatetheweightsafterjustonetrainingexample!A
thousandtrainingexamples,andwedbeat16,000years!
Youcanseethatthebruteforceapproachisntpracticalatall.Infactitgetsworseveryquickly
asweaddnetworklayers,nodesorpossibilitiesforweightvalues.
Thispuzzleresistedmathematiciansforyears,andwasonlyreallysolvedinapracticalwayas
lateasthe1960s70s.Therearedifferentviewsonwhodiditfirstormadethekeybreakthrough
buttheimportantpointisthatthislatediscoveryledtotheexplosionofmodernneuralnetworks
whichcancarryoutsomeveryimpressivetasks.
Sohowdowesolvesuchanapparentlyhardproblem?Believeitornot,youvealreadygotthe
toolstodoityourself.Wevecoveredallofthemearlier.Soletsgetonwithit.
Thefirstthingwemustdoisembrace pessimism .
Themathematicalexpressionsshowinghowalltheweightsresultinaneuralnetworksoutput
aretoocomplextoeasilyuntangle.Theweightcombinationsaretoomanytotestonebyoneto
findthebest.
Thereareevenmorereasonstobepessimistic.Thetrainingdatamightnotbesufficientto
properlyteachanetwork.Thetrainingdatamighthaveerrorssoourassumptionthatitisthe
perfecttruth,somethingtolearnfrom,isthenflawed.Thenetworkitselfmightnothaveenough
layersornodestomodeltherightsolutiontotheproblem.
Whatthismeansiswemusttakeanapproachthatisrealistic,andrecognisestheselimitations.
Ifwedothat,wemightfindanapproachwhichisntmathematicallyperfectbutdoesactually
giveusbetterresultsbecauseitdoesntmakefalseidealisticassumptions.

84
Letsillustratewhatwemeanbythis.Imagineaverycomplexlandscapewithpeaksand
troughs,andhillswithtreacherousbumpsandgaps.Itsdarkandyoucantseeanything.You
knowyoureonthesideofahillandyouneedtogettothebottom.Youdonthaveanaccurate
mapoftheentirelandscape.Youdohaveatorch.Whatdoyoudo?Youllprobablyusethe
torchtolookattheareaclosetoyourfeet.Youcantuseittoseemuchfurtheranyway,and
certainlynottheentirelandscape.Youcanseewhichbitofearthseemstobegoingdownhill
andtakesmallstepsinthatdirection.Inthisway,youslowlyworkyourwaydownthehill,step
bystep,withouthavingafullmapandwithouthavingworkedoutajourneybeforehand.
Themathematicalversionofthisapproachiscalled gradientdescent ,andyoucanseewhy.

Afteryouvetakenastep,youlookagainatthesurroundingareatoseewhichdirectiontakes
youclosertoyourobjective,andthenyoustepagaininthatdirection.Youkeepdoingthisuntil
yourehappyyouvearrivedatthebottom.Thegradientreferstotheslopeoftheground.You
stepinthedirectionwheretheslopeissteepestdownwards
Nowimaginethatcomplexlandscapeisamathematicalfunction.Whatthisgradientdescent
methodgivesusisanabilitytofindtheminimumwithoutactuallyhavingtounderstandthat
complexfunctionenoughtoworkitoutmathematically.Ifafunctionissodifficultthatwecant
easilyfindtheminimumusingalgebra,wecanusethismethodinstead.Sureitmightnotgive
ustheexactanswerbecausewereusingstepstoapproachananswer,improvingourposition
bitbybit.Butthatisbetterthannothavinganansweratall.Anyway,wecankeeprefiningthe
answerwitheversmallerstepstowardstheactualminimum,untilwerehappywiththeaccuracy
weveachieved.
Whatsthelinkbetweenthisreallycoolgradientdescentmethodandneuralnetworks?Well,if
thecomplexdifficultfunctionistheerrorofthenetwork,thengoingdownhilltofindtheminimum
meanswereminimisingtheerror.Wereimprovingthenetworksoutput.Thatswhatwewant!

85
Letslookatthisgradientdescentideawithasupersimpleexamplesowecanunderstandit
properly.
2
Thefollowinggraphshowsasimplefunction y x
=(1)+1.Ifthiswasafunctionwhereywas
theerror,wewouldwanttofindthex
whichminimisesit.Foramoment,pretendthiswasntan
easyfunctionbutinsteadacomplexdifficultone.
Todogradientdescentwehavetostartsomewhere.Thegraphshowsourrandomlychosen
startingpoint.Likethehillclimber,welookaroundtheplacewerestandingandseewhich
directionisdownwards.Theslopeismarkedonthegraphandinthiscaseisanegative
gradient.Wewanttofollowthedownwarddirectionsowemovealongxtotheright.Thatis,we
increasexalittle.Thatsourhillclimbersfirststep.Youcanseethatweveimprovedour
positionandmovedclosertotheactualminimum.
Letsimaginewestartedsomewhereelse,asshowninthenextgraph.

86

Thistime,theslopebeneathourfeetispositive,sowemovetotheleft.Thatis,wedecreasex
alittle.Againyoucanseeweveimprovedourpositionbymovingclosertotheactualtrue
minimum.Wecankeepdoingthisuntilourimprovementsaresosmallthatwecanbesatisfied
thatwevearrivedattheminimum.
Anecessaryrefinementistochangethesizeofthestepswetaketoavoidovershootingthe
minimumandforeverbouncingaroundit.Youcanimaginethatifwevearrived0.5metresfrom
thetrueminimumbutcanonlytake2metrestepsthenweregoingtokeepmissingthe
minimumaseverystepwetakeinthedirectionoftheminimumwillovershoot.Ifwemoderate
thestepsizesoitisproportionatetothesizeofthegradientthenwhenweareclosewelltake
smallersteps.Thisassumesthataswegetclosertoaminimumtheslopedoesindeedget
shallower.Thatsnotabadassumptionatallformostsmoothcontinuousfunctions.Itwouldnt
beagoodassumptionforcrazyzigzaggyfunctionswithjumpsandgaps,which
mathematicianscall discontinuities.
Thefollowingillustratesthisideaofmoderatingstepsizeasthefunctiongradientgetssmaller,
whichisagoodindicatorofhowclosewearetoaminimum.

87

Bytheway,didyounoticethatweincrease x
intheoppositedirectiontothegradient?A
positivegradientmeanswereduce x.Anegativegradientmeansweincrease x.Thegraphs
makethisclear,butitiseasytoforgetandgetitthewrongwayaround.
Whenwedidthisgradientdescent,wedidntworkoutthetrueminimumusingalgebrabecause
2
wepretendedthefunction y x
=( 1) +1wastoocomplexanddifficult.Evenifwecouldntwork
outtheslopeexactlyusingmathematicalprecision,wecouldestimateit,andyoucanseethis
wouldstillworkquitewellinmovingusinthecorrectgeneraldirection.
Thismethodreallyshineswhenwehavefunctionsofmanyparameters.Sonotjust y
depending
x
on ,butmaybeydependingon a b
, c
, d
, e
,and f
.Remembertheoutputfunction,and
thereforetheerrorfunction,ofaneuralnetworkdependsonmanymanyweightparameters.
Oftenhundredsofthethem!
Thefollowingagainillustratesgradientdescentbutwithaslightlymorecomplexfunctionthat
dependson2parameters.Thiscanberepresentedin3dimensionswiththeheightrepresenting
thevalueofthefunction.

88

Youmaybelookingatthat3dimensionalsurfaceandwonderingwhethergradientdescent
endsupinthatothervalleyalsoshownattheright.Infact,thinkingmoregenerally,doesnt
gradientdescentsometimesgetstuckinthewrongvalley,becausesomecomplexfunctionswill
havemanyvalleys?Whatsthewrongvalley?Itsavalleywhichisntthelowest.Theanswerto
thisisyes,thatcanhappen.
Toavoidendingupinthewrongvalley,orfunction minimum ,wetrainneuralnetworksseveral

timesstartingfromdifferentpointsonthehilltoensurewedontalwaysendingupinthewrong
valley.Differentstartingpointsmeanschoosingdifferentstartingparameters,andinthecaseof
neuralnetworksthismeanschoosingdifferentstartinglinkweights.
Thefollowingillustratesthreedifferentgoesatgradientdescent,withoneofthemendingup
trappedinthewrongvalley.
Letspauseandcollectourthoughts.

89

KeyPoints:

Gradientdescent isareallygoodwayofworkingouttheminimumofafunction,and
itreallyworkswellwhenthatfunctionissocomplexanddifficultthatwecouldnteasily
workitoutmathematicallyusingalgebra.

Whatsmore,themethodstillworkswellwhentherearemanyparameters,something
thatcausesothermethodstofailorbecomeimpractical.

Thismethodisalso resilienttoimperfectionsinthedata,wedontgowildlywrongif
thefunctionisntquiteperfectlydescribedorweaccidentallytakeawrongstep
occasionally.

Theoutputofaneuralnetworkisacomplexdifficultfunctionwithmanyparameters,thelink
weights,whichinfluenceitsoutput.Sowecanusegradientdescenttoworkouttheright
weights?Yes,aslongaswepicktherighterrorfunction.
Theoutputfunctionofaneuralnetworkitselfisntanerrorfunction.Butweknowwecanturnit
intooneeasily,becausetheerroristhedifferencebetweenthetargettrainingvaluesandthe
actualoutputvalues.
Theressomethingtowatchoutforhere.Lookatthefollowingtableoftrainingandactualvalues
forthreeoutputnodes,togetherwithcandidatesforanerrorfunction.
Network Target Error Error Error

Output Output
2
(targetactual) |targetactual| (targetactual)
0.4 0.5 0.1 0.1 0.01
0.8 0.7 0.1 0.1 0.01
1.0 1.0 0 0 0
Sum 0 0.2 0.02

90
Thefirstcandidateforanerrorfunctionissimplythe(targetactual)
.Thatseemsreasonable
enough,right?Wellifyoulookatthesumoverthenodestogetanoverallfigureforhowwell
thenetworkistrained,youllseethesumiszero!
Whathappened?Clearlythenetworkisntperfectlytrainedbecausethefirsttwonodeoutputs
aredifferenttothetargetvalues.Thesumofzerosuggeststhereisnoerror.Thishappens
becausethepositiveandnegativeerrorscanceleachotherout.Eveniftheydidntcancelout
completely,youcanseethisisabadmeasureoferror.
Letscorrectthisbytakingthe
absolute valueofthedifference.Thatmeansignoringthesign,
andiswritten
|targetactual|
.Thatcouldwork,becausenothingcanevercancelout.The
reasonthisisntpopularisbecausetheslopeisntcontinuousneartheminimumandthismakes
gradientdescentnotworksowell,becausewecanbouncearoundtheVshapedvalleythatthis
errorfunctionhas.Theslopedoesntgetsmallerclosertotheminimum,soourstepsdontget
smaller,whichmeanstheyriskovershooting.
2
Thethirdoptionistotakethesquareofthedifference (targetactual).Thereareseveral
reasonswhywepreferthisthirdoneoverthesecondone,includingthefollowing:
Thealgebraneededtoworkouttheslopeforgradientdescentiseasyenoughwiththis
squarederror.
Theerrorfunctionissmoothandcontinuousmakinggradientdescentworkwellthere
arenogapsorabruptjumps.
Thegradientgetssmallernearertheminimum,meaningtheriskofovershootingthe
objectivegetssmallerifweuseittomoderatethestepsizes.
Isthereafourthoption?Yes,youcanconstructallkindofcomplicatedandinterestingcost
functions.Somedontworkwellatall,someworkwellforparticularkindsofproblems,and
somedoworkbutarentworththeextracomplexity.
Right,wereonthefinallapnow!
Todogradientdescent,wenowneedtoworkouttheslopeoftheerrorfunctionwithrespectto
theweights.Thisrequires calculus.Youmayalreadybefamiliarwithcalculus,butifyourenot,
orjustneedareminder,the Appendix containsagentleintroduction.Calculusissimplya
mathematicallyprecisewayofworkingouthowsomethingchangeswhensomethingelsedoes.
Forexample,howthelengthofaspringchangesastheforceusedtostretchitchanges.Here
wereinterestedinhowtheerrorfunctiondependsonthelinkweightsinsideaneuralnetwork.
Anotherwayofaskingthisishowsensitiveistheerroristochangesinthelinkweights?

91
Letsstartwithapicturebecausethatalwayshelpskeepusgroundedinwhatweretryingto
achieve.
Thegraphisjustliketheonewesawbeforetoemphasisethatwerenotdoinganything
different.Thistimethefunctionweretryingtominimiseistheneuralnetworkserror.The
parameterweretryingtorefineisanetworklinkweight.Inthissimpleexampleweveonly
shownoneweight,butweknowneuralnetworkswillhavemanymore.
Thenextdiagramshowstwolinkweights,andthistimetheerrorfunctionisa3dimensional
surfacewhichvariesasthetwolinkweightsvary.Youcanseeweretryingtominimisetheerror
whichisnowmorelikeamountainouslandscapewithavalley.

92
Itshardertovisualisethaterrorsurfaceasafunctionofmanymoreparameters,buttheideato
usegradientdescenttofindtheminimumisstillthesame.
Letswriteoutmathematicallywhatwewant.
Thatis,howdoestheerrorEchangeastheweightw changes.Thatstheslopeoftheerror
jk
functionthatwewanttodescendtowardstheminimum.
Beforeweunpackthatexpression,letsfocusforthemomentonlyonthelinkweightsbetween
thehiddenandthefinaloutputlayers.Thefollowingdiagramshowsthisareaofinterest
highlighted.Wellcomebacktothelinkweightsbetweentheinputandhiddenlayerslater.
Wellkeepreferringbacktothisdiagramtomakesurewedontforgetwhateachsymbolreally
meansaswedothecalculus.Dontbeputoffbyit,thestepsarentdifficultandwillbe
explained,andalloftheconceptsneededhavealreadybeencoveredearlier.
First,letsexpandthaterrorfunction,whichisthesumofthedifferencesbetweenthetargetand
actualvaluessquared,andwherethatsumisoverallthe n
outputnodes.

93

Allwevedonehereiswriteoutwhattheerrorfunction Eactuallyis.
Wecansimplifythisstraightawaybynoticingthattheoutputatanode n
,whichis
on ,only
dependsonthelinksthatconnecttoit.Thatmeansforanode k o
,theoutput onlydependson
k
weightsw ,becausethoseweightsareforlinksintonode
jk k
.
Anotherwayoflookingatthisisthattheoutputofanode k
doesnotdependonweights ,
wjb
where b
doesnotequal k
,becausethereisnolinkconnectingthem.Theweight w isforalink
jb
connectingtooutputnode b k
not.
Thismeanswecanremoveallthe
on
fromthatsumexcepttheonethattheweight linksto,
wjk
o
thatis .Thisremovesthatpeskysumtotally!Anicetrickworthkeepinginyourbackpocket.
k
Ifyouvehadyourcoffee,youmayhaverealisedthismeanstheerrorfunctiondidntneedto
sumoveralltheoutputnodesinthefirstplace.Weveseenthereasonisthattheoutputofa
nodeonlydependsontheconnectedlinksandhencetheirweights.Thisisoftenglossedoverin
manytextswhichsimplystatetheerrorfunctionwithoutexplainingit.
Anyway,wehaveasimplerexpression.
Now,welldoabitofcalculus.Remember,youcanrefertotheAppendixifyoureunfamiliar
withdifferentiation.
That t
k w
partisaconstant,andsodoesntvaryas t
varies.Thatis
jk w
isntafunctionof
k .If
jk
youthinkaboutit,itwouldbereallystrangeofthetruthexampleswhichprovidethetarget
valuesdidchangedependingontheweights!Thatleavesthe
ok
partwhichweknowdoes
dependon w becausetheweightsareusedtofeedforwardthesignaltobecometheoutputs
jk
o .
k
Wellusethechainruletobreakapartthisdifferentiationtaskintomoremanageablepieces.
AgainrefertotheAppendixforanintroductiontothechainrule.

94

Nowwecanattackeachsimplerbitinturn.Thefirstbitiseasyasweretakingasimple
derivativeofasquaredfunction.Thisgivesusthefollowing.
Thesecondbitneedsabitmorethoughtbutnottoomuch.That
ok
istheoutputofthenode
k
which,ifyouremember,isthesigmoidfunctionappliedtotheweightedsumoftheconnected
incomingsignals.Soletswritethatouttomakeitclear.
Thato o
istheoutputfromtheprevioushiddenlayernode,nottheoutputfromthefinallayer
j .
k
Howdowedifferentiatethesigmoidfunction?Wecoulddoitthelonghardway,usingthe
fundamentalideasintheAppendix,butothershavealreadydonethatwork.Wecanjustusethe
wellknownanswer,justlikemathematiciansallovertheworlddoeveryday.
Somefunctionsturnintohorribleexpressionswhenyoudifferentiatethem.Thissigmoidhasa
nicesimpleandeasytouseresult.Itsoneofthereasonsthesigmoidispopularforactivation
functionsinneuralnetworks.
Soletsapplythiscoolresulttogetthefollowing.

95

Whatsthatextralastbit?Itsthechainruleagainappliedtothesigmoidderivativebecausethe
expressioninsidethesigmoid()functionalsoneedstobedifferentiatedwithrespectto w .That
jk
tooiseasyandtheanswerissimply o .
j
Beforewewritedownthefinalanswer,letsgetridofthat2atthefront.Wecandothatbecause
wereonlyinterestedinthedirectionoftheslopeoftheerrorfunctionsowecandescendit.It
doesntmatterifthereisaconstantfactorof2,3oreven100infrontofthatexpression,aslong
wereconsistentaboutwhichonewestickto.Soletsgetridofittokeepthingssimple.
Heresthefinalanswerwevebeenworkingtowards,theonethatdescribestheslopeofthe
errorfunctionsowecanadjusttheweight w .
jk
Phew!Wedidit!
Thatsthemagicexpressionwevebeenlookingfor.Thekeytotrainingneuralnetworks.
Itsworthasecondlook,andthecolourcodinghelpsshoweachpart.Thefirstpartissimplythe
(targetactual)errorweknowsowell.Thesumexpressioninsidethesigmoidsissimplythe
signalintothefinallayernode,wecouldhavecalledit itomakeislooksimpler.Itsjustthe
k
signalintoanodebeforetheactivationsquashingfunctionisapplied.Thatlastpartistheoutput
fromtheprevioushiddenlayernode j
.Itisworthviewingtheseexpressionsintheseterms
becauseyougetafeelforwhatphysicallyisinvolvedinthatslope,andultimatelytherefinement
oftheweights.
Thatsafantasticresultandweshouldbereallypleasedwithourselves.Manypeoplefind
gettingtothispointreallyhard.

96

Onealmostfinalbitofworktodo.Thatexpressionweslavedoverisforrefiningtheweights
betweenthehiddenandoutputlayers.Wenowneedtofinishthejobandfindasimilarerror
slopefortheweightsbetweentheinputandhiddenlayers.
Wecoulddoloadsofalgebraagainbutwedonthaveto.Wesimplyusethatphysical
interpretationwejustdidandrebuildanexpressionforthenewsetofweightswereinterested
in.Sothistime,
Thefirstpartwhichwasthe(targetactual)errornowbecomestherecombined
backpropagatederroroutofthehiddennodes,justaswesawabove.Letscallthat e.
j
Thesigmoidpartscanstaythesame,butthesumexpressionsinsiderefertothe
precedinglayers,sothesumisoveralltheinputsmoderatedbytheweightsintoa
hiddennode j i
.Wecouldcallthis .
j
o
Thelastpartisnowtheoutputofthefirstlayerofnodes ,whichhappentobetheinput
i
signals.
Thisniftywayofavoidinglotsofwork,issimplytakingadvantageofthesymmetryinthe
problemtoconstructanewexpression.Wesayitssimplebutitisaverypowerfultechnique,
wieldedbysomeofthemostbigbrainedmathematiciansandscientists.Youcancertainly
impressyourmateswiththis!
Sothesecondpartofthefinalanswerwevebeenstrivingtowardsisasfollows,theslopeofthe
errorfunctionfortheweightsbetweentheinputandhiddenlayers.
Wevenowgotallthesecrucialmagicexpressionsfortheslope,wecanusethemtoupdatethe
weightsaftereachtrainingexampleasfollows.
Remembertheweightsarechangedinadirectionoppositetothegradient,aswesawclearlyin
thediagramsearlier.Wealsomoderatethechangebyusingalearningfactor,whichwecan
tuneforaparticularproblem.Wesawthistoowhenwedevelopedlinearclassifiersasawayto
avoidbeingpulledtoofarwrongbybadtrainingexamples,butalsotoensuretheweightsdont
bouncearoundaminimumbyconstantlyovershootingit.Letssaythisinmathematicalform.

97

Theupdatedweight istheoldweightadjustedbythenegativeoftheerrorslopewejust
wjk
workedout.Itsnegativebecausewewanttoincreasetheweightifwehaveapositiveslope,
anddecreaseitifwehaveanegativeslope,aswesawearlier.Thesymbolalpha,isafactor
whichmoderatesthestrengthofthesechangestomakesurewedontovershoot.Itsoften
calleda
learningrate.
Thisexpressionappliestotheweightsbetweentheinputandhiddenlayerstoo,notjusttothose
betweenthehiddenandoutputlayers.Thedifferencewillbetheerrorgradient,forwhichwe
havethetwoexpressionsabove.
Beforewecanleavethis,weneedtoseewhatthesecalculationslooklikeifwetrytodothem
asmatrixmultiplications.Tohelpus,welldowhatwedidbefore,whichistowriteoutwhateach
elementoftheweightchangematrixshouldbe.
Iveleftoutthelearningrateasthatsjustaconstantanddoesntreallychangehowwe
organiseourmatrixmultiplication.
Thematrixofweightchangescontainsvalueswhichwilladjusttheweight w linkingnode
j,k j
in
onelayerwiththenode k
inthenext.Youcanseethatfirstbitoftheexpressionusesvalues
fromthenextlayer(node k
),andthelastbitusesvaluesfromthepreviouslayer(node j
).
Youmightneedtostareatthepictureaboveforawhiletoseethatthelastpart,thehorizontal
matrixwithonlyasinglerow,isthetransposeoftheoutputsfromthepreviouslayer O .The
j
colourcodingshowsthatthedotproductistherightwayaround.Ifyourenotsure,trywriting
outthedotproductwiththesetheotherwayaroundandyoullseeitdoesntwork.

98

So,thematrixformoftheseweightupdatematrices,isasfollow,readyforustoimplementina
computerprogramminglanguagethatcanworkwithmatricesefficiently.
Thatsit!Jobdone.

KeyPoints:

Aneuralnetworkserrorisafunctionoftheinternallinkweights.

Improvinganeuralnetworkmeansreducingthiserrorbychangingthoseweights.

Choosingtherightweightsdirectlyistoodifficult.Analternativeapproachisto
iterativelyimprovetheweightsbydescendingtheerrorfunction,takingsmallsteps.
Eachstepistakeninthedirectionofthegreatestdownwardslopefromyourcurrent
position.Thisiscalled
gradientdescent .

Thaterrorslopeispossibletocalculateusingcalculusthatisnttoodifficult.

WeightUpdateWorkedExample
Letsworkthroughacoupleofexampleswithnumbers,justtoseethisweightupdatemethod
working.
Thefollowingnetworkistheoneweworkedwithbefore,butthistimeweveaddedexample
o
outputvaluesfromthefirsthiddennode o
andthesecondhiddennode
j=1 .Thesearejust
j=2
madeupnumberstoillustratethemethodandarentworkedoutproperlybyfeedingforward
signalsfromtheinputlayer.

99

Wewanttoupdatetheweight betweenthehiddenandoutputlayers,whichcurrentlyhas
w11
thevalue2.0.
Letswriteouttheerrorslopeagain.
Letsdothisbitbybit:
Thefirstbit(
tk
ok
)istheerror
e1=1.5,justaswesawbefore.
Thesuminsidethesigmoidfunctions w
j o
jk is(2.0*0.4)+(4.0*0.5)=2.8.
j
2.8
Thesigmoid1/(1+e )isthen0.943.Thatmiddleexpressionisthen0.943*(10.943)
=0.054.
Thelastpartissimply o whichis
j becausewereinterestedintheweight
oj=1 w where
11 j
=1.Hereitissimply0.4.
Multiplyingallthesethreebitstogetherandnotforgettingtheminussignatthestartgivesus
0.06048.

100

Ifwehavealearningrateof0.1thatgiveisachangeof(0.1*0.06048)=+0.006.Sothe
neww istheoriginal2.0plus0.006=2.006.
11
Thisisquiteasmallchange,butovermanyhundredsorthousandsofiterationstheweightswill
eventuallysettledowntoaconfigurationsothatthewelltrainedneuralnetworkproduces
outputsthatreflectthetrainingexamples.
PreparingData
Inthissectionweregoingtoconsiderhowwemightbestpreparethetrainingdata,preparethe
initialrandomweights,andevendesigntheoutputstogivethetrainingprocessagoodchance
ofworking.
Yes,youreadthatright!Notallattemptsatusingneuralnetworkswillworkwell,formany
reasons.Someofthosereasonscanbeaddressedbythinkingaboutthetrainingdata,theinitial
weights,anddesigningagoodoutputscheme.Letslookateachinturn.
Inputs
Havealookatthediagrambelowofthesigmoidactivationfunction.Youcanseethatifthe
inputsarelarge,theactivationfunctiongetsveryflat.

101
Averyflatactivationfunctionisproblematicbecauseweusethegradienttolearnnewweights.
Lookbackatthatexpressionfortheweightchanges.Itdependsonthegradientoftheactivation
function.Atinygradientmeanswevelimitedtheabilitytolearn.Thisiscalled
saturatinga
neuralnetwork.Thatmeansweshouldtrytokeeptheinputssmall.
Interestingly,thatexpressionalsodependsontheincomingsignal( o)soweshouldntmakeit
j
toosmalleither.Veryverytinyvaluescanbeproblematictoobecausecomputerscanlose
accuracywhendealingveryverysmallorveryverylargenumbers.
Agoodrecommendationistorescaleinputsintotherange0.0to1.0.Somewilladdasmall
offsettotheinputs,like0.01,justtoavoidhavingzeroinputswhicharetroublesomebecause
theykillthelearningabilitybyzeroingtheweightupdateexpressionbysettingthat
oj
=0.
Outputs
Theoutputsofaneuralnetworkarethesignalsthatpopoutofthelastlayerofnodes.Ifwere
usinganactivationfunctionthatcantproduceavalueabove1.0thenitwouldbesillytotryto
setlargervaluesastrainingtargets.Rememberthatthelogisticfunctiondoesntevengetto1.0,
itjustgetseverclosertoit.Mathematicianscallthis
asymptoticallyapproaching1.0.
Thefollowingdiagrammakesclearthatoutputvalueslargerthan1.0andbelowzeroaresimply
notpossiblefromthelogisticactivationfunction.
Ifwedosettargetvaluesintheseinaccessibleforbiddenranges,thenetworktrainingwilldrive
everlargerweightsinanattempttoproducelargerandlargeroutputswhichcanneveractually
beproducedbytheactivationfunction.Weknowthatsbadasthatsaturatesthenetwork.
Soweshouldrescaleourtargetvaluestomatchtheoutputspossiblefromtheactivation
function,takingcaretoavoidthevalueswhichareneverreallyreached.

102

Itiscommontotousearangeof0.0to1.0,butsomedousearangeof0.01to0.99because
both0.0and1.0areimpossibletargetsandriskdrivingoverlylargeweights.
RandomInitialWeights
Thesameargumentapplieshereaswiththeinputsandoutputs.Weshouldavoidlargeinitial
weightsbecausetheycauselargesignalsintoanactivationfunction,leadingtothesaturation
wejusttalkedabout,andthereducedabilitytolearnbetterweights.
Wecouldchooseinitialweightsrandomlyanduniformlyfromarange1.0to+1.0.Thatwould
beamuchbetterideathanusingaverylargerange,say1000to+1000.
Canwedobetter?Probably.
Mathematiciansandcomputerscientistshavedonethemathstoworkoutaruleofthumbfor
settingtherandominitialweightsgivenspecificshapesofnetworksandwithspecificactivation
functions.Thatsalotofspecifics!Letscarryonanyway.
Wewontgointothedetailsofthatworkingoutbutthecoreideaisthatifwehavemanysignals
intoanode,whichwedoinaneuralnetwork,andthatthesesignalsarealreadywellbehaved
andnottoolargeorcrazilydistributed,theweightsshouldsupportkeepingthosesignalswell
behavedastheyarecombinedandtheactivationfunctionapplied.Inotherwords,wedont
wanttheweightstounderminetheeffortweputintocarefullyscalingtheinputsignals.Therule
ofthumbthesemathematiciansarriveatisthattheweightsareinitialisedrandomlysampling
fromarangethatisroughlytheinverseofthesquarerootofthenumberoflinksintoanode.So
ifeachnodehas3linksintoit,theinitialweightsshouldbeintherange1/(3)=0.577.Ifeach
nodehas100incominglinks,theweightsshouldbeintherange1/(100)=0.1.
Intuitivelythismakesense.Someoverlylargeinitialweightswouldbiastheactivationfunctionin
abiaseddirection,andverylargeweightswould saturate
theactivationfunctions.Andthemore
linkswehaveintoanode,themoresignalsarebeingaddedtogether.Soaruleofthumbthat
reducestheweightrangeiftherearemorelinksmakessense.
Ifyourarealreadyfamiliarwiththeideaofsamplingfromprobabilitydistributions,thisruleof
thumbisactuallyaboutsamplingfromanormaldistributionwithmeanzeroandastandard
deviationwhichistheinverseofthesquarerootofthenumberoflinksintoanode.Butletsnot
worrytoomuchaboutgettingthispreciselyrightbecausethatruleofthumbassumesquitea
fewthingswhichmaynotbetrue,suchasanactivationfunctionlikethealternativetanh()anda
specificdistributionoftheinputsignals.
Thefollowingdiagramsummarisesvisuallyboththesimpleapproach,andthemore
sophisticatedapproachwithanormaldistribution.

103

Whateveryoudo,dontsettheinitialweightsthesameconstantvalue,especiallynozero.That
wouldbebad!
Itwouldbebadbecauseeachnodeinthenetworkwouldreceivethesamesignalvalue,andthe
outputoutofeachoutputnodewouldbethesame.Ifwethenproceededtoupdatetheweights
inthenetworkbybackpropagatingtheerror,theerrorwouldhavetobedividedequally.Youll
remembertheerrorissplitinproportiontotheweights.Thatwouldleadtoequalweightupdates
leadingagaintoanothersetofequalvaluedweights.Thissymmetryisbadbecauseifthe
properlytrainednetworkshouldhaveunequalweights(extremelylikelyforalmostallproblems)
thenyoudnevergetthere.
Zeroweightsareevenworsebecausetheykilltheinputsignal.Theweightupdatefunction,
whichdependsontheincomingsignals,iszeroed.Thatkillstheabilitytoupdatetheweights
completely.
Thearemanyotherthingsyoucandotorefinehowyouprepareyourinputdata,howyouset
yourweights,andhowyouorganiseyourdesiredoutputs.Forthisguide,theaboveideasare
botheasyenoughtounderstandandalsohaveadecenteffect,sowellstopthere.

KeyPoints:

Neuralnetworksdontworkwelliftheinput,outputandinitialweightdataisnot

104
preparedtomatchthenetworkdesignandtheactualproblembeingsolved.

Acommonproblemis saturation wherelargesignals,sometimesdrivenbylarge
weights,leadtosignalsthatareattheveryshallowslopesoftheactivationfunction.
Thisreducestheabilitytolearnbetterweights.

Anotherproblemis
zero
valuesignalsorweights.Thesealsokilltheabilitytolearn
betterweights.

Theinternallinkweightsshouldbe
randomandsmall,avoidingzero.Somewilluse
moresophisticatedrules,forexample,reducingthesizeoftheseweightsifthereare
morelinksintoanode.

Inputs shouldbescaledtobesmall,butnotzero.Acommonrangeis0.01to0.99,or
1.0to+1.0,dependingonwhichbettermatchestheproblem.

Outputs shouldbewithintherangeofwhattheactivationfunctioncanproduce.
Valuesbelow0orabove1,inclusive,areimpossibleforthelogisticsigmoid.Setting
trainingtargetsoutsidethevalidrangewilldriveeverlargerweights,leadingto
saturation.Agoodrangeis0.01to0.99.

105
Part2DIYwithPython
Toreallyunderstandsomething,
youneedtomakeityourself.

Startsmallthengrow

106
Inthissectionwellmakeourownneuralnetwork.
Welluseacomputerbecause,asyouknowfromearlier,therewillbemanythousandsof
calculationstodo.Computersaregoodatdoingmanycalculationsveryquickly,withoutgetting
boredorlosingaccuracy.
Wewilltellacomputerwhattodousinginstructionsthatitcanunderstand.Computersfind
humanlanguageslikeEnglish,FrenchorSpanishhardtounderstandpreciselyandwithout
ambiguity.Infacthumanshavetroublewithprecisionandambiguitywhencommunicatingwith
eachother,socomputershaveverylittlehopeofdoingbetter!
Python
Wellbeusingacomputerlanguagecalled Python .Pythonisagoodlanguagetostartwith
becauseitiseasytolearn.ItsalsoeasytoreadandunderstandsomeoneelsesPython
instructions.Itisalsoverypopular,andusedinmanydifferentareas,includingscientific
research,teaching,globalscaleinfrastructures,aswellasdataanalyticsandartificial
intelligence.Pythonisincreasinglybeingtaughtinschools,andtheextremelypopular
RaspberryPihasmadePythonaccessibletoevenmorepeopleincludingchildrenandstudents.
TheAppendixincludesaguidetosettingupaRaspberryPiZerotodoalltheworkweve
coveredinthisguidetomakeyourownneuralnetworkwithPython.ARaspberryPiZerois
especiallyinexpensivesmallcomputer,andcurrentlycostsabout4or$5.Thatsnotatypoit
reallydoescost4.
ThereisalotthatyoucanlearnaboutPython,oranyothercomputerlanguage,butherewell
remainfocussedonmakingourownneuralnetwork,andonlylearnjustenoughPythonto
achievethis.
InteractivePython=IPython
RatherthangothroughtheerrorpronestepsofsettingupPythonforyourcomputer,andallthe
variousextensionsthathelpdomathematicsandplotimages,weregoingtousea
prepackagedsolution,called IPython.
IPythoncontainsthePythonprogramminglanguageandseveralcommonnumericalanddata
plottingextensions,includingtheoneswellneed.IPythonalsohastheadvantageofpresenting
interactivenotebooks,whichbehavemuchlikepenandpapernotepads,idealfortryingout
ideasandseeingtheresults,andthenchangingsomeofyourideasagain,easilyandwithout
fuss.Weavoidworryingaboutprogramfiles,interpretersandlibraries,whichcandistractus
fromwhatwereactuallytryingtodo,especiallywhentheydontworkasexpected.

107
The
ipython.org
sitegivesyousomeoptionsforwhereyoucangetprepackagedIPython.Im
usingthe
Anaconda packagefrom
www.continuum.io/downloads ,shownbelow.
ThatsitemayhavechangeditsappearancesinceItookthatsnapshot,sodontbeputoff.First
gothesectionforyourcomputer,whichmightbeaWindowscomputer,oranAppleMacwith
OSX,oraLinuxcomputer.Underthesectionforyourcomputer,makesureyougetthe Python
3.5
versionandnotthe2.7.
AdoptionofPython3isgatheringpace,andisthefuture.Python2.7iswellestablishedbutwe
needtomovetothefutureandstartusingPython3wheneverwecan,especiallyfornew
projects.Mostcomputershave 64bit
brainssomakesureyougetthatversion.Only
computersroughlymorethan10yearsoldarelikelytoneedthelegacy32bitversion.
Followtheinstructionsonthatsitetogetitinstalledonyourcomputer.InstallingIPythonshould
bereallyeasy,becauseitsdesignedtobeeasy,andshouldntcauseanyproblems.
AVeryGentleStartwithPython
WellassumeyounowhaveaccesstoIPython,havingfollowedtheinstructionsforgettingit
installed.
Notebooks
OncewevefireditupandclickedNewNotebook,werepresentedwithanemptynotebook
likethefollowing.

108

Thenotebookisinteractive,meaningitwaitsforyoutoaskittodosomething,doesit,andthen
presentsbackyouranswer,andwaitsagainforyournextinstructionorquestion.Itslikearobot
butlerwithatalentforarithmeticthatnevergetstired.
Ifyouhavewanttodosomethingthatisevenmildlycomplicated,itmakessensetobreakit
downintosections.Thismakesiteasytoorganiseyourthinking,andalsofindwhichpartofthe
bigprojectwentwrong.ForIPython,thesesectionsarecalledcells.TheaboveIPython
notebookhasaninitialemptycell,andyoumaybeabletoseethetypingcaretblinking,waiting
foryoutotypeyourinstructionsintoit.
Letsinstructthecomputer!Letsaskittomultiplytwonumbers,say2times3.Letstype 2*3
intothecellandclicktheruncellbuttonthatlookslikeaaudioplaybutton.Dontincludethe
speechmarks.Thecomputershouldquicklyworkoutwhatyoumeanbythis,andpresentthe
resultbacktoyouasfollows.
Youcanseetheanswer 6
iscorrectlypresented.Wevejustissuedourfirstinstructiontoa
computerandsuccessfullyreceivedacorrectresult.Ourfirstcomputerprogram!
DontbedistractedbyIPythonlabellingyourquestionasIn[1]anditsanswerasOut[1].
Thatsjustitswayofremindingyouwhatyouasked(input)andandwhatitrepliedwith(output).
Thenumbersarethesequenceyouaskedanditresponded,usefulforkeepingtrackifyoufind
yourselfjumpingaroundyournotebookadjustingandreissuingyourinstructions.

109
SimplePython
WereallymeantitwhenwesaidPythonwasaneasycomputerlanguage.Inthenextreadycell,
labelledIn[],typethefollowingcodeandclickplay.Theword
code
iswidelyusedtoreferto
instructionswritteninacomputerlanguage.Ifyoufindthatmovingthepointertoclicktheplay
buttonistoocumbersome,likeIdo,youcanusethekeyboardshortcutctrlenter,instead.
print(HelloWorld!)
YoushouldgetaresponsewhichsimplyprintsthephraseHelloWorld!asfollows.
YoucanseethatissuingthesecondinstructiontoprintHelloWorld!didntremovethe
previouscellwithitsinstructionandoutputanswer.Thisisusefulwhenslowlybuildingupa
solutionofseveralparts.
Nowletsseewhatsgoingonwiththefollowingcodewhichintroducesakeyidea.Enterandrun
itinanewcell.Ifthereisnonewemptycell,clickthebuttonthatlookslikeaplussignlabelled
InsertCellBelow.
x=10
print(x)
print(x+5)

y=x+7
print(y)

print(z)
x=10
Thefirstline lookslikeamathematicalstatementwhichsaysxis10.InPythonthis
meansthatxissetto10,thatis,thevalue10isplacedinanvirtualboxcalledx.Itsassimple
asthefollowingdiagramshows.

110

That10staysthereuntilfurthernotice.Weshouldntbesurprisedbythe print(x) because

weusedtheprintinstructionbefore.Itshouldprintthevalueofx,whichis10.Whydoesntit
justprinttheletterx?BecausePythonisalwayseagertoevaluatewhateveritcan,andxcan
beevaluatedtothevalue10soitprintsthat.Thenextlineprint(x+5)evaluatesx+5,whichis
10+5or15,soweexpectittoprint15.
Thenextbity=x+7againshouldntbedifficultyouworkoutifwefollowthisideathatPython
evaluateswhateveritcan.Wevetoldittoassignavaluetoanewboxlabelledy,butwhat
value?Theexpressionisx+7,whichis10+7,or17.Soyholdsthevalue17,andthenextline
shouldprintit.
Whathappenswiththelineprint(z)whenwehaventassignedavaluetozlikewehavewithx
andy?Wegetanerrormessagewhichispoliteandtellsusabouttheerrorofourways,trying
tobehelpfulaspossiblesowecanfixit.Ihavetosay,mostcomputerlanguageshaveerror
messageswhichtrytobehelpfulbutdontalwayssucceed.
Thefollowingshowstheresultsoftheabovecode,includingthehelpfulpoliteerrormessage,
namezisnotdefined.

111

Theseboxeswithlabelslikexandy,whichholdvalueslike10and17,arecalled variables .
Variablesincomputerlanguagesareusedtomakeasetofinstructionsgeneric,justlike
mathematiciansuseexpressionslikexandytomakegeneralstatements.
AutomatingWork
Computersaregreatfordoingsimilartasksmanytimestheydontmindandtheyreveryquick
comparedtohumanswithcalculators!
Letsseeifwecangetacomputertoprintthefirsttensquarednumbers,startingwith0
squared,1squared,then2squaredandsoon.Weexpecttoseeanoutputofsomethinglike0,
1,4,9,16,25,andsoon.
Wecouldjustdothecalculationourselves,andhaveasetofinstructionslikeprint(0),print
(1),print(4),andsoon.Thiswouldworkbutwewouldhavefailedtogetthecomputertodo
thecalculationforus.Morethanthat,wewouldhavemissedtheopportunitytohaveageneric
setofinstructiontoprintthesquaresofnumbersuptoanyspecifiedvalue.Todothisweneed
topickupafewmorenewideas,sowelltakeitgently.
Issuethefollowingcodeintothenextreadycellandrunit.
list(range(10))

112
Youshouldgetalistoftennumbers,from0upto9.Thisisgreatbecausewegotthecomputer
todotheworktocreatethelist,wedidnthavetodoitourselves.Wearethemasterandthe
computerisourservant!
Youmayhavebeensurprisedthatthelistwasfrom0to9,andnotfrom1to10.Thisisbecause
manycomputerrelatedthingsstartwith0andnot1.ItstrippedmeupmanytimeswhenI
assumedacomputerliststartedwith1andnot0.Creatingorderedlistsareusefultokeepcount
whenperformingcalculations,orindeedapplyingiterativefunctions,manytimes.
Youmayhavenoticedwemissedouttheprintkeyword,whichweusedwhenweprintedthe
phraseHelloWorld!,butagaindidntwhenweevaluated2*3.Usingtheprintkeywordcanbe
optionalwhenwereworkingwithPythoninaninteractivewaybecauseitknowswewanttosee
theresultoftheinstructionsweissued.
Averycommonwaytogetcomputerstodothingsrepeatedlyisbyusingcodestructurescalled
loops .Thewordloopdoesgiveyoutherightimpressionofsomethinggoinground,andround
potentiallyendlessly.Ratherthandefinealoop,itseasiesttoseeasimpleone.Enterandrun
thefollowingcodeinanewcell.
forninrange(10):
print(n)
pass
print(done)
Therearethreenewthingsheresoletsgothroughthem.Thefirstlinehastherange(10)that
wesawbefore.Thiscreatesalistofnumbersfrom0to9,aswesawbefore.
Theforninisthebitthatcreatesaloop,andinthiscaseitdoessomethingforeverynumber
inthelist,andkeepscountbyassigningthecurrentvaluetothevariablen.Wesawvariables
earlierandthisisjustlikeassigningn=0duringthefirstpassoftheloop,thenn=1,thenn=2,
untiln=9whichisthelastiteminthelist.
Thenextlineprint(n)shouldntsurpriseusbysimplyprintingthevalueofn.Weexpectallthe
numbersinthelisttobeprinted.Butnoticetheindentbeforeprint(n).Thisisimportantin
Pythonasindentsareusedmeaningfullytoshowwhichinstructionsaresubservienttoothers,in
thiscasetheloopcreatedbyfornin....Thepassinstructionsignalstheendoftheloop,and
thenextlineisbackatnormalindentationandnotpartoftheloop.Thismeansweonlyexpect
donetobeprintedonce,andnottentimes.Thefollowingshowstheoutput,anditisaswe

113
expected.
Itshouldbeclearnowthatwecanprintthesquaresbyprintingn*n.Infactwecanmakethe
outputmorehelpfulbyprintingphraseslikeThesquareof3is9.Thefollowingcodeshows
thischangetotheprintinstructionrepeatedinsidetheloop.Notehowthevariablesarenot
insidequotesandarethereforeevaluated.
forninrange(10):
print("Thesquareof",n,"is",n*n)
pass
print("done")
Theresultisshownasfollows.
Thisisalreadyquitepowerful!Wecangetthecomputertopotentiallydoalotofworkvery
quicklywithjustaveryshortsetofinstructions.Wecouldeasilymakethenumberofloop
iterationsmuchlargerbyusingrange(50)orevenrange(1000)ifwewanted.Tryit!
Comments
BeforewediscovermorewildandwonderfulPythoncommands,havealookatthefollowing

114
simplecode.
#thefollowingprintsoutthecubeof2
print(2**3)
Thefirstlinebeginswithahashsymbol#.Pythonignoresanylinesbeginningwithahash.
Ratherthanbeinguseless,wecanusesuchlinestoplacehelpfulcommentsintothecodeto
makeitclearerforotherreaders,orevenourselvesifwecamebacktothecodeatalatertime.
Trustme,youllbethankfulyoucommentedyourcode,especiallythemorecomplexorless
obviousbitsofcode.ThenumberoftimesIvetriedtodecodemyowncode,askingmyself
whatwasIthinking
Functions
Wespentalotoftimeearlierinpart1oftheguideworkingwithmathematicalfunctions.We
thoughtoftheseasmachineswhichtakeinput,dosomework,andpopouttheresult.And
thosefunctionsstoodintheirownright,andcouldbeusedagainandagain.
Manycomputerlanguages,Pythonincluded,makeiteasytocreatereusablecomputer
instructions.Likemathematicalfunctions,thesereusablesnippetsofcodestandontheirownif
youdefinethemsufficientlywell,andallowyoutowriteshortermoreelegantcode.Whyshorter
code?Becauseinvokingafunctionbyitsnamemanytimesisbetterthanwritingoutallthe
functioncodemanytimes.
Andwhatdowemeanbysufficientlywelldefined?Itmeansbeingclearaboutwhatkindsof
inputafunctionexpects,andwhatkindofoutputitproduces.Somefunctionswillonlytake
numbersasinput,soyoucantsupplythemwithawordmadeupofletters.
Again,thebestwaytounderstandthissimpleideaofa functionistoseeasimpleoneandplay
withit.Enterthefollowingcodeandrunit.
#functionthattakes2numbersasinput
#andoutputstheiraverage
defavg(x,y):
print("firstinputis",x)
print("secondinputis",y)
a=(x+y)/2.0
print("averageis",a)
returna
Letstalkaboutwhatwevedonehere.Thefirsttwolinesstartingwith#areignoredbyPython
butforuscanbeusedascommentsforfuturereaders.Thenextbitdefavg( x,y
)tellsPython

115
weareaboutdefineanewreusablefunction.Thatsthedefkeyword.Theavgbitisthename
wevegivenit.Itcouldhavebeencalledbananaorplutobutitmakessensetousenames
thatreminduswhatthefunctionactuallydoes.Thebitsinbrackets( x,y)tellsPythonthatthis
functiontakestwoinputs,tobecalled x y
andinsidetheforthcomingdefinitionofthefunction.
Somecomputerlanguagesmakeyousaywhatkindofobjectstheseare,butPythondoesntdo
this,itjustcomplainspolitelylaterwhenyoutrytoabuseavariable,liketryingtouseawordas
ifitwasanumber,orothersuchinsanity.
NowthatwevesignalledtoPythonthatwereabouttodefineafunction,weneedtoactuallytell
itwhatthefunctionistodo.Thisdefinitionofthefunctionisindented,asshowninthecode
above.Somelanguagesuselotsofbracketstomakeitclearwhichinstructionsbelongtowhich
partsofaprogram,butthePythondesignersfeltthatlotsofbracketswerenteasyontheeye,
andthatindentationmadeunderstandingthestructureofaprograminstantlyvisualandeasier.
Opinionsaredividedbecausepeoplegetcaughtoutbysuchindentation,butIloveit!Itsoneof
thebesthumanfriendlyideastocomeoutofthesometimesgeekyworldofcomputer
programming!
Thedefinitionoftheavg( x,y)functioniseasytounderstandasitusesonlythingsweveseen
already.Itprintsoutthefirstandsecondnumberswhichthefunctiongetswhenitisinvoked.
Printingtheseoutisntnecessarytoworkouttheaverageatall,butwevedoneittomakeit
reallyclearwhatishappeninginsidethefunction.Thenextbitcalculates( x+y
)/2.0andassigns
thevaluetothevariablenamed a
.Weagainprinttheaveragejusttohelpusseewhatsgoing
oninthecode.Thelaststatementsaysreturna.Thisisistheendofthefunctionandtells
Pythonwhattothrowoutasthefunctionsoutput,justlikemachinesweconsideredearlier.
Whenweranthiscode,itdidntseemtodoanything.Therewerenonumbersproduced.Thats
becauseweonlydefinedthefunction,buthaventusedityet.Whathasactuallyhappenedis
thatPythonhasnotedthisfunctionandwillkeepitreadyforwhenwewanttouseit.
Inthenextcellenteravg(2,4)toinvokethisfunctionwiththeinputs2and4.Bytheway,
invokingafunctioniscalled callingafunction intheworldofcomputerprogramming.The
outputshouldbewhatweexpect,withthefunctionprintingastatementaboutthetwoinput
valuesandtheaverageitcalculated.Youllalsoseetheansweronitsown,becausecallingthe
functioninaninteractivePythonsessionsprintsoutthereturnedvalue.Thefollowingshowsthe
functiondefinitionandtheresultsofcallingitwithavg(2,4)andalsobiggervalues(200,301).
Haveaplayandexperimentwithyourowninputs.

116

Youmayhavenoticedthatthefunctioncodewhichcalculatestheaveragedividesthesumof
thetwoinputsby2.0andnotjust2.Whyisthis?WellthisisapeculiarityofPythonwhichIdont
like.Ifweusedjust2theresultwouldberoundeddowntothenearestwholenumber,because
Pythonconsidersjust2asaninteger.Thiswouldbefineforavg(2,4)because6/2is3,a
wholenumber.Butforavg(200,301)theaverageis501/2whichshouldbe250.5butwouldbe
roundeddownto250.ThisisalljustverysillyIthink,butworththinkingaboutifyourowncode
isntbehavingquiteright.Dividingby2.0tellsPythonwereallywanttostickwithnumbersthat
canhavefractionalparts,andwedontwantittorounddowntowholenumbers.
Letstakeastepbackandcongratulateourselves.Wevedefinedareusablefunction,oneof
themostimportantandpowerfulelementsofbothmathematicsandincomputerprogramming.
Wellusereusablefunctionswhenwecodeourownneuralnetwork.Forexample,itmakes
sensetomakeareusablefunctionthatdoesthesigmoidactivationfunctioncalculationsowe
cancallonitmanytimes.
Arrays
Arraysarejusttablesofvalues,andtheycomeinreallyhandy.Liketables,youreferto
particularcellsaccordingtotherowandcolumnnumber.Ifyouthinkofspreadsheets,youll
knowthatcellsarereferredtointhisway,B1orC5forexample,andthevaluesinthosecells
canbeusedincalculations,C3+D7forexample.

117

Whenwegettocodingourneuralnetwork,wellusearraystorepresentthematricesofinput
signals,weightsandoutputsignalstoo.Andnotjustthose,wellusethemtorepresentthe
signalsinsidetheneuralnetworksastheyfeedforward,andalsotheerrorsastheypropagate
backwards.Soletsgetfamiliarwiththem.Enterandrunthefollowingcode.
importnumpy
Whatdoesthisdo?TheimportcommandtellsPythontobringadditionalpowersfrom
elsewhere,toaddnewtoolstoitsbelt.SometimestheseadditionaltoolsarepartofPythonbut
arentmadeimmediatelyreadytouse,tokeepPythonleanandmean,andonlycarryextrastuff
ifyouintendtouseit.OftentheseadditionaltoolsnotcorepartsofPython,butarecreatedby
othersasusefulextras,contributedforeveryonetouse.Hereweveimportedanadditionalset
oftoolspackagedintoamodulenamed numpy .Numpyisreallypopular,andcontainssome
usefulstuff,likearraysandanabilitytodocalculationswiththem.
Inthenextcell,thefollowingcode.
a=numpy.zeros([3,2])
print(a)
Thisusestheimportednumpymoduletocreateanarrayofshape3by2,withallthecellssetto
thevaluezeroandassignsthewholethingtoavariablenamed a a
.Wethenprint.Wecansee
thisarrayfullofzerosinwhatlookslikeatablewith3rowsand2columns.

118

Nowletsmodifythecontentsofthisarrayandchangesomeofthosezerostoothervalues.The
followingcodeshowshowyoucanrefertospecificcellstooverwritethemwithnewvalues.Its
justlikereferringtospreadsheetcellsorstreetmapgridreferences.
a[0,0]=1
a[0,1]=2
a[1,0]=9
a[2,1]=12
print(a)
Thefirstlineupdatesthecellatrowzeroandcolumnzerowiththevalue1,overwriting
whateverwastherebefore.Theotherlinesaresimilarupdates,withafinalprintoutwith
print(a).Thefollowingshowsuswhatthearraylookslikeafterthesechanges.
Nowthatweknowhowtosetthevalueofcellsinanarray,howdowelookthemupwithout
printingouttheentirearray?Wevebeendoingitalready.Wesimplyusetheexpressionslike
a a
[1,2]or[2,1]torefertothecontentofthesecellswhichwecanprintorassigntoother
variables.Thecodeshowsusdoingjustthis.
print(a[0,1])
v=a[1,0]
print(v)

119
Youcanseefromtheoutputthatthefirstprintinstructionproducedthevalue2.0whichiswhats
a
insidethecellat[0,1].Nextthevalueinside v
[1,0]isassignedtothevariableandthisvariable
isprinted.Wegettheexpected9.0printedout.
Thecolumnandrownumberingstartsfrom0andnot1.Thetopleftisat[0,0]not[1,1].This
alsomeansthatthebottomrightisat[2,1]not[3,2].ThiscatchesmeoutsometimesbecauseI
keepforgettingthatmanythingsinthecomputerworldbeginwith0andnot1.Ifwetriedtorefer
a
to[3,2]wedgetanerrormessagetellingusweweretryingtolocateacellwhichdidntexist.
Wedgetthesameifwemixedupourcolumnsandrows.Letstryaccessing a
[0,2]which
doesntexistjusttoseewhaterrormessageisreported.
Arrays,ormatrices,willbeusefulforneuralnetworksbecausewecansimplifytheinstructions
todothemanymanycalculationsforfeedingsignalsforwardanderrorsbackwardsthrougha
network.Wesawthisinpart1ofthisguide.
Plottingarrays
Justlikelargetablesorlistsofnumbers,lookingatlargearraysisntthatinsightful.Visualising
themhelpsusquicklygetanideaofthegeneralmeaning.Onewayofplotting2dimensional
arraysofnumbersisthinkofthemasflat2dimensionalsurfaces,colouredaccordingtothe
valueateachcellinthearray.Youcanchoosehowyouturnavalueinsideacellintoacolour.
Youmightchoosetosimplyturnthevalueintoacolouraccordingtoacolourscale,oryoumight
coloureverythingwhiteexceptvaluesaboveacertainthresholdwhichwouldbeblack.
Letstryplottingthesmall3by2arraywecreatedabove.
Beforewecandothis,weneedtoextendPythonsabilitiestoplotgraphics.Wedothisby
importing additionalPythoncodethatothershavewritten.Youcanthinkofthisasborrowing
foodrecipebooksfromyourfriendtoaddtoyourownbookshelf,sothatyourbookshelfnow
hasextracontentenablingyoutopreparemorekindsofdishesthanyoucouldbefore.

120

Thefollowingshowshowweimportgraphicsplottingcapability.
importmatplotlib.pyplot
Thematplotlib.pyplotisthenameofthenewrecipebookthatwereborrowing.Youmight
comeacrossphraseslikeimportamoduleorimportalibrary.Thesearejustthenames
giventotheadditionalPythoncodeyoureimporting.IfyougetintoPythonyoulloftenimport
additionalcapabilitiestomakeyourlifeeasierbyreusingothersusefulcode.Youmighteven
createyourownusefulcodetosharewithothers!
Onemorething,weneedtobefirmwithIPythonaboutplottinganygraphicsinthenotebook,
andnottrytoplotitinaseparateexternalwindow.Weissuethisexplicitorderasfollows:
%matplotlibinline
Werereadytoplotthatarraynow.Enterandrunthefollowingcode.
matplotlib.pyplot.imshow(a,interpolation="nearest")
Theinstructiontocreateaplotisimshow(),andthefirstparameteristhearraywewanttoplot.
ThatlastbitinterpolationistheretotellPythonnottotrytoblendthecolourstomaketheplot
looksmoother,whichitdoesbydefault,tryingtohelpful.Letslookattheoutput.
Exciting!Ourfirstplotshowsthe3by2sizedarrayascolours.Youcanseethatthearraycells
whichhavethesamevaluealsohavethesamecolour.Laterwellbeusingthisverysame
imshow()instructiontovisualiseanarrayofvaluesthatwefeedourneuralnetwork.

121

TheIPythonpackagehasaveryrichsetoftoolsforvisualisingdata.Youshouldexplorethem
togetafeelforthewiderangeofplotspossible,andeventrysomeofthem.Eventheimshow()
instructionhasmanyoptionsforplottingforustoexplore,suchasusingdifferentcolour
palettes.
Objects
WeregoingtolookatonemorePythonideacalled objects .Objectsaresimilartothereusable
functionsbecausewedefinethemonceandusethemmanytimes.Butobjectscandoalot
morethansimplefunctionscan.
Theeasiestwaytounderstandobjectsistoseetheminaction,ratherthantalkloadsaboutan
abstractconcept.Havealookatthefollowingcode.
#classforadogobject
class Dog:

#dogscanbark()
def bark(self):
print("woof!")
pass

pass
Letsstartwithwhatwerefamiliarwithfirst.Youcanseethereisafunctioncalledbark()inside
thatcode.Itsnotdifficulttoseethatifwecalledthefunction,wedgetwoof!printedout.Thats
easyenough.
Letslookaroundthatfamiliarfunctiondefinitionnow.Youcanseetheclasskeyword,anda
nameDogandastructurethatlookslikeafunctiontoo.Youcanseetheparallelswithfunction
definitionswhichalsohaveaname.Thedifferenceisthatwithfunctionsweusethedef
keywordtodefinethem,butwithobjectdefinitionsweusethekeywordclass.
Beforewediveinanddiscusswhata classis,comparedtoanobject,againhavealookat
somerealbutsimplecodethatbringstheseabstractideastolife.
sizzles=Dog()
sizzles.bark()
Youcanseethefirstlinecreatesavariablecalled sizzles
,anditseemstocomefromwhat
lookslikeafunctioncall.InfactthatDog()isaspecialfunctionthatcreatesaninstanceofthe
Dogclass.Wecanseenowhowtocreatethingsfromclassdefinitions.Thesethingsarecalled
objects .Wevecreatedanobjectcalled sizzles fromtheDogclassdefinitionwecanconsider

122
thisobjecttobeadog!
Thenextlinecallsthebark()functiononthe
sizzles object.Thisishalffamiliarbecauseweve
seenfunctionsalready.Whatslessfamiliaristhatwerecallingthebark()functionasifitwasa
partofthe
sizzles object.Thatsbecausebark()isafunctionthatallobjectscreatedfromthe
Dogclasshave.ItstheretoseeinthedefinitionoftheDogclass.
Letssayallthatinsimpleterms.Wecreated sizzles,akindofDog.Thatsizzles isanobject,

createdintheshapeofaDogclass.Objectsareinstancesofaclass.
Thefollowingshowswhatwehavedonesofar,andalsoconfirmsthat sizzles .bark()does

indeedoutputawoof!
Youmighthavespottedtheselfinthedefinitionofthefunctionasbark(self).Thismayseem
odd,anditdoestome.AsmuchasIlikePython,Idontthinkitisperfect.Thereasonthatself
isthereissothatwhenPythoncreatesthefunction,itassignsittotherightobject.Ithinkthis
shouldbeobviousbecausethatbark()isinsidetheclassdefinitionsoPythonshouldknow
whichobjectstoconnectitto,butthatsjustmyopinion.
Letsseeobjectsandclassesbeingusedmoreusefully.Havealookatthefollowingcode.
sizzles=Dog()
mutley=Dog()

sizzles.bark()
mutley.bark()
Letsrunitandseewhathappens.

123

Thisisinteresting!Wearecreatingtwoobjectscalledsizzlesandmutley.Theimportantthingto
realiseisthattheyarebothcreatedfromthesameDog()classdefinition.Thisispowerful!We
definewhattheobjectsshouldlooklikeandhowtheyshouldbehave,andthenwecreatereal
instancesofthem.
Thatisthedifferencebetween classes and

objects,oneisadefinitionandoneisareal
instanceofthatdefinition.Aclassisacakerecipeinabook,anobjectisacakemadewiththat
recipe.Thefollowingshowsvisuallyhowobjectsaremadefromaclassrecipe.
Whatusearetheseobjectsmadefromaclass?Whygotothetrouble?Itwouldhavebeen
simplertojustprintthewordwoof!withoutallthatextracode.
Well,youcanseehowitmightbeusefultohavemanyobjectsallofthesamekind,allmade
fromthesametemplate.Itsaveshavingtocreateeachoneseparatelyinfull.Butthereal

124
benefitcomesfromobjectshavingdataandfunctionsallwrappedupneatlyinsidethem.The
benefitistoushumancoders.Ithelpsusmoreeasilyunderstandmorecomplicatedproblemsif
bitsofcodeareorganisedaroundobjectstowhichtheynaturallybelong.Dogsbark.Buttons
click.Speakersemitsound.Printersprint,orcomplaintheyreoutofpaper.Inmanycomputer
systems,buttons,speakersandprintersareindeedrepresentedasobjects,whosefunctions
youinvoke.
Youllsometimesseetheseobjectfunctionscalled methods .Wevealreadydonethisabove,

weaddedabark()functiontotheDogclassandboththesizzlesandmutleyobjectsmadefrom
thisclasshaveabark()method.Yousawthembothbarkintheexample!
Neuralnetworkstakesomeinput,dosomecalculations,andproduceanoutput.Wealsoknow
theycanbetrained.Youcanseethattheseactions,trainingandproducingananswer,are
naturalfunctionsofaneuralnetwork.Thatis,functionsofaneuralnetworkobject.Youllalso
rememberthatneuralnetworkshavedatainsidethemthatnaturallybelongstherethelink
weights.Thatswhywellbuildourneuralnetworkasanobject.
Forcompleteness,letsseehowweadddatavariablestoaclass,andsomemethodstoview
andchangethisdata.HavealookatthefollowingfreshclassdefinitionofaDogclass.Theres
afewnewthingsgoingonheresowelllookatthemoneatatime.
#classforadogobject
class
Dog:

#initialisationmethodwithinternaldata
def

__init__(self,petname,temp) :
self.name=petname
self.temperature=temp

#getstatus
def

status(self) :
print("dognameis",self.name)
print("dogtemperatureis",self.temperature)
pass

#settemperature
def

setTemperature(self,temp) :
self.temperature=temp
pass

#dogscanbark()
def

bark(self) :
print("woof!")

125
pass

pass
ThefirstthingtonoticeiswehaveaddedthreenewfunctionstothisDogclass.Wealreadyhad
thebark()function,nowhavenewonescalled__init__(),status()andsetTemperature().Adding
newfunctionsiseasyenoughtounderstand.Wecouldhaveaddedonecalledsneeze()togo
withbark()ifwewanted.
Butthesenewfunctionsseemtohavevariablenamesinsidethefunctionnames.That
setTemperatureisactuallysetTemperature(self, temp ).Thefunnynamed__init__isactually
__init__(self, petname ,
temp).Whataretheseextrabitsinsidethebrackets?Theyarethe
variablesthefunctionexpectswhenitiscalledcalled parameters .Rememberthataveraging
functionavg( x,y)wesawearlier?Thedefinitionofavg()madeclearitexpected2numbers.So
the__init__()functionneedsa petname anda
temp ,andthesetTemperature()functionneeds
justatemp.
Nowletslookinsidethesenewfunctions.Letslookatthatoddlynamed__init__()first.Whyis
itgivensuchaweirdname?Thenameisspecial,andPythonwillcallafunctioncalled
__init__()whenanobjectisfirstcreated.Thatsreallyhandyfordoinganyworkpreparingan
objectbeforeweactuallyuseit.Sowhatdowedointhismagicinitialisationfunction?Weseem
toonlycreate2newvariables,calledself. name andself.temperature .Youcanseetheirvalues
fromthevariables petname andtemppassedtothefunction.Theself.partmeansthatthe
variablesarepartofthisobjectitselfitsownself.Thatis,theybelongonlytothisobject,and
areindependentofanotherDogobjectorgeneralvariablesinPython.Wedontwanttoconfuse
thisdogsnamewiththatofanotherdog!Ifthisallsoundscomplicated,dontworry,itllbeeasy
toseewhenweactuallyrunarealexample.
Nextisthestatus()function,whichisreallysimple.Itdoesnttakeanyparameters,andsimply
printsouttheDogobjectsnameandtemperaturevariables.
FinallythesetTemperature()functiondoestakeaparameter.Youcanseethatitsetsthe
self.temperature totheparametertempsuppliedwhenthisfunctioniscalled.Thismeansyou
canchangetheobjectstemperatureatanytimeafteryoucreatedtheobject.Youcandothisas
manytimesasyoulike.
Weveavoidingtalkingaboutwhyallthesefunctions,includingthatbark()functionhaveaself
asthefirstparameter.ItisapeculiarityofPython,whichIfindabitirksome,butthatstheway
Pythonhasevolved.WhatitdoesismakecleartoPythonthatthefunctionyoureaboutto
definebelongstotheobject,referredtoasself.Youwouldhavethoughtitwasobvious
becausewerewritingthefunctioninsidetheclass.Youwontbesurprisedthatthishascaused
debatesamongstevenexpertPythonprogrammers,soyoureingoodcompanyinbeing
puzzled.

126

Letsseeallthisinactiontomakeitreallyclear.ThefollowingshowsthenewDogclassdefined
withthesenewfunctions,andanewDogobjectcalled lassiebeingcreatedwithparameters
settingoutitsnameasLassieanditsinitialtemperatureas37.
Youcanseehowcallingthestatus()functiononthis
lassie
Dogobjectprintsoutthedogs
nameanditscurrenttemperature.Thattemperaturehasntchangedsincelassiewascreated.
Letsnowchangethetemperatureandchecktheitreallyhaschangedinsidetheobjectby
askingforanotherupdate:
lassie.setTemperature(40)
lassie.status()
Thefollowingshowstheresults.

127

YoucanseethatcallingsetTemperature(40)onthelassieobjectdidindeedchangetheobjects
internalrecordofthetemperature.
Weshouldbereallypleasedbecausewevelearnedquitealotaboutobjects,whichsome
consideranadvancedtopic,anditwasntthathardatall!
WevelearnedenoughPythontobeginmakingourneuralnetwork.
NeuralNetworkwithPython
WellnowstartthejourneytomakeourownneuralnetworkwiththePythonwevejustlearned.
Wellprogressalongthisjourneyisshorteasytotacklesteps,andbuildupaPythonprogram
bitbybit.
Startingsmall,thengrowing,isawiseapproachtobuildingcomputercodeofevenmoderate
complexity.
Aftertheworkwevejustdone,itseemsreallynaturaltostarttobuildtheskeletonofaneural
networkclass.Letsjumprightin!
TheSkeletonCode
Letssketchoutwhataneuralnetworkclassshouldlooklike.Weknowitshouldhaveatleast
threefunctions:
initialisationtosetthenumberofinput,hiddenandoutputnodes
trainrefinetheweightsafterbeinggivenatrainingsetexampletolearnfrom
querygiveananswerfromtheoutputnodesafterbeinggivenaninput

128
Thesemightnotbeperfectlydefinedrightnow,andtheremaybemorefunctionsneeded,but
fornowletsusethesetomakeastart.
Thecodetakingshapewouldbesomethinglikethefollowing:
#neuralnetworkclassdefinition
class neuralNetwork :

#initialisetheneuralnetwork
def
__init__() :
pass

#traintheneuralnetwork
def
train() :
pass

#querytheneuralnetwork
def
query() :
pass

Thatsnotabadstartatall.Infactthatsasolidframeworkonwhichtofleshouttheworkingof
theneuralnetworkindetail.
InitialisingtheNetwork
Letsbeginwiththeinitialisation.Weknowweneedtosetthenumberofinput,hiddenand
outputlayernodes.Thatdefinestheshapeandsizeoftheneuralnetwork.Ratherthanset
theseinstone,wellletthembesetwhenanewneuralnetworkobjectiscreatedbyusing
parameters.Thatwayweretainthechoicetocreatenewneuralnetworksofdifferentsizeswith
ease.
Theresanimportantpointunderneathwhatwevejustdecided.Goodprogrammers,computer
scientistsandmathematicians,trytocreategeneralcoderatherthanspecificcodewhenever
theycan.Itisagoodhabit,becauseitforcesustothinkaboutsolvingproblemsinadeeper
morewidelyapplicableway.Ifwedothis,thenoursolutionscanbeappliedtodifferent
scenarios.Whatthismeanshereisthatwewilltrytodevelopcodeforaneuralnetworkwhich
triestokeepasmanyusefuloptionsopen,andassumptionstoaminimum,sothatthecodecan
easilybeusedfordifferentneeds.Wewantthesameclasstobeabletocreateasmallneural
networkaswellasaverylargeonesimplybypassingthedesiredsizeasparameters.
Dontforgetthelearningratetoo.Thatsausefulparametertosetwhenwecreateanewneural
network.Soletsseewhatthe__init__()functionmightlooklike:

129

def

__init__(self
,inputnodes,hiddennodes,outputnodes,
learningrate
):

#setnumberofnodesineachinput,hidden,outputlayer

self
.inodes
=
inputnodes

self
.hnodes
=
hiddennodes

self
.onodes
=
outputnodes

#learningrate

self
.lr
=
learningrate
pass
Letsaddthattoourclassdefinitionofaneuralnetwork,andtrycreatingasmallneuralnetwork
objectwith3nodesineachlayer,andalearningrateof0.5.
#numberofinput,hiddenandoutputnodes
input_nodes=3
hidden_nodes=3
output_nodes=3

#learningrateis0.3
learning_rate=0.3

#createinstanceofneuralnetwork
n=neuralNetwork(input_nodes,hidden_nodes,output_nodes,
learning_rate)
Thatwouldgiveusanobject,sure,butitwouldntbeveryusefulyetaswehaventcodedany
functionsthatdousefulworkyet.Thatsok,itisagoodtechniquetostartsmallandgrowcode,
findingandfixingproblemsalongtheway.
Justtomakesurewevenotgotlost,thefollowingshowstheIPythonnotebookatthisstage,
withtheneuralnetworkclassdefinitionandthecodetocreateanobject.

130

Whatdowedonext?Wellwevetoldtheneuralnetworkobjecthowmanyinput,hiddenand
outputlayernodeswewant,butnothinghasreallybeendoneaboutit.
WeightsTheHeartoftheNetwork
Sothenextstepistocreatethenetworkofnodesandlinks.Themostimportantpartofthe
networkisthe linkweights .Theyreusedtocalculatethesignalbeingfedforward,theerroras
itspropagatedbackwards,anditisthelinkweightsthemselvesthatarerefinedinanattemptto
toimprovethenetwork.
Wesawearlierthattheweightscanbeconciselyexpressedasamatrix.Sowecancreate:
Amatrixfortheweightsforlinksbetweentheinputandhiddenlayers, W ,ofsize
input_hidden
(
hidden_nodes by
input_nodes) .
Andanothermatrixforthelinksbetweenthehiddenandoutputlayers,
Whidden_output ,of
size(
output_nodes by
hidden_nodes) .
Remembertheconventionearliertoseewhythefirstmatrixisofsize( hidden_nodes by
input_nodes) andnottheotherwayaround( input_nodes by
hidden_node ).

131
Rememberfrompart1ofthisguidethattheinitialvaluesofthelinkweightsshouldbesmalland
random.Thefollowingnumpyfunctiongeneratesanarrayofvaluesselectedrandomlybetween
0and1,wherethesizeis(rowsbycolumns).
numpy .
random .rand( rows
, columns )
Allgoodprogrammersuseinternetsearchenginestofindonlinedocumentationonhowtouse
coolPythonfunctions,oreventofindusefulfunctionstheydidntknowexisted.Googleis
particularlygoodforfindingstuffaboutprogramming.Thisnumpy.random.rand()functionis
described here
,forexample.
Ifweregoingtousethenumpyextensions,weneedtoimportthelibraryatthetopofourcode.
Trythisfunction,andconfirmtoyourselfitworks.Thefollowingshowsitworkingfora(3by3)
numpyarray.Youcanseethateachvalueinthearrayisrandomandbetween0and1.
Wecandobetterbecauseweveignoredthefactthattheweightscouldlegitimatelybenegative
notjustpositive.Therangecouldbebetween1.0and+1.0.Forsimplicity,wellsimplysubtract
0.5fromeachoftheabovevaluestoineffecthavearangebetween0.5to+0.5.Thefollowing
showsthisneattrickworking,andyoucanseesomerandomvaluesbelowzero.
WerereadytocreatetheinitialweightmatricesinourPythonprogram.Theseweightsarean
intrinsicpartofaneuralnetwork,andlivewiththeneuralnetworkforallitslife,notatemporary
setofdatathatvanishesonceafunctioniscalled.Thatmeansitneedstobepartofthe
initialisationtoo,andaccessiblefromotherfunctionslikethetrainingandthequerying.
Thefollowingcode,withcommentsincluded,createsthetwolinkweightmatricesusingthe
self.inodes ,self.
hnodesandself.
onodestosettherightsizeforbothofthem.
#linkweightmatrices,wihandwho

132
#weightsinsidethearraysarew_i_j,wherelinkisfromnode
itonodejinthenextlayer
#w11w21
#w12w22etc
self.wih=(numpy.random.rand(self.hnodes,self.inodes)0.5)
self.who=(numpy.random.rand(self.onodes,self.hnodes)0.5)
Greatwork!Weveimplementedtheveryheartofaneuralnetwork,itslinkweightmatrices!
Optional:MoreSophisticatedWeights
Thisbitisoptionalasitisjustasimplebutpopularrefinementtoinitialisingweights.
Ifyoufollowedtheearlierdiscussioninpart1ofthisguideaboutpreparingdataandinitialising
weightsattheendofthefirstpartofthisguide,youwillhaveseenthatsomepreferaslightly
moresophisticatedapproachtocreatingtheinitialrandomweights.Theysampletheweights
fromanormalprobabilitydistributioncentredaroundzeroandwithastandarddeviationthatis
relatedtothenumberofincominglinksintoanode,1/(numberofincominglinks).
Thisiseasytodowiththehelpofthenumpylibrary.AgainGoogleisreallyhelpfulinfindingthe
rightdocumentation.Thenumpy.random.normal()functiondescribed here ,helpsussamplea
normaldistribution.Theparametersarethecentreofthedistribution,thestandarddeviationand
thesizeofanumpyarrayifwewantamatrixofrandomnumbersinsteadofjustasingle
number.
Ourupdatedcodetoinitialisetheweightswilllooklikethis:
self.wih= numpy.random.normal (
0.0 ,
pow(self.hnodes,0.5) ,
(self.hnodes,self.inodes) )
self.who= numpy.random.normal (
0.0 ,
pow(self.onodes,0.5) ,
(self.onodes,self.hnodes) )
Youcanseewevesetthecentreofthenormaldistributionto0.0.Youcanseetheexpression
forthestandarddeviationrelatedtothenumberofnodesinnextlayerinPythonformas
pow(self.hnodes,0.5)whichissimplyraisingthenumberofnodestothepowerof0.5.That
lastparameteristheshapeofthenumpyarraywewant.
QueryingtheNetwork
Itsoundslogicaltostartworkingonthecodethattrainstheneuralnetworknext,byfleshingout
thecurrentlyemptytrain()function.Welldelaydoingthat,andinstead,workonthesimpler
query()function.Thiswillgiveusmoretimetograduallybuildconfidenceandgetsomepractice
atusingbothPythonandtheseweightmatricesinsidetheneuralnetworkobject.

133
Thequery()functiontakestheinputtoaneuralnetworkandreturnsthenetworksoutput.Thats
simpleenough,buttodothatyoullrememberthatweneedtopasstheinputsignalsfromthe
inputlayerofnodes,throughthehiddenlayerandoutofthefinaloutputlayer.Youllalso
rememberthatweusethelinkweightstomoderatethesignalsastheyfeedintoanygiven
hiddenoroutputnode,andwealsousethesigmoidactivationfunctiontosquishthesignal
comingoutofthosenodes.
Ifwehadlotsofnodes,wewouldhaveahorribletaskonourhands,writingoutthePythoncode
foreachofthosenodes,doingtheweightmoderating,summingsignals,applyingtheactivation
function.Andthemorenodes,themorecode.Thatwouldbeanightmare!
Luckilywedonthavetodothatbecauseweworkedouthowtowritealltheseinstructionsina
simpleconcisematrixform.Thefollowingshowshowthematrixofweightsforthelinkbetween
theinputandhiddenlayerscanbecombinedwiththematrixofinputstogivethesignalsintothe
hiddenlayernodes:
Xhidden=
W I
input_hidden
Thegreatthingaboutthiswasnotjustthatitwaseasierforustowritedown,butthat
programminglanguageslikePythoncanalsounderstandmatricesanddoalltherealworkquite
efficientlybecausetheyrecognisethesimilaritiesbetweenalltheunderlyingcalculations.
YoullbeamazedathowsimplethePythoncodeactuallyis.Thefollowingappliesthenumpy
librarysdotproductfunctionformatricestothelinkweights
Winput_hidden I
andtheinputs.
hidden_inputs=numpy.dot(self.wih,inputs)
Thatsit!
ThatsimplepieceofPythondoesalltheworkofcombiningalltheinputswithalltherightlink
weightstoproducethematrixofcombinedmoderatedsignalsintoeachhiddenlayernode.We
donthavetorewriteiteitherifnexttimewechoosetouseadifferentnumberofnodesforthe
inputorhiddenlayers.Itjustworks!
Thispowerandeleganceiswhyweputtheeffortintotryingtounderstandthematrix
multiplicationapproachearlier.
Togetthesignalsemergingfromthehiddennode,wesimplyapplythesigmoidsquashing
functiontoeachoftheseemergingsignals:
O
hidden=sigmoid( Xhidden)

134
Thatshouldbeeasy,especiallyifthesigmoidfunctionisalreadydefinedinahandyPython
library.Itturnsoutitis!ThescipyPythonlibraryhasasetofspecialfunctions,andthesigmoid
functioniscalledexpit().Dontaskmewhyithassuchasillyname.Thisscipylibraryis
importedjustlikeweimportedthenumpylibrary:
#scipy.specialforthesigmoidfunctionexpit()
importscipy.special
Becausewemightwanttoexperimentandtweak,orevencompletelychange,theactivation
function,itmakessensetodefineitonlyonceinsidetheneuralnetworkobjectwhenitisfirst
initialised.Afterthatwecanrefertoitseveraltimes,suchasinthequery()function.This
arrangementmeansweonlyneedtochangethisdefinitiononce,andnothavetolocateand
changethecodeanywhereanactivationfunctionisused.
Thefollowingdefinestheactivationfunctionwewanttouseinsidetheneuralnetworks
initialisationsection.
#activationfunctionisthesigmoidfunction
self.activation_function= lambda x:scipy.special.expit(x)
Whatisthiscode?Itdoesntlooklikeanythingweveseenbefore.Whatisthat lambda?Well,
thismaylookdaunting,butitreallyisnt.Allwevedonehereiscreatedafunctionlikeanyother,
butweveusedashorterwayofwritingitout.Insteadoftheusualdef()definitions,weusethe
magic lambda tocreateafunctionthereandthen,quicklyandeasily.Thefunctionheretakesx
andreturnsscipy.special.expit(x)whichisthesigmoidfunction.Functionscreatedwithlambda
arenameless,or anonymous asseasonedcodersliketocallthem,buthereweveassignedit
tothenameself.activation_function().Allthismeansisthatwheneversomeoneneedstouser
theactivationfunction,alltheyneedtodoiscallself.activation_function().
So,goingbacktothetaskathand,wewanttoapplytheactivationfunctiontothecombinedand
moderatedsignalsintothehiddennodes.Thecodeisassimpleasthefollowing:
#calculatethesignalsemergingfromhiddenlayer
hidden_outputs=self.activation_function(hidden_inputs)
Thatis,thesignalsemergingfromthehiddenlayernodesareinthematrixcalled
hidden_outputs .
Thatgotusasfarasthemiddlehiddenlayer,whataboutthefinaloutputlayer?Well,thereisnt
anythingreallydifferentbetweenthehiddenandfinaloutputlayernodes,sotheprocessisthe
same.Thismeansthecodeisalsoverysimilar.

135
Havealookatthefollowingcodewhichsummariseshowwecalculatenotjustthehiddenlayer
signalsbutalsotheoutputlayersignalstoo.
#calculatesignalsintohiddenlayer

#calculatesignalsintofinaloutputlayer
final_inputs=numpy.dot(self.who,hidden_outputs)
#calculatethesignalsemergingfromfinaloutputlayer
final_outputs=self.activation_function(final_inputs)
Ifwetookawaythecomments,therearejustfourlinesofcodeshowninboldthatdoallthe
calculationsweneeded,twoforthehiddenlayerandtwoforthefinaloutputlayer.
TheCodeThusFar
Letstakeabreathandpausetocheckwhatthecodefortheneuralnetworkclasswere
buildinguplookslike.Itshouldlooksomethinglikethefollowing.
classneuralNetwork :

def__init__(self,inputnodes,hiddennodes,outputnodes,
learningrate) :
#setnumberofnodesineachinput,hidden,output
layer
self.inodes=inputnodes
self.hnodes=hiddennodes
self.onodes=outputnodes

#weightsinsidethearraysarew_i_j,wherelinkis
fromnodeitonodejinthenextlayer
#w11w21
#w12w22etc
self.wih=numpy.random.normal(0.0,pow(self.hnodes,
0.5),(self.hnodes,self.inodes))
self.who=numpy.random.normal(0.0,pow(self.onodes,
0.5),(self.onodes,self.hnodes))

136
#learningrate
self.lr=learningrate

self.activation_function=lambdax:
scipy.special.expit(x)

pass

deftrain()
:

pass

defquery(self,inputs_list)
:
#convertinputslistto2darray
inputs=numpy.array(inputs_list,ndmin=2).T

hidden_outputs=
self.activation_function(hidden_inputs)

#calculatethesignalsemergingfromfinaloutput
layer

returnfinal_outputs
Thatisjusttheclass,asidefromthatweshouldbeimportingthenumpyandscipy.special
modulesrightatthetopofthecodeinthefirstnotebookcell:
importnumpy
importscipy.special
Itisworthbrieflynotingthequery()functiononlyneedsthe
input_list
.Itdoesntneedanyother
input.

137

Thatsgoodprogress,andnowwelookatthemissingpiece,thetrain()function.Remember
therearetwophasestotraining,thefirstiscalculatingtheoutputjustasquery()doesit,andthe
secondpartisbackpropagatingtheerrorstoinformhowthelinkweightsarerefined.
Beforewegoontowritethetrain()functionfortrainingthenetworkwithexamples,letsjusttest
thecodewehaveatthispointworks.Letscreateasmallnetworkandqueryitwithsome
randominput,justtoseeitworking.Obviouslytherewontbeanyrealmeaningtothis,were
onlydoingthistousethesefunctionswejustcreated.
Thefollowingshowsthecreationofasmallnetworkwith3nodesineachoftheinput,hidden
andoutputlayers,andqueriesitwitharandomlychoseninputof(1.0,0.5,1.5).
Youcanseethecreationofaneuralnetworkobjectdoesneedalearningratetobeset,even
thoughwerenotusingityet.Thatsbecauseourdefinitionoftheneuralnetworkclasshasan
initialisationfunction__init__()thatdoesrequireittobespecified.Ifitsnotset,thePythoncode
willfailandthrowyouanerrortochewon!
Youcanalsoseethattheinputisalist,whichinPythoniswritteninsidesquarebrackets.The
outputisalsoalist,withsomenumbers.Evenifthisoutputhasnorealmeaning,becausewe
haventtrainedthenetwork,wecanbehappythethingworkedwithnoerrors.
TrainingtheNetwork
Letsnowtackletheslightlymoreinvolvedtrainingtask.Therearetwopartstothis:
Thefirstpartisworkingouttheoutputforagiventrainingexample.Thatisnodifferentto
whatwejustdidwiththequery()function.
Thesecondpartistakingthiscalculatedoutput,comparingitwiththedesiredoutput,
andusingthedifferencetoguidetheupdatingofthenetworkweights.
Wevealreadydonethefirstpartsoletswritethatout:

138
deftrain(self,inputs_list,targets_list):
targets=numpy.array(targets_list,ndmin=2).T



pass
Thiscodeisalmostexactlythesameasthatinthequery()function,becausewerefeeding
forwardthesignalfromtheinputlayertothefinaloutputlayerinexactlythesameway.
Theonlydifferenceisthatwehaveanadditionalparameter, targets_list,definedinthefunction
namebecauseyoucanttrainthenetworkwithoutthetrainingexampleswhichincludethe
desiredortargetanswer.
deftrain (self,inputs_list,targets_list)
Thecodealsoturnsthe targets_listintoanumpyarray,justasthe
inputs_lististurnedintoa
numpyarray.
Nowweregettingclosertotheheartoftheneuralnetworksworking,improvingtheweights
basedontheerrorbetweenthecalculatedandtargetoutput.
Letsdothisingentlemanageablesteps.
Firstweneedtocalculatetheerror,whichisthedifferencebetweenthedesiredtargetoutput
providedbythetrainingexample,andtheactualcalculatedoutput.Thatsthedifference
betweenthematrices( targets
final_outputs )doneelementbyelement.ThePythoncodefor
thisisreallysimple,showingagaintheelegantpowerofmatrices.
#erroristhe(targetactual)

139
output_errors=targetsfinal_outputs
Wecancalculatethebackpropagatederrorsforthehiddenlayernodes.Rememberhowwe
splittheerrorsaccordingtotheconnectedweights,andrecombinethemforeachhiddenlayer
node.Weworkedoutthematrixformofthiscalculationas
T
errors =weights
hidden errors
hidden_output output
ThecodeforthisisagainsimplebecauseofPythonsabilitytodomatrixdotproductsusing
numpy.
#hiddenlayererroristheoutput_errors,splitbyweights,
recombinedathiddennodes
hidden_errors=numpy.dot(self.who.T,output_errors)
Sowehavewhatweneedtorefinetheweightsateachlayer.Fortheweightsbetweenthe
hiddenandfinallayers,weusethe output_errors .Fortheweightsbetweentheinputand
hiddenlayers,weusethese hidden_errors wejustcalculated.
Wepreviouslyworkedouttheexpressionforupdatingtheweightforthelinkbetweenanode j
andanode k
inthenextlayerinmatrixform:
Thealphaisthelearningrate,andthesigmoidisthesquashingactivationfunctionwesaw
before.Rememberthatthe*multiplicationisthenormalelementbyelementmultiplication,and
thedotisthematrixdotproduct.Thatlastbit,thematrixofoutputsfromthepreviouslayer,is
transposed.Ineffectthismeansthecolumnofoutputsbecomesarowofoutputs.
ThatshouldtranslatenicelyintoPythoncode.Letsdothecodefortheweightsbetweenthe
hiddenandfinallayersfirst:
#updatetheweightsforthelinksbetweenthehiddenandoutput
layers
self.who+= self.lr *numpy.dot( (
output_errors * final_outputs*
(1.0final_outputs) )
,numpy.transpose(hidden_outputs) )

140
Thatsalonglineofcode,butthecolourcodingshouldhelpyouseehowitrelatesbacktothat
mathematicalexpression.Thelearningrateisself. lr
andsimplymultipliedwiththerestofthe
expression.Thereisamatrixmultiplicationdonebynumpy.dot()andthetwoelementsare
colouredredandgreentoshowthepartrelatedtotheerrorandsigmoidsfromthenextlayer,
andthetransposedoutputsfromthepreviouslayer.
That+=simplymeansincreasetheprecedingvariablebythenextamount.So x+=3means
increase x x
by3.Itsjustashorterwayofwriting x
= +3.Youcanuseotherarithmetictoosox
/=3meansdivide x
by3.
Thecodefortheotherweightsbetweentheinputandhiddenlayerswillbeverysimilar.Wejust
exploitthesymmetryandrewritethecodereplacingthenamessothattheyrefertotheprevious
layers.Hereisthecodeforbothsetsofweights,colouredsoyoucanseethesimilaritiesand
differences:
#updatetheweightsforthelinksbetweenthehiddenandoutput
layers
self. who +=self.lr*numpy.dot(( output_errors *final_outputs*
(1.0final_outputs) ),
numpy.transpose(hidden_outputs) )

#updatetheweightsforthelinksbetweentheinputandhidden
layers
self. wih +=self.lr*numpy.dot(( hidden_errors *hidden_outputs*
(1.0hidden_outputs) ),
numpy.transpose(inputs) )
Thatsit!
Itshardtobelievethatallthatworkwedidearlierwithhugevolumesofcalculations,theeffort
weputintoworkingoutthematrixapproach,workingourwaythroughthegradientdescent
methodofminimisingthenetworkerror,allthathasamountedtotheaboveshortconcise
code!Insomeways,thatsthepowerofPythonbutactuallyitisareflectionofourhardworkto
simplifysomethingwhichcouldeasilyhavebeenmadecomplexandhorrible.
TheCompleteNeuralNetworkCode
Wevecompletedtheneuralnetworkclass.Hereitisforreference,andyoucanalwaysgetit
fromthefollowinglinktogithub,anonlineplacetosharingcode:
https://github.com/makeyourownneuralnetwork/makeyourownneuralnetwork/blob/master/
part2_neural_network.ipynb
classneuralNetwork :

141

learningrate)
:

#weightsinsidethearraysarew_i_j,wherelinkisfrom
nodeitonodejinthenextlayer
#w11w21
#w12w22etc
self.wih=numpy.random.normal(0.0,pow(self.hnodes,0.5),
(self.hnodes,self.inodes))
self.who=numpy.random.normal(0.0,pow(self.onodes,0.5),
(self.onodes,self.hnodes))
#learningrate

self.activation_function=lambdax:scipy.special.expit(x)

pass

deftrain(self,inputs_list,targets_list)
:



142
#outputlayererroristhe(targetactual)

#updatetheweightsforthelinksbetweenthehiddenand
outputlayers
self.who+=self.lr*numpy.dot((output_errors*
final_outputs*(1.0final_outputs)),
numpy.transpose(hidden_outputs))

#updatetheweightsforthelinksbetweentheinputand
hiddenlayers
self.wih+=self.lr*numpy.dot((hidden_errors*
hidden_outputs*(1.0hidden_outputs)),numpy.transpose(inputs))

pass


:



returnfinal_outputs
Theresnotalotofcodehere,especiallyifyouappreciatethatthiscodecanbeusedtocreate,
trainandquery3layerneuralnetworksforalmostanytask.
Nextwellworkonourspecifictaskoflearningtorecognisenumberswrittenbyhumans.
TheMNISTDatasetofHandwrittenNumbers

143
Recognisinghumanhandwritingisanidealchallengefortestingartificialintelligence,because
theproblemissufficientlyhardandfuzzy.Itsnotclearanddefinedlikemultiplyinglotsoflotsof
numbers.
Gettingcomputerstocorrectlyclassifywhatanimagecontains,sometimescalledthe image
recognition problem,haswithstooddecadesofattack.Onlyrecentlyhasgoodprogressbeen
made,andmethodslikeneuralnetworkshavebeenacrucialpartoftheseleapsforward.
Togiveyouasenseofhowhardtheproblemofimagerecognitionis,wehumanswill
sometimesdisagreeonwhatanimagecontains.Welleasilydisagreewhatahandwritten
characteractuallyis,particularlyifthecharacterwaswritteninarushorwithoutcare.Havea
lookatthefollowinghandwrittennumber.Isita4ora9?
Thereisacollectionofimagesofhandwrittennumbersusedbyartificialintelligenceresearchers
asapopularsettotesttheirlatestideasandalgorithms.Thefactthatthecollectioniswell
knownandpopularmeansthatitiseasytocheckhowwellourlatestcrazyideaforimage
recognitionworkscomparedtoothers.Thatis,differentideasandalgorithmsaretestedagainst
thesamedataset.
ThatdatasetiscalledtheMNISTdatabaseofhandwrittendigits,andisavailablefromthe
respectedneuralnetworkresearcherYannLeCunswebsite http://yann.lecun.com/exdb/mnist/.
Thatpagealsolistshowwelloldandnewideashaveperformedinlearningandcorrectly
classifyingthesehandwrittencharacters.Wellcomebacktothatlistseveraltimestoseehow
wellourownideasperformagainstprofessionals!
TheformatoftheMNISTdatabaseisn'ttheeasiesttoworkwith,soothershavehelpfully
createddatafilesofasimplerformat,suchasthisone
http://pjreddie.com/projects/mnistincsv/.
ThesefilesarecalledCSVfiles,whichmeanseachvalueisplaintextseparatedbycommas
(commaseparatedvalues).Youcaneasilyviewtheminanytexteditor,andmostspreadsheet
ordataanalysissoftwarewillworkwithCSVfiles.Theyareprettymuchauniversalstandard.
ThiswebsiteprovidestwoCSVfiles:

144
A
training
set
http://www.pjreddie.com/media/files/mnist_train.csv
A
test
set
http://www.pjreddie.com/media/files/mnist_test.csv
Asthenamessuggest,the trainingsetisthesetof
60,000
labelledexamplesusedtotrainthe
neuralnetwork. Labelledmeanstheinputscomewiththedesiredoutput,thatis,whatthe
answershouldbe.
Thesmaller testset
of
10,000isusedtoseehowwellourideaoralgorithmworks.Thistoo
containsthecorrectlabelssowecanchecktoseeifourownneuralnetworkgottheanswer
rightornot.
Theideaofhavingseparatetrainingandtestdatasetsistomakesurewetestagainstdatawe
haventseenbefore.Otherwisewecouldcheatandsimplymemorisethetrainingdatatogeta
perfect,albeitdeceptive,score.Thisideaofseparatingtrainingfromtestdataiscommonacross
machinelearning.
Letstakeapeekatthesefiles.ThefollowingshowsasectionoftheMNISTtestsetloadedinto
atexteditor.
Whoah!Thatlookslikesomethingwentwrong!Likeoneofthosemoviesfromthe80swherea
computergetshacked.
Actuallyalliswell.Thetexteditorisshowinglonglinesoftext.Thoselinesconsistofnumbers,
separatedbycommas.Thatiseasyenoughtosee.Thelinesarequitelongsotheywrap

145
aroundafewtimes.Helpfullythistexteditorshowsthereallinenumbersinthemargin,andwe
canseefourwholelinesofdata,andpartofthefifthone.
Thecontentoftheserecords,orlinesoftext,iseasytounderstand:
Thefirstvalueisthe label,thatis,theactualdigitthatthehandwritingissupposedto
represent,suchasa"7"ora"9".Thisistheanswertheneuralnetworkistryingtolearn
togetright.
Thesubsequentvalues,allcommaseparated,arethepixelvaluesofthehandwritten
digit.Thesizeofthepixelarrayis28by28,sothereare784valuesafterthelabel.
Countthemifyoureallywant!
Sothatfirstrecordrepresentsthenumber5asshownbythefirstvalue,andtherestofthe
textonthatlinearethepixelvaluesforsomeoneshandwrittennumber5.Thesecondrecord
representsahandwritten0,thethirdrepresents4,thefourthrecordis1andthefifth
represents9.YoucanpickanylinefromtheMNISTdatafilesandthefirstnumberwilltellyou
thelabelforthefollowingimagedata.
Butitishardtoseehowthatlonglistof784valuesmakesupapictureofsomeones
handwrittennumber5.Weshouldplotthosenumbersasanimagetoconfirmthattheyreally
arethecolourvaluesofhandwrittennumber.
BeforewediveinanddothatweshoulddownloadasmallersubsetoftheMNISTdataset.The
MNISTdatadatafilesareprettybigandworkingwithasmallersubsetishelpfulbecauseit
meanswecanexperiment,trialanddevelopourcodewithoutbeingsloweddownbyalarge
datasetslowingourcomputersdown.Oncewevesettledonanalgorithmandcodewere
happywith,wecanusethefulldataset.
ThefollowingarethelinkstoasmallersubsetsoftheMNISTdataset,alsoinCSVformat:
10recordsfromtheMNISTtestdataset
https://raw.githubusercontent.com/makeyourownneuralnetwork/makeyourownneuralnetw
ork/master/mnist_dataset/mnist_test_10.csv
100recordsfromtheMNISTtrainingdataset
https://raw.githubusercontent.com/makeyourownneuralnetwork/makeyourownneuralnetw
ork/master/mnist_dataset/mnist_train_100.csv
Ifyourbrowsershowsthedatainsteadofdownloadingitautomatically,youcanmanuallysave
thefileusingtheFile>SaveAs,orequivalentactioninyourbrowser.

146
Savethedatafilestoalocationthatworkswellforyou.Ikeepmydatafilesinafoldercalled
mnist_datasetnexttomyIPythonnotebooksasshowninthefollowingscreenshot.Itgets
messyifIPythonnotebooksanddatafilesarescatteredallovertheplace.
Beforewecandoanythingwiththedata,likeplottingitortraininganeuralnetworkwithit,we
needtofindawaytogetatitfromourPythoncode.
OpeningafileandgettingitscontentisreallyeasyinPython.Itsbesttojustshowitandexplain
it.Havealookatthefollowingcode:
data_file =open("mnist_dataset/mnist_train_100.csv",'r')
data_list =data_file .readlines()
data_file .close()
Thereareonlythreelinesofcodehere.Letstalkthrougheachone.
Thefirstlineusesafunctionopen()toopenafile.Youcanseefirstparameterpassedtothe
functionisthenameofthefile.Actually,itismorethanjustthefilenamemnist_train_100.csv,
itisthewholepathwhichincludesthedirectorythefileisin.Thesecondparameterisoptional,
andtellsPythonhowwewanttotreatthefile.ThertellsPythonthatwewanttoopenthefile
forreadingonly,andnotforwriting.Thatwayweavoidanyaccidentschangingordeletingthe
data.Ifwetriedtowritetothatfileandchangeit,Pythonwouldstopusandraiseanerror.
Whatsthatvariable data_file ?Theopen()functioncreatesafilehandle,areference,tothatfile
andweveassignedittoavariablenamed data_file
.Nowthatweveopenedthefile,anyfurther
actionslikereadingfromit,aredonethroughthathandle.
Thenextlineissimple.Weusethereadlines()functionassociatedwiththefilehandle data_file,
toreadallofthelinesinthefileintothevariabledata_list.Thevariablecontainsalist,where
eachitemofthelistisastringrepresentingalineinthefile.Thatsmuchmoreusefulbecause
wecanjumptospecificlinesjustlikewewouldjumptospecificentriesinalist.So data_list[0]
isthefirstrecord,and
data_list [9]isthetenthrecord,andsoon.

147

Bytheway,youmayhearpeopletellyounottousereadlines()becauseitreadstheentirefile
intomemory.Theywilltellyoutoreadonelineatatimeanddowhateverworkyouneedwith
eachline,andthenmoveontothenextline.Theyarentwrong,itismoreefficienttoworkona
lineatatime,andnotreadtheentirefileintomemory.Howeverourfilesarentthatmassive,
andthecodeiseasierifweusereadlines(),andforus,simplicityandclarityisimportantaswe
learnPython.
Thelastlineclosesthefile.Itisgoodpracticetocloseandcleanupafterusingresourceslike
files.Ifwedont,theyremainopenandcancauseproblems.Whatproblems?Well,some
programsmightnotwanttowritetoafilethatwasopenedelsewhereincaseitledtoan
inconsistency.Itwouldbeliketwopeopletryingtowritealetteronthesamepieceofpaper!
Sometimesyourcomputermaylockafiletopreventthiskindofclash.Ifyoudontcleanupafter
youusefiles,youcanhaveabuildupoflockedfiles.Attheveryleast,closingafile,allowsyour
computertofreeupmemoryitusedtoholdpartsofthatfile.
Createanewemptynotebookandtrythiscode,andseewhathappenswhenyouprintout
elementsofthatlista.Thefollowingshowsthisworking.
Youcanseethatthelengthofthelistis100.ThePythonlen()functiontellsushowbigalistis.
Youcanalsoseethecontentofthefirstrecorddata_list
[0].Thefirstnumberis5whichisthe
label,andtherestofthe784numbersarethecolourvaluesforthepixelsthatmakeupthe
image.Ifyoulookcloselyyoucantellthesecolourvaluesseemtorangebetween0and255.
Youmightwanttolookatotherrecordsandseeifthatistruetheretoo.Youllfindthatthe
colourvaluesdoindeedfallwithintherangefrom0to255.

148
Wedidseeearlierhowwemightplotarectangulararrayofnumbersusingtheimshow()
function.Wewanttodothesameherebutweneedtoconvertthatlistofcommaseparated
numbersintoasuitablearray.Herearethestepstodothat:
Splitthatlongtextstringofcommaseparatedvaluesintoindividualvalues,usingthe
commasastheplacetodothesplitting.
Ignorethefirstvalue,whichisthelabel,andtaketheremaininglistof28*28=784
valuesandturnthemintoanarraywhichhasashapeof28rowsby28columns.
Plotthatarray!
Again,itiseasiesttoshowthefairlysimplePythoncodethatdoesthis,andtalkthroughthe
codetoexplaininmoredetailwhatishappening.
FirstwemustntforgettoimportthePythonextensionlibrarieswhichwillhelpuswitharraysand
plotting:
importnumpy
%matplotlibinline
Lookatthefollowing3linesofcode.Thevariableshavebeencolouredtomakeiteasierto
understandwhichdataisbeingusedwhere.
all_values = data_list [0].split(',')

image_array =numpy.asfarray( all_values [1:]).reshape((28,28))
matplotlib.pyplot.imshow( image_array ,cmap='Greys',
interpolation='None')
Thefirstlinetakesthefirstrecordsdata_list[0]thatwejustprintedout,andsplitsthatlong
stringbyitscommas.Youcanseethesplit()functiondoingthis,withaparametertellingitwhich
symboltosplitby.Inthiscasethatsymbolisthecomma.Theresultswillbeplacedinto
all_values .YoucanprintthisouttocheckthatisindeedalongPythonlistofvalues.
Thenextlinelooksmorecomplicatedbecausethereareseveralthingshappeningonthesame
line.Letsworkoutfromthecore.Thecorehasthat all_values listbutthistimethesquare
brackets[1:]areusedtotakeallexceptthefirstelementofthislist.Thisishowweignorethe
firstlabelvalueandonlytaketherestofthe784values.Thenumpy.asfarray()isanumpy
functiontoconvertthetextstringsintorealnumbersandtocreateanarrayofthosenumbers.
Hangonwhatdowemeanbyconvertingtextstringintonumbers?Wellthefilewasreadas
text,andeachlineorrecordisstilltext.Splittingeachlinebythecommasstillresultsinbitsof
text.Thattextcouldbethewordapple,orange123or567.Thetextstring567isnotthe

149
sameasanumber567.Thatswhyweneedtoconverttextstringstonumbers,evenifthetext
lookslikenumbers.Thelastbit.reshape((28,28))makessurethelistofnumberiswrapped
aroundevery28elementstomakeasquarematrix28by28.Theresulting28by28arrayis
calledimage_array.Phew!Thatwasafairbithappeninginoneline.
Thethirdlinesimplyplotstheimage_arrayusingtheimshow()functionjustlikewesawearlier.
Thistimewedidselectagreyscalecolourpalettewithcmap=Greystobettershowthe
handwrittencharacters.
Thefollowingshowstheresultsofthiscode:
Youcanseetheplottedimageisa5,whichiswhatthelabelsaiditshouldbe.Ifweinstead
choosethenextrecorddata_list[1]whichhasalabelof0,wegetthefollowingimage.

150

Youcantelleasilythehandwrittennumberisindeedzero.
PreparingtheMNISTTrainingData
WeveworkedouthowtogetdataoutoftheMNISTdatafilesanddisentangleitsowecan
makesenseofit,andvisualiseittoo.Wewanttotrainourneuralnetworkwiththisdata,butwe
needtothinkjustalittleaboutpreparingthisdatabeforewethrowitatourneuralnetwork.
Wesawearlierthatneuralnetworksworkbetteriftheinputdata,andalsotheoutputvalues,are
oftherightshapesothattheystaywithinthecomfortzoneofthenetworknodeactivation
functions.
Thefirstthingweneedtodoistorescaletheinputcolourvaluesfromthelargerrange0to255
tothemuchsmallerrange0.011.0.Wevedeliberatelychosen0.01asthelowerendofthe
rangetoavoidtheproblemswesawearlierwithzerovaluedinputsbecausetheycanartificially
killweightupdates.Wedonthavetochoose0.99fortheupperendoftheinputbecausewe
dontneedtoavoid1.0fortheinputs.Itsonlyfortheoutputsthatweshouldavoidthe
impossibletoreach1.0.
Dividingtherawinputswhichareintherange0255by255willbringthemintotherange01.
Wethenneedtomultiplyby0.99tobringthemintotherange0.00.99.Wethenadd0.01to
shiftthemuptothedesiredrange0.01to1.00.ThefollowingPythoncodeshowsthisinaction:
scaled_input=(numpy.asfarray(all_values[1:])/255.0*0.99)+
0.01
print(scaled_input)

Theoutputconfirmsthatthevaluesarenowintherange0.01to0.99.

151

SowevepreparedtheMNISTdatabyrescalingandshiftingit,readytothrowatourneural
networkforbothtrainingandquerying.
Wenowneedtothinkabouttheneuralnetworksoutputs.Wesawearlierthattheoutputs
shouldmatchtherangeofvaluesthattheactivationfunctioncanpushout.Thelogisticfunction
wereusingcantpushoutnumberslike2.0or255.Therangeisbetween0.0and1.0,andin
factyoucantreach0.0or1.0asthelogisticfunctiononlyapproachestheseextremeswithout
actuallygettingthere.Soitlookslikewellhavetoscaleourtargetvalueswhentraining.
Butactually,wehaveadeeperquestiontoaskourselves.Whatshouldtheoutputevenbe?
Shoulditbeanimageoftheanswer?Thatwouldmeanwehavea28x28=784outputnodes.
Ifwetakeastepbackandthinkaboutwhatwereaskingtheneuralnetwork,werealisewere
askingittoclassifytheimageandassignthecorrectlabel.Thatlabelisoneof10numbers,
from0to9.Thatmeansshouldbeabletohaveanoutputlayerof10nodes,oneforeachofthe
possibleanswers,orlabels.Iftheanswerwas0thefirstoutputlayernodewouldfireandthe
restshouldbesilent.Iftheanswerwas9thelastoutputlayernodewouldfireandtherest
wouldbesilent.Thefollowingillustratesthisscheme,withsomeexampleoutputstoo.

152

Thefirstexampleiswheretheneuralnetworkthinksithasseenthenumber5.Youcansee
thatthelargestsignalemergingfromtheoutputlayerisfromthenodewithlabel5.Remember
thisisthesixthnodebecausewerestartingfromalabelforzero.Thatseasyenough.Therest
oftheoutputnodesproduceasmallsignalveryclosetozero.Roundingerrorsmightresultinan
outputofzero,butinfactyoullremembertheactivationfunctiondoesntproduceanactual
zero.
Thenextexampleshowswhatmighthappeniftheneuralnetworkthinksithasseena
handwrittenzero.Againthelargestoutputbyfarisfromthefirstoutputnodecorrespondingto
thelabel0.
Thelastexampleismoreinteresting.Heretheneuralnetworkhasproducedthelargestoutput
signalfromthelastnode,correspondingtothelabel9.Howeverithasamoderatelybigoutput
fromthenodefor4.Normallywewouldgowiththebiggestsignal,butyoucanseehowthe
networkpartlybelievedtheanswercouldhavebeen4.Perhapsthehandwritingmadeithard
tobesure?Thissortofuncertaintydoeshappenwithneuralnetworks,andratherthanseeitas
abadthing,weshouldseeitasausefulinsightintohowanotheranswerwasalsoacontender.
Thatsgreat!Nowweneedtoturntheseideasintotargetarraysfortheneuralnetworktraining.
Youcanseethatifthelabelforatrainingexampleis5,weneedtocreateatargetarrayfor
theoutputnodewherealltheelementsaresmallexcepttheonecorrespondingtothelabel5.
Thatcouldlooklikethefollowing[0,0,0,0,0,1,0,0,0,0].

153
Infactweneedtorescalethosenumbersbecausewevealreadyseenhowtryingtogetthe
neuralnetworktocreateoutputsof0and1,whichareimpossiblefortheactivationfunction,will
drivelargeweightsandasaturatednetwork.Sowellusethevalues0.01and0.99instead,so
thetargetforthelabel5shouldbe[0.01,0.01,0.01,0.01,0.01,0.99,0.01,0.01,0.01,0.01].
HavealookatthefollowingPythoncodewhichconstructsthetargetmatrix:
#outputnodesis10(example)
onodes=10
targets=numpy.zeros(onodes)+0.01
targets[int(all_values[0])]=0.99
Thefirstline,otherthanthecomment,simplysetsthenumberofoutputnodesto10,whichis
rightforourexamplewithtenlabels.
Thesecondlinesimpleusesaconvenientnumpyfunctioncallednumpy.zeros()tocreatean
arrayfilledwithzeros.Theparameterittakesisthesizeandshapeofthearraywewant.Here
wejustwantasimpleoneoflength onodes ,whichthenumberofnodesonthefinaloutput
layer.Weadd0.01tofixtheproblemwithzeroswejusttalkedabout.
ThenextlinetakesthefirstelementoftheMNISTdatasetrecord,whichisthetrainingtarget
label,andconvertsthatstringintoaninteger.Rememberthatrecordisreadfromthesource
filesasatextstring,notanumber.Oncethatconversionisdone,thattargetlabelisusedtoset
therightelementofthetargetslistto0.99.Itlooksneatbecausealabelof0willbeconverted
tointeger0,whichisthecorrectindexintothe targets[]arrayforthatlabel.Similarly,alabelof
9willbeconvertedtointeger9,and targets [9]isindeedthelastelementofthatarray.
Thefollowingshowsanexampleofthisworking:
Excellent,wehavenowworkedouthowtopreparetheinputsfortrainingandquerying,andthe
outputsfortrainingtoo.
LetsupdateourPythoncodetoincludethiswork.Thefollowingshowsthecodedevelopedthus
far.Thecodewillalwaysbeavailableongithubatthefollowinglink,butwillevolveasweadd
moretoit:

154

part2_neural_network_mnist_data.ipynb
Youcanalwaysseethepreviousversionsastheyvedevelopedatthefollowinghistoryview:
https://github.com/makeyourownneuralnetwork/makeyourownneuralnetwork/commits/ma
ster/part2_neural_network_mnist_data.ipynb
#pythonnotebookforMakeYourOwnNeuralNetwork
#codefora3layerneuralnetwork,andcodeforlearningtheMNIST
dataset
#(c)TariqRashid,2016
#licenseisGPLv2

importnumpy
importscipy.special
#libraryforplottingarrays
#ensuretheplotsareinsidethisnotebook,notanexternalwindow
%matplotlibinline

classneuralNetwork:

learningrate):

#w11w21
#w12w22etc

155
#learningrate


pass

deftrain(self,inputs_list,targets_list):




outputlayers

hiddenlayers

pass

156
defquery(self,inputs_list):



returnfinal_outputs
input_nodes=784
hidden_nodes=100
output_nodes=10
#learningrateis0.3
learning_rate=0.3
learning_rate)
#loadthemnisttrainingdataCSVfileintoalist
training_data_file=open("mnist_dataset/mnist_train_100.csv",'r')
training_data_list=training_data_file.readlines()
training_data_file.close()
#gothroughallrecordsinthetrainingdataset
forrecordintraining_data_list:
#splittherecordbythe','commas
all_values=record.split(',')
#scaleandshifttheinputs
inputs=(numpy.asfarray(all_values[1:])/255.0*0.99)+0.01
#createthetargetoutputvalues(all0.01,exceptthedesired
labelwhichis0.99)
targets=numpy.zeros(output_nodes)+0.01

157
#all_values[0]isthetargetlabelforthisrecord
n.train(inputs,targets)
pass
Youcanseeweveimportedtheplottinglibraryatthetop,addedsomecodetosetthesizeof
theinput,hiddenandoutputlayers,readthesmallerMNISTtrainingdataset,andthentrained
theneuralnetworkwiththoserecords.
Whyhavewechosen784inputnodes?Remember,thats28x28,thepixelswhichmakeup
thehandwrittennumberimage.
Thechoiceof100hiddennodesisnotsoscientific.Wedidntchooseanumberlargerthan784
becausetheideaisthatneuralnetworksshouldfindfeaturesorpatternsintheinputwhichcan
beexpressedinashorterformthantheinputitself.Sobychoosingavaluesmallerthanthe
numberofinputs,weforcethenetworktotrytosummarisethekeyfeatures.Howeverifwe
choosetoofewhiddenlayernodes,thenwerestricttheabilityofthenetworktofindsufficient
featuresorpatterns.Wedbetakingawayitsabilitytoexpressitsownunderstandingofthe
MNISTdata.Giventheoutputlayerneeds10labels,hence10outputnodes,thechoiceofan
intermediate100forthehiddenlayerseemstomakesense.
Itisworthmakinganimportantpointhere.Thereisntaperfectmethodforchoosinghowmany
hiddennodesthereshouldbeforaproblem.Indeedthereisntaperfectmethodforchoosing
thenumberofhiddenlayerseither.Thebestapproaches,fornow,aretoexperimentuntilyou
findagoodconfigurationfortheproblemyouretryingtosolve.
TestingtheNetwork
Nowthatwevetrainedthenetwork,atleastonasmallsubsetof100records,wewanttotest
howwellthatworked.Wedothisagainsttheseconddataset,thetrainingdataset.
Wefirstneedtogetatthetestrecords,andthePythoncodeisverysimilartothatusedtoget
thetrainingdata.
#loadthemnisttestdataCSVfileintoalist
test_data_file=open("mnist_dataset/mnist_test_10.csv",'r')
test_data_list=test_data_file.readlines()
test_data_file.close()
Weunpackthisdatainthesamewayasbefore,becauseithasthesamestructure.
Beforewecreatealooptogothroughallthetestrecords,letsjustseewhathappensifwe
manuallyrunonetest.Thefollowingshowsthefirstrecordfromthetestdatasetbeingusedto
querythenowtrainedneuralnetwork.

158

Youcanseethatthelabelforthefirstrecordfromthetestdatasetis7.Thatswhatwehope
theneuralnetworkwillanswerwhenwequeryit.
Plottingthepixelvaluesasanimageconfirmsthehandwrittennumberisindeeda7.
Queryingthetrainednetworkproducesalistofnumbers,theoutputsfromeachoftheoutput
nodes.Youcanquicklyseethatoneoutputvalueismuchlargerthantheothers,andistheone
correspondingtothelabel7.Thatstheeighthelement,becausethefirstonecorrespondsto
thelabel0.
Itworked!
Thisisrealmomenttosavour.Allourhardworkthroughoutthisguidewasworthit!
Wetrainedourneuralnetworkandwejustgotittotelliswhatitthinksisthenumber
representedbythatpicture.Rememberthatithasntseenthatpicturebefore,itwasntpartof
thetrainingdataset.Sotheneuralnetworkwasabletocorrectlyclassifyahandwritten
characterthatithadnotseenbefore.Thatismassive!

159
WithjustafewlinesofsimplePythonwehavecreatedaneuralnetworkthatlearnstodo
somethingthatmanypeoplewouldconsiderartificialintelligenceitlearnstorecogniseimages
ofhumanhandwriting.
Thisisevenmoreimpressivegiventhatweonlytrainedonatinysubsetofthefulltrainingdata
set.Rememberthattrainingdatasethas60,000records,andweonlytrainedon100.I
personallydidntthinkitwouldwork!
Letscrackonandwritethecodetoseehowtheneuralnetworkperformsagainsttherestofthe
dataset,andkeepascoresowecanlaterseeifourownideasforimprovingthelearning
worked,andalsotocomparewithhowwellothershavedonethis.
Itseasiesttolookatthefollowingcodeandtalkthroughit:
#testtheneuralnetwork

#scorecardforhowwellthenetworkperforms,initiallyempty
scorecard=[]

#gothroughalltherecordsinthetestdataset
forrecordintest_data_list:
#correctanswerisfirstvalue
correct_label=int(all_values[0])
print(correct_label,"correctlabel")
#querythenetwork
outputs=n.query(inputs)
#theindexofthehighestvaluecorrespondstothelabel
label=numpy.argmax(outputs)
print(label,"network'sanswer")
#appendcorrectorincorrecttolist
if(label==correct_label):
#network'sanswermatchescorrectanswer,add1to
scorecard
scorecard.append(1)
else:
#network'sanswerdoesn'tmatchcorrectanswer,add0to
scorecard
scorecard.append(0)
pass

160
pass

Beforewejumpintotheloopwhichworksthroughallthetestdatasetrecords,wecreatean
emptylist,called
scorecard ,whichwillbethescorecardthatweupdateaftereachrecord.
Youcanseethatinsidetheloop,wedowhatwedidbefore,wesplitthetextrecordbythe
commastoseparateoutthevalues.Wekeepanoteofthefirstvalueasthecorrectanswer.We
grabtheremainingvaluesandrescalethemsotheyresuitableforqueryingtheneuralnetwork.
Wekeeptheresponsefromtheneuralnetworkinavariablecalled outputs.
Nextistheinterestingbit.Weknowtheoutputnodewiththelargestvalueistheonethe
networkthinksistheanswer.Theindexofthatnode,thatis,itsposition,correspondstothe
label.Thatsalongwayofsaying,thefirstelementcorrespondstothelabel0,andthefifth
elementcorrespondstothelabel4,andsoon.Luckilythereisaconvenientnumpyfunction
thatfindsthelargestvalueinanarrayandtellsusitsposition,numpy.argmax().Youcanread
aboutitonlinehere
.Ifitreturns0weknowthenetworkthinkstheansweriszero,andandso
on.
Thatlastbitofcodecomparesthelabelwiththeknowncorrectlabel.Iftheyarethesame,a1
isappendedtothescorecard,otherwisea0isappended.
Iveincludedsomeusefulprint()commandsinthecodesothatwecanseeforourselvesthe
correctandpredictedlabels.Thefollowingshowstheresultsofthiscode,andalsoofprinting
outthescorecard.

161
Notsogreatthistime!Wecanseethattherearequiteafewmismatches.Thefinalscorecard
showsthatoutoftentestrecords,thenetworkgot6right.Thatsascoreof60%.Thatsnot
actuallytoobadgiventhesmalltrainingsetweused.
Letsfinishoffwithsomecodetoprintoutthattestscoreasafraction.
#calculatetheperformancescore,thefractionofcorrectanswers
scorecard_array=numpy.asarray(scorecard)
print("performance=",scorecard_array.sum()/
scorecard_array.size)

Thisisasimplecalculationtoworkoutthefractionofcorrectanswers.Itsthesumof1entries
onthescorecarddividedbythetotalnumberofentries,whichisthesizeofthescorecard.Lets
seewhatthisproduces.
Thatproducesthefraction0.6or60%accuracyasweexpected.
TrainingandTestingwiththeFullDatasets
Letsaddallthisnewcodewevejustdevelopedtotestingthenetworksperformancetoour
mainprogram.
Whilewereatit,letschangethefilenamessothatwerenowpointingtothefulltrainingdata
setof60,000records,andthetestdatasetof10,000records.Wepreviouslysavedthosefiles
as
mnist_dataset/mnist_train.csv and
mnist_dataset/mnist_test.csv .Weregettingserious
now!
Remember,youcangetthePythonnotebookonlineatgithub:
part2_neural_network_mnist_data.ipynb
Thehistoryofthatcodeisalsoavailableongithubsoyoucanseethecodeasitdeveloped:
https://github.com/makeyourownneuralnetwork/makeyourownneuralnetwork/commits/ma
ster/part2_neural_network_mnist_data.ipynb

162

Theresultoftrainingoursimple3layerneuralnetworkagainstthefull60,000training
examples,andthentestingitagainstthe10,000records,givesusanoverallperformancescore
of
0.9473
.Thatisveryverygood.Almost95%accurate!
Itisworthcomparingthisscoreofjustunder95%accuracyagainstindustrybenchmarks
recordedat http://yann.lecun.com/exdb/mnist/ .Wecanseethatwerebetterthansomeofthe
historicbenchmarks,weareaboutthesameperformanceasthesimplestneuralnetwork
approachlistenthere,whichhasaperformanceof95.3%.
Thatsnotbadatall.Weshouldbeverypleasedthatourveryfirstgoatasimpleneuralnetwork
achievesthekindofperformancethataprofessionalneuralnetworkresearcherachieved.
Bytheway,itshouldntsurpriseyouthatcrunchingthrough60,000trainingexamples,each
requiringasetoffeedforwardcalculationsfrom784inputnodes,through100hiddennodes,
andalsodoinganerrorfeedbackandweightupdate,alltakesawhileevenforafastmodern
homecomputer.Mynewlaptoptookabout2minutestogetthroughthetrainingloop.Yours
maybequickerorslower.
SomeImprovements:TweakingtheLearningRate
A95%performancescoreontheMNISTdatasetwithourfirstneuralneuralnetwork,usingonly
simpleideasandsimplePythonisnotabadatall,andifyouwantedtostophereyouwouldbe
entirelyjustified.
Butletsseeifwecanmakesomeeasyimprovements.
Thefirstimprovementwecantryistoadjustthelearningrate.Wesetitat0.3previouslywithout
reallyexperimentingwithdifferentvalues.
Letstrydoublingitto 0.6
,toseeifaboostwillactuallybehelpfulorharmfultotheoverall
networklearning.Ifwerunthecodewegetaperformancescoreof 0.9047.Thatsworsethan
before.Soitlookslikethelargerlearningrateleadstosomebouncingaroundandovershooting
duringthegradientdescent.
Letstryagainwithalearningrateof 0.1
.Thistimetheperformanceisanimprovementat
0.9523 .Itssimilarinperformancetoonelistedonthatwebsitewhichhas1000hiddennodes.
Weredoingwellwithmuchless!

163
Whathappensifwekeepgoingandsetalearningrateofanevensmaller 0.01
?The
performanceisntsogoodat 0.9241
.Soitseemshavingtoosmallalearningrateisdamaging.
Thismakessensebecausewerelimitingthespeedatwhichgradientdescenthappens,were
makingthestepstoosmall.
Thefollowingplotsagraphoftheseresults.Itsnotaveryscientificapproachbecausewe
shouldreallydotheseexperimentsmanytimestoreducetheeffectofrandomnessandbad
journeysdownthegradientdescent,butitisstillusefultoseethegeneralideathatthereisa
sweetspotforlearningrate.
Theplotsuggestedthatbetweenalearningrateof0.1and0.3theremightbebetter
performance,soletstryalearningrateof
0.2.
Theperformanceis 0.9537.Thatisindeedatiny
bitbetterthaneither0.1and0.3Thisideaofplottinggraphstogetabetterfeelforwhatisgoing
onissomethingyoushouldconsiderinotherscenariostoopictureshelpusunderstandmuch
betterthanalistofnumbers!
Sowellstickwithalearningrateof0.2,whichseemstobeasweetspotfortheMNISTdataset
andourneuralnetwork.
Bytheway,ifyourunthiscodeyourself,yourownscoreswillbeslightlydifferentbecausethe
wholeprocessisalittlerandom.Yourinitialrandomweightswontbethesameasmyinitial

164
randomweights,andsoyourowncodewilltakeadifferentroutedownthegradientdescent
thanmine.
SomeImprovements:DoingMultipleRuns
Thenextimprovementwecandoistorepeatthetrainingseveraltimesagainstthedataset.
Somepeoplecalleachrunthroughan epoch .Soatrainingsessionwith10epochsmeans
runningthroughtheentiretrainingdataset10times.Whywouldwedothat?Especiallyifthe
timeourcomputerstakegoesupto10or20oreven30minutes?Thereasonitisworthdoingis
thatwerehelpingthoseweightsdothatgradientdescentbyprovidingmorechancestocreep
downthoseslopes.
Letstryitwith2epochs.Thecodechangesslightlybecausewenowaddanextralooparound
thetrainingcode.Thefollowingshowsthisouterloopcolourcodedtohelpseewhats
happening.

#epochsisthenumberoftimesthetrainingdatasetisusedfor
training
epochs=10

foreinrange(epochs):
inputs=(numpy.asfarray(all_values[1:])/255.0*0.99)+
0.01
#createthetargetoutputvalues(all0.01,exceptthe
desiredlabelwhichis0.99)
pass
pass
Theresultingperformancewith2epochsis 0.9579,asmallimprovementoverjust1epoch.
Justlikewedidwithtweakingthelearningrate,letsexperimentwithafewdifferentepochsand
plotagraphtovisualisetheeffectthishas.Intuitionsuggeststhemoretrainingyoudothebetter
theperformance.Someofyouwillrealisethattoomuchtrainingisactuallybadbecausethe
networkoverfitstothetrainingdata,andthenperformsbadlyagainstnewdatathatithasnt

165
seenbefore.This overfitting
issomethingtobewareofacrossmanydifferentkindsofmachine
learning,notjustneuralnetworks.
Hereswhathappens:
Youcanseetheresultsarentquitesopredictable.Youcanseethereisasweetspotaround5
or7epochs.Afterthatperformancedegrades,andthismaybetheeffectofoverfitting.Thedip
at6epochsisprobablyabadrunwiththenetworkgettingstuckinabadminimumduring
gradientdescent.Actually,Iwouldhaveexpectedmuchmorevariationintheresultsbecause
wevenotdonemanyexperimentsforeachdatapointtoreducetheeffectofexpected
variationsfromwhatisessentiallyarandomprocess.ThatswhyIveleftthatoddpointfor6
epochsin,toremindusthatneuralnetworklearningisarandomprocessatheartandcan
sometimesnotworksowell,andsometimesworkreallybadly.
Anotherideaisthatthelearningrateistoohighforlargernumbersofepochs.Letstrythis
experimentagainandtunedownthelearningratefrom0.2downto0.1andseewhathappens.
Thepeakperformanceisnowupto 0.9628 ,or96.28%,with7epochs.
Thefollowinggraphshowsthenewperformancewithlearningrateat0.1overlaidontothe
previousone.

166

Youcanseethatcalmingdownthelearningratedidindeedproducebetterperformancewith
moreepochs.Thatpeakof 0.9689representsanapproximateerrorrateof3%,whichis
comparabletothenetworksbenchmarksonYannLeCuns website.
Intuitivelyitmakessensethatifyouplantoexplorethegradientdescentformuchlonger(more
epochs),youcanaffordtotakeshortersteps(learningrate),andoverallyoullfindabetterpath
down.Itdoesseemthat5epochsisprobablythesweetspotforthisneuralnetworkagainstthis
MNISTlearningtask.Againkeepinmindthatwedidthisinafairlyunscientificway.Todoit
properlyyouwouldhavetodothisexperimentmanytimesforeachcombinationoflearning
ratesandepochstominimisetheeffectofrandomnessthatisinherentingradientdescent.
ChangeNetworkShape
Onethingwehaventyettried,andperhapsshouldhaveearlier,istochangetheshapeofthe
neuralnetwork.Letstrychangingthenumberofmiddlehiddenlayernodes.Wevehadthem
setto100forfartoolong!
Beforewejumpinandrunexperimentswithdifferentnumbersofhiddennodes,letsthinkabout
whatmighthappenifwedo.Thehiddenlayeristhelayerwhichiswherethelearninghappens.
Remembertheinputnodessimplybringintheinputsignals,andtheoutputnodessimplypush
outthenetworksanswer.Itsthehiddenlayer(orlayers)whichhavetolearntoturntheinput
intotheanswer.Itswherethelearninghappens.Actually,itsthelinkweightsbeforeandafter
thehiddennodesthatdothelearning,butyouyouknowwhatImean.

167

Ifwehadtoofewhiddennodes,say3,youcanimagethereisnowaythereisenoughspaceto
learnwhateveranetworklearns,tosomehowturnalltheinputsintothecorrectoutputs.Itwould
belikeaskingacarwith5seatstocarry10people.Youjustcantfitthatmuchstuffinside.
Computerscientistscallthiskindoflimita
learningcapacity
.Youcantlearnmorethanthe
learningcapacity,butyoucanchangethevehicle,orthenetworkshape,toincreasethe
capacity.
Whatifwehad10000hiddennodes?Wellwewontbeshortoflearningcapacity,butwemight
findithardertotrainthenetworkbecausenowtherearetoomanyoptionsforwherethe
learningshouldgo.Maybeitwouldtake10000sofepochstotrainsuchanetwork.
Letsrunsomeexperimentsandseewhathappens.
Youcanseethatforlownumbersofhiddennodestheresultsarenotasgoodforhigher
numbers.Weexpectedthat.Buttheperformancefromjust5hiddennodeswas 0.7001
.Thatis
prettyamazinggiventhatfromsuchfewlearninglocationsthenetworkisstillabout70%right.
Rememberwevebeenrunningwith100hiddennodesthusfar.Just10hiddennodesgetsus
0.8998 accuracy,whichagainisprettyimpressive.Thats1/10thofthenodeswevebeenused
to,andthenetworksperformancejumpsto90%.

168
Thispointisworthappreciating.Theneuralnetworkisabletogivereallygoodresultswithso
fewhiddennodes,orlearninglocations.Thatsatestamenttotheirpower.
Asweincreasethenumberofhiddennodes,theresultsdoimprovebutnotasdrastically.The
timetakentotrainthenetworkalsoincreasessignificantly,becauseeachextrahiddennode
meansnewnetworklinkstoeverynodeintheprecedingandnextlayers,whichallrequirelots
morecalculations!Sowehavetochooseanumberofhiddennodeswithatolerableruntime.
Formycomputerthats200nodes.Yourcomputermaybefasterorslower.
Wevealsosetanewrecordforaccuracy,with 0.9751with200nodes.Andalongrunwith500
nodesgaveus 0.9762 .ThatsreallygoodcomparedtothebenchmarkslistedonLeCuns
website.
Lookingbackatthegraphsyoucanseethatthepreviousstubbornlimitofabout95%accuracy
wasbrokenbychangingtheshapeofthenetwork.
GoodWork!
Lookingbackoverthiswork,wevecreatedaneuralnetworkusingonlythesimpleconceptswe
coveredearlier,andusingsimplePython.
Andthatneuralnetworkhasperformedsowell,withoutanyextrafancymathematicalmagic,its
performanceisveryrespectablecomparedtonetworksthatacademicsandresearchersmake.
Theresmorefuninpartthreeofguide,butevenifyoudontexplorethoseideas,donthesitate
toexperimentfurtherwiththeneuralnetworkyouvealreadymadetryadifferentnumberof
hiddennodes,oradifferentscaling,orevenadifferentactivationfunction,justtoseewhat
happens.
FinalCode
Thefollowingshowsthefinalcodeincaseyoucantaccessthecodeongithub,orjustprefera
copyhereforeasyreference.
#pythonnotebookforMakeYourOwnNeuralNetwork
#codefora3layerneuralnetwork,andcodeforlearningtheMNIST
dataset
#(c)TariqRashid,2016
#licenseisGPLv2

importnumpy
importscipy.special

169
#libraryforplottingarrays
#ensuretheplotsareinsidethisnotebook,notanexternalwindow
%matplotlibinline
classneuralNetwork
:

learningrate)
:

#w11w21
#w12w22etc
#learningrate


pass

deftrain(self,inputs_list,targets_list)
:


170



outputlayers

hiddenlayers

pass

:



returnfinal_outputs

171
input_nodes=784
hidden_nodes=200
output_nodes=10
#learningrate
learning_rate=0.1
learning_rate)
#loadthemnisttrainingdataCSVfileintoalist
training_data_file=open("mnist_dataset/mnist_train.csv",'r')
training_data_list=training_data_file.readlines()
training_data_file.close()
#epochsisthenumberoftimesthetrainingdatasetisusedfor
training
epochs=5
foreinrange(epochs):
inputs=(numpy.asfarray(all_values[1:])/255.0*0.99)+
0.01
#createthetargetoutputvalues(all0.01,exceptthe
desiredlabelwhichis0.99)
pass
pass
#loadthemnisttestdataCSVfileintoalist
test_data_file=open("mnist_dataset/mnist_test.csv",'r')

172
test_data_list=test_data_file.readlines()
test_data_file.close()

#testtheneuralnetwork

#scorecardforhowwellthenetworkperforms,initiallyempty
scorecard=[]

#gothroughalltherecordsinthetestdataset
forrecordintest_data_list:
#correctanswerisfirstvalue
correct_label=int(all_values[0])
#querythenetwork
outputs=n.query(inputs)
#theindexofthehighestvaluecorrespondstothelabel
label=numpy.argmax(outputs)
#appendcorrectorincorrecttolist
if(label==correct_label):
#network'sanswermatchescorrectanswer,add1to
scorecard
scorecard.append(1)
else:
#network'sanswerdoesn'tmatchcorrectanswer,add0to
scorecard
scorecard.append(0)
pass

pass

#calculatetheperformancescore,thefractionofcorrectanswers
scorecard_array=numpy.asarray(scorecard)
print("performance=",scorecard_array.sum()/
scorecard_array.size)

173
Part3EvenMoreFun
Ifyoudontplay,youdontlearn.

174
Inthispartoftheguidewellexplorefurtherideasjustbecausetheyrefun.Theyarent
necessarytounderstandingthebasicofneuralnetworkssodontfeelyouhavetounderstand
everythinghere.
Becausethisisafunextrasection,thepacewillbeslightlyquicker,butwewillstilltrytoexplain
theideasinplainEnglish.
YourOwnHandwriting
ThroughoutthisguidewevebeenusingimagesofhandwrittennumbersfromtheMNIST
dataset.Whynotuseyourownhandwriting?
Inthisexperiment,wellcreateourtestdatasetusingourownhandwriting.Wellalsotryusing
differentstylesofwriting,andnoisyorshakyimagestoseehowwellourneuralnetworkcopes.
Youcancreateimagesusinganyimageeditingorpaintingsoftwareyoulike.Youdonthaveto
usetheexpensivePhotoshop,the GIMP isafreeopensourcealternativeavailablefor
Windows,MacandLinux.Youcanevenuseapenonpaperandphotographyourwritingwitha
smartphoneorcamera,orevenuseaproperscanner.Theonlyrequirementisthattheimageis
square(thewidthisthesameasthelength)andyousaveitasPNGformat.Youlloftenfindthe
savingformatoptionunderFile>SaveAs,orFile>Exportinyourfavouriteimageeditor.
HerearesomeimagesImade.

175
Thenumber5issimplymyownhandwriting.The4isdoneusingachalkratherthanamarker.
Thenumber3ismyownhandwritingbutdeliberatelywithbitschoppedout.The2isavery
traditionalnewspaperorbooktypefacebutblurredabit.The6isadeliberatelywobblyshaky
image,almostlikeareflectioninwater.Thelastimageisthesameasthepreviousonebutwith
noiseaddedtoseeifwecanmaketheneuralnetworksjobevenharder!
Thisisfunbutthereisaseriouspointhere.Scientistshavebeenamazedatthehumanbrains
abilitytocontinuetofunctionamazinglywellaftersufferingdamage.Thesuggestionisthat
neuralnetworksdistributewhattheyvelearnedacrossseverallinkweights,whichmeansifthey
suffersomedamage,theycanperformfairlywell.Thisalsomeansthattheycanperformfairly
welliftheinputimageisdamagedorincomplete.Thatsapowerfulthingtohave.Thatswhat
wewanttotestwiththechoppedup3intheimagesetabove.
WellneedtocreatesmallerversionsofthesePNGimagesrescaledto28by28pixels,to
matchwhatweveusedfromtheMNISTdata.Youcanuseyourimageeditortodothis.
Pythonlibrariesagainhelpusoutwithreadinganddecodingthedatafromcommonimagefile
formats,includingthePNGformat.Havealookatthefollowingsimplecode:
import scipy.misc
img_array =

scipy .
misc.imread (
image_file_name ,flatten =
True )

img_data =

255.0

img_array .
reshape( 784
)
img_data =(img_data /
255.0
*

0.99 )+
0.01

Thescipy.misc.imread()functionistheonethathelpsusgetdataoutofimagefilessuchas
PNGorJPGfiles.Wehavetoimportthescipy.misclibrarytouseit.Theflatten=True
parameterturnstheimageintosimplearrayoffloatingpointnumbers,andiftheimagewere
coloured,thecolourvalueswouldbeflattenedintogreyscale,whichiswhatweneed.
Thenextlinereshapesthearrayfroma28x28squareintoalonglistofvalues,whichiswhatwe
needtofeedtoourneuralnetwork.Wevedonethatmanytimesbefore.Whatisnewisthe
subtractionofthearraysvaluesfrom255.0.Thereasonforthisisthatitisconventionalfor0to
meanblackand255tomeanwhite,buttheMNISTdatasethasthistheoppositewayaround,
sowehavetoreversethevaluestomatchwhattheMNISTdatadoes.
Thelastlinedoesthefamiliarrescalingofthedatavaluessotheyrangefrom0.01to1.0.
SamplecodetodemonstratethereadingofPNGfilesisalwaysonlineatgihub:
part3_load_own_images.ipynb

176
WeneedtocreateaversionofourbasicneuralnetworkprogramthattrainsontheMNIST
trainingdatasetbutinsteadoftestingwiththeMNISTtestset,ittestsagainstdatacreatedfrom
ourownimages.
Thenewprogramisonlineatgithub:
part3_neural_network_mnist_and_own_data.ipynb
Doesitwork?Itdoes!Thefollowingsummarisestheresultsofqueryingwithourownimages.
Youcanseethattheneuralnetworkrecognisedalloftheimageswecreated,includingthe
deliberatelydamaged3.Onlythe6withaddednoisefailed.
Tryitwithyourownimages,especiallyhandwritten,toprovetoyourselfthattheneuralnetwork
reallydoeswork.
Andseehowfaryoucangowithdamagedordeformedimages.Youllbeimpressedwithhow
resilienttheneuralnetworkis.

177

178
InsidetheMindofaNeuralNetwork
Neuralnetworksareusefulforsolvingthekindsofproblemsthatwedontreallyknowhowto
solvewithsimplecrisprules.Imaginewritingasetofrulestoapplytoimagesofhandwritten
numbersinordertodecidewhatthenumberwas.Youcanimaginethatwouldntbeeasy,and
ourattemptsprobablynotverysuccessfuleither.
MysteriousBlackBox
Onceaneuralnetworkistrained,andperformswellenoughontestdata,youessentiallyhavea
mysterious blackbox .Youdontreallyknow
how itworksouttheansweritjustdoes.
Thisisntalwaysaproblemifyourejustinterestedinanswers,anddontreallycarehowtheyre
arrivedat.Butitisadisadvantageofthesekindsofmachinelearningmethodsthelearning
doesntoftentranslateintounderstandingorwisdomabouttheproblemtheblackboxhas
learnedtosolve.
Letsseeifwecantakeapeekinsideoursimpleneuralnetworktoseeifwecanunderstand
whatithaslearned,tovisualisetheknowledgeithasgatheredthroughtraining.
Wecouldlookattheweights,whichisafterallwhattheneuralnetworklearns.Butthatsnot
likelytobethatinformative.Especiallyasthewayneuralnetworksworkistodistributetheir
learningacrossdifferentlinkweights.Thisgivesthemanadvantageinthattheyareresilientto
damage,justlikebiologicalbrainsare.Itsunlikelythatremovingonenode,orevenquiteafew
nodes,willcompletelydamagetheabilityofaneuralnetworktoworkwell.
Heresacrazyidea.
BackwardsQuery
Normally,wefeedatrainedneuralnetworkaquestion,andoutpopsananswer.Inour
example,thatquestionisanimageofahumanhandwrittennumber.Theanswerisalabel
representinganumberfrom0to9.
Whatifweturnedthisaroundanddiditbackwards?Whatifwefedalabelintotheoutput
nodes,andfedthesignalbackwardsthroughthealreadytrainednetwork,untiloutpoppedan
imagefromtheinputnodes?Thefollowingdiagramshowsthenormalforwardquery,andthis
crazyreverse backquery idea.

179

Wealreadyknowhowtopropagatesignalsthroughanetwork,moderatingthemwithlink
weights,andrecombiningthematnodesbeforeapplyinganactivationfunction.Allthisworks
forsignalsflowingbackwardstoo,exceptthattheinverseactivationisused.If y=f(x)
wasthe
forwardactivationthentheinverseis x=g(y).Thisisntthathardtoworkoutforthelogistic
functionusingsimplealgebra:
x
y=1/(1+e )

x
1+e =1/y

x
e=(1/y)1=(1y)/y

x=ln[(1y)/y]

x=ln[y/(1y)]
Thisiscalledthe
logit
function,andthePythonscipy.speciallibraryprovidesthisfunctionas
scipy.special.logit(),justlikeitprovidesscipy.special.expit() forthelogisticsigmoidfunction.
Beforeapplyingthelogit()inverseactivationfunction,weneedtomakesurethesignalsare
valid.Whatdoesthismean?Well,yourememberthelogisticsigmoidfunctiontakesanyvalue
andoutputsavaluesomewherebetween0and1,butnotincluding0and1themselves.The
inversefunctionmusttakevaluesfromthesamerange,somewherebetween0and1,excluding
0and1,andpopoutavaluethatcouldbeanypositiveornegativenumber.Toachievethis,we
simplytakeallthevaluesatalayerabouttohavethelogit()applied,andrescalethemtothe
validrange.Ivechosentherange0.01to0.99.

180

Thecodeisalwaysavailableonlineatgithubatthefollowinglink:
part3_neural_network_mnist_backquery.ipynb
TheLabel0
Letsseewhathappensifwedoabackquerywiththelabel0.Thatis,wepresentvaluesto
theoutputnodeswhichareall0.01exceptforthefirstnoderepresentingthelabel0wherewe
usethevalue0.99.Inotherwords,thearray
[0.99,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,
0.01].
Thefollowingshowstheimagethatpopsoutoftheinputnodes.
Thatisinteresting!
Thatimageisaprivilegedinsightintothemindofaneuralnetwork.Whatdoesitmean?Howdo
weinterpretit?
Themainthingwenoticeisthatthereisaroundshapeintheimage.Thatmakessense,
becausewereaskingtheneuralnetworkwhattheidealquestionisforananswertobe0.
Wealsonoticedark,lightandsomemediumgreyareas:
The darkareasarethepartsofthequestionimagewhich,ifmarkedbyapen,makeup
supportingevidencethattheanswershouldbea0.Thesemakesenseastheyseem
toformtheoutlineofazeroshape.

181
The
light
areasarethebitsofthequestionimagewhichshouldremainclearofanypen
markstosupportthecasethattheanswerisa0.Againthesemakesenseastheyform
themiddlepartofazeroshape.
Theneuralnetworkisbroadlyindifferenttothe
grey
areas.
Sowehaveactuallyunderstood,inaroughsense,whattheneuralnetworkhaslearnedabout
classifyingimageswiththelabel0.
Thatsarareinsightasmorecomplexnetworkswithmorelayers,ormorecomplexproblems,
maynothavesuchreadilyinterpretableresults.Youreencouragedtoexperimentandgavea
goyourself!
MoreBrainScans
Thefollowingshowstheresultsofbackqueryingtherestofthedigits.
Wow!Againsomereallyinterestingimages.They'relikeultrasoundscansintothebrainofthe
neuralnetwork.
Somenotesabouttheseimages:
The"7"isreallyclear.Youcanseethedarkbitswhich,ifmarkedinthequeryimage,
stronglysuggestalabel"7".Youcanalsoseetheadditional"white"areawhichmustbe
clearofanymarking.Togetherthesetwocharacteristicsindicatea"7".
Thesameappliestothe"3"therearedarkareaswhich,ifmarked,indicatea"3",and
therearewhiteareaswhichmustbeclear.

182

The2and"5"aresimilarlycleartoo.
The"4"isinterestinginthatthereisashapewhichappearstohave4quadrants,and
excludedareastoo.
The"8"islargelymadeupofasnowmanshapedwhiteareassuggestingthataneight
ischaracterisedbymarkingskeptoutoftheseheadandbodyregions.
The1isratherpuzzling.Itseemstofocusmoreonareaswhichmuchbekeptclear
thanonareaswhichmustbemarked.Thatsok,itswhatthenetworkhappenstohave
learnedfromtheexamples.
The9isnotveryclearatall.Itdoeshaveadefinitedarkareaandsomefinershapes
forthewhiteareas.Thisiswhatthenetworkhaslearned,andoverall,whencombined
withwhatithaslearnedfortherestofthedigits,allowsthenetworktoperformreallywell
at97.5%accuracy.Wemightlookatthisimageandconcludethatmoretraining
examplesmighthelpthenetworklearnaclearertemplatefor9.
Sothereyouhaveitarareinsightintotheworkingsofthemindofaneuralnetwork.

183
CreatingNewTrainingData:Rotations
IfyouthinkabouttheMNISTtrainingdatayourealisethatitisquitearichsetofexamplesof
howpeoplewritenumbers.Thereareallsortsofstylesofhandwritinginthere,goodandbad
too.
Theneuralnetworkhastolearnasmanyofthesevariationsaspossible.Itdoeshelpthatthere
aremanyformsofthenumber4inthere.Somearesquished,somearewide,someare
rotated,somehaveanopentopandsomeareclosed.
Wouldntitbeusefulifwecouldcreateyetmoresuchvariationsasexamples?Howwouldwe
dothat?Wecanteasycollectthousandsmoreexamplesofhumanhandwriting.Wecouldbutit
wouldbeverylaborious.
Acoolideaistotaketheexistingexamples,andcreatenewonesfromthosebyrotatingthem
clockwiseandanticlockwise,by10degreesforexample.Foreachtrainingexamplewecould
havetwoadditionalexamples.Wecouldcreatemanymoreexampleswithdifferentrotation
angles,butfornowletsjusttry+10and10degreestoseeiftheideaworks.
Pythonsmanyextensionsandlibrariescometotherescueagain.The
ndimage.interpolation.rotate() canrotateanarraybyagivenangle,whichisexactlywhatwe
need.Rememberthatourinputsareaonedimensionallonglistoflength784,becauseweve
designedourneuralnetworkstotakealonglistofinputsignals.Wellneedtoreshapethatlong
listintoa28*28arraysowecanrotateit,andthenunrolltheresultbackintoa784longlistof
inputsignalsbeforewefeedittoourneuralnetwork.
Thefollowingcodeshowshowweusethendimage.interpolation.rotate()function,assumingwe
havethe scaled_input arrayfrombefore:
#createrotatedvariations
#rotatedanticlockwiseby10degrees
inputs_plus10_img=
scipy.ndimage.interpolation.rotate( scaled_input.reshape(28,28) ,
10,
cval=0.01 ,reshape=False)
#rotatedclockwiseby10degrees
inputs_minus10_img=
scipy.ndimage.interpolation.rotate( scaled_input.reshape(28,28) ,
10,
cval=0.01 ,reshape=False)
Youcanseethattheoriginalscaled_inputarrayisreshapedtoa28by28array,thenscaled.
Thatreshape=Falseparameterpreventsthelibraryfrombeingoverlyhelpfulandsquishingthe
imagesothatitallfitsaftertherotationwithoutanybitsbeingclippedoff.Thecvalisthevalue
usedtofillinarrayelementsbecausetheydidntexistintheoriginalimagebuthavenowcome

184
intoview.Wedontwantthedefaultvalueof0.0butinstead0.01becauseweveshiftedthe
rangetoavoidzerosbeinginputtoourneuralnetwork.
Record6(theseventhrecord)ofthesmallerMNISTtrainingsetisahandwrittennumber1.
Youcanseetheoriginalandtwoadditionalvariationsproducedbythecodeinthediagram
below.
Youcanseethebenefitsclearly.Theversionoftheoriginalimagerotated+10degreesprovides
anexamplewheresomeonemighthaveastyleofwritingthatslopestheir1sbackwards.Even
moreinterestingistheversionoftheoriginalrotated10degrees,whichisclockwise.Youcan
seethatthisversionisactuallystraighterthantheoriginal,andinsomesenseamore
representativeimagetolearnfrom.
LetscreateanewPythonnotebookwiththeoriginalneuralnetworkcode,butnowwith
additionaltrainingexamplescreatedbyrotatingtheoriginals10degreesinbothdirections.This
codeisavailableonlineatgithubatthefollowinglink:
part2_neural_network_mnist_data_with_rotations.ipynb
Aninitialrunwithlearningratesetto0.1,andonlyonetrainingepoch,theresultant
performancewas 0.9669.Thatsasolidimprovementover0.954withouttheadditionalrotated
trainingimages.ThisperformanceisalreadyamongstthebetteroneslistedonYannLeCunns
website .
Letsrunaseriesofexperiments,varyingthenumberofepochstoseeifwecanpushthis
alreadygoodperformanceupevenmore.Letsalsoreducethe learningrateto0.01
because
wearenowcreatingmuchmoretrainingdata,socanaffordtotakesmallermorecautious
learningsteps,asweveextendedthelearningtimeoverall.

185
Rememberthatwedontexpecttoget100%asthereisverylikelyaninherentlimitduetoour
specificneuralnetworkarchitectureorthecompletenessofourtrainingdata,sowemaynever
hopetogetabove98%orsomeotherfigure.Byspecificneuralnetworkarchitecturewemean
thechoiceofnodesineachlayer,thechoiceofhiddenlayers,thechoiceofactivationfunction,
andsoon.
Heresthegraphshowingtheperformanceaswevarytheangleofadditionalrotatedtraining
images.Theperformancewithouttheadditionalrotatedtrainingexamplesisalsoshownfor
easycomparison.
Youcanseethatwith5epochsthebestresultis 0.9745
or97.5%accuracy.Thatisajumpup
againonourpreviousrecord.
Itisworthnoticingthatforlargeanglestheperformancedegrades.Thatmakessense,aslarge
anglesmeanswecreateimageswhichdontactuallyrepresentthenumbersatall.Imaginea3
onitssideat90degrees.Thatsnotathreeanymore.Sobyaddingtrainingexampleswith
overlyrotatedimages,werereducingthequalityofthetrainingbyaddingfalseexamples.Ten
degreesseemstobetheoptimalangleformaximisingthevalueofadditionaldata.
Theperformancefor10epochspeaksatarecordbreaking 0.9787
oralmost
98%!Thatisreally
astunningresult,amongstthebestforthiskindofsimplenetwork.Rememberwehaventdone

186
anyfancytrickstothenetworkordatathatsomepeoplewilldo,wevekeptitsimple,andstill
achievedaresulttobeveryproudof.
Welldone!

187
Epilogue
InthisguideIhopethatyouhaveseenhowsomeproblemsareeasyforhumanstosolve,but
hardfortraditionalcomputerapproaches.Imagerecognitionisoneofthesesocalledartificial
intelligencechallenges.
Neuralnetworkshaveenabledhugeprogressonimagerecognition,andawiderangeofother
kindsofhardproblemstoo.Akeypartoftheirearlymotivationwasthepuzzlethatbiological
brainslikepigeonorinsectbrainsappearedtobesimplerandslower,thantodayshuge
supercomputersandyettheycouldcarryoutcomplextaskslikeflight,feedingandbuilding
homes.Thosebiologicalbrainsalsoseemedextremelyresilienttodamage,ortoimperfect
signals.Digitalcomputersandtraditionalcomputingwerenteitherofthesethings.
Today,neuralnetworksareakeypartofsomeofthemostfantasticsuccessesinartificial
intelligence.Thereiscontinuedhugeinterestinneuralnetworksandmachinelearning,
especiallydeeplearning whereahierarchyofmachinelearningmethodsareused.Inearly
2016,GooglesDeepMindbeataworldmasterattheancientgameofGo.Thisisamassive
milestoneforartificialintelligence,becauseGorequiresmuchdeeperstrategyandnuancethan
chess,forexample,andresearchershadthoughtacomputerplayingthatwellwasyearsoff.
Neuralnetworksplayedakeyroleinthatsuccess.
Ihopeyouveseenhowthecoreideasbehindneuralnetworksareactuallyquitesimple.AndI
hopeyouvehadfunexperimentingwithneuralnetworkstoo.Perhapsyouvedevelopedan
interesttoexploreotherkindsofmachinelearningandartificialintelligence.
Ifyouvedoneanyofthesethings,Ivesucceeded.

188
AppendixA:
AGentleIntroductiontoCalculus
Imagineyoureinacarcruisingsmoothlyandrelaxedataconstant30milesperhour.Imagine
youthenpresstheacceleratorpedal.Ifyoukeepitpressedyourspeedincreasesto35,40,50,
and60milesperhour.
Thespeedofthecar changes .
Inthissectionwellexploretheideaofthingschanginglikethespeedofacarandhowto
workoutthatchangemathematically.Whatdowemean,mathematically?Wemean
understandinghowthingsarerelatedtoeachother,sowecanworkoutpreciselyhowchanges
inoneresultinchangesinanother.Likecarspeedchangingwiththetimeonmywatch.Or
plantheightchangingwithrainlevels.Ortheextensionofametalspringchangingasweapply
differentamountsofpullingforce.
Thisiscalled
calculus bymathematicians.Ihesitatedaboutcallingthissectioncalculus
becausemanypeopleseemtothinkthatisahardscarysubjecttobeavoided.Thatsareal
shame,andthefaultofbadteachingandterribleschooltextbooks.
Bytheendofthisappendix,youllseethatworkingouthowthingschangeinamathematically
precisewaybecausethatsallcalculusreallyisisnotthatdifficultformanyusefulscenarios.
Evenifyouvedonecalculusor differentiation
already,perhapsatschool,itisworthgoing
throughthissectionbecausewewillunderstandhowcalculuswasinventedbackinhistory.The
ideasandtoolsusedbythosepioneeringmathematiciansarereallyusefultohaveinyourback
pocket,andcanbeveryhelpfulwhentryingtosolvedifferentkindsofproblemsinfuture.
Ifyouenjoyagoodhistoricalpunchup,lookupthedramabetweenLeibnizandNewtonwho
bothclaimedtohaveinvestedcalculusfirst!

189

AFlatLine
Letsfirststartwithaveryeasyscenariotogetourselvessettledandreadytogo.
Imaginethatcaragain,butcruisingataconstantspeedof30milesperhour.Notfaster,not
slower,just30milesperhour.
Heresatableshowingthespeedatvariouspointsintime,measuredeveryhalfaminute.
Time(mins) Speed(mph)
0.0 30
0.5 30
1.0 30
1.5 30
2.0 30
2.5 30
3.0 30
Thefollowinggraphvisualisesthisspeedatthoseseveralpointsintime.

190

Youcanseethatthespeeddoesntchangeovertime,thatswhyitisastraighthorizontalline.It
doesntgoup(faster)ordown(slower),itjuststaysat30milesperhour.
Themathematicalexpressionforthespeed,whichwellcall s
,is
Now,ifsomeoneaskedhowthespeedchangeswithtime,wedsayitdidnt.Therateofchange
iszero.Inotherwords,thespeeddoesntdependontime.Thatdependencyiszero.
Wevejustdonecalculus!Wehave,really!
Calculusisaboutestablishinghowthingschangeasaresultofotherthingschanging.Herewe
arethinkingabouthowspeedchangeswithtime .
Thereisamathematicalwayofwritingthis.

191
Whatarethosesymbols?Thinkofthesymbolsmeaninghowspeedchangeswhentime
changes,orhowdoessdependon t
.
Sothatexpressionisamathematiciansconcisewayofsayingthespeeddoesntchangewith
time.Orputanotherway,thepassingoftimedoesntaffectspeed.Thedependencyofspeed
ontimeiszero.Thatswhatthezerointheexpressionmeans.Theyarecompletely
independent.Ok,okwegetit!
Infactyoucanseethisindependencewhenyoulookagainattheexpressionforthespeed, s
=
30.Thereisnomentionofthetimeinthereatall.Thatis,thereisnosymbol t
hiddeninthat
expression.Sowedontneedtodoanyfancycalculustoworkoutthat s/t
=0,wecandoit
bysimplylookingattheexpression.Mathematicianscallthisbyinspection.
Expressionslikes/t
,whichexplainarateofchange,arecalled derivatives .Wedontneed
toknowthisforourpurposes,butyoumaycomeacrossthatwordelsewhere.
Nowletsseewhathappensifwepresstheaccelerator.Exciting!
ASlopedStraightLine
Imaginethatsamecargoingat30milesperhour.Wepresstheacceleratorgentlyandthecar
speedsup.Wekeepitpressedandwatchthespeeddialonthedashboardandnotethespeed
every30seconds.
After30secondsthecarisgoingat35milesperhour.After1minutethecarisnowgoingat40
milesperhour.After90secondsitsgoingat45milesperhour,andafter2minutesits50miles
perhour.Thecarkeepsspeedingupby10milesperhoureveryminute.
Heresthesameinformationsummarisedinatable.
0.0 30
0.5 35
1.0 40
1.5 45
2.0 50
2.5 55
3.0 60

192

Letsvisualisethisagain.
Youcanseethatthespeedincreasesfrom30milesperhourallthewayupto60milesperhour
ata
constantrate
.Youcanseethisisaconstantratebecausetheincrementsinspeedarethe
sameeveryhalfaminuteandthisleadstoastraightlineforspeed.
Whatstheexpressionforthespeed?Wellwemusthavespeed30attimezero.Andafterthat
weaddanextra10milesperhoureveryminute.Sotheexpressionisasfollows.
Orusingsymbols,
Youcanseetheconstant30inthere.Andyoucanalsoseethe(10*time)whichaddsthe10
milesperhourforeveryminute.Youllquicklyrealisethatthe10isthe
gradientofthatlinewe
plotted.Rememberthegeneralformofstraightlinesis y
=
ax b
+whereaisthe
slope ,or
gradient .
Sowhatstheexpressionforhowspeedchangeswithtime?Well,wevealreadybeentalking
aboutit,speedincreases10mpheveryminute.

193

Whatthisissaying,isthatthereisindeedadependencybetweenspeedandtime.Thisis
becauses/tisnotzero.
Rememberingthattheslopeofastraightline y
=
ax b
+ a
is,wecanseebyinspectionthat
s
theslopeof =30+10 t
willbe10.
Greatwork!Wevecoveredalotofthebasicsofcalculusalready,anditwasntthathardatall.
Nowletshitthatacceleratorharder!
ACurvedLine
ImagineIstartedthecarfromstationaryandhittheacceleratorhardandkeptitpressedhard.
Clearlythestartingspeediszerobecausewerenotmovinginitially.
Imaginewerepressingtheacceleratorsohardthatthecardoesntincreaseitsspeedata
constantrate.Insteaditincreasesitsspeedinfasterway.Thatmeans,itisnotadding10miles
perhoureveryminute.Insteaditisaddingspeedataratethatitselfgoesupthelongerwekeep
thatacceleratorpressed.
Forthisexample,letsimaginethespeedismeasuredeveryminuteasshowninthistable.
0 0
1 1
2 4
3 9
4 16
5 25
6 36
7 49

194
8 64
IfyoulookcloselyyoumighthavespottedthatIvechosentohavethespeedasthesquareof
2 2
thetimeinminutes.Thatis,thespeedattime2is2=4,andattime3is3=9,andtime4is
2
4 =16,andsoon.
Theexpressionforthisiseasytowritetoo.
Yes,Iknowthisisaverycontrivedexampleofcarspeed,butitwillillustratereallywellhowwe
mightdocalculus.
Letsvisualisethissowecangetafeelforhowthespeedchangeswithtime.
Youcanseethespeedzoomingupwardsataneverfasterrate.Thegraphisntastraightline
now.Youcanimaginethatthespeedwouldexplodetoveryhighnumbersquitequickly.At20
minutesthespeedwouldbe400milesperhour.Andat100minutesthespeedwouldbe10,000
milesperhour!
Theinterestingquestioniswhatstherateofchangeforthespeedwithrespecttotime?That
is,howdoesthespeedchangewithtime?

195
Thisisntthesameasasking,whatistheactualspeedataparticulartime.Wealreadyknow
thatbecausewehavetheexpression s t2
=
.
Whatwereaskingisthisatanypointintime,whatisthe
rateofchange
ofspeed?Whatdoes
thisevenmeaninthisexamplewherethegraphiscurved?
Ifwethinkbacktoourprevioustwoexamples,wesawthattherateofchangewastheslopeof
thegraphofspeedagainsttime.Whenthecarwascruisingataconstant30,thespeedwasnt
changing,soitsrateofchangewaszero.Whenthecarwasgettingfastersteadily,itsrateof
changewas10milesperhoureveryminute.Andthat10mpheveryminutewastrueforany
pointintime.Attime2minutestherateofchangewas10mph/min.Anditwastrueat4minutes
andwouldbetrueat100minutestoo.
Canweapplythissamethinkingtothiscurvedgraph?Yeswecanbutletsgoextraslowly
here.
CalculusByHand
Letslookmorecloselyatwhatishappeningattime3minutes.
At3minutes,thespeedis9milesperhour.Weknowitwillbefasterafter3minutes.Lets
comparethatwithwhatishappeningat6minutes.At6minutes,thespeedis36milesperhour.
Wealsoknowthatthespeedwillbefasterafter6minutes.
Butwealsoknowthatamomentafter6minutesthespeedincreasewillbegreaterthanan
equivalentmomentafter3minutes.Thereisarealdifferencebetweenwhatshappeningattime
3minutesandtime6minutes.
Letsvisualisethisasfollows.

196

Youcanseethattheslopeattime6minutesissteeperthanattime3minutes.Theseslopes
aretherateofchangewewant.Thisisanimportantrealisation,soletssayitagain.Therateof
changeofacurveatanypoint,istheslopeofthecurveatthatpoint.
Buthowdowemeasuretheslopeofalinethatiscurved?Straightlineswereeasy,butcurvy
lines?Wecouldtryto estimate
theslopebydrawingastraightline,calleda
tangent,whichjust
touchesthatcurvedlineinawaythattriestobeatthesamegradientasthecurvejustatthe
point.Thisisinfacthowpeopleusedtodoitbeforeotherwayswereinvented.
Letstrythisroughandreadymethod,justsowehavesomesympathywiththatapproach.The
followingdiagramshowsthespeedgraphwithatangenttouchingthespeedcurveattime6
minutes.

197

Toworkouttheslope,orgradient,weknowfromschoolmathsthatweneedtodividetheheight
oftheinclinebytheextent.Inthediagramthisheight(speed)isshownas s
,andtheextent
(time)isshownas t.Thatsymbol,
calleddelta,justmeansasmallchange.So
tisa
smallchangein t
.
Theslopeiss/t .Wecouldchoseanysizedtrianglefortheincline,andusearulerto
measuretheheightandextent.WithmymeasurementsIjusthappentohaveatrianglewith s
measuredas9.6,and t
as0.8.Thatgivestheslopeasfollows:
Wehaveakeyresult!Therateofchangeofspeedattime6minutesis12.0mphpermin.

198
Youcanseethatrelyingonaruler,andeventryingtoplaceatangentbyhand,isntgoingtobe
veryaccurate.Soletsgetatinybitmoresophisticated.
CalculusNotByHand
Lookatthefollowinggraphwhichhasanewlinemarkedonit.Itisntthetangentbecauseit
doesnttouchthecurveonlyatasinglepoint.Butisdoesseemtobecentredaroundtime3
minutesinsomeway.
Infactthereisconnectiontotime3minutes.Whatwevedoneischosenatimeaboveand
t
belowthispointofinterestat=3.Here,weveselectedpoints2minutesaboveandbelow t
=3
minutes.Thatis,t
=1andt
=5minutes.
Usingourmathematicalnotation,wesaywehavea xof2minutes.Andwehavechosen
pointsxxand
x+x .Rememberthatsymbol
justmeansasmallchange,so
xisasmall
changein x
.
Whyhavewedonethis?Itllbecomeclearverysoonhangintherejustabit.

199
Ifwelookatthespeedsattimes xxand
x+x ,anddrawalinebetweenbetweenthosetwo
points,wehavesomethingthatveryroughlyhasthesameslopeasatangentatthemiddle
x
point .Havealookagainatthediagramabovetoseethatstraightline.Sure,itsnotgoingto
haveexactlythesameslopeasatruetangentat x
,butwellfixthis.
Letsworkoutthegradientofthisline.Weusethesameapproachasbeforewherethegradient
istheheightoftheinclinedividedbytheextent.Thefollowingdiagrammakesclearerwhatthe
heightandextentishere.
Theheightisthedifferencebetweenthetwospeedsatxx and
x+x
,thatis,1and5minutes.
2 2
Weknowthespeedsare1 =1and5=25mphatthesepointssothedifferenceis24.Theextent
istheverysimpledistancebetweenxxand
x+x,thatis,between1and5,whichis4.Sowe
have:

200

Thegradientoftheline,whichisapproximatesthetangentatt
=3minutes,is6mphpermin.
Letspauseandhaveathinkaboutwhatwevedone.Wefirsttriedtoworkouttheslopeofa
curvedlinebyhanddrawingatangent.Thisapproachwillneverbeaccurate,andwecantdoit
manymanytimesbecause,beinghuman,wellgettired,bored,andmakemistakes.Thenext
approachdoesntneedustohanddrawatangent,insteadwefollowarecipetocreatea
differentlinewhichseemstohaveapproximatelytherightslope.Thissecondapproachcanbe
automatedbyacomputeranddonemanytimesandveryquickly,asnohumaneffortisneeded.
Thatsgoodbutnotgoodenoughyet!
Thatsecondapproachisonlyanapproximation.Howcanweimproveitsoitsnotan
approximation?Thatsouraimafterall,tobeabletoworkouthowthingschange,thegradient,
inamathematicallypreciseway.
Thisiswherethemagichappens!Wellseeoneoftheneattoolsthatmathematicianshave
developedandhadabittoomuchfunwith!
Whatwouldhappenifwemadetheextentsmaller?Anotherwayofsayingthatis,whatwould
happenifwemadethe xsmaller?Thefollowingdiagramillustratesseveralapproximationsor
slopelinesresultingfromadecreasing
x .

201

Wevedrawnthelinesfor x
=2.0,
x=1.0,
x=0.5and
x=0.1.Youcanseethatthelines
aregettingclosertothepointofinterestat3minutes.Youcanimaginethataswekeepmaking
xsmallerandsmaller,thelinegetscloserandclosertoatruetangentat3minutes.
Asxbecomesinfinitelysmall,thelinebecomesinfinitelyclosertothetruetangent.Thats
prettycool!
Thisideaofapproximatingasolutionandimprovingitbymakingthedeviationssmallerand
smallerisverypowerful.Itallowsmathematicianstosolveproblemswhicharehardtoattack
directly.Itsabitlikecreepinguptoasolutionfromtheside,insteadofrunningatitheadon!
CalculuswithoutPlottingGraphs

202
Wesaidearlier,calculusisaboutunderstandinghowthingschangeinamathematicallyprecise
way.Letsseeifwecandothatbyapplyingthisideaofeversmaller xtothemathematical
expressionsthatdefinethesethingsthingslikeourcarspeedcurves.
2
Torecap,thespeedisafunctionofthetimethatweknowtobe s t
= .Wewanttoknowhow
thespeedchangesasafunctionoftime.Weveseenthatistheslopeof s
whenitisplotted
againstt
.
Thisrateofchange s/tistheheightdividedbytheextentofourconstructedlinesbutwhere
the
xgetsinfinitelysmall.
2
Whatistheheight?Itis( t
+x) t
( )2
xaswesawbefore.Thisisjusts t2
=wheretisabit
belowandabitabovethepointofinterest.Thatamountofbitis x.
Whatistheextent?Aswesawbefore,itissimplythedistancebetween( t
+
x t
)and(
x
)
whichis2 x
.
Werealmostthere,
Letsexpandandsimplifythatexpression,

203
Weveactuallybeenveryluckyherebecausethealgebrasimplifieditselfveryneatly.
Sowevedoneit!Themathematicallypreciserateofchangeis s/t t
=2.Thatmeansforany
timet,weknowtherateofchangeofspeed s/t
=2t
.
At
t
=3minuteswehave s/t
=2t
=6.Weactuallyconfirmedthatbeforeusingthe
approximatemethod.For t
=6minutes, s/t
=2t
=12,whichnicelymatcheswhatwefound
beforetoo.
Whatabout t
=100minutes?s/t
=2t
=200mphperminute.Thatmeansafter100minutes,
thecarisspeedingupatarateof200mphperminute.
Letstakeamomenttoponderthemagnitudeandcoolnessofwhatwejustdid.Wehavea
mathematicalexpressionthatallowsustopreciselyknowtherateofchangeofthecarspeedat
anytime.Andfollowingourearlierdiscussion,wecanseethatchangesinsdoindeeddepend
ontime.
Wewereluckythatthealgebrasimplifiednicely,butthesimple s t2
=didntgiveusan
opportunitytotryreducingthexinanintentionalway.Soletstryanotherexamplewherethe
speedofthecarisonlyjustabitmorecomplicated,
Whatistheheightnow?Itisthedifferencebetweenscalculatedat t+x andscalculatedat

2 2
tx t
.Thatis,theheightis(+ )
x t
+2(+
x t
)( )
x t
2(
x).
Whatabouttheextent?Itissimplythedistancebetween( t
+
x t
)and(
x)whichisstill2
x
.

204

Thatsagreatresult!Sadlythealgebraagainsimplifiedalittletooeasily.Itwasntwastedeffort
becausethereisapatternemergingherewhichwellcomebackto.
Letstryaanotherexample,whichisntthatmuchmorecomplicated.Letssetthespeedofthe
cartobethecubeofthetime.

205

Nowthisismuchmoreinteresting!Wehavearesultwhichcontainsa x,whereasbeforethey
wereallcancelledout.
Well,rememberthatthegradientisonlycorrectasthe xgetssmallerandsmaller,infinitely
small.
Thisisthecoolbit!Whathappenstothex
intheexpression s/t t2
=3+ 2
xas
xgets
smallerandsmaller?Itdisappears!Ifthatsoundssurprising,thinkofaverysmallvaluefor x.If
youtry,youcanthinkofanevensmallerone.Andanevensmallerone..andyoucouldkeep
goingforever,gettingeverclosertozero.Soletsjustgetstraighttozeroandavoidallthat
hassle.
Thatgivesisthemathematicallypreciseanswerwewerelookingfor:
Thatsafantasticresult,andthistimeweusedapowerfulmathematicaltooltodocalculus,and
itwasntthathardatall.
Patterns
Asfunasitistoworkoutderivativesusingdeltaslike
x
andseeingwhathappenswhenwe
makethemsmallerandsmaller,wecanoftendoitwithoutdoingallthatwork.
Seeifyoucanseeanypatterninthederivativesweveworkedoutsofar:

206

Youcanseethatthederivativeofafunctionoftisthesamefunctionbutwitheachpowerof t
4 3 7 6
reducedbyone.So tbecomes t,andtwouldbecome t,andsoon.Thatsreallyeasy!Andif
yourememberthat t t1
isjust,theninthederivativeitbecomes t0
,whichis1.
Constantnumbersontheirownlike3or4or5simplydisappear.Constantvariablesontheir
own,whichwemightcall a b
, c
or,alsodisappear,becausetheytoohavenorateofchange.
Thatswhytheyrecalled constants .
t2
Buthangon, became2 t
notjust t
.Andt3
became3t2
notjustt2
.Wellthereisanextrastep
wherethepowerisusedasamultiplierbeforeitisreduced.Sothe5in2 t5
isusedasan
additionalmultiplierbeforethepowerisreduced5*2 t4
=10 t4
.
Thefollowingsummarisesthispowerrulefordoingcalculus.
Letstryitonsomemoreexamplesjusttopracticethisnewtechnique.

207

Sothisruleallowsustodoquitealotofcalculusandformanypurposesitsallthecalculuswe
need.Yes,itonlyworksfor
polynomials ,thatisexpressionsmadeofvariableswithpowerslike
3 2
y=
ax+
bx+
cx d
+,andnotwiththingslikesin(x x
)orcos().Thatsnotamajorflawbecause
thereareahugenumberofusesfordoingcalculuswiththispowerrule.
Howeverforneuralnetworkswedoneedoneextratool,whichwelllookatnext.
FunctionsofFunctions
Imagineafunction
whereyisitself
f
Wecanwritethisas x3
=(+)2
xifwewantedto.

208
Howdoes f
changewith y
?Thatis,whatis
f/y
?Thisiseasyaswejustapplythepowerrule
wejustdeveloped,multiplyingandreducingthepower,so f/y y
=2.
Whataboutamoreinterestingquestionhowdoes f
changewith x?Wellwecouldexpandout
theexpression f x3
=(+)2
x andapplythissameapproach.Wecantcantapplyitnaivelyto(x3

2 3
+x
)tobecome2( x x
+).
Ifweworkedmanyoftheseoutthelonghardway,usingthediminishingdeltasapproachlike
before,wedstumbleuponanothersetofpatterns.Letsjumpstraighttotheanswer.
Thepatternisthis:
Thisisaverypowerfulresult,andiscalledthechainrule .
Youcanseethatitallowsustoworkoutderivativesinlayers,likeonionrings,unpackingeach
layerofcomplexity.Toworkout f/x
wemightfinditeasiertoworkout
f/y
andthenalso
easiertoworkouty/x
.Iftheseareeasier,wecanthendocalculusonexpressionsthat
otherwiselookquiteimpossible.Thechainruleallowsustobreakaproblemintosmallereasier
ones.
Letslookatthatexampleagainandapplythischainrule:
Wenowworkouttheeasierbits.Thefirstbitis(
f/y y
)=2.Thesecondbitis(
y/x x2
)=3
+1.Sorecombiningthesebitsusingthechainruleweget

209

y
Weknowthat x3
= x
+sowecanhaveanexpressionwithonly
x
Magic!
Youmaychallengethisasaskwhywedidntjustexpandout f x
intermsoffirstandthenapply
simplepowerrulecalculustotheresultingpolynomial.Wecouldhavedonethat,butwe
wouldnthaveillustratedthechainrule,whichallowsustocrackmuchharderproblems.
Letslookatjustonefinalexamplebecauseitshowsushowtohandlevariableswhichare
independentofothervariables.
Ifwehaveafunction
where x y
,andzareindependentofeachother.Whatdowemeanbyindependent?Wemean
x
that y
, andz
canbeanyvalueanddontcarewhattheothervariablesaretheyarent
affectedbychangesintheothervariables.Thiswasntthecaseinpreviousexamplewhere y
3
wasx +x y
,so wasdependenton x.
Whatis f/x
?Letslookateachpartofthatlongexpression.Thefirstbitis2
xy
,sothe
derivativeis2 y
.Whyisthissosimple?Itssimplebecause y
isnotdependenton x.Whatwere
askingwhenwesay f/x
ishowdoes f x
changewhenchanges.Ify
doesntdependon x
,we
cantreatitlikeaconstant.That y
mightaswellbeanothernumberlike2or3or10.

210
Letscarryon.Thenextbitis3 x2

z .Wecanapplythepowerreductionruletoget2*3 xz
or6
xz.
Wetreatz
asjustaboringconstantnumberlike2or4ormaybe100,because xandz
are
independentofeachother.Achangein z x
doesntaffect.
z
Thefinalbit4 x
hasno initatall.Soitvanishescompletely,becausewetreatitlikeaplain
constantnumberlike2or4.
Thefinalansweris
Theimportantthinginthislastexampleishavingtheconfidencetoignorevariablesthatyou
knowareindependent.Itmakesdoingcalculusonquitecomplexexpressionsdrastically
simpler,anditisaninsightwellneedlotswhenlookingatneuralnetworks.
YoucandoCalculus!
Ifyougotthisfar,welldone!
Youhaveagenuineinsightintowhatcalculusreallyis,andhowitwasinventedusing
approximationsthatgetbetterandbetter.Youcanalwaystrythesemethodsonothertough
problemsthatresistnormalwaysforsolvingthem.
Thetwotechniqueswelearned,reducingpowersandthechainrule,allowsustodoquitealot
ofcalculus,includingunderstandinghowneuralnetworksreallyworkandwhy.
Enjoyyournewpowers!

211
AppendixB:
DoItwithaRaspberryPi
InthissectionwewillaimtogetIPythonsetuponaRaspberryPi.
Thereareseveralgoodreasonsfordoingthis:
RaspberryPisarefairly inexpensive and

accessibletomanymorepeoplethan
expensivelaptops.
RaspberryPisareveryopentheyrunthe free
and
opensource Linuxoperating
system,togetherwithlotsoffreeandopensourcesoftware,includingPython.Open
sourceisimportantbecauseitisimportanttounderstandhowthingswork,tobeableto
shareyourworkandenableotherstobuildonyourwork.Educationshouldbeabout
learninghowthingswork,andmakingyourown,andnotbeaboutlearningtobuyclosed
proprietarysoftware.
Fortheseandotherreasons,theyarewildlypopularinschoolsandathomeforchildren
whoarelearningaboutcomputing,whetheritissoftwareorbuildinghardwareprojects.
RaspberryPisarenotaspowerfulasexpensivecomputersandlaptops.Soitisan
interestingandworthychallengetobeprovethatyoucanstillimplementausefulneural
networkwithPythononaRaspberryPi.
Iwillusea
RaspberryPiZero becauseitisevencheaperandsmallerthanthenormal
RaspberryPis,andthechallengetogetaneuralnetworkrunningisevenmoreworthy!Itcosts
about4UKpounds,or$5USdollars.Thatwasntatypo!
Heresmine,shownnexttoa2pennycoin.Itstiny!

212

InstallingIPython
WellassumeyouhaveaRaspberryPipoweredupandakeyboard,mouse,displayandaccess
totheinternetworking.
Thereareseveraloptionsforanoperatingsystem,butwellstickwiththemostpopularwhichis
theofficiallysupported Raspian,aversionofthepopularDebianLinuxdistributiondesignedto
workwellwithRaspberryPis.YourRaspberryPiprobablycamewithitalreadyinstalled.Ifnot
installitusingtheinstructionsatthatlink.YoucanevenbuyanSDmemorycardwithitalready
installed,ifyourenotconfidentaboutinstallingoperatingsystems.
ThisisthedesktopyoushouldseewhenyoustartupyourRaspberryPi.

213

Youcanseethemenubuttonclearlyatthetopleft,andsomeshortcutsalongthetoptoo.
WeregoingtoinstallIPythonsowecanworkwiththemorefriendlynotebooksthroughaweb
browser,andnothavetoworryaboutsourcecodefilesandcommandlines.
TogetIPythonwedoneedtoworkwiththecommandline,butweonlyneedtodothisonce,
andtherecipeisreallysimpleandeasy.
OpentheTerminalapplication,whichistheiconshortcutatthetopwhichlookslikeablack
monitor.Ifyouhoveroverit,itlltellyouitistheTerminal.Whenyourunit,youllbepresented
withablackbox,intowhichyoutypecommands,lookinglikethethis.

214

YourRaspberryPiisverygoodbecauseitwontallownormaluserstoissuecommandsthat
makedeepchanges.Youhavetoassumespecialprivileges.Typethefollowingintothe
terminal:
sudosu
Youshouldseethepromptendinwitha#hashcharacter.Itwaspreviouslya$dollarsign.
Thatshowsyounowhavespecialprivilegesandyoushouldbealittlecarefulwhatyoutype.
ThefollowingcommandsrefreshyourRaspberryslistofcurrentsoftware,andthenupdatethe
onesyouvegotinstalled,pullinginanyadditionalsoftwareifitsneeded.
aptgetupgrade
aptgetupdate
Unlessyoualreadyrefreshedyoursoftwarerecently,therewilllikelybesoftwarethatneedsto
beupdated.Youllseequitealotoftextflyby.Youcansafelyignoreit.Youmaybeprompted
toconfirmtheupdatebypressingy.
NowthatourRaspberryisallfreshanduptodate,issuethecommandtogetIPython.Note
that,atthetimeofwriting,theRaspiansoftwarepackagesdontcontainasufficientlyrecent
versionofIPythontoworkwiththenotebookswecreatedearlierandputongithubforanyoneto

215
viewanddownload.Iftheydid,wewouldsimplyissueasimpleaptgetinstallipython3
ipython3notebookorsomethinglikethat.
Ifyoudontwanttorunthosenotebooksfromgithub,youcanhappilyusetheslightlyolder
IPythonandnotebookversionsthatcomefromRaspberryPissoftwarerepository.
IfwedowanttorunmorerecentIPythonandnotebooksoftware,weneedtousesomepip
commandsinadditionaltotheaptgettogetmorerecentsoftwarefromthePythonPackage
Index.ThedifferenceisthatthesoftwareismanagedbyPython(pip),notbyyouroperating
systemssoftwaremanager(apt).Thefollowingcommandsshouldgeteverythingyouneed.
aptgetinstallpython3matplotlib
aptgetinstallpython3scipy

pip3installipython
pip3installjupyter
pip3installmatplotlib
pip3installscipy
Afterabitoftextflyingby,thejobwillbedone.Thespeedwilldependonyourparticular
RaspberryPimodel,andyourinternetconnection.ThefollowingshowsmyscreenwhenIdid
this.

216

TheRaspberryPinormallyusesanmemorycard,calledanSDcard,justliketheonesyou
mightuseinyourdigitalcamera.Theydonthaveasmuchspaceasanormalcomputer.Issue
thefollowingcommandtocleanupthesoftwarepackagesthatweredownloadedinorderto
updateyourRaspberryPi.
aptgetclean
Thatsit,jobdone.RestartyourRaspberryPiincasetherewasaparticularlydeepchangesuch
asachangetotheverycoreofyourRaspberryPi,likeakernelupdate.Youcanrestartyour
RaspberryPibyselectingtheShutdownoptionfromthemainmenuatthetopleft,andthen
choosingReboot,asshownnext.
AfteryourRaspberryPihasstartedupagain,startIPythonbyissuingthefollowingcommand
fromtheTerminal:
jupyternotebook
ThiswillautomaticallylaunchawebbrowserwiththeusualIPythonmainpage,fromwhereyou
cancreatenewIPythonnotebooks.Jupyteristhenewsoftwareforrunningnotebooks.
Previouslyyouwouldhaveusedtheipython3notebookcommand,whichwillcontinuetowork
foratransitionperiod.ThefollowingshowsthemainIPythonstartingpage.

217

Thatsgreat!SowevegotIPythonupandrunningonaRaspberryPi.
YoucouldproceedasnormalandcreateyourownIPythonnotebooks,butwelldemonstrate
thatthecodewedevelopedinthisguidedoesrun.WellgetthenotebooksandalsotheMNIST
datasetofhandwrittennumbersfromgithub.Inanewbrowsertabgotothelink:
https://github.com/makeyourownneuralnetwork/makeyourownneuralnetwork
Youllseethegithubprojectpage,asshownnext.GetthefilesbyclickingDownloadZIPatthe
topright.

218

Thebrowserwilltellyouwhenthedownloadhasfinished.OpenupanewTerminalandissue
thefollowingcommandtounpackthefiles,andthendeletethezippackagetoclearspace.
unzipDownloads/makeyourownneuralnetworkmaster.zip
rmfDownloads/makeyourownneuralnetworkmaster.zip
Thefileswillbeunpackedintoadirectorycalledmakeyourownneuralnetworkmaster.Feelfree
torenameittoashorternameifyoulike,butitisntnecessary.
ThegithubsiteonlycontainsthesmallerversionsoftheMNISTdata,becausethesitewont
allowverylargefilestobehostedthere.Togetthefullset,issuethefollowingcommandsinthat
sameterminaltonavigatetothemnist_datasetdirectoryandthengetthefulltrainingandtest
datasetsinCSVformat.
cdmakeyourownneuralnetworkmaster/mnist_dataset
wgetc http://pjreddie.com/media/files/mnist_train.csv
wgetc http://pjreddie.com/media/files/mnist_test.csv
Thedownloadingmaytakesometimedependingonyourinternetconnection,andthespecific
modelofyourRaspberryPi.

219
YouvenowgotalltheIPythonnotebooksandMNISTdatayouneed.Closetheterminal,butnot
theotheronethatlaunchedIPython.
GobacktothewebbrowserwiththeIPythonstartingpage,andyoullnowseethenewfolder
makeyourownneuralnetworkmastershowingonthelist.Clickonittogoinside.Youshouldbe
abletoopenanyofthenotebooksjustasyouwouldonanyothercomputer.Thefollowing
showsthenotebooksinthatfolder.
MakingSureThingsWork
Beforewetrainandtestaneuralnetwork,letsfirstcheckthatthevariousbits,likereadingfiles
anddisplayingimages,areworking.Letsopenthenotebookcalled
part3_mnist_data_set_with_rotations.ipynbwhichdoesthesetasks.Youshouldseethe
notebookopenandreadytorunasfollows.

220

FromtheCellmenuselectRunAlltorunalltheinstructionsinthenotebook.Afterawhile,
anditwilltakelongerthanamodernlaptop,youshouldgetsomeimagesofrotatednumbers.

221

Thatshowsseveralthingsworked,includingloadingthedatafromafile,importingthePython
extensionmodulesforworkingwitharraysandimages,andplottinggraphics.
LetsnowCloseandHaltthatnotebookfromtheFilemenu.Youshouldclosenotebooksthis
way,ratherthansimplyclosingthebrowsertab.
TrainingAndTestingANeuralNetwork
Nowletstrytraininganeuralnetwork.Openthenotebookcalled
part2_neural_network_mnist_data.Thatstheversionofourprogramthatisfairlybasicand
doesntdoanythingfancylikerotatingimages.BecauseourRaspberryPiismuchslowerthana
typicallaptop,wellturndownsomeofparameterstoreducetheamountofcalculationsneeded,
sothatwecanbesurethecodeworkswithoutwastinghoursandfindingthatitdoesnt.
Ivereducedthenumberofhiddennodesto10,andthenumberofepochsto1.Ivestillused
thefullMNISTtrainingandtestdatasets,notthesmallersubsetswecreatedearlier.Setit
runningwithRunAllfromtheCellmenu.Andthenwewait...
Normallythiswouldtakeaboutoneminuteonmylaptop,butthiscompletedinabout 25
minutes .That'snottooslowatall,consideringthisRaspberryPiZerocosts400timeslessthan
mylaptop.Iwasexpectingittotakeallnight.

222

RaspberryPiSuccess!
Wevejustproventhatevenwitha4or$5RaspberryPiZero,youcanstillworkfullywith
IPythonnotebooksandcreatecodetotrainandtestneuralnetworksitjustrunsalittleslower!

223

Make Your Own Neural Network PDF

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Make Your Own Neural Network PDF

Cargado por

Copyright:

Formatos disponibles

Contents

Problem Computer Human

Findfacesinaphotoofacrowd Hard Easy

Ohno!Wevegonetoofarand overshot theknowncorrectanswer.Ourpreviouserrorwas

Example Width Length Bug

1 3.0 1.0 ladybird

2 1.0 3.0 caterpillar

InputA InputB LogicalAND LogicalOR

InputA InputB LogicalXOR

Thesigmoidfunction,sometimesalsocalledthe logisticfunction ,is

Insomeguides,youllseethiskindofmatrixmultiplicationcalleda dotproduct oran

Themathematicalversionofthisapproachiscalled gradientdescent ,andyoucanseewhy.

Toavoidendingupinthewrongvalley,orfunction minimum ,wetrainneuralnetworksseveral

Network Target Error Error Error

0.4 0.5 0.1 0.1 0.01

0.8 0.7 0.1 0.1 0.01

Sum 0 0.2 0.02

That10staysthereuntilfurthernotice.Weshouldntbesurprisedbythe print(x) because

Letssayallthatinsimpleterms.Wecreated sizzles,akindofDog.Thatsizzles isanobject,

Thefollowingshowswhatwehavedonesofar,andalsoconfirmsthat sizzles .bark()does

Thatisthedifferencebetween classes and

Youllsometimesseetheseobjectfunctionscalled methods .Wevealreadydonethisabove,

all_values = data_list [0].split(',')

Thepeakperformanceisnowupto 0.9628 ,or96.28%,with7epochs.

Whatistheheightnow?Itisthedifferencebetweenscalculatedat t+x andscalculatedat

RaspberryPisarefairly inexpensive and

También podría gustarte