Está en la página 1de 38

Artificial Intelligence

and Machine Learning


in Industry
Perspectives from Leading Practitioners

David Beyer
AI is moving fast.
Don’t fall behind.
Early adopters of applied AI have a unique opportunity to invent new
business models, reshape industries, and build the impossible.

Put AI to work — right now.

theaiconf.com
D4267
Artificial Intelligence and
Machine Learning in
Industry
Perspectives from Leading Practitioners

David Beyer

Beijing Boston Farnham Sebastopol Tokyo


Artificial Intelligence and Machine Learning in Industry
by David Beyer
Copyright © 2017 O’Reilly Media Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (http://oreilly.com/safari). For more
information, contact our corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.

Editor: Shannon Cutt Interior Designer: David Futato


Production Editor: Kristen Brown Cover Designer: Karen Montgomery
Proofreader: Kristen Brown Illustrator: Rebecca Demarest

March 2017: First Edition

Revision History for the First Edition


2017-03-20: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Artificial Intelli‐
gence and Machine Learning in Industry, the cover image, and related trade dress are
trademarks of O’Reilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the author disclaim all responsibility for errors or omissions, including without limi‐
tation responsibility for damages resulting from the use of or reliance on this work.
Use of the information and instructions contained in this work is at your own risk. If
any code samples or other technology this work contains or describes is subject to
open source licenses or the intellectual property rights of others, it is your responsi‐
bility to ensure that your use thereof complies with such licenses and/or rights.

978-1-491-95933-6
[LSI]
Table of Contents

Artificial Intelligence and Machine Learning in Industry. . . . . . . . . . . . . . 1


Michael Osborne: Automation, Simplification, and Meeting the
Technology Halfway 2
Arjun Singh: Helping Students Learn with Machine Learning 7
Jake Heller: The Future of Legal Practice 12
Aaron Kimball: Intelligent Microbes 17
Bryce Meredig: The Periodic Table as Training Data 22
Erik Andrejko: Transforming Agriculture with Machine
Learning 27

iii
Artificial Intelligence and
Machine Learning in Industry

Just as the dust started to settle in the aftermath of Google’s stunning


victory for artificial intelligence in the game of Go, researchers at
Carnegie Mellon University kicked it up once more by defeating
humans in poker. In so challenging the human metier, these and
other breakthroughs in speech, reasoning, and vision startle as
much as they impress. Taken together, they suggest a new normal of
rapid and sustained progress.
In time, the excitement of research segues to its application, setting
once-elegant abstractions against commercial realities. To date, a
growing number of businesses centered on AI are in the process of
stress-testing society, challenging its basic assumptions about labor
and the economy. In a recent report, McKinsey projects that about
half of today’s work could be automated by 2055. While similar
studies may disagree on specifics, precision in this case matters less
than the accuracy of their consensus: automation will fundamentally
reshape work and, by extension, industry.
The adoption of AI is, in a word, uneven. The choppy history of
other major technology shifts, from the steam engine to IT, would
suggest as much. The broader sweep of AI and its economic, social,
and political influence will face growing scrutiny from scholars and
policymakers alike. This report hopes to add to this discussion
through interviews with the entrepreneurs and executives on the
front lines of AI, machine learning, and industry.
To begin, Michael Osborne situates the report in its historical and
economic context, drawing on his research at Oxford. Arjun Singh

1
follows with a discussion of AI in education; Jake Heller takes us on
a tour of machine learning and the law; Aaron Kimball illuminates
the otherwise hidden world of microbes and their commercial use;
Bryce Meredig describes the application of machine learning to
materials and the periodic table; and finally, Erik Andrejko discusses
his work at Climate Corporation and the role machine learning
plays in farming and agriculture.

Michael Osborne: Automation, Simplification,


and Meeting the Technology Halfway
Michael Osborne is the Dyson Associate Professor in Machine Learn‐
ing at the University of Oxford. He is a faculty member of the Oxford-
Man Institute of Quantitative Finance and the codirector of the Oxford
Martin Programme on Technology and Employment.
Key Takeaways

1. The current wave of automation is of a piece with previous tech‐


nological revolutions in the social and political debate it gar‐
ners. Yet its potential impact on labor markets—and by
extension, society—breaks with precedent.
2. The burden of automation will initially fall on the shoulders of
the least skilled. The accelerating pace of research in machine
learning and robotics, however, suggests that automation’s reach
will usurp functions of ever higher skill and complexity.
3. Jobs more immune to automation possess some mix of skills
that require social intelligence, creativity, and manual dexterity.
4. The varying degree to which firms can redesign tasks, jobs, and
processes to take advantage of automation is an important, if
often overlooked, driver of automation trends.

Let’s begin with your background.


I’m an engineer by background with a focus in machine learning.
My academic career to date has focused largely on designing algo‐
rithms that, in one form or another, automate human work. At
heart, they wrest decision-making away from people. Such algo‐
rithms can operate well beyond human capacity, examining, for
example, billions of data points a piece in search of an anomalous
signal.

2 | Artificial Intelligence and Machine Learning in Industry


This technical background, in time, offered a segue to my current
work, exploring the societal consequences of automating the pre‐
serve of human activity. Around 2013, I connected with the econo‐
mist Carl Frey, which led in turn to a joint paper and a renewed
focus—using machine learning itself as a tool to understand
machine learning as a crucial driver in industry and society at large.
Can you provide some context on the history of labor and its rela‐
tionship with automation?
Our assertion that work is increasingly vulnerable to automation
draws fierce pushback: that is, the historical antecedents to our
claim have largely been proven false. Any labor decline from break‐
throughs in automation has been consistently offset with a range of
new employment. So what’s different this time?
We believe the profound wave of machine learning currently sweep‐
ing through society will replace cognitive work, much as the Indus‐
trial Revolution of the 18th and 19th centuries replaced its manual
analog. As machines grow increasingly adept at automating cogni‐
tive labor, the human metier correspondingly declines. The present
shift need not reprise the historical pattern, in which humans re-
distributed to other work. As others have noted, a better analogy
invokes not humans themselves, but their equestrian companion.
Imagine, if you will, that you are a horse in the early 1900s. Despite
breathtaking revolutions in technology over the previous hundred
years (e.g., the telegraph overtaking the Pony Express, and railroads
cannibalizing horse-powered travel), you might be feeling pretty
happy about your prospects. In fact, the US horse population con‐
tinued to increase approximately sixfold between 1840 and 1900.
Your confidence in future job opportunities might begin to seem
like an idee fixe: equine labor is in some fundamental way resistant
to automation.
Such confidence would soon crumble under its own weight. By
1950, the US equine population declined to 10% of its 1900 level.
Society had crossed a rubicon of sorts, beyond which machines
could outdo horses in every relevant dimension.
Our work examines how this scenario might unfold with human
instead of equine labor. Humans, for example, may do better at very
high-level emotional interactions. Yet it seems unlikely that such a
skill (or others like it) will find sufficient demand to maintain full

Michael Osborne: Automation, Simplification, and Meeting the Technology Halfway | 3


employment. This isn’t a conclusive prediction, but rather a plausi‐
ble outcome worthy of our attention.
Assuming machines really do crowd out human labor, what
aspects of work are at risk?
This is the vital question. Absent the worst-case scenario, even mod‐
erate perturbations in the labor market can lead to major upheaval
in society. Our work suggests the automation burden rests most
heavily upon the shoulders of the least skilled—a tragic outcome
considering the difficulty of retraining.
We contend that new jobs will emerge from the dust of automation,
but they might be a shadow of their former self. 21st-century work,
by and large, may not match the skill mix and volume for a healthy
replacement rate. In the absence of decisive education reform, a
growing list of occupations (e.g., truck drivers, auditors, clerks in
various retail situations—to name just a few) will fail to keep up.
The workforce dislocation might permanently disenfranchise a
meaningful swath of society, setting them adrift in an economy
without demand for their time and skill. This stands as one of the
key points we hope to convey to policy makers: these trends in auto‐
mation pose a real risk to already widening wealth inequality.
In the coming decades, which are the “safe” jobs more immune to
automation?
We found three loose groupings of skills that offer some degree of
protection from automation. The first of these is creativity. The abil‐
ity to generate novel ideas still remains generally out of reach for
machines. The second is social intelligence. While algorithms can
interact with humans via chatbots, for example, they still fall short at
higher-level social functions (e.g., negotiation or persuasion). The
final of these three guardrails, so to speak, centers on manual dex‐
terity—unstructured physical interaction in the world. This is fairly
difficult to automate even today. The upshot of our work suggests
that jobs without at least one the above bottlenecks faces material
risk of automation.

4 | Artificial Intelligence and Machine Learning in Industry


How far along is the AI research community in tackling these
“bottlenecks?” And which of the three do you think will be the
first to succumb to machines?
If I were to rely merely on technical progress, I think we’ll see
advances in manipulation first, social intelligence second, and crea‐
tivity third. Advances in robotics continue to enable improved
object manipulation in obstructed environments. As it relates to
social intelligence, we’ve seen the reemergence of chatbots and algo‐
rithms with meaningful marks on the Turing Test. Finally, creativity
itself has found expression in machines over the past couple of
years, such as the DeepDream algorithms that can “paint” in a num‐
ber of artistic styles.
While research continues to shatter our expectations of the possible,
the technologies with the most immediate impact trace back to older
work. In terms of jobs, cutting-edge research matters less than the
evolving nature of work itself: what matters more is the means by
which jobs, and by extension industry, can be remodeled to exploit
state-of-the-art machines. It’s less about new technology, and more a
question of redesigning jobs to suit the technology already at hand.
Can you elaborate?
Consider the typing pool of the 1950s, in which groups of workers
were arrayed to take dictation and other miscellaneous tasks. These
occupations now seem but a distant memory. You might attribute
the demise of the typing pool to the invention of the word processor,
but word processors alone were insufficient as a drop-in replace‐
ment.
Firms eventually realized that while typing pools covered a wide
range of tasks, their cost outweighed the benefit of the alternative
(that is, whittling down the task of handling documents to a degree
that employees could manage themselves). This key rearchitecture
made the typing pool obsolete.
Which industries and which categories of labor will experience
the biggest impact from automation?
In a paper published in 2013, we described a novel approach to esti‐
mating the probability of computerisation for 702 occupations using
a Gaussian process classifier. Our work drew heavily from O*NET
data from the Department of Labor and involved some degree of

Michael Osborne: Automation, Simplification, and Meeting the Technology Halfway | 5


hand labeling. In the final analysis, we found that 47% of the US
labor market faces the risk of automation.
Over a twenty-year horizon, we found the accommodation and food
services industries to be particularly high-risk; 87% of their current
employment faces the real threat of automation. As an example, res‐
taurants like Chili’s are replacing some of the tasks performed by
their waitstaff with tablets. At the same time, travel booking websites
like AirBnB portend profound shifts in the accommodations space.
In the UK alone, we’ve seen employment for travel agents drop by
50% in the last decade or so.
In the case of the transportation and warehousing industry, 75% of
employment is at risk—from forklift operators to hospital porters.
The transportation of goods already commonly occurs in highly
structured environments. For example, Amazon recently acquired
Kiva systems, which astutely recognized that their robots don’t need
to fully solve the SLAM (simultaneous localization and mapping)
problem.
Instead, the robots can make effective use of barcodes strategically
placed on the warehouse floor for guidance, leaving humans the
more complex task of removing items from shelves. The robot, for
its part, simply moves the entire shelving unit as required. These
robots reduce, but don’t fully replace, human labor. But like a thou‐
sand cuts, such reductions add up over time.
If the biggest impact will be for jobs that are more amenable to
being restructured, how do we gauge the “restructurability” of a
given task?
Toward this end, we’re exploring automation around primary
healthcare delivery in the UK. Automation in healthcare is both
urgent and complicated: rising costs combined with the specter of
budget cuts in the UK demand some degree of automation. Our
work includes ethnographic surveys and other primary research. In
interviewing front-line staff, we seek to understand their views and
interactions with technology, as well as opportunities for efficiency.
Assessing the restructurability of a job requires a narrow aperture
and a nuanced understanding of the given occupation. That said,
even if a task can be automated, following through requires navigat‐
ing a web of stakeholders and norms. In the case of healthcare, for

6 | Artificial Intelligence and Machine Learning in Industry


example, GP and patient associations chafe at certain kinds of auto‐
mation. The barriers, in other words, are many.
What are the most exciting directions you expect your research
and that of your peers to take in the next five or so years?
Within machine learning, most recent advances have occurred
within supervised learning tasks, requiring algorithms to be explic‐
itly taught (structured) tasks. I expect that in the next five or so
years, we’ll begin to make more progress on the more challenging
problems within unsupervised learning, in which an agent must
infer properties of the world from raw observations of it; and in
active and reinforcement learning, in which an agent is able to
request new data so as to optimally inform itself about the world.
These latter modes are much more closely akin to how humans
learn, and offer the most exciting prospects for artificial learning
agents.

Arjun Singh: Helping Students Learn with


Machine Learning
Arjun is the cofounder and CEO of Gradescope, which he built while a
teaching assistant at UC Berkeley (BS EECS ’06, PhD CS ’16). He
worked under Pieter Abbeel on robotics and computer vision research,
including autonomous helicopter control, laundry-folding robots, and
robotic perception. As a six-time TA, Arjun enjoyed working on educa‐
tional technology to improve the experience of his students. In 2012, he
worked to integrate Berkeley’s homegrown MOOC platform into edX,
and was also the head TA for CS188x, one of Berkeley’s first MOOCs.
Key Takeaways

1. With educational material increasingly digitized, the application


of machine learning can benefit students and teachers alike,
whether through intelligent and automated grading, personal‐
ized learning, and other promising approaches.
2. Gradescope, one such startup in the quickly growing edtech
space, uses recent advancements in computer vision and deep
learning (e.g., LSTMs) to help teachers grade assignments more
efficiently.

Arjun Singh: Helping Students Learn with Machine Learning | 7


3. Gradescope and its peers are building toward a world in which
students receive instant feedback and adaptive educational con‐
tent designed around their skill and understanding.

Let’s start with your background.


I’m originally from the Las Vegas area. I completed my undergrad in
EE & CS at Berkeley, as well as my PhD, where I focused on robotics
research under Pieter Abbeel.
How did Gradescope come about? What was the motivation?
I was a teaching assistant at Berkeley for a graduate course in artifi‐
cial intelligence a number of times. Each time, I faced renewed frus‐
tration with the grading process. It required a grading party of ten to
fifteen graduate students all huddled around a table for ten or more
hours—exhausting for everyone involved. To save time, we tried
scanning work and grading it online, in lieu of pen and paper. The
move online eliminated some of the more tedious aspects of grading
(e.g., adding up scores, writing the same thing over and over, flip‐
ping pages and so on). This minor reform reduced the total time
burden and, as an added bonus, made it harder for students to cheat.
With digitization behind us, we can get to the more interesting busi‐
ness of applying machine learning as an aid to real automation.
We’ve been hard at work on exactly that over the past few months.
Can you provide context and history for the application of
machine learning in the education world?
At the moment, one of the most widely deployed examples in the
field is likely automated essay scoring for standardized testing.
These technologies extract features from the student writing and use
standard classifiers to predict the score a human would give the
essay. Features can include word length, word count, words per sen‐
tence, spelling, and grammar quality (similar to what you might find
in Microsoft Word). More sophisticated approaches might review
how a particular sentence parses (i.e., do the words fit together in a
reasonable way?).
WriteLab is the most advanced essay feedback system I’m aware of.
They have a very sophisticated system; it is focused not on scoring,
but on essay improvement. Overall, the most widely deployed sys‐
tems tend to be the least sophisticated. They work sufficiently well

8 | Artificial Intelligence and Machine Learning in Industry


to suit their use case, yet no one would mistake their output for that
of a human grader.
On another front, we’re seeing progress toward scaling human
assessment in the evaluation of computer code. A common
approach in this vein involves clustering similar student responses
together, typically at the function and part-of-function level (on the
order of 3–10 lines of code). Rather than providing feedback to each
student piecemeal, the grader can comment on such a cluster once,
which then fans out feedback to the relevant students.
This approach has been applied by turns to a number of domains.
Powergrading, a project from Microsoft Research, is a notable
example. It first learns a similarity metric for short answer questions
from labeled data. Next, the system places responses into groups and
subgroups, allowing instructors to evaluate them all at once.
Outside of grading, personalized learning and intelligent tutoring
represent another important thread. The problem has elicited a
number of different solutions, but common to them all is the goal of
fully understanding a student’s skills and the knowledge required for
any given question. By understanding this dynamic, an intelligent
tutoring system can guide students down a path of materials and
questions, constantly updating its estimate of the student’s mastery.
How does Gradescope work and what’s the science behind it?
Machine learning will power our soon-to-be-released “assisted grad‐
ing” feature. The key insight underlying this feature recognizes that
students generally provide a bounded set of answers to a given ques‐
tion (e.g., one thousand students might answer a question in fifteen
ways total). The grading assistant thus allows instructors to grade
only these fifteen unique responses, rather than a full scan of the
entire thousand. As a very simple example, imagine the algebra
question: “What is x if 50 - x = 30?” Perhaps 800 students supply the
correct answer of “20.” However, 150 students might make a mistake
with the minus sign, and respond with “80,” and the other 50 stu‐
dents supply an assortment of other answers. Rather than cycling
through 800 times, the instructor can mark all the correct answers at
once. Furthermore, each incorrect response can still be addressed
individually. As a result, the grader can allot partial credit and sup‐
ply appropriate feedback for each response.

Arjun Singh: Helping Students Learn with Machine Learning | 9


Broadly speaking, Gradescope focuses on handwritten work, which
follows directly from our use cases—in-class exams and complex
homework assignments. By breaking away from the digitized out‐
put, we unmoored ourselves from the mainstay of research in our
field. As a result, our current efforts draw heavily on computer
vision and handwriting recognition.
The first version of assisted grading is designed around short-
answer questions (i.e., at most a few words and short math ques‐
tions/answer pairs). We rely on deep learning, and more specifically,
LSTMs (long short-term memory networks), to recognize the hand‐
written work. LSTMs, which have recently become very popular for
a wide range of problems in computer vision and speech recogni‐
tion, are useful in cases involving long-term dependencies between
elements of a sequence. In the handwriting case, this ends up having
a big impact, as handwriting is often connected together, and accu‐
racy is greatly improved by recognizing full words at a time rather
than individual letters.
Once we are equipped with a digital representation, we group the
answers together. We then employ different methods for different
types of problems (i.e., we treat text-based short answer questions
differently from math-based short answer questions). We currently
ask the user to tell us the type of each question, but we are develop‐
ing methods to detect this automatically.
In all of our work, we pay particular attention to user trust. Grades,
by their very nature, are a sensitive matter, demanding accuracy and
fairness. This puts the burden on us to move as close as possible to
100% accuracy. We poured a lot of effort into the instructor/user
interface, letting them quickly accept or reject our suggestions,
which yields more grist for our training.
What were the key machine learning challenges you were forced
to address?
I’d start by noting that even though human graders don’t always ach‐
ieve complete accuracy, they expect it from us. From their point of
view, losing a point to an algorithmic error is unacceptable. So until
we meet or surpass human-level benchmarks in grading, we will
maintain a human-in-the-loop.
Specifically with regard to machine learning, one of the early chal‐
lenges we faced had to do with our handwriting algorithms. We real‐

10 | Artificial Intelligence and Machine Learning in Industry


ized that existing datasets didn’t fully meet our needs. In a certain
respect, they provided the complete opposite. That is, such datasets
consist of a small number of writers who produce a lot of work.
Grading, on the other hand, implies a lot of writers with limited out‐
put per writer. This mainly meant having to go through and label a
large amount of data. Specifically, we went through our existing
bank of exam submissions and transcribed the handwriting, so that
we could train the handwriting recognition system.
Going forward, what are some exciting new machine learning
approaches you hope to apply with the data you’re collecting?
Above, I mostly discussed our efforts in automating grading and
freeing up instructor time to teach instead of grade. We hope to go
beyond those efforts and apply machine learning to other parts of
the learning process as well.
When instructors grade on paper, they typically use the gradebook
to record a single number for each student per assignment. This
means they inadvertently discard a lot of very valuable data (namely,
the reason behind every point earned by every student on every
question in a course). Instead, with Gradescope, they develop a digi‐
tal rubric, a list of grading criteria with associated point values. As
they grade student work, they select a subset of the rubric items to
associate with each student’s answer. The software proceeds to com‐
pute a grade from the chosen rubric items.
Because grading now happens digitally, we store the previously lost
data. This data allows us to generate a far clearer picture of the stu‐
dent’s understanding. And unlike most other digital platforms that
exclude partial credit, our rubric approach leads to a more nuanced
understanding of student progress.
More broadly, if you had a magic wand and could apply machine
learning at will, how might you reshape education?
First, we could instantly grade all work, which would confer a num‐
ber of benefits. Students would get instant feedback, enabling them
to practice with an endless supply of questions per topic. Instruc‐
tors, for their part, would spend zero time grading, freed up, instead,
to focus on teaching and student interaction.
Second, we’d have a clear understanding of the state of every stu‐
dent’s understanding, data that would guide teachers at the student
level. They’d know, for example, whether a student might need addi‐

Arjun Singh: Helping Students Learn with Machine Learning | 11


tional practice to brush up on a particular concept. Stepping back, in
such a world, teachers could measure the effectiveness of their les‐
sons and tune them in response to incoming student data.
We’re already hard at work on the first problem—instant feedback
for every type of question. The second problem is a bit more chal‐
lenging, for a few reasons. The first reason has to do with data. Pub‐
lishers, for example, often want to measure the effectiveness of their
educational content—e.g. does this particular textbook chapter help
students learn effectively? To measure this, the publisher would
need to know not only whether a student read the chapter, but
when: was it before the assessment or after?
In the same vein, publishers might want to know whether the stu‐
dent also attended the lecture or watched the corresponding video,
in addition to the telemetry from the digital textbook. As publishers
increasingly switch to digital content, they’re building better data
profiles in turn. Some publishers even embed multiple choice quiz‐
zes to get a coarse measure of student understanding. Most, how‐
ever, don’t close the full feedback loop with access to the written
exam.
Further complicating matters is the lack of an accepted taxonomy of
concepts. For instance, how do you map one instructor’s teaching to
another for the sake of comparison? Perhaps their terminology dif‐
fers. In other words, if both instructors tagged every lecture, hand‐
out, and homework and exam question with the associated
concepts, there’s little guarantee they’ll match. As a matter of fact,
you can essentially guarantee that they won’t. Recently, researchers
have focused on ways to automatically tag questions with the
required “skills” (or concepts), and at the same time, measure the
student’s ability vis-a-vis said skills. This is currently one of the big‐
gest and more challenging areas of research in the space.

Jake Heller: The Future of Legal Practice


Jake Heller is the founder and CEO of Casetext. Previously, he was
president of the Stanford Law Review and a managing editor of the
Stanford Law & Policy Review, and worked in the White House Coun‐
sel’s Office and the Massachusetts Governor’s Office of Legal Counsel,
clerked on the First Circuit Court of Appeals, and was a litigation
associate at Ropes & Gray LLP.

12 | Artificial Intelligence and Machine Learning in Industry


Key Takeaways

1. Contrary to common wisdom, the legal field has been at the


vanguard of technology adoption, including a head start on
machine learning applications.
2. The vast corpus of legal knowledge lends itself to machine
learning. Meaning in the field pivots on language and a complex
network of cases, opinions, briefs, and so on.
3. Casetext seeks to, first, “free” legal knowledge by exposing case
law that is otherwise paywalled, building a unique dataset as
lawyers interact with and annotate its content. Its latest product,
CARA, uses machine learning to enrich any legal document
with relevant research.
4. The intersection of machine learning and law presupposes a
bigger, albeit philosophical question: is the law computable?

Why don’t we start with your background and how you got to
Casetext?
I grew up in Silicon Valley, and have been coding from an early age.
My dad founded an internet company in our garage in ’94. As his
company grew, I worked alongside him on weekends, nights, and
summers, giving me a head start on web technology. And for the
longest time, I envisioned a career in programming. My passion for
code gave way to a keen interest in policy, through high school
speech and debate, and then, in turn, to law. At Stanford Law
School, I applied myself primarily to questions of technology law
and policy.
After graduating and a few years into my legal practice, I kept
returning to an idea that I started thinking about in law school. I
knew, both in theory and from personal experience, that lawyers
spend 20–30% of their time engaged in legal research—the task of
locating precisely the correct precedent, statutes, and regulations to
help you win your case (as a junior lawyer, it clocked in closer to
70%).
Finding the relevant precedent means combing through a rather
large search space—more than ten million cases containing over one
hundred million pages of text. Locating the precedent further entails
situating it in the broader context of the law. Even a precedent seem‐
ingly on point might prove misleading—perhaps the case was over‐

Jake Heller: The Future of Legal Practice | 13


turned or deemed irrelevant to your circumstances. It’s a
maddeningly difficult process.
To help accomplish this task, lawyers typically subscribe to legal
databases, such as LexisNexis, Westlaw, and others. The price tag
bites: $100 for a single search and $20 for access to a single docu‐
ment. Combined, companies in this space generate over ten billion
dollars in annual revenue.
The key idea behind Casetext is to remove the paywall to legal
knowledge while building a business through premium research
technologies. The incumbents are so expensive because their models
required them to hire thousands of human editors, which drove up
costs. Instead, Casetext builds on a mix of data science, natural lan‐
guage processing, and crowdsourcing, accomplishing with twenty
engineers what the others produce with twenty thousand editors.
What’s the history of machine intelligence in the law?
Common wisdom suggests that lawyers are somehow tech-
backward or late adopters. This couldn’t be further from the truth.
The law and its practitioners have been on the vanguard of technol‐
ogy. If you should happen to visit the Computer History Museum in
Mountain View, California, you would come across a LexisNexis ter‐
minal, an erstwhile internet for legal research well before the
modern public web.
The same is true of machine learning: lawyers made early use of its
power. It found its initial purchase in “e-discovery.” An acronym of
electronic discovery, e-discovery technologies help lawyers during
the “discovery” phase of litigation, where each side shares their
records with the other. E-discovery software allows lawyers to parse
records at large scale, as they hunt for a “smoking gun” email or an
incriminating presentation. Developed almost a decade ago, e-
discovery tools began including “predictive coding,” which marries
machine learning and e-discovery. Human reviewers (i.e., attorneys)
generate the initial flurry of training data by annotating documents
for relevance. The first batch of, say, ten thousand human-reviewed
documents enable prediction on the remaining million.
Machine learning, in addition, has long played a supporting role in
legal research. Bloomberg, Westlaw, and LexisNexis, for their part,
use it to improve their search results based on clicks, views, and
other behavior data from their clients. It has also been used to aid

14 | Artificial Intelligence and Machine Learning in Industry


their human editors in predicting whether a new case overturns an
older one.
More recently, what explains its widening use in the legal field?
On the heels of the economic collapse of 2008, a wave of budget-
constrained clients balked at paying for research technology. Law
firms went from “recovering” (i.e., billing their clients for) nearly all
their technology costs to settling for recovering only a fraction of
those costs. Worse yet, clients began to demand alternative fee struc‐
tures, shaking an industry accustomed to billable hours. Market
pressure gave way to soul searching, as law firms aspired to new effi‐
ciencies, abetted by novel technology. Casetext and other companies
in our category has given firms capacity to do, as the saying goes,
more with less.
Can you elaborate on Casetext’s approach to machine learning?
Casetext provides a full-fledged legal research system that lets you
do a lot of the work you might do on LexisNexis, for example. You
can search the law, and you can read, bookmark, and cite cases. Such
functions all make use of machine learning to some extent. Our
most rigorous application of machine learning, however, powers
CARA, our “Case Analysis Research Assistant.”
CARA enables attorneys to simply drag-and-drop in a legal docu‐
ment, and in seconds, it will read and understand that document,
returning relevant research the attorney has so far missed. This ena‐
bles attorneys to make sure they aren’t missing any key precedents as
part of their research, or catch opposing counsel leaving something
critical out. CARA can even help predict the thrust of the opposi‐
tion’s brief, given what you’ve worked on so far.
CARA is powered in large part by a machine learning model that is
trained on the network of citations between legal cases. Legal writ‐
ing, by its nature, derives from precedent—so all cases, articles, and
briefs will vigorously cite prior precedent. Our citation network
draws from the massive collection of articles, cases, and briefs in our
database. Using machine learning, we began to discern the relation‐
ships that shape the law.
When considering what sorts of techniques to use, we explored the
literature on classical information retrieval sciences. From the litera‐
ture, we picked up an insight we could apply to legal research specif‐
ically: if every case that cites cases A, B, and C also cites cases D and

Jake Heller: The Future of Legal Practice | 15


E, anybody citing A, B, and C would be wise to further consider D
and E. We developed a similar concept that we internally call “cita‐
tion bundles”—bundles of cases that we know to be related because
they are so often cited together.
However, we quickly realized that this method was not nuanced
enough for the precise activity of legal research. Over time, we
began to fold in more and more factors into the machine learning
core of CARA, including which topics the brief covers (derived from
latent semantic analysis), a relative weighting of the importance of
each citation relationship, the recency with which certain cases have
been cited, and literally over a hundred other factors.
At what point can you predict rulings ahead of time based on the
content of the briefs or the presiding judge, for example?
Our goal is not to predict how courts will rule, which is more the
concern of journalists and investors. (As far as investors go, the
Supreme Court adjudicates at most eighty cases a year, the vast
majority of which aren’t financially actionable—so the idea of using
machine learning to predict how the Supreme Court will decide
cases is often a waste of money for investors.)
Further, forecasting judicial decisions is rather complicated.
Machines, for example, have a hard time grokking ideology or a par‐
ticular political philosophy that might underpin a decision. No
machine could have predicted Bush v. Gore, for example, which no
human was surprised to find completely came down to the political
party of each individual justice. There’s an outfit called Fantasy
SCOTUS, where you can predict how the Supreme Court will decide
cases, and the crowd’s predictions are compared to a machine learn‐
ing model; to date, human players best the AI systems at predicting
Supreme Court rulings every time.
To venture into the abstract, do you think the law is fully comput‐
able in theory? In the sense that we need human judges at all?
That’s a great question, and I’m going to stake a controversial posi‐
tion compared to many in legal tech. A lot of people in my field
believe that should machines be taught to understand the language
and content of the law, then jurisprudence and other legal matters
can be restated as a series of “if, then” statements.
I disagree. First, the law is sometimes intentionally ambiguous. Con‐
gress, for example, might purposefully engineer ambiguity into leg‐

16 | Artificial Intelligence and Machine Learning in Industry


islation for any number of reasons, including that they don’t want to
take the political heat for a controversial position and are content to
“let things get worked out in the courts.”
Furthermore, at their core, most precedents turn on somewhat sub‐
jective weighing factors and standards. Take fair use, for example.
Determining whether a use of someone’s copyright is permissible
“fair use” constitutes a four-part test, and none of these factors
approach mathematical certainty. Each decision, when considering
these factors, may vary from judge to judge.
All that said, I will admit that the more well-defined areas of the law
(e.g., tax) might in fact be computable. Nonetheless, I think the most
interesting, difficult parts of law are by their design uncomputable
and will likely remain so for some time.

Aaron Kimball: Intelligent Microbes


Since 2014, Aaron Kimball has been the CTO of Zymergen, a company
sitting at the intersection of machine learning, automation and biology
serving customers in the industrial chemical market. He previously
worked at Cloudera on both Apache Hadoop and Sqoop, after which
he co-founded the retail analytics company, WibiData. He completed
both a bachelors and masters degree in computer science from Cornell
and the University of Washington, respectively.
Key Takeaways

1. Throughout history, humans have co-opted microbes to serve a


variety of applications, from artisanal fermentation to large
scale industrial output.
2. Despite advances in genomic sciences, progress toward a full
understanding of microbes and their commercial use has
remained slow, gated by human intuition in a lengthy process of
trial and error.
3. With the goal of improving microbe performance, Zymergen
replaces human intuition with machine learning and manual lab
work with automation. In so doing, it can better discern the
complex relationship between microbial DNA and its associated
traits, making it possible to produce better microbes that can
then be applied to the production of various industrial
molecules.

Aaron Kimball: Intelligent Microbes | 17


Let’s start with your background.
I’ve been developing software professionally since 2008, initially at
Cloudera—where I was the first engineer. During my tenure, I
developed Apache Sqoop, in addition to working on Apache
Hadoop. In 2010, I cofounded WibiData, where we focused on
developing big data applications for the retail sector. I first met Josh,
Jed, and Zach at Zymergen in early 2014 and have been CTO here
ever since. My education centered entirely on computer science, first
at Cornell University, followed by my time at the University of
Washington. Through my work at Zymergen, I’ve gotten an on-the-
job education in biology.
Tell us a bit about the history of microbes and their application to
industry.
Microbes have been used for millennia. Although of course they
didn’t know it at the time, it’s what ancient civilizations used to brew
beer and wine—and what we still use today. By the 19th century,
Louis Pasteur discovered the throughline that connects microbes to
fermentation and to disease, revolutionizing the field of microbiol‐
ogy in the process. Over time, human civilization wielded the
microbe as a versatile chemical factory: artisanal applications even‐
tually gave way to industrial-scale ventures. In addition to drugs like
penicillin, many of the vitamins we buy for nutrition or the ingredi‐
ents commonly found on food labels are produced microbially. The
industrial chemicals space, as it is known, amounts to a multibillion
dollar global industry.
Despite their enormous value, microbes still conceal their share of
secrets. Decades have passed since Watson and Crick’s famous dis‐
covery, yet human understanding of biology remains limited.
Zymergen hopes to change that.
By combining recent advances in genomics, sequencing, automa‐
tion, and machine learning, we developed a platform that efficiently
and systematically explores the vast search space for biology. As a
result, many of the world’s biggest challenges—feeding a growing
global population, security, climate change, and materials for safer
cars—are poised to find solutions in biology.
To understand why this hasn’t been done before, consider the mag‐
nitude of biology’s complexity. Scientists estimate there are 1081 stars
in the universe. The number is beyond comprehension, and yet it

18 | Artificial Intelligence and Machine Learning in Industry


pales in comparison to the 1013,000 ways in which the genes of even
the simplest biological system—a microbe—can be altered. To say
the space exceeds human intuition and intellectual capacity would
be an understatement. Nevertheless, as is common with most scien‐
tific discovery, the history of microbe engineering is one of testing
human-generated ideas. As a result, progress has been marked by
epiphanies, the fruit of individual breakthroughs and error-prone
lab work, rather than predictable engineering.
Where does Zymergen and its approach fit in?
Zymergen takes a data-driven approach, replacing manual lab work
with automation, and human-generated hypotheses with machine
learning algorithms. The result is a platform that comprehensively
and systematically explores the search space for biology, generating
a growing library of data that delivers results with increasing effi‐
ciency. Just as Google PageRank replaced human-curated search
engines, Zymergen recognizes that scientists cannot efficiently query
the search space for microbes using intuition alone. Today, Zymer‐
gen uses its platform to engineer microbes to make industrial chem‐
icals. While the platform can support work in other living systems,
microbes offer a useful starting point.
As with wine and beer, microbes naturally convert feedstock into an
end product. Industry has long sought to repurpose this capability
for their own products, achieving only limited success. Even though
microbes are currently used for the production of some commodity
goods, the cost and complexity has limited their application on a
broader scale.
What are the key problems Zymergen hopes to solve?
Zymergen focuses on improving the performance of microbes: gen‐
erating higher yield, productivity, or other metabolic measures when
applied to production of a particular molecule. Since we recognize
the limitations of human intuition, algorithms and machine learn‐
ing enable us to drive each successive experiment more effectively
and efficiently.
Today, human understanding of the relationship between genotype
(i.e., the DNA) and phenotype (i.e., the associated trait) can be
described as tenuous at best. For instance, changes to parts of the
genome previously believed to be unrelated to direct metabolism
(the sequential conversion of feedstock to intermediate molecules

Aaron Kimball: Intelligent Microbes | 19


and then to the desired output molecule) can have a material effect
on the actual metabolic processes.
Complicating matters, we have yet to develop a deterministic model
of DNA-to-phenotype expression that can be modeled in software.
Efforts to simulate the impact of DNA changes on the cell lack suffi‐
cient fidelity to predict whether the modified DNA will have posi‐
tive, adverse, or null impact.
Improving a phenotype requires us to, in effect, reprise evolution on
a vastly accelerated schedule. To that end, we run numerous trials
with subtle changes in each variation of DNA we design. In the pro‐
cess, we use high-throughput capabilities to employ a more system‐
atic approach to proposing and testing genomic edits, not one based
in ad hoc human intuition. Our approach further serves to identify
the consequences of changes to nonobvious parts of the genome.
The initial improvements come with a cost, however. Even as scien‐
tists identify performance increases, reaping marginal improvement
becomes increasingly difficult. Manipulating the narrow set of genes
directly involved in the relevant metabolic pathway can deliver a lot
of the early improvements. To counter diminishing returns, addi‐
tional marginal improvements require exploring the outer reaches
of our current understanding, the so-called “dark space” of the
genome, where the correlation between DNA and function remains
a mystery.
Using machine learning, we can extract patterns from a large num‐
ber of trials invisible to the human eye. For example, one of the
problems we need to address is that of “consolidation.” That is,
examining the aftermath of a series of trials will reveal some per‐
centage as “hits”: genetic edits found to improve cell function over
the common parent strain. But it wouldn’t be useful to construct a
new “master strain” through a union of the various “hits” because
such amalgamation is more complex than might otherwise seem;
specific subsets of changes are additive in combination; others are
neutral or deleterious.
Addressing this problem means negotiating a combinatorially large
number of distinct subsets. Which subsets of the hits do we test,
knowing that creating these subsets is no easy task? We have had
good success using machine learning to predict useful subsets of
edits, thereby quickly narrowing the search space.

20 | Artificial Intelligence and Machine Learning in Industry


Finally, we rely on machine learning to improve the Zymergen pro‐
cess writ large, efficiently managing factory capacity. Specifically, we
use it to plan which trials and DNA variations to run in our produc‐
tion factory, and, moreover, in what priority .
What are some of the key challenges going forward from a
machine learning perspective?
Interpreting DNA is like staring at machine code for a computer
system with little understanding of the instruction set and a missing
reference manual. Biologists have made progress toward under‐
standing the syntax of DNA—identifying markers that delimit
where genes and other functional elements begin and others end,
but a complete picture of their complex interaction remains elusive.
In some well-studied species, such as E. coli, researchers have cre‐
ated detailed reference genomes with “annotations” that describe the
instruction set of delimiters, genes, promoters, and other functional
elements, along with links to known functional impacts. In the
microbial species used in industry, on the other hand, we’re stuck
with rather sparse annotation sets. This means that despite a high-
resolution understanding of the genome, we lack the rich feature set
needed for training.
We currently fill this gap by combining genome data with lab trial
performance test data. Unfortunately, this approach has its limita‐
tions. The results are often binary; editing gene X is or isn’t useful.
Yet this simple binary answer obscures a more complex reality of
causation. In some cases, by using “ladders” of changes with titrated
strengths of effect, we can construct more linear gradients of cause
and effect. Nevertheless, to extrapolate from there requires a new
theory about the genome, which itself demands, at the very least,
higher-fidelity annotations.
Beyond the genome, our wet lab factory in itself resembles a com‐
plex system. Perturbing and testing genomic change chains together
an intricate process with hundreds of steps, the effects of which
remain invisible to the human eye. In effect, optimizing our factory
for higher throughput, lower cost, load balancing, and effectiveness
represents a challenge at the intersection of operations research and
biology. Understanding causality in process changes requires a
model that combines the lab environment in detail (as a function of
sensors in addition to process specifications) with test outcomes.

Aaron Kimball: Intelligent Microbes | 21


Figuring out the true tolerance of process changes on various steps
will come only through repeated trials and statistical analysis. In this
effort, we lean on process modeling techniques, such as root-cause
analysis. Over time, we develop control charts for our processes and
use them to model the effect of process deviations on outcomes.
Predicting whether a process change will be beneficial or not can
save us time and money, simultaneously adding to the precision of
our work.

Bryce Meredig: The Periodic Table as


Training Data
Bryce Meredig is cofounder and Chief Science Officer of Citrine Infor‐
matics. Citrine’s software aggregates and analyzes large volumes of sci‐
entific data to help customers rapidly invent and manufacture new
materials.
Key Takeaways

1. Materials science concerns the understanding and control of


matter and its properties toward practical applications. Work in
the field has traditionally counted on expertise in the sciences
(e.g., solid-state physics, chemistry and so on).
2. Materials problems can be reframed as data problems. By
describing materials and their properties as scalars and vectors,
researchers and businesses can now compute what they used to
intuit. Citrine offers a platform to help companies solve materi‐
als problems using data and machine learning.
3. Academics and researchers are actively exploring the exciting
intersection of machine learning and physical theory. Citrine,
for their part, is actively working on incorporating physical pri‐
ors into their materials models.

Tell us a bit about yourself.


I’m a materials scientist by background. I studied the subject as a
Stanford undergrad, followed by a PhD at Northwestern, where I
focused on materials informatics—the materials analogue of bioin‐
formatics. The idea is to apply machine learning to materials data to
derive new knowledge, purely from the data itself. With my science
education complete, I returned to Stanford for an MBA, which

22 | Artificial Intelligence and Machine Learning in Industry


dovetailed with my current work as co-founder of Citrine
Informatics.
For those unfamiliar, can you provide some background on mate‐
rials science?
Materials scientists are concerned with understanding and control‐
ling the properties of matter, and specifically, matter with practical
applications. Energy materials are one such application, the domain
of batteries, photovoltaics, and so on. A materials scientist asks,
“What materials should we use to make such products achieve our
desired performance?”
Historically, materials scientists train in the fundamental concepts of
chemistry and physics, especially solid-state physics. This broad
education centers on the observation that materials phenomena
span a wide range of length scales. Our field considers matter and its
behavior all the way down to the atomic scale. How do atoms, for
example, sort into crystalline structures or molecules, and how do
these arrangements influence materials properties? Then, we must
also be concerned with the scale of everyday life: we have to manage
the behavior of materials that comprise an aircraft, for instance.
Commercial applications naturally follow. Companies like Boeing
and Airbus might worry about fashioning a collection of materials
into a fuselage. Alternatively, a company like Intel needs to drive
certain performance and cost improvements in their semiconduc‐
tors, a materials endeavor that spans the entire periodic table.
Prior to Citrine, what role did modern data science and machine
learning play in the field?
The domains of production and manufacturing led the way in the
use of data and statistics, through methods such as Six Sigma. More
fundamental R&D, in contrast, has traditionally been driven by
domain knowledge and expert intuition.
Tell us more about Citrine’s approach, in contrast.
Citrine’s aim is to function as the materials-centric data and analysis
platform across the manufacturing sector. Any company with mate‐
rials or chemistry-intensive products can use Citrine to improve
their decision-making process around materials, ranging from
materials selection for highly tailored use cases to novel material
design and discovery. As an example, a chemical giant like Dow

Bryce Meredig: The Periodic Table as Training Data | 23


might want to design a polymer or molecule with novel properties
in a predictable and rational manner. Then, with the molecule
design in hand, the challenge shifts to scaling the manufacturing
process, which Citrine supports as well.
By and large, our customers arrive with specific design goals in
mind. They might want to design a lighter vehicle that achieves bet‐
ter fuel economy. Ultimately, these high-level goals are reducible to
the lower level chemistry and physics of materials science. Put dif‐
ferently, the existing vehicle comprises certain alloys with particular
mechanical properties. The question thus becomes how to substitute
the existing materials with lighter alternatives without sacrificing the
crucial mechanical properties.
Complicating matters for our customers, a typical industrial applica‐
tion must juggle dozens of targets and constraints. In some cases,
some of the neccessary materials need to be created de novo or
modified from existing materials, in which case Citrine helps iden‐
tify the right mix of materials for the task.
Where does machine learning fit in?
Suppose an aerospace company requires materials for hypersonic
flight. Most materials, needless to say, have not been rigorously sub‐
jected to hypersonic conditions. Machine learning is ideally suited
to closing this knowledge gap. By learning how a few select materials
behave under hypersonic conditions, it can provide guidance about
others without such tests.
In reality, many of the problems our customers solve using Citrine
boil down to regression problems. Frequently, the materials proper‐
ties in question can be represented as scalars or vectors. Let’s con‐
sider thermoelectric materials to illustrate this point.
Thermoelectrics generate a voltage when subjected to a temperature
gradient, or vice versa. One commercial application involves har‐
vesting waste heat—the kind you’d find, for example, in the engine
compartment of a car. If, in principle, you could design a cheap,
easy-to-manufacture and high-efficiency thermoelectric material,
car makers would adopt it in a heartbeat. They could use it to help
capture engine heat and redirect it to battery charge, rather than los‐
ing it to dissipation.
A key materials challenge for good thermoelectric materials is that
they conduct electrons, but not heat (i.e., phonons). Typically, these

24 | Artificial Intelligence and Machine Learning in Industry


two properties strongly correlate, but we might want to find unusual
materials that decouple the two effects.
Posed as such, this problem provides a great test case for machine
learning. We can train models to predict the key properties of a
thermoelectric—screening for the rare materials that combine low
thermal conductivity with high electrical conductivity. The alternate
approach would require a huge number of experiments, largely
driven by physical intuition, which is the usual strategy in materials
design.
Machine learning can be useful when applied to physics-based simu‐
lations as well. Such simulations are often directionally useful, but
simulating real world effects poses a challenge. We cannot easily
compose a set of tidy equations to completely describe the relevant
effects. In practice, we have found that outputs of physics-based
simulations can serve as useful input to machine learning programs,
along with experimental observations.
Can you elaborate on the limitations of such physical simula‐
tions?
One standard tool in physics-based simulation of materials is den‐
sity functional theory (DFT). DFT is an electronic structure method
that solves quantum mechanical equations to predict materials
properties. However, real-world materials behavior happens at the
scale of meters, not individual atoms (i.e., one tenth of one billionth
of a meter), and most DFT simulations are constrained to treating a
few hundred or few thousand atoms. In another limitation, DFT
models materials at zero temperature. We, of course, do not use
materials at absolute zero.
As a result of these constraints, we rely on considerable approxima‐
tions and extrapolations in the application of DFT to practical mate‐
rials problems. Machine learning, in contrast, can directly model
real-world materials behavior—that is, if we have the training data,
which, of course, may originate in DFT. Machine learning can, for
example, incorporate various finite-temperature effects in experi‐
ments left out of DFT. One would expect the ground-state electronic
structure of a material to correlate in some way with that material’s
properties at any temperature, in principle, and machine learning
can directly take advantage of this correlation in a black-box
fashion.

Bryce Meredig: The Periodic Table as Training Data | 25


Can the machine learning inform or potentially modify the
underlying physical equations?
Tremendous potential exists at the intersection of physical theory
and machine learning, and indeed the promise of the field is
matched by its growing interest. Citrine is actively working in the
integration of physics-based priors within our machine learning
framework. Academic researchers making exciting contributions in
these areas include Alán Aspuru-Guzik at Harvard, Kieron Burke at
UC Irvine, and Klaus-Robert Müller at TU Berlin. As it happens, a
recent workshop theme at UCLA’s well-known Institute for Pure
and Applied Mathematics was entitled “Understanding Many-
Particle Systems with Machine Learning.”
Can you share an example success in the application of machine
learning in your domain?
Citrine’s platform has played an important role in several successful,
significantly accelerated materials development efforts. While our
customers deem the industrial examples proprietary, we have, in
fact, published peer-reviewed case studies with academic collabora‐
tors.
In one example published in APL Materials, Citrine accurately pre‐
dicted the properties of a new thermoelectric material that was sub‐
sequently synthesized and characterized experimentally. The
material is noteworthy because it contains a high proportion of met‐
allic elements and exhibits surprising thermoelectric performance,
in spite of this traditional disadvantage.
We further demonstrated that this material draws from unexpected
regions of the periodic table, yet simultaneously shares other charac‐
teristics with well-known thermoelectrics. As an exercise, this out‐
come demonstrates the power of applying machine learning to
physical sciences. It builds upon and reinforces known principles,
and at the same time, helps domain experts generate completely
novel ideas.
Separately, in a Chemistry of Materials article, we established that
Citrine’s machine learning can reliably anticipate chemical systems
that will form a particular atomic arrangement known as the Heus‐
ler crystal structure. This paper is important for two main reasons:
First, it shows that machine learning can dramatically improve the
yield (i.e., efficiency of compound discovery), in some cases from

26 | Artificial Intelligence and Machine Learning in Industry


perhaps a few percent to over 80%. In fact, we reported the success‐
ful experimental synthesis of 12 novel Heusler compounds in this
single paper alone; the entire scientific community typically discov‐
ers about 50 Heuslers per year. Second, it reveals a case in which a
common chemical rule of thumb breaks down in a way that
machine learning informed by more features can address.
What are you most excited about in the next, say, five years
around materials science and the intersection with machine
learning?
I subscribe to the Fourth Paradigm idea, which suggests that data-
driven science is an emerging new paradigm of scientific inquiry,
complementary to theory, computational simulation, and experi‐
ment. I believe that folks in our field should apply data science and
machine learning to extract insights from data collected over the
previous decades. We finally have the computational horsepower
and the algorithms needed to unlock the mysteries trapped within
data all this time. I expect we’ll see a step function improvement in
the state of the art in materials science, driven by data-intensive
methods. We’re already starting to see this, actually, with our current
customers. Discoveries are falling out of the data, just as we
hypothesized that they would.

Erik Andrejko: Transforming Agriculture with


Machine Learning
Erik Andrejko leads the data science and research organization at The
Climate Corporation, applying large-scale statistical machine learning
and data science to solve problems in a variety of domains including
climatology, agronomic modeling and geospatial applications. He has a
PhD in Mathematics from University of Wisconsin-Madison.
Key Takeaways

1. Farming generates huge data volumes: a single crop in a single


country during a single season can produce upward of 150 bil‐
lion observations.
2. Agriculture presents a unique set of challenges for machine
learning, including the limited number of growing seasons and
complex causality chains.

Erik Andrejko: Transforming Agriculture with Machine Learning | 27


3. Data science, as applied to agriculture, demands a multidiscipli‐
nary approach that melds mechanistic models with more com‐
mon machine learning practice.

Let’s start with your background.


I started out in computer science. In the process, I realized I really
enjoy mathematics, ultimately earning a PhD in pure math. These
days, I work with statistics, machine learning, and data science at the
Climate Corporation, where I build data products to help farmers
make more informed decisions to improve their productivity and
sustainability.
Could you describe the history of data and analysis in agricul‐
ture?
A wealth of what we now consider modern statistics originated in
efforts to design and analyze agricultural experiments, tracing back
to Fisher and the Rothamsted Experimental Station in the 1920s.
That tradition continues today in the modern context.
Today, agriculture generates more and more data from a variety of
different sources, from farming equipment, satellite images, weather,
and so on. It’s to the degree that farmers feel inundated with data,
but don’t have the systematic ability to process it into actionable
insights.
Walk us through the kinds of data you work with at Climate Cor‐
poration.
We work with data that touches on everything from the soil to the
atmosphere, at very large volume. Consider that a single crop in a
single country during a single season can generate upward of 150
billion observations, when combining the environment, farmer
practices, and the measurements of the crop.
To understand how the environment impacts agriculture requires
understanding the impact from each of these data sources on a
layer-by-layer basis. Generally, it’s not enough to merely highlight
correlation. To succeed, we need to apply sophisticated statistical
and machine learning techniques in order to tease out root causes.
What kinds of experts comprise these teams?
Our organization is composed of teams representing a rather broad
range of domain scientists who work directly with specialists in

28 | Artificial Intelligence and Machine Learning in Industry


machine learning and statistics. Some of these teams focus on build‐
ing models to understand atmospheric processes, such as meteorol‐
ogists or atmospheric physicists. Others work in different specialties
—for example, soil hydrology, soil physics, and biogeochemistry. In
addition, we employ experts with a deep understanding of the crop,
such as crop physiologists and crop breeders. Finally, given the sorts
of data sources involved, we need scientists with experience analyz‐
ing physical and biological systems through remotely sensed plat‐
forms, including satellites or aircraft platforms.
How does machine learning in agriculture differ from more typi‐
cal applications?
I believe there are two meaningful differences. The first is the com‐
plexity of the domain and, by extension, a limited number of avail‐
able trials. A farmer will typically encounter about 40 growing
seasons over their career.
This limitation practically constrains what farmers can learn over
time, which in turn presents a unique challenge for machine learn‐
ing. There just isn’t much opportunity to experiment with new
approaches. In this context, the risk associated with a bad decision
multiplies. Getting it wrong even once can have material conse‐
quences for a farmer’s business and prospects.
Making matters worse, while we often develop and back-test mod‐
els, unlike our peers in other settings, we can’t rely on online testing.
For example, we can’t easily select models by performing a short
term A/B test in an online setting, as it generally takes one growing
season to observe the predicted outcomes. Instead, we rely on care‐
fully designed field trials, which means actually venturing out into
the field to collect data for our tests.
Getting back to your original question, I believe the other key differ‐
ence is the visibility and transparency required of our models, as
opposed to other applications. As an example, consumer web mod‐
els tend to be primarily driven from the data and rarely expose their
logic to the end consumer. In contrast, our models come preloaded,
so to speak, with a plethora of background knowledge derived from
multiple scientific disciplines. Furthermore, when presented to the
farmer, we can expose not only the recommended course of action,
but the reasoning behind it.

Erik Andrejko: Transforming Agriculture with Machine Learning | 29


We find that farmers are excellent in using model-based thinking to
consider counterfactuals—e.g., what would happen if I planted a
crop like last year, but with some permutation in the weather, like a
wetter early season? By accurately capturing conditional depend‐
ence, these models gain the trust of the farmers with significant
domain expertise.
Can you elaborate on how your models bake in scientific knowl‐
edge?
Consider, for example, how precipitation affects crop yield. Cer‐
tainly, more precipitation is better than a drought, but excess water
is equally harmful. At present, we know precipitation affects crop
growth and development by interacting with the soil through a vari‐
ety of latent processes that ultimately link precipitation to crop yield.
This means a machine learning model that uses precipitation and
crop yield data can be enriched by feature engineering these well-
understood scientific principles.
Sometimes we express such scientific principles statistically. For
example, precipitation can be modeled as a stochastic process. We
might postulate that precipitation will impact a latent moisture state
informed by a relevant soil measurement. We can encode this back‐
ground information directly into the structure of the model to cap‐
ture relevant structure as part of the model itself. These models
typically can be trained much more quickly, while holding fixed the
amount of data.
Can you elaborate on the kinds of machine learning approaches
you use in your work?
I’d preface this by noting we make extensive use of a wide variety of
models, and typically build composite models that draw from multi‐
ple approaches. This includes a large number of mechanistic mod‐
els, which are relatively uncommon in most machine learning
shops. Mechanistic models attempt to capture the underlying physi‐
cal understanding or the underlying causation using a mathematical
approach. Modeling the motion of billiard balls on a pool table
using Newton’s laws of motion is a good example of this.
Most people are familiar with atmospheric models, which model
atmospheric physics to help forecasters explain how two neighbor‐
ing regions in the atmosphere, with different temperature and pres‐
sure, will interact. These models, in other words, help predict what

30 | Artificial Intelligence and Machine Learning in Industry


will happen at the boundary layer between the two. We use mecha‐
nistic models like this in cases where the underlying physics is well
understood, as is the case with systems like the atmosphere and the
soil.
The mechanistic models are almost always coupled with a machine-
learned model either on the input side (e.g., a mechanistic soil mois‐
ture model that consumes a statistical rainfall forecast model) or on
the output side (e.g., a mechanistic soil moisture model used as an
input feature to a machine-learned crop yield model).
Coupling different types of models together helps us to maximize
both the predictive power of models (what is likely to occur) with
the explanatory power of models (why something could occur), ach‐
ieving a good trade-off between the two.
Where do you see machine learning in agriculture going over the
next five years?
I think we are at an inflection point in the application of machine
learning technology in many fields. In particular across several
domains, machine-learned models are performing at the level of
human experts—and often exceeding it. This is particularly true
with the application of deep neural networks to things like image
classification in pathology and natural language translation.
As I see it, the challenge for machine learning in agriculture over the
next five years will be to adapt and apply these rapidly evolving tech‐
niques to the domain. In particular, it will be essential to connect the
large volumes of time-series environmental data (with geospatial
data collected from proximal and remote machinery) together with
the genetic data that describes the crop. The techniques that have
shown promise with these classes of data individually will need to be
adapted and extended to work with these types of data in an integra‐
tive context. I anticipate that we will see a number of successful
applications of deep neural networks to this and similar types of
problems.

Erik Andrejko: Transforming Agriculture with Machine Learning | 31


About the Author
David Beyer is an investor with Amplify Partners, an early-stage VC
focused on the next generation of technical founders solving over-
the-horizon problems for the enterprise. He began his career in
technology as the cofounder and CEO of Chartio.com, a pioneering
provider of cloud-based data visualization and analytics.

También podría gustarte