Documentos de Académico
Documentos de Profesional
Documentos de Cultura
ukasz Alwast
Dissertation submitted in partial fulfilment of the requirements for the degree of MSc in
Digital Anthropology (UCL) of the University of London in 2014
Word Count: 14 254
Note: This dissertation is an unrevised examination copy for consultation only and it
should not be quoted or cited without the permission of the Chairman of the Board of
Examiners for the MSc in Digital Anthropology (UCL)
Abstract
Over the past XX years, the term data science has swiftly moved into the
vernacular of scientific and technological vocabulary. As this happened, it signified
a larger phenomenon that is taking place in the sciences and society at large,
namely digitization and datafication of many of the aspects of the world that had
not been quantified and digitized before. This trend seems to have its own, new
acolytes data scientists.
Heralded by the media as the high-priests of
algorithms and the sexiest job of the XXI century, the phenomenon unravels a
more deeply grounded conversation about the establishment of a new profession
in the public milieu, the making of science and scientists, and the evolving nature
of handling and understanding data. Drawing on contributions from science and
technology studies (STS), organizational studies, anthropology, and Internet
studies, this work frames the research around the self-identity of a professional
and group perception of the authenticity and competence of interdisciplinary, XXI
century quantitative analysts.
Table of contents
1. Introduction
2. Methodology
3. Research questions
12
12
14
15
16
19
- Provisional selves
21
- Communities of practice
23
26
26
28
31
33
- Tools of practice
36
38
7. Discussion
41
8. Closing words
44
9. Bibliography
46
Acknowledgements
I would like to thank my supervisor Stefana Broadbent for her guidance, patience
and confidence in setting this piece of research on track. I always found her
enthusiasm contagious, which made this creative endeavour much more
invigorating, intriguing, and within my reach.
I am also grateful to Haidy Geismar, my course co-convenor and personal tutor,
whom I could always count on for thoughtful advice and a critical eye.
I also appreciate the help of Ciara Green, my course peer, who dedicated her time
to listen to my rants on data science and proved to be a good, critical listener.
Then there are of course my informants, with whom a number of in-depth
interviews allowed me to investigate my questions in sufficient depth.
And finally, thanks to my mom and dad, for always supporting me in whatever I
decided to pursue.
I.
Introduction
Data, as it stands, surrounds us. For computational systems, we, as human
beings are carriers, herders and interpreters of it. Data, after all, is the
foundation for deriving information - the particular mean of insight and
intelligence that enables us to make informed, individual and collective
decisions. Or so we believe.
Some profound changes in this area have been happening over the past
30 years. With the advancement of computational technologies our ability to
collect, share and analyse data (and therefore, information) has changed to a
degree that is historically unprecedented. In fact, according to one of the
corporations that helped set up the infrastructure for this transition, IBM (2013)
- 90% of the worlds data has been produced in the last two years - and we are
yet to recognize how to harness its potential".
There is little doubt that amongst other phenomena, technology shapes
our lives (Bijker et al., 1989), and so do we, shape technology (Mackenzie &
Wajcman, 1985). After all technology - the outcome of making
something (techne), and science - the outcome of thoughtfully pursuing and
understanding (episteme), are inherently linked to one another (Parry, 2014).
As Thomas Kuhn (1963) asserted long years ago, science is inherently about the
data, so in as much should be technology. And if science is the initiator of a new
way of understanding the world, it also creates opportunities for doing things
differently. This, unfortunately, often translates in the popular discourse into a
simplification that scientific opportunities = money, or data = money, and
there are a number of larger and smaller loopholes of seeing the world through
such a lens. This is, however, often the reality of technology and business
narratives, and this is why the fairly new concept of data science, and its
acolytes - data scientists - appears so worthy of investigation.
There are, of course, some limitations to what only a few months of
research can capture in trying to unpack such a large phenomenon. This is why
this research aspired to become an ethnographic snapshot, on the level of its
5
the making of science and the evolution of the data analyst role?
II.Methodology
Research for this study did not start with a pre-assumed research
question. It began with an iterative process of probing and exploring what would
be an appropriate angle to unravel ones interests, recognize compelling
questions and seek out relevant, transferable knowledge. Building on those
interests (the processes of long-term socio-technical development, broadly
understood innovation and social perception (and expectations) for scientific
and technological change), I was looking for a phenomenon that would merge
those themes together - and data science and data scientists appeared as timely
and worthy candidates.
For the requirements of an ethnographic endeavour, however, this was not
going to be an easy task. Data scientist roles in organizations are still fairly
scarce,
outstripping each other in trying to acquire talent, and, if they are successful,
those individuals often work on some of the more critical aspects of
organizations processes, in many cases, highly confidential and sensitive. It was
therefore very difficult to convince any of the individuals I had in my network,
or in their network, to pursue an organizational ethnography and investigate
their organization, as a research field. Limiting the research to one organization
would also be a danger in itself, therefore a decision was made to conduct the
research with a number of informants from different organizations, focus on
them as individuals (rather than their organizations), and collate ethnographic
insight from an accessible field site, for which Data Science London appeared to
be a good candidate.
Inspiration for pursuing the research through such an approach emerged
from anthropological accounts of researchers who historically also tried to follow
either scientific, or ICT-heavy communities, from Latour and Woolgar (1986
(1979), Levy (1984), Latour (1987), Miller and Slater (2001), Biao (2006), Kelty
(2008), to Coleman (2013). Advised by my supervisor, I started exploring the
10
titles of data scientists, just months before the study began. Another group of
informants was composed of individuals who were data analysts in different
organizational settings e.g. a post-doctoral astrophysicist, a statistician with 15
years of industrial experience, or an economic research fellow. The third group
were individuals actively engaged in shaping (Data Science London) or
researching the community.
During the period of this study - and throughout the digital anthropology
masters program - I worked in an organization that was actively hiring and
building a data science team. However, due to the nature of being an actor in
an organizational setting, and the fact that the act of pursuing ethnographic
research might change the dynamics of my social position and relationships
within such environment (Berg & Lune, 2011), I made a deliberate decision that
this organization would not be a field of the study. However, without doubt, my
experiences and observations during that time helped me inform how the
research was framed and which themes would be selected for deeper
investigation, thus influencing the discussion and reflection on this subject.
Thirdly, my participation in Data Science London meet-ups led to a
number of conversations, observed practices and behaviours that were very
11
attention throughout the year of this study (2013-2014). A number of new public
and private institutions legitimizing the term data science emerged both in the
UK and the US (e.g. Imperial College Data Science Institute and Data Science at
NYU) and led to a number of discussions and conversations around this topic - for
example, a meeting at Imperial College titled A Data Scientist is a statistician
who lives in Shoreditch (?) (Data Science Institute, 2014).
This study might have also benefited more from applying ethnomethodological approaches to the subject, however, due to the confidentially of
the work pursued by some of the questioned data scientists, and their limited
pool, this approach had to be withdrawn and restrained to a number of semistructured interviews. For the purpose of a more in-depth study, on a larger
group of informants, ethnomethodology would be highly recommended for
triangulation purposes.
12
This links
closely with the issue of biased group inclusivity, in-group favouritism and the
gender-biased perception of competence (Moss-Racusin et al., 2012) and
might also have impacted this study.
Values, beliefs and legacies of the open and free software movements the
popularity of data science can be associated with the proliferation of tools
developed in the spirit of open software, that allowed tackling with
increasingly sophisticated data questions. In fact, the Data Science London
organizers included in their mission statement the following claim:
13
alongside data science, also attracted significant and very interesting scholarly
work from researchers in Internet studies (Mayer-Schnberger & Cukier, 2013)
communications studies (Parks, 2014), digital- anthropology (Boellstorff, 2013),
digital- humanities (Manovich, 2011) and sociology (Ruppert, 2013).
This is why the literature review will be in its nature cross cutting,
pointing to contributions and sources available amongst the different areas
15
discussed above. Deeper commentary will be given only to those positions that
have been identified and judged to be important for supporting the research
questions and revealing of the larger picture of conversations that take place
within this topic.
The first part of the literature review will briefly introduce the term
data, and show an interesting historical trajectory of two disciplines, namely
statistics and computer science, which have always tackled data from a
perspective relevant for the profession of the data scientist. The latter part of
the literature review will introduce the body of literature on professional selfidentity, organizational socialization and provisional selves. This will correspond
closely with the section on the characteristics and dynamics of communities of
practice, and the forms of practices, tools and behaviours that make a group and
the individuals within it both socialized and distinctive.
This changed as
computerization entered the space, and to some degree, the cost of setting up
data collecting infrastructures has decreased to a point that gave new
opportunities for entities that were not able to use these types of solutions
before. Needless to say, in extreme cases like the NSA or Google data farms
(Forbes, 2014) - even todays data infrastructures can be very expensive to
operate.
Not surprisingly, individuals in the field of statistics have repetitively been
asking themselves questions - what is the role of statistics in the data
revolution? Friedman himself, a statistics Professor at Stanford, argued over a
decade ago that the idea of learning from data has been around for a long time,
but the interest in analysing these large and complex data sets has only
recently [2000s] become so intense (Friedman, 2001: 5). He associated this
with the development of novel, data-base management systems where large
quantities of data resided, and as a result, has given fertile ground for data
mining approaches. The processes of analysing data for purposes other than for
18
which it was collected (Friedman, 2001: 6), shifting the usual application of data
from transaction processing to decision-support.
This argument, dating back to 2001, is worth remembering when looking
at current conversations around Big Data, including a question Friedman posed
towards data mining - although data mining appears to be a viable commercial
enterprise, one can ask whether or not it qualifies as an intellectual discipline?
(Friedman, 2001: 7).
intellectual discipline, but in the future, almost certainly [it will be] and one
can predict a big intellectual and academic future for new data mining
methodologies will emerge (2001: 7). At that time, data mining packages were
already incorporating well-known procedures from the fields of machine
learning, pattern recognition, neural-networks and data visualization. And of
course, some questions remained unanswered - should statistics remain at what
its good at (i.e. probabilistic inference based on mathematics), or ought it be
concerned with a set of problems, rather than tools?
An important remark in Freedmans argument was that statisticians will
first and foremost have to make peace with computing, as this is where the
data is. As if computing is to become one of the fundamental research tools,
then the community will have to teach, or be sure that students learn, the
relevant Computer Science topics, and some basic paradigms of the field will
have to be modified (Friedman, 2001: 9). This thought neatly corresponds with
what Hal Varian (2014), Professor at iSchool at Berkeley and Chief Economist at
Google, argues about the modern training of economists - in particular,
econometricians and the type of skills and tools they need to start acquiring
from their computer science comrades.
This observation leads to conversations about career prospects in data
analysis roles. Friedman argued (2001: 9) that up until around the 2000s, if
someone was interested in data analysis, then statistics was one of the very few
(even remotely) appropriate fields to work in. In 2013/2014, this is no longer the
case. There are many other exciting data orientated sciences that are
competing [with statistics] for customers, students, jobs and even [our own]
statisticians (Friedman, 2001: 9). Even prominent statisticians are becoming
19
Cleveland
Change the instruments, and you will change the entire social theory that goes with
them
Source: Latour, 2009 in boyd & Crawford, 2012: 665
23
although increasingly growing over the last few years educational programmes
finishing with a degree in data science.
This argument corresponds interestingly with the work of Gil Eyal (2013),
sociologist at Columbia University, who argues that sociology of profession is in
fact a story of the past and sociology of expertise is a more timely and
comprehensive to capture the changes of todays world. In his words, sociology of
expertise maintains an analytical distinction between experts and expertise as
two irreducible models of analysis, treating expertise neither as an attribution,
nor a set of skills, but as a network connecting actors, instruments, statements
and institutional arrangements (Eyal, 2013).
Coming from that, a straightforward question seems to be whether it is
the data or the scientist part of the role, which has a stronger influence on
professional identification, or if it is something different? Ibarra suggests that in
assuming new roles, people not only acquire new skills but also adopt the social
norms and rules that govern how they should conduct themselves (Shein, 1978 in
Ibarra, 1999: 765). Practices and social norms of scientists in a lab setting were
already well investigated in a seminal study by Latour and Woolgar in 1979
(1986).
particularly relevant for data scientists as these individuals often transition from
academic/research backgrounds into industry, where the dynamics and
challenges of the environment require different practices. Identities have long
been seen as constructed and negotiated in social interaction (Mead, 1934;
Goffman, 1959) and socialization is not a unilateral process imposing conformity
on the individual. It is a negotiated adaptation by which people strive to improve
24
identity claims by conveying images that signal how they view themselves or
hope to be viewed by others, but it is unclear to what degree they remain part
of their past, scientific role, and to what degree part of a new form of a data
analysis/lead scientists/consultant, as is often expected of them. Not without
significance is the self-perception of authenticity; that is, the degree of
congruence between what one feels and communicates in public behaviour
about his or her character or competence (McIntosh 1989, in Ibarra 1999: 778).
For the data scientists, this is an area where the concept of situated learning
and communities of practice falls neatly into place.
V. V. Communities of practice
Communities of practice (CoP) is a concept developed at the beginning of
the 1990s by Jean Lave and Etienne Wenger, who proposed a new model of
learning, described at the time as situated learning theory (Lave & Wenger,
1991). The concept was a critique of earlier cognitivist theories of learning as
knowledge was said primarily not to be abstract and symbolic, but provisional,
mediated and socially constructed (Berger and Luckmann, 1966; Blacker, 1995).
Situated learning theory positions communities of practice as the context in
which an individual develops the practices - values, norms, relationships - and
identities appropriate to that community.
recognition and the ability to negotiate meaning, but does not necessarily entail
equality, respect or collaboration.
A particularly intriguing aspect of participation is how members of a
community gain status within it.
(1990) suggested that there is a distinction between a core, and peripheries, and
it is through continuous participation that one gains recognition or moves to the
centre. They have, however, deviated slightly from this opinion since then and
acknowledged that participation may involve learning trajectories which do not
lead to a comprehensive full participation (Handley et al., 2006: 644). This is
an important point to note in respect to the interviews from this study.
Another important aspect of CoP is identity. The concept of identity rests
on critical readings of social identity theory (Handley et al., 2006: 664); but,
according to Leve and Wenger (1991), learning is not simply about developing
ones knowledge and practice, but also involves a process of understanding who
we are and in which communities of practice we belong and are accepted. Two
main processes of identity construction in a workplace are identity-regulation
and identity-work. According to Handley (2006: 644), the first process refers to
regulation originating from the organization (e.g. recruitment, induction and
promotion policies) and the employees individual responses. The second process
of identity-work refers to employees efforts to form, repair, maintain or revise
their perceptions of self, and this involves a negotiation between the
organizations efforts at identity-regulation (which the employee may, or may
not internalize) and the employees sense of self, derived from current work as
well as other identities (Handley, 2006: 645) all highly relevant for data
scientists in their working environments.
The third and final aspect of CoP is indeed practice, which according to
Brown and Duguid (2001: 203) is an undertaking or engaging fully in a task, job
or profession. After all, by participating in a community, newcomers develop an
26
27
The last two decades have seen significant changes to the way we use
computational tools in modern workspaces (Brynjolfsonn & McAfee, 2014;
Pentland, 2014). The Internet, email, cloud services and mobile phones are only
a few manifestations of this phenomenon. Along consumer products, big changes
have been also happening in the world of artistic and scientific crafts namely,
the worlds of design and science.
A good example for design are architects, who are less and less being
trained in being proficient at the drawing board, but instead master the use of
design software such as AutoCad, Adobe Suite and the likes. Another example
are surgeons who increasingly need to become proficient in using tools that
allow them to conduct distant, robotic surgeries. These changes are also
reaching the sciences. The basic tools of scientific practice have changed too
in many cases, today, a laptop connected to the Internet and appropriate
research software is enough to pursue multiple scientific inquiries.
This has
in cosmology there are enormous amounts of data to tackle; its not a controlled
environment and theres pressure on taking as much data as there is possible thats where
machine learning becomes useful, when you dont have enough information, or when youre
29
The recurring
scientists playing an active role in not only being the skilled technical
craftsman but also the digital champion or educator of this transition in skills
training.
30
- Researcher, exploring how people interact with technologies in health and life sciences
what would make a data scientist? exceptional programming skills, use of common
statistical software and an academic background in physical sciences or statistics
data scientists work less on the data collection side, or infrastructure () data science
is more about analysis the front end between insight and data that refers to services and
data
they [data scientists] have to program, do statistics, follow the digital trail and know
what to do at the end of it () there is much about experience humility about the data
in academia youre getting points for being clever, and in industry it works out what you do, or
it doesnt, its important that you can do things quickly () in academia if you show how you did
it, theyll say ummthats just linear regression
when you leave academia you start learning C++ and Python () there seems to be a big
community movement to change tools catch-up with the great tools of the outside world
This, and other training programmes in data science - e.g. those led by
General Assembly - are part of a larger system of training packages
complemented by earlier established on-line education courses promoted by
academic power-houses such as MIT and Stanford (with a number of on-line or
distance courses in Data Science, Machine Learning and Data Visualization). This
strongly affirms that data science is a career path that did not exist in the past
(at least, not under such a name). It is only in recent years that traditional
organizations of power and educational credibility - organizations such as iSchool
at Berkeley University, NYC and Imperial College London - have opened research
programmes in data science and entered this growing field.
One cannot look at the emergence of the data scientist role without the
context, and ongoing evolution of the labour market and novel employment
opportunities associated with socio-economic and technological change. For a
long time, some of the best and financially most rewarding career paths for
students graduating in mathematics, physics, economics and engineering in
general, quantitative degrees - were in big technology, engineering or financial
34
organizations.
centres. It goes without saying that post-graduate education what still remains
at the core of the data scientist training has also changed over the years. More
and more PhD training opportunities have been offered at academic institutions
that were later not matched with further post-doc or tenured opportunities in
academia.
empirical backing of the Higgs Boson, in itself was home to hundreds of PhD and
post-doctoral scientists in fields ranging from physics, through to engineering
and computer science.
finished, they may have to consider moving into industry due to the limited
opportunities at other research institutions. This is said to be one of the reasons
why the label data science has found fertile grounds scientists needed to
re-brand themselves for the purpose of industry roles.
Amongst the informants of the study were both experienced individuals
who have gone down the path to becoming a data scientist and individuals
working at the gateways of this role. This provided the study with an interesting
perspective on where and when the transition between data science began, and
whether it could be a sustainable label for self-identification in a working
environment.
I heard about data science for the first time as a PhD student, during an industrial placement
I did at a *major web-company*
- Data Scientist [1], working in industry
I heard for the first time about data science when I was working at *major tech company* 4-5
years ago and tech-companies were starting recruiting data scientist
I heard about data science for the first time when I was doing a PhD,
and people were leaving for industry
35
I heard about data science probably for the first time with regards to Patils &
Davenports article in Harvard Business Review [the sexiest job of the 21st century]
academia is pressured for novelty () thats why each life scientist writes his own
code, because everybody in academia needs to come with their own solution, and industry is
more about finding the code that is the most efficient for the given task and finance is
absolutely the best at it, and has for years been hiring some of the best of the best physicists
Research, exploring how people interact with technologies in health and life sciences
It also seems that the term data scientist has strong origins in the techindustry, in particular in places such as San Francisco and Seattle (home to the
largest tech-companies). To some degree, this should not be surprising, as data
science emerged from the recruiting practices of companies that really could
pursue Big Data Microsoft, IBM, Facebook, Twitter, Google etc., and it was (and
still often is) the endeavours of these organizations that the term seems to be
receiving so much attention.
They
36
are the most involved in the transition of scientists into tech jobs, which also
fits the larger campaigns narrative for STEM education.
- Researcher, exploring how people interact with technologies in health and life sciences
The type of projects which data scientists get involved in - unless they are
in strictly domain-specific areas such as banking, insurance or biological mapping
often entail complexity that reaches far beyond what a simple computationalsystem could frame, and are within the interplay of a number of dynamic sociotechnical systems. Data-driven decision making, which seems to be at the heart
of data science for such areas as public health, transport, or public services
transport, requires expert knowledge from a range of disciplines and standpoints, often also taking into consideration social, political, scientific, usability
and aesthetic aspects of the developed solutions.
However, inter-disciplinary
I believe data science is more about the science bit, than the data () for example,
designers use data from a design perspective, but what they really do is design
37
Difficulty with
A statisticians way of thinking is being comfortable with uncertainty, and thats often
quite opposite of programmers, who in how they work need proof of correctness
38
I can plot results in R, MatLab, Gnuplot, when I speak to another statisticians but
with designers, they need to briefed in other ways () I met creatives who have no idea about
data science and just see it through visualization, but completely loose the science parts, and
thats a completely lost perspective
[Data Science London] This is a meetup for data scientists, data miners, statisticians, data
analysts, data engineers, data architects, data visualizers, data journalists, data science
practitioners, data consultants, academics, researchers, people from science and social
sciences, and in general people directly involved in data projects.
- Meet-up website
The formula of these events is mostly built around a set of data provided
either by a third-party, or by the organizers themselves. This data is, in many
cases, unstructured and requires a certain level of ingenuity and expertise to
make use of it. This data is then available to teams of hackers composed of
data scientists, software engineers, web developers, graphic designers and
others to devise, in most cases, a prototype for a data-product that in some
way would address a need in a novel and impactful way. This is the space where
interdisciplinary work takes place at its most extreme. Participants often do not
39
know each other before the event and have to coin teams through conversation
at the beginning of the meeting. At many of these events, it is reiterated that
group-work leads to the best results, requiring an effort to reach out to people
who are usually outside of ones disciplinary background.
features that were respected. Individuals from different fields had to make an
effort to be accepted into the group, particularly if they did not have the data
literacy and ability to use the appropriate data science tools.
For a
It is
only the last few years, if not months, that certain institutional frameworks for
recognition and accreditation have begun to be established (i.e. the Insight Data
Science Fellowships, Coursera or EDx courses), and it seems quite clear that the
expectation for data science skills flourishes on the demand side of the market,
rather than the supply side. For that reason, it was worth exploring what where
the skills and tools so much desired by industry recruiters:
Search, Solr, Java, Pig, Map Reduce), expert SQL skills, exposure
and understanding of development tools such as Java; predictive
analytics and machine learning packages (and BA/BS in maths/
statistics/machine learning or equivalent)
Source: Linkedin search results (2014)
operation tools (Map/Reduce, Hadoop, Hive) and analytical software (R, MatLab,
SAS). However, many of the study informants on several occasions underlined
that it is not the knowledge of these tools, per se, that is key to the data
scientist doing his or her job correctly, but the ability to apply them to a data
problem. That is precisely what distinguishes a skilled data scientist from an
aspiring one.
many people come into data science, or machine learning meet-ups, theyre provided
with data, theyll run a simple algorithm and say this is the result; Im a data scientist ()
thats not how this works
good ones [data scientists] know how to use proper tools for a given context,
others are just enthusiasts
42
data is now on everyones mind () its a bit of a frenzy () and if you dont use data youre
not innovating
- Researcher, exploring how people interact with technologies in health and life sciences
43
because its a newish thing [data science], it seems attractive for executive staff () like,
wow, he does data science means hes doing innovation
a lot of people are trying to brand themselves as data scientists e.g. a quant wants to find a
job, or an excel analyst tries to sell himself () there is a sense of tribalism, you know I want
to brand myself as a data-scientist, these are the good network, and things I can pick-up and
progress in my career
44
VII. Discussion
Research and analysis conducted during this short study leads to a
convincing argument that we might be witnessing an establishment of a new
profession - emerging at the boundaries of engineering, computing and statistics
- sitting within a longer tradition of the evolution of the data analyst role. The
need for a new breed of researchers and data analysts has been expressed on
both sides of the market amongst the scientific community and the industry, spearheaded mainly by technology companies from the usual, US innovation
hubs. This phenomenon fits into the picture of a more universal change
happening in society: that is, digitization and scientification of work practice.
This process might be often concealed under the messages of increasing
datafication of products, services and policy interventions, and marked by
additional slogans of data-driven analytics or evidence-based decision
making. This is, however, a manifestation of technological advancement and the
increasing consequences that the ICT revolution is having on subsequent areas of
our lives.
This case is also a good example for observing and evaluating how
professional identity is evolving within the community of people calling
themselves, or being labeled, as data scientists. This has much to do with the
attributes, beliefs, motives and experiences that they express and the identities
they construct in social interactions. This is particularly interesting for the data
scientists profession, as it is in the process of making. By some it is viewed with
a degree of scepticism, by others eagerly taken on, whilst for many still remains
nebulous. Additionally, the term has inbuilt semantic and linguistic conflict of its
parts - at what point was there ever science without data? As this profession
forms out of a stream of scientific training, it fits in the larger conversation
about the making of science, of a scientists, of science communication, and
scientific management. This leaves us with two questions:
45
group create a new role, and how does the self-perception of authenticity
accord with what one feels and communicates around his or her competencies?
For the self-perceived data scientist and for the industry recruiters who
also shape the perception of the profession, the role is strongly associated with
an advanced (Masters or PhD) degree in applied quantitative disciplines or
computer science. This is because the role of the data scientist seems to assume
a blend between computing and quantitative skills, backed with practice and
experience in conducting scientific work. This educational background is still
asserted by the labels of higher education, however respective on-line and
industry-led programmes have been made available for enriching the training
base. This training is mostly focused on acquiring knowledge about the use of Big
Data tools and their appropriate use for a changing organizational context. In
many cases the tools developed by industry in the last few years are the ones
mostly associated with the data scientists role Hadoop, Cassandra, Map/
Reduce, Hive, Pig, are all the new generation of Big Data tools. And
programming languages: C++, Python, Java; and analytical tools MatLab, Stata
and R, join this group too.
These are the technical craftsmanship tools" that are expected for data
scientists by the labour market, and to some degree, by the data scientist
themselves. As this research and analysis suggests, it is not the pure knowledge
of these tools that makes one a respected data scientist amongst peers, but the
ability to independently choose the appropriate tools for the given context and
the aptitude to skilfully interpret and communicate the findings. Pure
knowledge of these tools, training, or education doesnt seem to yet make one
the data scientist. This is, rather, a consequence of the type of role one is
expected to pursue at the workplace and its organisational title. For example,
there are a number of individuals possessing the above traits, who pursue the
data scientist tasks, but are not labelled as data scientists.
Due to this, the phenomenon corresponds closely with the question of
how the job of the data scientist fits the larger picture of the making of science
and the evolution of data analysis roles. There is a push in the public narrative,
supported by the tech-industry and backed by some policy decisions, that the
46
universal labels to capture data- and computing- literate individuals who are
comfortable applying the novel tools that the technology sector (and academia)
create to tackle with the increasingly complex environment of data-generating
instruments.
A particularly interesting aspect that emerged over the last few years, is
the increasing emphasis on applying machine learning methodologies to vast
data streams. This reflects yet another phenomenon increasingly taking place in
organizations, namely the automation of work and either replacement of
human physical labour with robotics, or human cognitive labour with algorithms.
Taking this forward to data analysis roles, one can recognize that there is a
tendency towards substituting organisational resources of classical data analysts
(both in science, and in industry) with computational solutions, and data
scientists due to their role and training are the ones bearing the torch. However,
due to the complexity of many of the issues at the centre of social expectations
e.g. medical records and public health, environmental sensors and pollution
monitoring, mobility patterns and crisis management a purely computational,
tech-led approach, has been often believed to be misleading.
As in many other cases, domain knowledge is necessary in order to
recognize appropriate questions, frame the design of the research and soundly
interpret the results. For that reason, there is an increasing emphasis on
interdisciplinary skill-sets, creating teams with different skills and competences
that include different attributes, values, practices and experiences associated
with their roles. This can in some cases lead to enhanced outputs, but in the
process creates tensions resulting from different professional philosophies of
conduct" and approaches to adressing, interpreting and solving problems. Data
47
and the
implications on the systems of knowing, and the meaning of learning. The ways
in which data scientists will be establishing their professional identities attributes, beliefes, values, motives and experiences as the profession grows,
might have substantial implications on how decision making is conducted in
industry, business or the public sphere. For that reason, it is critical to make
sure the process of educating data scientists is comprehensive enough to
overcome the interpretative socio-technical and political limitations of Big Data,
machine learning, and whaterver comes next.
science, data scientists and the making of the next generation of data analyst
roles so important for further research.
48
Bibliography
Abbott, A. (1988). The System of Professions: An Essay on Division of Expert Labour.
Chicago: The University of Chicago Press.
Adams, M., & Kowalski, G. (1980). Professional Self-Identification Among Art Students.
Studies in Art Education vol. 21 no. 3, 31-39.
Anderson, C. (2008, 23 June). The End of Theory: The Data Deluge Makes the Scientific
Method Obsolete. Retrieved from Wired Magazine: http://archive.wired.com/
science/discoveries/magazine/16-07/pb_theory
Ashfort, B., & Humphrey, R. (1993). Emotional Labor in Service Roles: The Influence of
Identity. Academy of Management Review vol. 18 no. 1, 88-115.
Bandura, A. (1977). Self-efficacy: Toward a Unifying Theory of Behavioural Change.
Psychological Review vol. 84, no. 2, 191-215.
BBC. (2010). Joy of Stats (with Prof. Hans Rosling) [Motion Picture].
Becker, S., & Carper, J. (1956). The Development of Identification with an Occupation.
The American Journal of Sociology, 289-298.
Berger, L., & Luckmann, T. (1966). The Social Construction of Reality. New York:
Penguin Books.
Biao, X. (2006). Global "Body Shopping": An Indian Labour System in the Information
Technology Industry. Princeton, NJ: Princeton University Press.
Bijker, W., & Law, J. (2012). Shaping Technology/Building Society: Studies in
Sociotechnical Change. MIT.
Blacker, F. (1995). Knowledge, Knowledge Work and Organizations: An Overview and
Interpretation. Organization Studies, 1021-1046.
Boellstorff, T. (2013). Making big data, in theory. First Monday vol. 18, nr. 10.
boyd, d., & Crawford, K. (2012). Critical Questions For Big Data: Provocations for a
cultural, technological and scholarly phenomenon. Information, Communication
& Society vol. 15, iss. 5, 662-679.
Brown, J., & Duguid, P. (2001). Knowledge and Organization: A Social-Practice
Perspective . Organizational Science 12(2), 198-213.
Brynjolfsson, E., & McAfee, A. (2014). The Second Machine Age: Work Progress and
Prosperity in Time of Brilliant Technologies. New York: W.W. Norton & Company.
Burkholder, L. (1992). Philosophy and the Computer. Boulder, San Francisco and
Oxford: Westview Press.
Cleveland, S. W. (2001). Data Science: an Action Plan for Expanding the Technical Areas
of the Field of Statistics. International Statistical Review, 21-26.
Colleman, G. (2013). Coding Freedom: The Ethics and Aesthetics of Hacking. Princeton:
Princeton University Press.
Data Science Institute. (2014, September 5). Data Science Institute - Events. Retrieved
from Imperial College London: http://www3.imperial.ac.uk/data-science/events
50
Data Science London. (2014, September 5). About @DS_LDN. Retrieved from Data
Science London: http://datasciencelondon.org/data-science-london/
Davenport, T., & Patil, D. J. (2012). Data Scientist: The Sexiest Job of the 21st Century.
Harvard Business Review.
Diebold, F. (2012). "On the Origin(s) and Development of the Term Big Data". Working
Paper - Penn Economics.
Eyal, G. (2013) For a Sociology of Expertise: The Social Origins of the Autism Epidemic.
AJS vol. 118, no 4., 863-907
Forbes. (2014, September 5). Article: Blueprints Of NSA's Ridiculously Expensive Data
Center In Utah Suggest It Holds Less Info Than Thought. Retrieved from Forbes:
http://www.forbes.com/sites/kashmirhill/2013/07/24/blueprints-of-nsa-datacenter-in-utah-suggest-its-storage-capacity-is-less-impressive-than-thought/
Friedman, J. H. (2001). The Role of Statistics in the Data Revolution? International
Statistical Review 69, (1), 5-10.
Gillespie, T. (2014). The relevance of algorithms. In T. Gillespie, & B. P., Media
technologies: Essays on communication, materiality, and society (pp. 167-194).
Cambridge, MA: MIT Press.
Gitelman, L., & Jackson, V. (2013). Introduction. In L. (. Gitelman, "Raw Data" Is an
Oxymoron (pp. 9-23). Cambridge, MA: The MIT Press.
Goffman, E. (1959). The Presentation of Self in Everyday Life. Anchor Books .
Google. (2014, September 5). Company Overview. Retrieved from Google Company:
https://www.google.com/about/company/
Hall, R. (1968). Professionalization and bureacratization. American Sociological Review,
92-104.
Handley, K., Sturdy, A., Finchman, R., & Clark, T. (2006). Within and beyond
communities of practice: Making sense of learning through participation, identity
and practice. Journal of Management Studies, 641-653.
Haraway, D. (1988). Situated Knowledges: The Science Question in Feminism and the
Privilege of Partial Perspective. Feminist Studies vol. 14, no. 3, 575-599.
Harvey, D. (2007). A Brief History of Neoliberalism. Oxford: Oxford University Press.
Ibarra, H. (1999). Provisional selves: Experimenting with image and identity in
professional adaptation. Administrative Science Quaterly vol. 44 iss. 4, 764-791.
IBM. (2014, September 5). Apply new analytics tools to reveal new opportunities.
Retrieved from IBM Smarterplanet: http://www.ibm.com/smarterplanet/us/en/
business_analytics/article/it_business_intelligence.html
Insight Data Science Program. (2014). White Paper. San Francisco: Insight Data Science
Program.
Kelty, C. (2008). Two Bits - The Cultural Significance of Free Software. Durham and
London: Duke University Press.
Krause, E. (1971). The Sociology of Occupations. Boston: Little, Brown and Company.
Kuhn, T. (1962). The Structure of Scientific Revolutions. Chicago: University of Chicago
Press.
51
Latour, B. (1987). Science in Action: How to Follow Scientists and Engineers through
Society. Cambridge, MA: Harvard University Press.
Latour, B., & Woolgar, S. (1986 (1979)). Laboratory Life: The Construction of Scientific
Facts. Princeton, NJ: Princeton University Press.
Latour, B., Jensen, P., Venturini, T., Grauwin, S., & Boullier, D. (2012). The whole is
always smaller than its parts a digital test of Gabriel Tardes' monads. The
British Journal of Sociology vol. 63, iss. 4, 590-615.
Lave, J., & Wenger, E. (2008 (1991)). Communities of Practice: Learning, Meaning, and
Identity. Cambridge University Press.
Levy, S. (1984). Hackers: Heroes of the Computer Revolution. New York : Nerraw
Manijaime/Doubleday.
Lohr, S. (2012, August 11). How Big Data Became So Big. Retrieved from The New York
Times: http://www.nytimes.com/2012/08/12/business/how-big-data-becameso-big-unboxed.html?pagewanted=all&_r=0
Manovich, L. (2011). Trending: The Promises and the Challenges of Big Social Data. In
M. K. Gold, Debates in Digital Humanities. The University of Minnesota Press:
Minneapolis.
Mattman, C. A. (2013). A vision for data science. Nature vol. 493, 473 - 475.
Mayer-Schnberger, V., & Cukier, K. (2013). Big Data: A Revolution That Will
Transform How We Live, Work, and Think. Eamon Dolan/Houghton Mifflin
Harcourt.
McIntosh, P. (1989). Feeling like a fraud: Part II. Stone Center Working Paper no. 37,
Wellsey College.
McKinsey. (2011). Big data: The next frontier for innovation, competition, and
productivity. McKinsey.
Mead, G. H. (1934). Mind, Self, and Society . Chicago: University of Chicago Press.
Miller, D., & Slater, D. (2001). The Internet: An Ethnographic Approach. London:
Bloomsbury Academic.
Moss-Racusin, C., Dovidio, J. F., Brescoll, V., Grahama, M., & Handelsman, J. (2012).
Science facultys subtle gender biases favor male students. Proceedings of the
National Academy of Sciences of the United States of America (vol. 109 no. 41),
16474-16479.
O'Reilly. (2013, September 5). Retrieved from Strata Conference: http://
strataconf.com/
Parks, M. (2014). Big Data in Communication Research: Its Contents and Discontents.
Journal of Communication vol. 64, iss. 2, 355-360.
Parry, R. (2014, September 5). Episteme and Techne. Retrieved from The Stanford
Encyclopedia of Philosophy (Fall 2014 Edition): http://plato.stanford.edu/
archives/fall2014/entries/episteme-techne/
Pentland, A. (2014). Social Physics: How Good Ideas Spread The Lessons From a New
Science. New York: The Penguin Press.
52
53
54