大数据揭秘:数据科学家到底是干什么的?

96
钢铁苍穹
2016.01.17 00:19* 字数 3692

原文链接:Big Data Uncovered: What Does A Data Scientist Really Do?

Big Data Uncovered: What Does A Data Scientist Really Do?
大数据揭秘:数据科学家到底是干什么的?

The world of Big Data and data science can often seem complex or even arcane from the outside looking in. In business, a lot of people by now probably understand the basics of what Big Data analysis involves – collecting the ever growing amount of data we are generating, and using it to come up with meaningful insights. But what does this actually involve on a day to day level for the professionals who get their hands dirty with the nuts and bolts?
在外界看来,大数据领域和数据科学常被认为是高深复杂甚至神秘的。商业领域中,很多人现在可能已经了解了大数据分析所包含的基本概念:对我们生成的不断增长的海量数据加以收集,发掘其中具有重要意义的信息。但从事具体研究的专业人士每天究竟都在做些什么呢?

To have a look under the hood of a job that some describe as the 'Sexiest Job Of The 21st Century' I spoke to leading data scientist Dr Steve Hanks to get an overview of what the work of a data scientist actually involves, and what sort of person is likely to be successful in the field.
有人将数据科学家称为“21 世纪最性感的工作”,为了揭开其神秘面纱,我与权威的数据科学家 Steve Hanks 博士进行了交谈,大体了解了数据科学家的工作包括哪些方面,以及哪种人更加适合这个领域。

Dr Hanks gained a PhD in computer science at Yale University, has spent 15 years as a professor of computer science and has worked at companies including Amazon, Yahoo and Microsoft. Today he is chief data scientist at Whitepages.com where he is responsible for overseeing the Contact Graph – a database containing contact information for over 200 million people. The database is searched around two billion times every month and is the company's primary business asset.
Hanks 在耶鲁大学获得计算机科学博士学位,在 15 年时间内作为计算机科学专家先后供职于亚马逊、雅虎和微软等多家公司。目前他是 Whitepages.com 的首席数据科学家,负责 Contact Graph 的监管工作。Contact Graph 是一个数据库,存储了超过 2 亿人的联系信息。这个数据库每月被搜索约 20 亿次,是该公司主要的业务资产。

This database has driven Whitepage's business since it was launched in 1997 and more recently it has diversified into app development. Caller ID, its replacement mobile user interface, queries the main Whitepages database to give more complete information on who is calling, and to help cut nuisance and spam calls. It also generates another revenue stream by providing its data to other companies to use in fraud prevention.
自 1997 年推出以来,这个数据库一直是 Whitepages 的业务驱动力。最近这家公司又开发出一款手机应用 Caller ID,它可替代手机用户界面,通过查询 Whitepages 的主数据库,提供更加完善的来电显示信息,还可以屏蔽骚扰电话和广告电话。此外,这一数据库还扩展出一条新的盈利途径,即为其他公司提供数据以用于预防诈骗。

Key Capabilities of a data scientist
数据科学家的关键能力

The term "data scientist" can cover many roles across many industries and organizations from academia to finance or Government. Hanks leads a team of 12 to 15 members responsible for all of the analytics at Whitepages, and their skillsets and duties vary. However, he tells me, there are three key capabilities which every data scientist has to understand.
“数据科学家”这一术语可以代表学术、金融、政府等多种领域和组织中的多种角色。Hanks 所带领的团队有 12 至 15 名成员,他们共同负责 Whitepages 的所有数据分析工作,而各成员的技能和职责则各不相同。不过他告诉我,有三种能力是每个数据科学家必须具备的。

You have to understand that data has meaning
你必须清楚数据是有意义的

Hanks makes the point that we often overlook the fact that data means something and that it is important to understand that meaning. We have to look beyond the numbers and understand what they stand for if we are to gain any valid insights from it. Hanks points out "It doesn't have anything to do with algorithms or engineering or anything like that. Understanding data is really an art, and it's really important."
Hanks 认为,我们经常忽视一个事实,即任何数据都是有意义的,关键在于理解这些意义。如果想要从数据中提炼出任何有效的信息,我们必须将目光超越数据本身,探寻其所表示的东西。Hanks 指出,这与算法、工程学或类似的技术无关,理解数据实际上是一种艺术,并且非常重要。

You have to understand the problem that you need to solve, and how the data relates to that
你必须清楚自己需要解决的问题以及数据与这些问题的关系

Here is where you open your tool-kit to find the right analytics approaches and algorithms to work with your data. Hank talks about machine learning – which is very popular right now, but makes the point that there are hundreds of techniques to use data to solve problems – operations research, decision theory, game theory, control theory – which have all been around for a very long time. Hank says "Once you understand the data and you understand the problem you're trying to solve, that's when you can match the algorithm and get a meaningful solution."
这表示你需要从所掌握的技能中找出合适的分析方法和算法来搞定你的数据。Hanks 谈到了当前非常流行的机器学习,他指出使用数据解决问题的方法有几百种之多,如运筹学、决策论、博弈论、控制论等,且这些方法均已出现了很长时间。Hanks 认为,一旦你理解了数据,理解了试图去解决的问题,便能够找到最合适的算法并提供理想的解决方案。

You have to understand the engineering
你必须了解工程学

The third capability is about understanding and delivering the infrastructure required to perform any analysis. In Hank's words "It doesn't do any good to solve the problem if you don't have the infrastructure in place to deliver the solution effectively, accurately and at the right time and place."
第三种能力即能够对数据分析工作所需的基础知识有足够了解并运用自如。用 Hanks 的话来说,如果不具有相应的基础知识,以便能够适时适地提供准确有效的解决方案,对解决问题是毫无帮助的。

Being a good data scientist is really about paying attention to all three of those capabilities. You have to pay attention to the data and what it means, understand the problems and know about matching algorithms to those problems, and you have to understand the engineering to come up with solutions.
对于想成为一名优秀数据科学家的人,以上三种能力是必不可少的。你需要关注数据及其意义,理解问题并知晓解决问题的理想算法,还需要了解工程学,这将更有助于你解决问题。

At the same time it doesn't mean there's no room for specialization. Hanks makes the point that it is virtually impossible to be an expert in all three of those areas, not to mention all the sub-divisions of each of them. It is okay to specialize in one of these areas as long as you have an appreciation of all of them. Hanks tells me: "Even if you're primarily an algorithm person or primarily an engineer. If you don't understand the problem you're solving and what your data is, you're going to make bad decisions."
然而这并不表示没有专攻某一种能力的可能。Hanks 认为,实际上不可能存在精通全部三个领域的专家,更何况这些领域各自又具有若干分支。而在已对这些领域建立了解的基础之上,完全可以专门研究其中一个领域。但 Hanks 告诉我,即使你以算法研究为主或以工程师作为第一角色,如果没有理解所解决的问题或是没搞清楚数据的意义,同样没办法胜任数据科学家。

Key qualities of a data scientist
数据科学家的关键品质

In terms of personal qualities, a curiosity about data is essential, as well as communications skills, says Hanks. "People on my team spend a lot of time talking to customers to figure out what problems they need to solve, or talking to data vendors to find out what they can provide. So you become a middle man and communication is very important."
就个人品质而言,对数据的好奇心是必不可少的,沟通技巧也同样重要。Hanks 说“我的团队成员会花很长时间与客户进行沟通,指出他们亟待解决的问题,还会与数据供应商进行交流,以便确定他们能够提供哪些帮助。因此,你成了一个中间人,可见沟通是非常重要的。”

Lots of different types of people go into data science, and Hanks explained to me that he was probably not a very typical example. However in my experience there is no such thing. The key capabilities Hanks mentioned cover a broad range of skills and people of different personality types and mind sets are attracted to the profession.
许许多多不同类型的人从事着数据科学行业,Hanks 对我解释说他可能并不是个很典型的例子。而以我的经验来看,可不是这么回事儿。Hanks 提及的关键能力包含了范围广博的专业技术,而这个行业也不断吸引着具有不同个性和想法的人们。

"I just really loved the interplay", Hanks says, "From the beginning I was just totally fascinated. My first exposure to data science was probably in operations research, and I just loved the idea that you could take big data sets and use them to learn things, and improve things, and I found out that you really could use them to make a difference, I've found that fascinating for over 30 years now."
“我真的非常喜欢这种互动,”Hanks 说,“一开始我完全被迷住了。我第一次接触数据科学时,好像是关于运筹学的案例,你可以使用大数据集习得并改进事物,这种概念让我深深着迷,并且我还发现,你真的可以利用数据带来一些不同,直到现在我已经沉迷于此超过 30 年了。”

Even after all that time in the business though, problems still come up which have him scratching his head, and these serve as a great example of the sort of challenges data scientists find themselves struggling with on a day to day basis.
虽然在这一领域有着漫长而丰富的经历,还是可能遇到让他抓狂的问题。当谈到数据科学家的挑战时,这些问题就是极好的例子,就是他们每日钻研的目标。

Just this morning I was working on something and one of the algorithms just wasn't doing what it was supposed to do – basically it was showing us a link between a particular person and a particular phone number which we just knew was incorrect. These problems can be very intermittent and very hard to diagnose.
就在今天早上,我正忙于工作,发现某个算法没有达到预期效果,基本上这表示某人与某个电话号码之间的已知链接是不正确的。这些问题可能会断断续续地出现,并且非常难以确定。

"We have very specific algorithms that are supposed to do very specific things, and when they don't we just have to take them apart and find out why not, the problem is these days they are very complex and have a lot of working pieces! I can be completely mystified, like I am right now … but we will get there – we always do! That's really the sort of challenge we face day to day – systems which just don't behave the way they are supposed to according to our schematics."
“我们有非常具体的算法来处理非常具体的事情,当算法不奏效的时候,我们只能对它们进行仔细检查并找出原因,问题是这些天用到的算法非常复杂,并且有大量参与计算的代码段!我困惑极了,就像我现在这样……但我们总是会搞定的,我们一直如此!这就是我们每天面对的挑战——没有按照既定构思运行的各种系统。”

In the time that Hanks has been working with data he has seen huge changes in the field, from working on structured databases on mainframes, to distributed Hadoop networks, to the cloud based, real time data processing world of today. So where does he see the future taking analytics and Big Data?
在 Hanks 从事数据科学的这些年中,他感受到了这个领域的巨大变化,从运行于主机上的结构数据库,到分布式 Hadoop 网络,再到今天基于云的实时数据处理,技术发展日新月异。那么随着数据分析和大数据技术的发展,他对行业的未来又是如何看待的呢?

The Future of data science
数据科学的未来

Hanks sees a future of increased data streaming and real-time data processing, as opposed to huge batch processing of data. He believes that in this new world Hadoop MapReduce is less appropriate and in his work he is starting to use other systems like Scala and Akka.
Hanks 认为增量式数据流和实时数据处理技术将大有未来,但并不看好海量数据批处理技术的前景。他相信,在这个崭新的时代,Hadoop MapReduce 将不再那么适用,他在工作中已开始使用 Scala 和 Akka 等其他系统。

One of the biggest challenges Hanks sees is the keeping up with the fast developments of new technologies and new algorithms. He believes that in order to be an effective data scientist you have to be holistic. He believes that it is relatively easy to become a specialist in MapReduce or a particular machine learning algorithm but the challenge is keeping up with the general speed of development in data science. "It's a field that is just stunningly big and complex, and has incredible breadth and depth", Hanks tells me, "You have to understand all of the pieces but the field is getting so vast – that's going to be the challenge facing data scientists going into the future."
Hanks 眼中最大的挑战之一是要紧跟新技术和新算法快速发展的步伐。他认为,要成为一名出众的数据科学家,必须要具有全局观。他相信,成为 MapReduce 或某一机器学习算法领域的专家相对容易,更大的挑战在于紧跟数据科学的发展速度。“这是个非常庞大而复杂的领域,其范围有着无法想象的广度和深度,”Hanks 告诉我,“你必须要了解每个细节,但这个领域还在不停地飞速发展,这将是数据科学家未来所面临的挑战。”


译者注:今年我的第一篇翻译练笔,希望读者多多批评指正。手感逐渐恢复中...
我的译文