“I am Forrest Gump, I have a toothbrush, I have a lot of data and I scrub,” Josh Wills, Data Scientist Cloudera, whimsically described his profession to John Furrier and Dave Vellante inside theCube, live from Strata 2013. He also added that while he thinks of himself mostly as a mathematician, a data scientist is a lot like a data janitor.
John Furrier pointed out that data is now part of the developer community and wanted to know which are the best tools to scrub the data. Wills explained that when it comes to developer tools, the conversation could be described as a “religious debate,” as the tools depending a lot on each developer’s preference. Python, Aurora, SAS, they are all good scripting languages, his personal choice being the first two. “Some kind of scripting language” is a basic tool for a data scientist, but there isn’t a generally adopted best tool.
Talking about unstructured data that needs to be coded on, the need to analyze multiple sets of data and available solutions, Josh Wills expressed a preference for in-memory tools such as Spark and SAS, which provide a great way of exploring data. In what samples for data sets, he stated larger samples are preferable to smaller ones, especially when preparing data sets for other people to analyze,
John Furrier asked about existing collaborative tools in what data science is concerned, how they support team work, through cloud or other vehicles. While such tools would be a great idea, Josh Wills pointed out that nothing worth mentioning exists in this direction. He explained that at this point an inter-office, global collaboration solution is out of the question, a lightweight tool allowing people in the same office to collaborate would be very useful for data scientists. A collaboration tool allowing to share data analysis and data set preparation for data scientists in one location would be a great starting point.
One of the defining qualities of a data scientist is being relentless, Wills said. “If the tool does not answer my question, I google another tool.” A question without an answer is unacceptable to a data scientist.
Sharing projects he works on at Cloudera and is excited about, Wills said he is currently involved in simplifying data science and making everything simple, easy to use, so that machine level techniques become available to the general audience – a programmer or a statistician would then easily use data science in their daily activities.
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
O'Reilly Strata Conference + Hadoop World 2013 | Santa Clara. If you don’t think you received an email check your
spam folder.
Sign in to O'Reilly Strata Conference + Hadoop World 2013 | Santa Clara.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open this link to automatically sign into the site.
Register For O'Reilly Strata Conference + Hadoop World 2013 | Santa Clara
Please fill out the information below. You will recieve an email with a verification link confirming your registration. Click the link to automatically sign into the site.
You’re almost there!
We just sent you a verification email. Please click the verification button in the email. Once your email address is verified, you will have full access to all event content for O'Reilly Strata Conference + Hadoop World 2013 | Santa Clara.
I want my badge and interests to be visible to all attendees.
Checking this box will display your presense on the attendees list, view your profile and allow other attendees to contact you via 1-1 chat. Read the Privacy Policy. At any time, you can choose to disable this preference.
Select your Interests!
add
Upload your photo
Uploading..
OR
Connect via Twitter
Connect via Linkedin
EDIT PASSWORD
Share
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
O'Reilly Strata Conference + Hadoop World 2013 | Santa Clara. If you don’t think you received an email check your
spam folder.
Sign in to O'Reilly Strata Conference + Hadoop World 2013 | Santa Clara.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open this link to automatically sign into the site.
Sign in to gain access to O'Reilly Strata Conference + Hadoop World 2013 | Santa Clara
Please sign in with LinkedIn to continue to O'Reilly Strata Conference + Hadoop World 2013 | Santa Clara. Signing in with LinkedIn ensures a professional environment.
Are you sure you want to remove access rights for this user?
Details
Manage Access
email address
Community Invitation
Josh Wills | Strata Data Conference 2013
“I am Forrest Gump, I have a toothbrush, I have a lot of data and I scrub,” Josh Wills, Data Scientist Cloudera, whimsically described his profession to John Furrier and Dave Vellante inside theCube, live from Strata 2013. He also added that while he thinks of himself mostly as a mathematician, a data scientist is a lot like a data janitor.
John Furrier pointed out that data is now part of the developer community and wanted to know which are the best tools to scrub the data. Wills explained that when it comes to developer tools, the conversation could be described as a “religious debate,” as the tools depending a lot on each developer’s preference. Python, Aurora, SAS, they are all good scripting languages, his personal choice being the first two. “Some kind of scripting language” is a basic tool for a data scientist, but there isn’t a generally adopted best tool.
Talking about unstructured data that needs to be coded on, the need to analyze multiple sets of data and available solutions, Josh Wills expressed a preference for in-memory tools such as Spark and SAS, which provide a great way of exploring data. In what samples for data sets, he stated larger samples are preferable to smaller ones, especially when preparing data sets for other people to analyze,
John Furrier asked about existing collaborative tools in what data science is concerned, how they support team work, through cloud or other vehicles. While such tools would be a great idea, Josh Wills pointed out that nothing worth mentioning exists in this direction. He explained that at this point an inter-office, global collaboration solution is out of the question, a lightweight tool allowing people in the same office to collaborate would be very useful for data scientists. A collaboration tool allowing to share data analysis and data set preparation for data scientists in one location would be a great starting point.
One of the defining qualities of a data scientist is being relentless, Wills said. “If the tool does not answer my question, I google another tool.” A question without an answer is unacceptable to a data scientist.
Sharing projects he works on at Cloudera and is excited about, Wills said he is currently involved in simplifying data science and making everything simple, easy to use, so that machine level techniques become available to the general audience – a programmer or a statistician would then easily use data science in their daily activities.