Exploring the cosmos for the better half of a century has generated voluminous stores of varied data. While this data was collected remotely, strides have been made back on earthly clouds that can now elastically store, visually describe and analyze these volumes of data. As earth-based space agencies explore ambitious new missions, JPL is taking small steps to unleash our data to visually educate and enable our engineers to best tackle our future with a data driven understanding of the present. This talk will discuss how JPL is building a new data science team using cloud based visual analytics to resolve mysteries whose questions & answers lay hidden in our data.
Exploring with the Big Data Scientist's Toolbox
In this hands-on tutorial, you’ll learn and apply the tools and techniques that Data Scientists use to explore big data for new information. Starting small, we’ll use Nutch to collect data from the internet. We’ll then use Cloud Computing and Hadoop to start analyzing 100+ Terabytes of this data from the Common Crawl Corpus, iterating through progressively smarter analysis.
In the second half of the day we’ll discuss what it means to be a Data Scientist. You’ll learn to apply the iPython stack to develop new analytical techniques including unstructured text sentiment analysis. We’ll discuss the importance of visualization and apply techniques to let our data speak for itself with tools like D3. We’ll conclude with methods for applying our days work onto any sized dataset.
Throughout the day, we’ll use Amazon’s Cloud infrastructure including Elastic Map Reduce, EC2 and S3. Attendees should come with a computer, Amazon Web Services account and a working ability with Python.
- Big Data Discussion
- Internet Data and Nutch
- Common Crawl Corpus
- The Hadoop Stack
- Processing with Hadoop
- Data Science Discussion
- Sentiment Analysis
- Data Visualization
- Integrating what we’ve learned