Startup Tools: NoSQL Databases

Big-data is transforming the world. Here you will learn data mining and machine learning techniques to process large datasets and extract valuable knowledge from them.

The book

The book is based on Stanford Computer Science course CS246: Mining Massive Datasets (and CS345A: Data Mining).

The book, like the course, is designed at the undergraduate computer science level with no formal prerequisites. To support deeper explorations, most of the chapters are supplemented with further reading references.

The Mining of Massive Datasets book has been published by Cambridge University Press. You can get 20% discount here.

By agreement with the publisher, you can download the book for free from this page. Cambridge University Press does, however, retain copyright on the work, and we expect that you will obtain their permission and acknowledge our authorship if you republish parts or all of it.

We welcome your feedback on the manuscript.

The MOOC (Massive Open Online Course)

We are launching an online course based on the Mining Massive Datases book:

Mining Massive Datasets MOOC

The course starts September 29 2014 and will run for 9 weeks with 7 weeks of lectures. Additional information and registration.

The 2nd edition of the book (v2.1)

The following is the second edition of the book. There are three new chapters, on mining large graphs, dimensionality reduction, and machine learning. There is also a revised Chapter 2 that treats map-reduce programming in a manner closer to how it is used in practice.

Together with each chapter there is aslo a set of lecture slides that we use for teaching Stanford CS246: Mining Massive Datasets course. Note that the slides do not necessarily cover all the material convered in the corresponding chapters.

Download the latest version of the book as a single big PDF file (511 pages, 3 MB).

Download the full version of the book with a hyper-linked table of contents that make it easy to jump around: PDF file (513 pages, 3.69 MB).

Note to the users of provided slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org/.

Comments and corrections are most welcome. Please let us know if you are using these materials in your course and we will list and link to your course.

Stanford big data courses

CS246

CS246: Mining Massive Datasets is graduate level course that discusses data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis is on Map Reduce as a tool for creating parallel algorithms that can process very large amounts of data.

CS341

CS341 Project in Mining Massive Data Sets is an advanced project based course. Students work on data mining and machine learning algorithms for analyzing very large amounts of data. Both interesting big datasets as well as computational infrastructure (large MapReduce cluster) are provided by course staff. Generally, students first take CS246 followed by CS341.

CS341 is generously supported by Amazon by giving us access to their EC2 platform.

CS224W

CS224W: Social and Information Networks is graduate level course that covers recent research on the structure and analysis of such large social and information networks and on models and algorithms that abstract their basic properties. Class explores how to practically analyze large scale network data and how to reason about it through models for network structure and evolution.

You can take Stanford courses!

If you are not a Stanford student, you can still take CS246 as well as CS224W or earn a Stanford Mining Massive Datasets graduate certificate by completing a sequence of four Stanford Computer Science courses. A graduate certificate is a great way to keep the skills and knowledge in your field current. More information is available at the Stanford Center for Professional Development (SCPD).

Supporting materials

If you are an instructor interested in using the Gradiance Automated Homework System with this book, start by creating an account for yourself here. Then, email your chosen login and the request to become an instructor for the MMDS book to [email protected]. You will then be able to create a class using these materials. Manuals explaining the use of the system are available here.

Students who want to use the Gradiance Automated Homework System for self-study can register here. Then, use the class token 1EDD8A1D to join the "omnibus class" for the MMDS book. See The Student Guide for more information.

Previous versions of the book

Version 1.0

The following materials are equivalent to the published book, with errata corrected to July 4, 2012.

Download the book as published here (340 pages, 2 MB).

Startup Tools: NoSQL Databases - Page 3

About the Course

You have probably heard that this is the era of "Big Data". Stories about companies or scientists using data to recommend movies, discover who is pregnant based on credit card receipts, or confirm the existence of the Higgs Boson regularly appear in Forbes, the Economist, the Wall Street Journal, and The New York Times. But how does one turn data into this type of insight? The answer is data analysis and applied statistics. Data analysis is the process of finding the right data to answer your question, understanding the processes underlying the data, discovering the important patterns in the data, and then communicating your results to have the biggest possible impact. There is a critical shortage of people with these skills in the workforce, which is why Hal Varian (Chief Economist at Google) says that being a statistician will be the sexy job for the next 10 years.

This course is an applied statistics course focusing on data analysis. The course will begin with an overview of how to organize, perform, and write-up data analyses. Then we will cover some of the most popular and widely used statistical methods like linear regression, principal components analysis, cross-validation, and p-values. Instead of focusing on mathematical details, the lectures will be designed to help you apply these techniques to real data using the R statistical programming language, interpret the results, and diagnose potential problems in your analysis. You will also have the opportunity to critique and assist your fellow classmates with their data analyses.

Recommended Background

Some familiarity with the R statistical programming language ( http://www.r-project.org/) and proficiency in writing in English will be useful. At Johns Hopkins, this course is taken by first-year graduate students in Biostatistics.

Course Format

The course will consist of lecture videos broken into 8-10 minute segments. There will be two major data analysis projects that will be peer-graded with instructor quality control. Course grades will be determined by the data analyses, peer reviews, and bonus points for answering questions on the course message board.

FAQ

  • How is this course different from "Computing for Data Analysis"?

    This course will focus on how to plan, carry out, and communicate analyses of real data sets. While we will cover the basics of how to use R to implement these analyses, the course will not cover specific programming skills. Computing for Data Analysis will cover some statistical programming topics that will be useful for this class, but it is not a prerequisite for the course.

  • What resources will I need for this class?

    A computer with internet access on which the R software environment can be installed (recent Mac, Windows, or Linux computers are sufficient).

  • Do I need to buy a textbook?

    There is no standard textbook for data analysis. The course lectures will include pointers to free resources about specific statistical methods, data sources, and other tools for data analysis.

Startup Tools: NoSQL Databases - Page 5

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Other Hadoop-related projects at Apache include:

  • Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
  • Avro™: A data serialization system.
  • Cassandra™: A scalable multi-master database with no single points of failure.
  • Chukwa™: A data collection system for managing large distributed systems.
  • HBase™: A scalable, distributed database that supports structured data storage for large tables.
  • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout™: A Scalable machine learning and data mining library.
  • Pig™: A high-level data-flow language and execution framework for parallel computation.
  • ZooKeeper™: A high-performance coordination service for distributed applications.

To get started, begin here:

  1. Learn about Hadoop by reading the documentation.
  2. Download Hadoop from the release page.
  3. Discuss Hadoop on the mailing list.

Please head to the releases page to download a release of Apache Hadoop.

A wide variety of companies and organizations use Hadoop for both research and production. Users are encouraged to add themselves to the Hadoop PoweredBy wiki page.

15 October, 2013: release 2.2.0 available

Apache Hadoop 2.x reaches GA milestone! Full information about this milestone release is available at Hadoop Releases.

25 August, 2013: release 2.1.0-beta available

Apache Hadoop 2.x reaches beta milestone! Full information about this milestone release is available at Hadoop Releases.

27 December, 2011: release 1.0.0 available

Hadoop reaches 1.0.0! Full information about this milestone release is available at Hadoop Releases.

March 2011 - Apache Hadoop takes top prize at Media Guardian Innovation Awards

Described by the judging panel as a "Swiss army knife of the 21st century", Apache Hadoop picked up the innovator of the year award for having the potential to change the face of media innovations.

See The Guardian web site

January 2011 - ZooKeeper Graduates

Hadoop's ZooKeeper subproject has graduated to become a top-level Apache project.

Apache ZooKeeper can now be found at http://zookeeper.apache.org/

September 2010 - Hive and Pig Graduate

Hadoop's Hive and Pig subprojects have graduated to become top-level Apache projects.

Apache Hive can now be found at http://hive.apache.org/

Pig can now be found at http://pig.apache.org/

May 2010 - Avro and HBase Graduate

Hadoop's Avro and HBase subprojects have graduated to become top-level Apache projects.

Apache Avro can now be found at http://avro.apache.org/

Apache HBase can now be found at http://hbase.apache.org/

July 2009 - New Hadoop Subprojects

Hadoop is getting bigger!

  • Hadoop Core is renamed Hadoop Common.
  • MapReduce and the Hadoop Distributed File System (HDFS) are now separate subprojects.
  • Avro and Chukwa are new Hadoop subprojects.

See the summary descriptions for all subprojects above. Visit the individual sites for more detailed information.

March 2009 - ApacheCon EU

In case you missed it.... ApacheCon Europe 2009

November 2008 - ApacheCon US

In case you missed it.... ApacheCon US 2008

July 2008 - Hadoop Wins Terabyte Sort Benchmark

Hadoop Wins Terabyte Sort Benchmark: One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (Daytona) terabyte sort benchmark. This is the first time that either a Java or an open source program has won.

Startup Tools: NoSQL Databases - Page 7
Startup Tools: NoSQL Databases - Page 8
Startup Tools: NoSQL Databases - Page 9
Startup Tools: NoSQL Databases - Page 10

RICON West, Basho's distributed systems conference, was a huge success. It featured a closing keynote from Google Fellow, Jeff Dean, and speakers from The Weather Channel, Netflix, Twitter, and more. You can check out videos from RICON West and past RICON conferences on Basho's Youtube Channel. Watch RICON videos

Basho is a distributed systems company that makes Riak, an open source distributed database, and Riak CS, open source cloud storage software. Both are architected for high availability, fault-tolerance, and linear scalability. They are used by businesses ranging from gaming to mobile to health, including over 25 percent of the Fortune 50. Basho's solutions are ideal for companies that need to always store and access critical data. Common use cases include powering Web, mobile, and social applications; collecting client-facing monitoring data; and building public and private cloud platforms. Learn More

Multi-Datacenter Replication, available with Riak Enterprise and Riak CS Enterprise, allows for data to easily be replicated across locations. Designed to maximize availability, replication of data allows enterprises to pre-determine the physical locations of specific data - addressing regulatory compliance and improving end-user experience through low latency. Learn More

Startup Tools: NoSQL Databases - Page 12

A Database for the Web

CouchDB is a database that completely embraces the web. Store your data with JSON documents. Access your documents with your web browser, via HTTP. Query, combine, and transform your documents with JavaScript. CouchDB works well with modern web and mobile apps. You can even serve web apps directly out of CouchDB. And you can distribute your data, or your apps, efficiently using CouchDB's incremental replication. CouchDB supports master-master setups with automatic conflict detection.

CouchDB comes with a suite of features, such as on-the-fly document transformation and real-time change notifications, that makes web app development a breeze. It even comes with an easy to use web administration console. You guessed it, served up directly out of CouchDB! We care a lot about distributed scaling. CouchDB is highly available and partition tolerant, but is also eventually consistent. And we care a lot about your data. CouchDB has a fault-tolerant storage engine that puts the safety of your data first.

See the introduction, technical overview, or one of the guides for more information.

Want to Contribute?

CouchDB is an open source project. And that means everything, from this website to the core of the database itself, has been contributed by passionate, helpful individuals (such as yourself) who liked what they saw and wanted to make something better of it. So if you like what you see, and want to make something better of it, we'd like to see your contributions. It doesn't matter how familiar you are with CouchDB, or whether you know how to program Erlang. There are plenty of ways to be helpful! Just ask!

Our community is our most valuable asset, and it could always do with a bit more love and attention. One of the first things you should do is actually use CouchDB, and get to know it, read about it, evangelise it, and engage with the wider community. Arrange a meetup, give a talk, publicize your use and let people know how you use CouchDB in the wild. You could also get stuck in on the user mailing list or IRC channel, helping new users with their problems. Or come join us on the developer mailing list and lets us know how else you think you can help. There's bound to be someone to point you in the right direction.

Why don't you check out JIRA and help us triage some of those issues? Or maybe you'd like to help us keep the wiki up-to-date? If you're looking for something a little more technical, you could help us with our documentation, QA, packaging, mobile, or release efforts, perhaps? Just drop by on the developer mailing list and let us know what you want to do. There's enough room for any sort of contribution!

Do you want to contribute code? Great! There's lots of stuff to work on. Don't know Erlang? Join the Erlang list, and learn you some Erlang in a friendly environment! You can use JIRA to find easy, medium, and hard issues to work on. Or, if you'd prefer, just open a new issue, and attach your patch. Don't want to use JIRA? Fork us on GitHub and send a pull request. Why don't you check out the contributor workflow guide?

Mailing Lists

Download CouchDB 1.5.0

Startup Tools: NoSQL Databases - Page 14
Startup Tools: NoSQL Databases - Page 15
Startup Tools: NoSQL Databases - Page 16
Startup Tools: NoSQL Databases - Page 17