The Industrial Internet: The Future Is Cloudy
This is the fourth post of a five-part weekly series exploring the Industrial Internet in the New Industrial Age. If you missed the beginning of the series, catch up by reading The Journey to the Industrial Internet, The Industrial Internet: The Future Is Big and The Industrial Internet: The Future Is Healthy.
Today, we gather and store very large amounts of data, and as time goes on, data sets become so large that they become untenable. What does that mean exactly? It means that simply asking for data could return result sets that exceed local memory, crashing the very hardware we use to analyze the data. A simple query like “Show me all the jet engines that produced high-exhaust temperatures while also showing intake airflow restrictions” could return terabytes of data. Today, that would be more data than a typical PC could handle, and it would exhaust virtual memory, then crashing the box.
Now the upper end of today’s data sets is so large that a new technology is needed to allow us to process the data to get meaning and insight. Enter: Hadoop®.
Hadoop is a technology developed by the large dot-net companies, specifically designed to address the issue of handling large data sets. After all, if you want to analyze the click-stream data for all users in the world, you need to store and process a massive amount of data. The fundamental idea behind Hadoop is to cluster large numbers of commodity computers together to act as one. Data is ingested into a Hadoop cluster and broken down into small packets that are then replicated across many nodes in the cluster. The assumption is that system-attached storage, rather than network-attached storage (think SAN) is cheaper but less reliable, so replicating packets across tens of computers gives you fault tolerance via redundancy (in case any one node crashes); it also gives you the ability to have 10s of computers process requests for the data, providing built-in load balancing.
But Hadoop is not a database; rather, it is a massively scalable file system that provides elastic scale. It is also a batch processing tool (think mainframe) for doing analysis. What does “elastic scale” mean? It means that adding more capacity is as simple as adding a new node to the cluster. It means that the cluster understands that a new node is available. It automatically takes advantage of the new hardware, making the system “bigger.”
Hadoop is a batch processing tool—think warehouse. The process of ingesting data is a bit time consuming. A file is broken into smaller pieces and then replicate across many nodes. This takes time. It is also very expensive to update or change in that the entire file and all its packets need to be deleted across the cluster from every node. Then the file, with the changes, is re-ingested—a timely operation.
Querying for data is also expensive in that the first thing you have to do is figure out where the data is. Remembering that the file has been broken up into packets; a technique called “map-reduce” is used to map the query to the right nodes in the cluster (where the data resides), and then to execute the query or operation. The data is replicated across many nodes, so the request may go out to all nodes and the first one with the answer wins. Of course, there is no guarantee that a query’s reach is just one data file, so the query many get broke down into many queries and distributed (mapped) to many nodes. The results in this case need to be assembled (reduced) before being returned to the user.
You can see that ingesting, changing and querying for data is time consuming. In fact, it can take 30 seconds just to stage the request for data. So why use it? Well, the reason is that with all the data chunked into smaller pieces, and with the use of tens, hundreds or even thousands of nodes, you can process extremely large data sets, leveraging parallel processing power. Given its ability to handle massive data sets then, the notion of passing the analytic to the data, rather than querying for data to process on a client machine centralizes the processing and leverages all the nodes in a cluster to compute in a new way. Rather than asking for data, you pass the processing to the data and return results, rather than large data sets.
All this is well and good, but Hadoop clusters are not your typical IT network. It is the stuff that clouds are made of. A cloud is simply elastic computing and elastic storage. Clouds can be public or private. This means that you can use a third-party cloud, like Amazon or Microsoft, or a company can create their own internal cloud that leverages the same technology. So you see, the future is cloudy.
Note: This post is part four of a five-week series examining the journey of the Industrial Internet.