Big Data

What is Big Data?

Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.
Gartner defines Big Data as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
According to IBM, 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few. All of this unstructured data is Big Data.

Big data spans three dimensions: Volume, Velocity and Variety.

Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes - even petabytes - of information.

Turn 12 terabytes of Tweets created each day into improved product sentiment analysis
Convert 350 billion annual meter readings to better predict power consumption

Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value.

Scrutinize 5 million trade events created each day to identify potential fraud
Analyze 500 million daily call detail records in real-time to predict customer churn faster

Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.

Monitor 100’s of live video feeds from surveillance cameras to target points of interest
Exploit the 80% data growth in images, video and documents to improve customer satisfaction

What does Hadoop solve?

Organizations are discovering that important predictions can be made by sorting through and analyzing Big Data.
However, since 80% of this data is "unstructured", it must be formatted (or structured) in a way that that makes it suitable for data mining and subsequent analysis.
Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful for analytics purposes.

The Importance of Big Data and What You Can Accomplish

The real issue is not that you are acquiring large amounts of data. It's what you do with the data that counts. The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and optimized offerings, and 4) smarter business decision making. For instance, by combining big data and high-powered analytics, it is possible to:

Determine root causes of failures, issues and defects in near-real time, potentially saving billions of dollars annually.
Optimize routes for many thousands of package delivery vehicles while they are on the road.
Analyze millions of SKUs to determine prices that maximize profit and clear inventory.
Generate retail coupons at the point of sale based on the customer's current and past purchases.
Send tailored recommendations to mobile devices while customers are in the right area to take advantage of offers.
Recalculate entire risk portfolios in minutes.
Quickly identify customers who matter the most.
Use clickstream analysis and data mining to detect fraudulent behavior.

Challenges

Many organizations are concerned that the amount of amassed data is becoming so large that it is difficult to find the most valuable pieces of information.

What if your data volume gets so large and varied you don't know how to deal with it?
Do you store all your data?
Do you analyze it all?
How can you find out which data points are really important?
How can you use it to your best advantage?

Until recently, organizations have been limited to using subsets of their data, or they were constrained to simplistic analyses because the sheer volumes of data overwhelmed their processing platforms. But, what is the point of collecting and storing terabytes of data if you can't analyze it in full context, or if you have to wait hours or days to get results? On the other hand, not all business questions are better answered by bigger data. You now have two choices:

Incorporate massive data volumes in analysis. If the answers you're seeking will be better provided by analyzing all of your data, go for it. High-performance technologies that extract value from massive amounts of data are here today. One approach is to apply high-performance analytics to analyze the massive amounts of data using technologies such as grid computing, in-database processing and in-memory analytics.
Determine upfront which data is relevant. Traditionally, the trend has been to store everything (some call it data hoarding) and only when you query the data do you discover what is relevant. We now have the ability to apply analytics on the front end to determine relevance based on context. This type of analysis determines which data should be included in analytical processes and what can be placed in low-cost storage for later use if needed.

Technologies

A number of recent technology advancements enable organizations to make the most of big data and big data analytics:

Cheap, abundant storage.
Faster processors.
Affordable open source, distributed big data platforms, such as Hadoop.
Parallel processing, clustering, MPP, virtualization, large grid environments, high connectivity and high throughputs.
Cloud computing and other flexible resource allocation arrangements.

The goal of all organizations with access to large data collections should be to harness the most relevant data and use it for better decision making.

Three Enormous Problems Big Data Tech Solves

But what’s less commonly talked about is why Big Data is such a problem beyond size and computing power. The reasons behind the conversation are the truly interesting part and need to be understood. Here you go…there are three trends that are driving the discussion and should be made painfully clear instead of lost in all the hype:

We’re digitizing everything. This is big data’s volume and comes from unlocking hidden data from common things all around us that were known before but weren’t quantified, stored, compared and correlated. Suddenly, there’s enormous value in the patterns of what was recently hidden from our view. Patterns offer understanding and a chance for prediction of what will happen next. These each are important and together are remarkably powerful.

There’s no time to intervene. This is big data’s velocity. All of that digital data creates massive historical records but also rich streams of information that are flowing constantly. When we take the patterns discovered in historical information and compare it to everything happening right now, we can either make better things happen or prevent the worst. This is revenue generating and life saving and all of the other wonderful things we hear about, but only if we have the systems in place to see it happening in the moment and do something about it. We can’t afford enough human watchers to do this, so the development of big data systems is the only way to get to better things when the data gives humans insufficient time to intervene.

Variation creates instability. This is big data’s variety. Data was once defined by what we could store and relate in tables of columns and rows. A world that’s digitized ignores those boundaries and is instead full of both structured and unstructured data. That creates a very big problem for systems that were built upon the old definition, which comprise just about everything around us. Suddenly, there’s data available that can’t be consumed or generated by a database. We either ignore that information or it ends up in places and formats that are unreadable to older systems. Gone is the ability to correlate unstructured information with that vast historical (but highly structured) data. When we can’t analyze and correlate well, we introduce instability into our world. We’re missing the big picture unless we build systems that are flexible and don’t require reprogramming the logic for every unexpected (and there will be many) change.

There you have it… The underlying reasons that big data matters and isn’t just hype (though there’s plenty of that). The digitization, lack of time for intervention and instability that big data creates leads us to develop whole new ways of managing information that go well beyond Hadoop and distributed computing. It’s why big data presents such enormous challenge and opportunity for software vendors and their customers, but only if these three challenges are the drivers and not opportunism.

BI vs. Big Data vs. Data Analytics By Example

Business Intelligence (BI) encompasses a variety of tools and methods that can help organizations make better decisions by analyzing “their” data. Therefore, Data Analytics falls under BI. Big Data, if used for the purpose of Analytics falls under BI as well.

Let’s say I work for the Center for Disease Control and my job is to analyze the data gathered from around the country to improve our response time during flu season. Suppose we want to know about the geographical spread of flu for the last winter (2012). We run some BI reports and it tells us that the state of New York had the most outbreaks. Knowing that information we might want to better prepare the state for the next winter. Theses types of queries examine past events, are most widely used, and fall under the Descriptive Analytics category.

Now, we just purchased an interactive visualization tool and I am looking at the map of the United States depicting the concentration of flu in different states for the last winter. I click on a button to display the vaccine distribution. There it is; I visually detected a direct correlation between the intensity of flu outbreak with the late shipment of vaccines. I noticed that the shipments of vaccine for the state of New York were delayed last year. This gives me a clue to further investigate the case to determine if the correlation is causal. This type of analysis falls under Diagnostic Analytics (discovery).

We go to the next phase which is Predictive Analytics. PA is what most people in the industry refer to as Data Analytics. It gives us the probability of different outcomes and it is future-oriented. The US banks have been using it for things like fraud detection. The process of distilling intelligence is more complex and it requires techniques like Statistical Modeling.

Back to our examples, I hire a Data Scientist to help me create a model and apply the data to the model in order to identify causal relationships and correlations as they relate to the spread of flu for the winter of 2013. Note that we are now taking about the future. I can use my visualization tool to play around with some variables such as demand, vaccine production rate, quantity… to weight the pluses and minuses of different decisions insofar as how to prepare and tackle the potential problems in the coming months.

The last phase is the Prescriptive Analytics and that is to integrate our tried-and-true predictive models into our repeatable processes to yield desired outcomes. An automated risk reduction system based on real-time data received from the sensors in a factory would be a good example of its use case.

Finally, here is an example of Big Data.
Suppose it’s December 2013 and it happens to be a bad year for the flu epidemic. A new strain of the virus is wreaking havoc, and a drug company has produced a vaccine that is effective in combating the virus. But, the problem is that the company can’t produce them fast enough to meet the demand. Therefore, the Government has to prioritize its shipments. Currently the Government has to wait a considerable amount of time to gather the data from around the country, analyze it, and take action. The process is slow and inefficient. The following includes the contributing factors. Not having fast enough computer systems capable of gathering and storing the data (velocity), not having computer systems that can accommodate the volume of the data pouring in from all of the medical centers in the country (volume), and not having computer systems that can process images, i.e, x-rays (variety).

Big Data technology changed all of that. It solved the velocity-volume-variety problem. We now have computer systems that can handle “Big Data”. The Center for Disease Control may receive the data from hospitals and doctor offices in real-time and Data Analytics Software that sits on the top of Big Data computer system could generate actionable items that can give the Government the agility it needs in times of crises.

Big Data Technology in Financial Services

The Financial Services Industry is amongst the most data driven of industries. The regulatory environment that commercial banks and insurance companies operate within requires these institutions to store and analyze many years of transaction data, and the pervasiveness of electronic trading has meant that Capital Markets firms both generate and act upon hundreds of millions of market related messages every day. For the most part, financial services firms have relied on relational technologies coupled with business intelligence tools to handle this ever-increasing data and analytics burden. It is however increasingly clear that while such technologies will continue to play an integral role, new technologies –many of them developed in response to the data analytics challenges first faced in e-commerce, internet search and other industries – have a transformative role in enterprise data management.

Consider a problem faced by every top-tier global bank: In response to new regulations, banks need to have a ‘horizontal view’ of risk within their trading arms. Providing this view requires banks to integrate data from different trade capture systems, each with their own data schemas, into a central repository for positions counter-party information and trades. It’s not uncommon for traditional ETL based approaches to take several days to extract, transform, cleanse and integrate such data. Regulatory pressure however dictates that this entire process be done many times every day. Moreover, various risk scenarios need to be simulated, and it’s not uncommon for the simulations themselves to generate terabytes of additional data every day. The challenge outlined is not only one of sheer data volumes but also of data variety, and the timeliness in which such varied data needs to be aggregated and analyzed.

Now consider an opportunity that has largely remained unexploited: As data driven as financial services companies are, analysts estimate that somewhere between 80 and 90 percent of the data that banks have is unstructured, i.e., in documents and in text form. Technologies that enable businesses to marry this data with structured content present an enormous opportunity for improving business insight for financial institutions. Take for example, information stored in insurance claim systems. Much valuable information is captured in text form. The ability to parse text information and combine the extracted information with structured data in the claims database will not only enable a firm to provide a better customer experience, it also may enhance their fraud detection capabilities.

The above scenarios were used to illustrate a few of the challenges and potential opportunities in building a comprehensive data management vision. These and other data management related challenges and opportunities have been succinctly captured and classified by others under the ‘Four Vs’ of data – Volume, Velocity, Variety and Value.

The visionary bank needs to deliver business insights in context, on demand, and at the point of interaction by analyzing every bit of data available. Big Data technologies comprise the set of technologies that enable banks to deliver to that vision. To a large extent, these technologies are made feasible by the rising capabilities of commodity hardware, the vast improvements in storage technologies, and corresponding fall in the price of computing resources. Given that most literature on Big Data relegate established technologies such as RDBMS to the ‘has been’ heap, it is important that we stress that relational technologies continue to play a central role in data management for banks, and that Big Data technologies augment the current set of data management technologies used in banks. Later sections of this paper will expand on this thought and explain how relational technology is positioned in the Big Data technology continuum.

This paper broadly outlines Oracle’s perspective on Big Data in Financial Services starting with key industry drivers for Big Data. Big Data comprises several individual technologies, and the paper outlines a framework to uncover these component technologies, then maps those technologies to specific Oracle offerings, and concludes by outlining how Oracle solutions may address Big Data patterns in Financial Services.

What is Driving Big Data Technology Adoption in Financial Services?

There are several use cases for big data technologies in the financial services industry, and they will be referred to throughout the paper to illustrate practical applications of Big Data technologies. In this section we highlight three broad industry drivers that accelerate the need for Big Data technology in the Financial Services Industry.

Customer Insight

Up until a decade or so ago, it may be said that banks, more than any other commercial enterprise, owned the relationship with consumers. A consumer’s bank was the primary source of the consumer’s identity for all financial, and many non-financial transactions. Banks were in firm control of customer relationship, and the relationship was for all practical purposes as long-term as the bank wanted it to be. Fast forward to today, and the relationship is reversed. Consumers now have transient relationships with multiple banks: a current account at one that charges no fees, a savings accounts with a bank that offers high interest, a mortgage with a one offering the best rate, and a brokerage account at a discount brokerage. Moreover, even collectively, financial institutions no longer monopolize a consumer’s financial transactions. New entrants-peer-to-peer services; and the Paypals, Amazons, Googles and Walmarts of the world – have had the effect of disinter mediating the banks. Banks no longer have a complete view of their customer’s preferences, buying patterns and behaviors. This problem is exacerbated by the fact that social networks now capture very valuable psychographic information – the consumer’s interests, activities and opinions.

The implication is that even if banks manage to integrate information from their own disparate systems, which in itself amounts to a gargantuan, a fully customer-centric view may not be attained. Gaining a fuller understanding of a customer’s preferences and interests are prerequisites for ensuring that banks can address customer satisfaction and for building more extensive and complete propensity models. Banks must therefore bring in external sources of information, information that is often unstructured. Valuable customer insight may also be gleaned from customer call records, customer emails and claims data, all of which are in textual format. Bringing together transactional data in CRM systems and payments systems, and unstructured data both from within and outside the firm requires new technologies for data integration and business intelligence to augment the traditional data warehousing and analytics approach. Big Data technologies therefore play a pivotal role in enabling customer centricity in this new reality.

Regulatory Environment

The spate of recent regulations is unprecedented for any industry. Dodd-Frank alone adds hundreds of new regulations that affect banking and securities industries. For example, these demands require liquidity planning and overall asset and liability management functions to be fundamentally rethought. Point-in-time liquidity positions currently provided by static analysis of relevant financial ratios are no longer sufficient, and a more near real-time view is being required. Efficient allocation of capital is now seen as a major competitive advantage, and risk-adjusted performance calculations require new points of integration between risk and finance subject areas. Additionally, complex stress tests, which put enormous pressure on the underlying IT architecture, are required with increasing frequency and complexity. On the Capital Markets side, regulatory efforts are focused on getting a more accurate view of risk exposures across asset classes, lines of business and firms in order to better predict and manage systemic interplays. Many firms are also moving to a real-time monitoring of counterparty exposure, limits and other risk controls. From the front office all the way to the boardroom, everyone is keen on getting holistic views of exposures and positions and of risk-adjusted performance.

Explosive Data Growth

Perhaps the most obvious driver is that financial transaction volumes are growing leading to explosive data growth in financial services firms. In Capital Markets, the pervasiveness of electronic trading has lead to a decrease in the value of individual trades and an increase in the number of trades. The advent of high turnover, low latency trading strategies generates considerable order flow and an even larger stream of price quotes. Complex derivatives are complicated to value and require several data points to help determine, among other things, the probability of default, the value of LIBOR in the future, and the expected date of the next ‘Black Swan’ event. In addition, new market rules are forcing the OTC derivative market – the largest market by notional value – toward an electronic environment.

Data growth is not limited to capital markets businesses. The Capgemini/RBS Global Payments study for 2011 estimates that the global volume for electronic payments is about 260 billion and growing between 15 and 22% for developing countries. As devices that consumers can use to initiate core transactions proliferate, so too do the number of transactions they make. Not only is the transaction volume increasing, the data points stored for each transaction are also expanding. In order to combat fraud and to detect security breaches, weblog data from bank’s Internet channels, geospatial data from smart phone applications, etc., have to be stored and analyzed along with core operations data. Up until the recent past, fraud analysis was usually performed over a small sample of transactions, but increasingly banks are analyzing entire transaction history data sets. Similarly, the number of data points for loan portfolio evaluation is also increasing in order to accommodate better predictive modeling.

Technology Implications

The technology ramifications of the broad industry trends outlined above are:

More data and more different data types: Rapid growth in structured and unstructured data from both internal and external sources requires better utilization of existing technologies and new technologies to acquire, organize, integrate and analyze data.

More change and uncertainty: Pre-defined, fixed schemas may be too restrictive when combining data from many different sources, and rapidly changing needs imply schema changes must be allowed more easily.

More unanticipated questions: Traditional BI systems work extremely well when the questions to be asked are known. But business analysts frequently don’t know all the questions they need to ask Self-service ability to explore data, add new data, and construct analysis as required is an essential need for banks driven by analytics.

More real-time analytical decisions: Whether it is a front office trader or a back office customer service rep, business users demand real-time delivery of information. Event processors, real-time decision making engines and in-memory analytical engines are crucial to meeting these demands.

The Big Data Technology Continuum

So how do we address the technology implications summarized in the previous section? The two dimensional matrix below provides a convenient starting, albeit incomplete, framework for decomposing the high-level technology requirements for managing Big Data. The figure below depicts, along the vertical dimension, the degree to which data is structured: Data can be unstructured, semi-structured or structured. The second dimension is the lifecycle of data: Data is first acquired and stored, then organized and finally analyzed for business insight. But before we dive into the technologies, a basic understanding of key terminology is in order.

We define the structure in ‘structured data’ in alignment with what is expected in relational technologies – that the data may be organized into records identified by a unique key, with each record having the same number of attributes, in the same order. Because each record has the same number of attributes, the structure or schema need be defined once as metadata for the table, and the data itself need not have metadata embedded in it.

Semi-structured data also has structure, but the structure can vary from record to record. Records in semi-structured data are sometimes referred to as jagged records because each record may have variable number of attributes and because the attributes themselves may be compound constructs, i.e. be made up of sub-attributes like in an XML document. Because of the variability in structure, metadata for semi-structured data has to be embedded within the data: for e.g., in the form of an XML schema or as name-value pairs that describe the names of attributes and their respective values, within the record. If the data contains tags or other markers to identify names and the positions of attributes within the data, the data can be parsed to extract these name-value pairs.

By unstructured data, we mean data for which structure does not conform to the two other classifications discussed above. Strictly speaking, unstructured text data usually does have some structure -- for e.g., the text in a call center conversation record has grammatical structure -- but the structure does not follow a record layout, nor are there any embedded metadata tags describing attribute. Of course, before unstructured data can be used to yield business insights, it has to be transformed into some form of structured data. One way to extract entities and relationships from unstructured text data is by using natural language processing (NLP). NLP extracts parts of speech such as nouns, adjectives, subject-verb-object relationships; commonly identifiable things such as places, company names, countries, phone numbers, products, etc.; and can also identify and score sentiments about products, people, etc. It’s also possible to augment these processors by supplying a list of significant entities to the parser for named entity extraction.

However, these are not ‘either/or’ technologies. They are to be viewed as part of a data management continuum: each technology enjoys a set of distinct advantages depending on the phase in the lifecycle of data management and on the degree of structure within data it needs to handle, and so these technologies work together within the scope of an enterprise architecture.

The two points below are expanded on further along in this section, but they are called out here for emphasis:

The diagram does not imply that all data should end up in a relational data warehouse before analysis may be performed. Data needs to be organized for analysis, but the organized data may reside on any suitable technology for analysis.

As the diagram only uses two dimensions for decomposing the requirements, it does not provide a complete picture. For example, the diagram may imply that structured data is always best handled in a relational database. That’s not always the case, and the section on handling structured data explains what other technologies may come into play when we consider additional dimensions for analysis.

Handling Unstructured Data

Unstructured data within the bank may be in the form of claims data, customer call records, content management systems, emails and other documents. Content from external sources such as Facebook, Twitter, etc., is also unstructured. Often, it may be necessary to capture such unstructured data first before processing the data to extract meaningful content. File systems of course can handle any type of data as they simply store data. Distributed file systems are file systems architected for high performance and scalability. They exploit parallelism that is made possible because these file systems are spread over several physical computers (from 10s to few thousand nodes). Data captured in distributed file systems must later be organized (reduced, aggregated, enriched, and converted into semi-structured or structured data) as part of the data lifecycle.

Dynamically indexing engines are relatively new class of databases in which no particular schema is enforced or defined. Instead, a ‘schema’ is dynamically built as data is ingested. In general, they work something akin to web search engines in that they crawl over data sources they are pointed at, extracting significant entities and establishing relationships between these entities using Natural Language Parsing or other text mining techniques. The extracted entities and relationships are stored as a graph structure within the database. These engines therefore simultaneously acquire and organize unstructured data.

Handling Semi-Structured Data

Semi-structured data within the bank may exist as loan contracts, in derivatives trading systems, as XML documents and HTML files, etc. Unlike unstructured data, semi-structured data contains tags to mark significant entity values contained within it. These tags and corresponding values are key-value pairs. If the data is in such a format that these key-value pairs need to be extracted from within it, it may need to be stored on a distributed file system for later parsing and extraction into key-value databases. Key-value stores are one in a family of NoSQL database technologies -- some others being graph databases and document databases – which are well suited for storing semi-structured data. Key-value stores do not generally support complex querying (joins and other such constructs) and may only support information retrieval using the primary key and in some implementations using an optional secondary key. Key-values stores like the file systems described in the previous section are also often partitioned, enabling extremely high read and write performance. But unlike in distributed file systems where data can be written and read in large blocks, key-value stores support high performance for single-record reads and writes only.

That these newer non-relational systems offer extreme scale and/or performance is accepted. But this advantage comes at a price. As data is spread across multiple nodes for parallelism there is increased likelihood of node failures, especially when cheaper commodity servers are used to reduce the overall system cost. In order to mitigate the increased risk of node, failures these systems replicate data on two or often three nodes. The CAP Theorem put forward by Prof. Eric Brewer states that such systems have to choose two from among the three properties of Consistency, Availability and Partition Tolerance. And most implementations choose to sacrifice Consistency, the C in ACID, thereby redefining themselves as BASE systems (Basically Available Soft-state Eventually consistent).

Handling Structured Data

Banks have applications that generate many terabytes of structured data and have so far relied almost exclusively on relational technologies for managing this data. However, the Big Data technology movement has risen partly from the limitations of relational technology, and the most serious limitations may be summed up as: Relational technologies were engineered to handle needs not always required. For example, relational systems can handle complex querying needs and they adhere to strict ACID properties. These capabilities are not always required, but because they are always “on”, there is an overhead associated to relational systems that sometimes constrains other more desired properties such as performance and scalability. To make the argument more concrete, let’s take an example scenario: It wouldn't be unusual for a medium to large size bank to generate 5-6 terabytes of structured data in modeling exposure profiles of their counterparties using Monte Carlo simulations (assuming 500000 trades, 5000 Scenarios). Much more data would be generated if stress tests were also performed. What’s needed is a database technology that can handle huge data volumes with extremely fast read (by key) and write speeds. There is no need for strict ACID compliance; availability needs are less than in say, a payment transaction system; there are no complex queries to be executed against this data; and it would be more efficient for the application that generates the data (the Monte Carlo runs) to have local data storage. Although the data is structured, a relational database may not be the optimal technology here. Perhaps a NoSQL database or distributed file system or even a data grid (or some combinations of technologies) may be faster and more cost effective in this scenario.

While relational technologies may be challenged in meeting some of these demands, the model benefits tremendously from its structure. These technologies remain the best way to organize data in order to quickly and precisely answer complex business questions, especially when the universe of such questions is known. They remain the preferred technology for systems that have complex reporting needs. Also, if ACID properties and reliability are must haves for applications such as core banking and payments, few other technologies meet the demands for running their mission critical, core systems.

Moreover, many limitations of relational technology implementations like scale and performance are addressed in specific implementations of the technology, and we discuss the Oracle approach to extending the capabilities of the Oracle Database implementation – both, in terms of its ability to scale and its ability to handle different types of data -- in the next section.

Adding New Dimensions to the Decomposition Framework

In the previous sections we used the two dimensions shown in Figure 1 to uncover the technologies required to handle big data needs. In this section we outline two additional dimensions that can be used for further decomposition of technology requirements.

Handling Real-Time Needs

Real-time risk management, in-process transactional fraud prevention, security breach analytics, real-time customer care feedback and cross-selling analytics, all necessitate the acquisition, organization and analysis of large amounts of data in real-time. The rate at which data arrives and the timeliness in responding to incoming data is another dimension we may apply to the technology requirements decomposition framework. Acquiring data at extremely fast rates and performing analysis on such data in real-time requires different set of technologies than previously discussed.

Three most common technologies for handling real-time analytics are Complex Event Processors, in-memory Distributed Data Grids and in-memory Databases. Complex Event Processing (CEP) engines provide a container for analytical applications that work on streaming data (market data, for example). Unlike in databases, queries are evaluated against data, in-memory, and continuously as data arrives into the CEP engine. CEP engines are therefore an essential component of event-driven architectures. CEP engines have found acceptance in the front office for algorithmic trading, etc., but they have wide applicability across all lines of business and even in retail banking: for detecting events in real-time as payment or core banking transactions are generated.

Distributed data grid technologies play a complementary role to CEP engines. Data grids not only store data in memory (usually as objects, i.e., in semi-structured form) they also allow distributed processing on data in memory. More sophisticated implementations support event based triggers for processing and MapReduce style processing across nodes. Using a distributed data grid, transactions can be collected and aggregated in real-time from different systems and processing can be done on the aggregated set in real-time. For example a bank could integrate transactions from different channels in a distributed data grid and have real-time analytics run on the collected set for superior multi-channel experience.

Reducing Data for Analysis

A guiding principle at the heart of the Big Data movement is that all data is valuable and that all data must be analyzed to extract business insight. But not all data sets contain equal amount of value, which is to say that the value-density or "signal-to-noise ratio" of data sets within the bank differ. For example, data from social media feeds maybe less value-dense than data from the bank’s CRM system. Value-density of data sets provides us with another dimension to apply to the framework for decomposing technology requirements for data management. For example, it may be far more cost effective to first store less value-dense data such as social media streams on a distributed file system than in a relational database. As you move from left to right in the framework, low-value density data should be aggregated, cleaned and reduced to high value–density data that is ready to be analyzed. This is not to say that the diagram implies that all data should eventually be stored in a relational data warehouse for analysis and visualization. Data can be organized in-place: The output of a MapReduce process on Hadoop may be the HDFS file system itself and analytical tools for Big Data should be able to provide present outputs from analysis of data present in both non-relational and relational systems.

CHALLENGES IN STORING AND PROCESSING DATA

The challenges that are involved in storing and processing of the data are listed below:

Large volumes of data:

It is estimated that the data in the world is around 2 zettabytes(10^21). The social networking giant Facebook hosts 10 billion photos, taking up one petabyte of storage. Just the logs that are generated by Facebook site itself occupy terabytes of storage per day.

Processing:

How to process the petabytes of data?

RDBMS systems are capable of storing data in the order of GBs to TBs. The RDBMS systems are not built for processing huge amount of data.

What about grid computing?

In grid computing, the data is stored on single san storage. For processing, the data is moved to N number of machines (grid), where the computation is done. The problems with grid computing are:

Lot of data movement over the network between the storage and computation machines. Just to move 1TB of data it will take 2 hours of time.
The developer has to write program to handle the mechanics of data flow, coordination between the computing machines, handling machine failures.

Unstructured Data:

RDBMS can store only structured data in the form of tables. It imposes schema on the data. You cannot store unstructured data in the RDBMS system.

Transfer rate:

It is the rate at which you can read data from the disk. Around 1990s, the size of hard disk was 1.2GB and transfer rate was 4.4MBPS. Now to read 1GB of data from the disk, it takes 5 mins.

Now in 2013, we have disks of size 1TB and transfer rate of 100 to 150MBPS. To read the whole 1TB of data from the disk, it takes around 2 hours 30 mins.

You can observe that there is a drastic growth in the storage capacity but not on the transfer rate. Two and half hours is a long time to read one terabyte data from the disk.

Seek time:

Seeking is the process of moving the disks head to a particular place in the disk. Seek time is improving more slowly compared to the transfer rate.

In an RDBMS system, if you want to read a row in a table of 1 billion rows (assuming there are no indexes on the table), then it has to find the row by reading the data from the start of the data file. This where the seek time causes problem.

Velocity:

The rapid growth of data is causing the data storage issues. An RDBMS system scales up. This means you have to replace the hard disk with bigger one to accommodate the new data.