Big Data Use Cases: In this tutorial, we will discuss real-life case studies of Big data, Hadoop, Apache Spark and Apache Flink. This tutorial will instruct about the several assorted big data use cases where the industry is using different Big Data tools (like Hadoop, Spark, Flink, etc.) to solve certain problems. Learn from the real case studies how to solve challenges related to Big Data. Flink use cases are mostly aimed at real-time analytics, while Apache Spark use cases are fixed on problematic computational machine learning algorithm implementations. Apache Hadoop use cases concentrate on handling massive volumes of data efficiently.
Big data Use Cases
This segment of the tutorial will provide you a detailed description of real-life big data use cases and big data applications in various domains–
Credit Card Fraud Detection
As many people are using a credit card nowadays, so it has become very necessary to protect people from frauds. It has become a big task for Credit card companies to identify whether the requested transaction is fraudulent or not.
A credit card transaction roughly takes 2-4 seconds to complete. Hence the companies need an innovative solution to identify the transactions which may appear as a fraud in this small time and thus protect their customers from becoming its victim.
An unusual number of clicks from the same IP address or a pattern in the access times — although this is the most obvious and easily identified form of click scam, it is amazing how many frauds still use this method, particularly for quick attacks. They may prefer to strike over a long weekend when they figure you may not be watching your log files carefully, clicking on your advertisement repetitively so that when you return to work on Tuesday, your account is significantly drained. Part of this scam might be unintentional when a user tries to reload a page.
Once again if you have made any transaction from Hyderabad today and the very next minute there is a transaction from your card in Malaysia. Then there are options that this transaction may be a fraud and not done by you. Hence companies need to process the data in real-time (Data in Motion analytics DIM) and analyze it against individual history in a very short period and identify whether the transaction is a fraud or not. Therefore, companies can accept or decline the transaction based on the severity.
To handle the data streams, we need streaming engines like Apache Flink. The streaming engine can utilize the real-time data streams at very high efficiency and process the data in low latency (without any delay).
Sentiment analysis provides material behind social data. A crucial task in sentiment analysis is categorizing the division of a given text at the document, sentence, or feature level — whether the suggested opinion in a document, a sentence, or an individual feature is positive, negative, or neutral. Advanced sentiment arrangement looks, for instance, at expressive states such as “angry,” “sad,” and “happy.”
In sentiment analysis, language is treated to identify and understand customer feelings and positions towards brands or topics in online discussions ie, what they are thinking about a specific product or service, whether they are pleased with it or not, etc.
For example, if a company is introducing a new product, it can find what its consumers are thinking about the product. Whether they are pleased with it or not or they would like to have some alterations in it can be located using Big data by doing sentiment analysis. For example, using sentiment analysis we can spot user’s opinions about the same. Then the company can exploit accordingly to alter or enhance the product to increase their sales and to make customers feel pleased with their product.
Below is a valid example of sentiment analysis:
The tweets about their flights are monitored by a major airline company to see how customers are feeling about updates, new planes, entertainment, etc. Nothing unusual there, excluding when they start feeding this information to their customer support platform and resolving them in real-time.
One notable instance occurred when a customer tweeted negatively about lost luggage before boarding his connecting flight. They collect negative tweets and offer him a free first class update on the way back. They also traced the luggage and gave information on where the luggage was, and where they would hand over it. Pointless to say, he was pretty surprised about it and tweeted like a happy holidaymaker throughout the rest of his trip.
With Hadoop, you can extract Twitter, Facebook, and other social media chats or discussions for sentiment data about you and your competition, and use it to make pointed, real-time, decisions that increase market share. With the help of rapid analysis of customer sentiment through social media, the company can swiftly take decisions and action and they need not wait for the sales report as earlier to run their business in a better way.
Data Processing (Retail)
Let’s now see an application for Leading Retail users in India. The customer was getting invoice data regularly which was of about 100 GB size and was in XML format. To create a report from the data, the traditional method was taking about 10 hours and the client had to wait until the report is created from the data.
This traditional method was developed in C and was taking so much time which was not a viable solution and the client was not pleased with it. The invoice data was in XML format which needs to be converted into a structured format before creating the report. This involved, verification of data, and application of complicated business rules.
In the current world when things are expected to be available anytime when required, waiting for 10 hours was not an accurate and suitable solution. Hence the client contacted Big data team of one of the companies with their problem and with a faith to get a better solution. The customer was even able to accept time reduced from 10 hours to 5 hours or a bit more also.
When Big Data team started working on their problem and reverted them with the solution, the client was shocked and could not believe that the report which they were generating in 10 hours could now be generated in just 10 minutes using Big Data and Hadoop. The team used a cluster of 10 nodes for the data getting created and now the time taken to process data was just 10 minutes. So you can think about the speed and efficiency of Big Data in the current world.
Orbitz is a leading travel company using modern technologies to change the way customers around the globe plan the travel. They manage the customer travel planning sites Orbitz, Ebookers, and CheapTickets.
It creates 1.5mn flight searches and 1mn hotel searches regularly and the journal data being generated by this action is approximately 500GB in size. The raw records are only stored for a few days because of expensive data warehousing. To handle such enormous data and to store it using traditional data warehouse storage and evaluation, infrastructure was becoming more expensive and time-consuming with time.
For example, to search a hotel in a database using a traditional approach, extraction needs to be done sequentially. The time it was taking to handle and classify hotels based on just last 3 months’ data was also 2 hours which was again not the satisfactory and viable solution today when clients are expecting results to be created on just their click.
This challenge was again very huge and needed some resolution to protect the company from losing their clients. Orbitz needed an efficient way to store and process this data, plus they needed to enhance their hotel rankings. Orbitz then tried using Big Data and Hadoop approach. Here HDFS, Map Reduce, and Hive were used to solve the challenge and just remarkable results were received. A Hadoop cluster provided a very cost-effective way to store huge amounts of raw records. Data is cleaned and evaluated and machine learning algorithms are run.
Previously when it was taking the time of about 2 hours to produce search results on hotel data of the last 3 months, the time was shrunk to just 26 minutes to produce the same result with Big Data. Big data was able to forecast hotel and flight search tendencies much faster, more efficiently, and cheaper than the traditional approach.
Now we are going to have a look at how Sears Holding benefited Hadoop to specify marketing promotions.
Sears is an American multinational departmental store chain with over 4,000 stores with millions of products and 100mn consumers. As of 2012, it is the fourth-largest U.S. departmental store company by retail sales and is the 12th-largest retailer in the United States, leading its competitor Macy’s in 2013 in terms of revenue.
With a large number of stores and customers, Sears has collected over 2PB of data up to now. Now the problem came when legacy systems became powerless to evaluate huge amounts of data to specify marketing & loyalty campaigns. They required to specify marketing campaigns, coupons and offers down to the individual customer, but our legacy systems were incompetent of supporting that which was leading to failure of their revenues.
Increasing customer reliability, and with its sales and profitability, was very much important to Sears due to huge competition.
Sears’ traditional process for evaluating marketing campaigns for loyalty club members usually took six weeks on the mainframe, Teradata, and SAS servers to evaluate just 10% of customer data. Here came the revolutionary implementation of Apache Hadoop, the high-scale, open-source data processing platform driving the big data tendency.
With the new advance of Big data, Sears reallocated to Hadoop with 300 nodes of commodity servers. The new process running on Hadoop can be finalized weekly for a 100% analysis of customer data. However the old models made use of 10% of available data, the new models run on 100%. For specific online and mobile commerce scenarios, Sears can now perform regular evaluations. Shared reports can be developed in 3 days instead of 6 to 12 weeks with this method. This step saved millions of dollars in mainframe and RDBMS cost and got 50 times better performance for Sears. They were even able to increase revenues through better evaluation of customer data in a sensible and fast way.
Market Basket Analysis
In the retail industry, inventory, pricing, and transaction data are spread across various sources. Business users need to gather this information to understand products to come up with reasonable pricing, find platforms to support. Therefore their online users would have effective performance, and where to target ads.
Market basket analysis may present the retailer with information to understand the purchase behavior of a buyer what he is looking for and what other things he may fascinate with buying along with this product.
A clear application of Market Basket Analysis is in the retail sector where retailers have huge amounts of transaction data and often thousands of products. One of the familiar examples is the Amazon and their recommendation system: “Customers who bought a particular item also bought items X, Y, and Z”. Market Basket Analysis is related to many other industries and uses cases.
For example, the retail company leader in the Fashion industry evaluated sales data from the last three years. There were more than 100 million receipts but the results attained can be used as an indicator for defining new positive initiatives, Identifying best schemes for the layout of goods in stores, etc.
This data will allow the retailer to understand the buyer’s needs and rewrite the store’s layout suitably, develop cross-promotional programs, or even capture new customers. By evaluating the uses’ buying pattern they can identify what are items they bought together. To make store customer-friendly these items can be placed together and related campaigns can be run to attract new buyers.
Market Basket Analysis algorithm can be personalized according to users’ needs. To boost sales, supermarkets are aiming to make the store more customer-friendly. Now business users can deeply explore the effectiveness of marketing and campaigns.
Marketing and sales organizations across several industries are looking to evaluate, understand, and forecast purchase behavior towards achieving goals to reduce customer agitation and maximize customer lifetime value (CLV). Selling extra products and services to existing customers over their lifetime is crucial for optimizing revenues and profitability. Market Basket Analysis association rules identify the products and services that customers usually purchase together, allowing organizations to offer and promote the right products to the right customers.
To execute this complex use-case Apache Spark is the best resolution that provides a simplified framework to handle various use-cases. Machine learning algorithms are used to develop Market Basket Analysis. Apache Spark offers MLlib which is a powerful machine learning library. Spark run iterative algorithm very efficiently.
Customer Churn Analysis
Churn analysis is the calculation of the rate of wear and tear in the customer base of several companies. It involves identifying those consumers who are most likely to withdraw from using your service or product.
Losing the customer is not enjoyed by any industry. In today’s market, all the customer-facing industries are facing the issue of customer churn due to a huge opposition in the market. Industries like Retail, Telecom, Banks, etc. are facing this issue rigorously.
The best way to manage these issues will be to forecast the subscribers who are likely to churn, well in advance so that businesses can take required measures to alleviate it and gain the customers.
The industries are involved in finding the root cause of customer churn; they want to know why the customer is departing them and which is the biggest factor. To find out the root cause of Customer churn companies need to evaluate the following data. This data might vary from TBs to PBs
· Companies need to go through billions of customer complaints which are stored for years and get them solved instantly.
· Data from social media, where users write their opinions about the product they are using from which companies can identify if customers are satisfied with their products or not.
Let’s consider an example of Call Centre Analysis. Here the data used is Call records and Transactional data. Many banks are adding this call center data with their transactional data warehouse to reduce churn, and increase sales, customer monitoring alerts, and scam detection.
Apache Flink suggests an opportunity to tap into the many internal and external customer interaction and behavioral data points to detect, measure, and improve the desired but deceptive objective of consistent and rewarding Customer Experience success.