Data mining

From Verify.Wiki
Jump to: navigation, search

Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Continuous Innovation

Although data mining is a relatively new term, the technology is not. Companies have used powerful computers to sift through volumes of supermarket scanner data and analyze market research reports for years. However, continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost.

Data mining is a process used by companies to turn raw data into useful information. By using software to look for patterns in large batches of data, businesses can learn more about their customers and develop more effective marketing strategies as well as increase sales and decrease costs. Data mining depends on effective data collection and warehousing as well as computer processing.[1]


For example. one Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays.

Currently Hot topics

Data mining is field which is being applied in all domains now a day. Hot topics in data mining are more than belows.[2]

1. Text summarization - As the problem of information overload has grown, and as the quantity of data has increased, so has interest in automatic summarization. Many news oriented applications are relying on text summarization. This is nice paper for it Page on

2. Title recommendation, Topic modeling - To predict the title for articles, websites etc. Needs to create learning based system using classification algorithms. In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents.

3. Semantic correction system - Little complex but interesting. Generally retried text faces semantic error, hence leads to wrong result. Applying this as preprocessing leads to better outcomes.

4. Syntactic correction system - Much needed now a days. Non-English speakers creates much syntactical error. It can also be used as preprocessing job in many projects. So you algorithm should automatically detect such errors and suggest correct grammar.

5. 'Search engine for wikipedia' - Wikipedia data available as dump file. Check dbpedia for reference. Apply indexing techniques and build small kinda SE for wiki pages. As wikipedia already provides this functionality but you can work on better user experience, result optimization.

6. Twitter tweets classifier - Pretty easy and interesting too. Creating learning system for various categories kinda Sports, entertainment, business, politics, hollywood etc. Train the classifier (naive bayes, SVM) and predict the category for incoming tweets.

7. Sentiment analysis for twitter, review, conversations - There are few packages available in R which can help to perform this job. One needs to add few additional feature on top of that to make more intuitive. Nltk, Stanford, word2vect are algo good open source tools for the same.

8. Anomaly detection - Again learning based classification system. Train the classifier using users pre-selected spam mail which would be able to classify new upcoming mails. If uses mark new mail as spam, then retrain(may be some other better option).

9. Sarcasms detection - This can be very interesting one. In sentiment analysis we identify users sentiment regarding somethings, here we identify sarcasm expressed by users. Check out Page on - Sarcasm detection on twitter Classifying Fake Users, Classifying insincere posts - Mail service providers like Gmail, Yahoo etc works a lot on keeping their users away from spam mail and spam users. Also on online discussion forums admin are much willing to auto delete smap-fake-irrelevant posts.

10. Fraud detection - Some users on social media intentionally creates hype about particular products, stock to let it be up. Identifying such fraudulent users and activity is also one of the challenging task.

11. Market Analysis - CocaCola continually hires 3rd party companies to process data related to them from Twitter and Facebook. They launch creative campaigns and want to constantly monitor if the campaign is being accepted by the audience. Many companies try to understand the flaws in their processes by trying to understand what their users/customers are saying about their products or services. Analysts are automating their work by building tools that read the news and try to predict the market situations for the next day. Sentiment Analysis is still one of the hottest applications (and yours truly has been engaged in research on Sentiment Analysis for two years.) You can read about Risk Analysis and Predictive Analysis to learn about latest concentration and advancements in these areas.

12. Robotics - The robots are not simply pre-programmed toys anymore. They try to learn how to do their work from their previous experiences. Genetic Algorithms to Reinforcement Learning, there are many areas of Computer Science that are trying to solve these problems from multiple perspectives. We would love to sit in the car that drives itself if it proves that it can think on the fly. We want missiles to hit the target despite being in an unknown land with totally different climate and unexpectedly high wind speeds.

13. Manufacturing, Automotive, Aviation - Concentration is on improving manufacturing processes to optimize time and material, and ensure high quality production in the assembly line. This extends beyond the factory and on the road when modern braking systems knows how much pressure should be applied on each tyre to stop your car in the most comfortable way. Air and Space industry is working on developing aircraft performance models.


Verification history