A few things the Symbol Research team are reading. Complex Insight is curated by Phillip Trotter (www.linkedin.com/in/phillip-trotter) from Symbol Research
GDELT: Global Database of Events, Language, and Tone
Phillip Trotter's insight:
The Global Database of Events, Language, and Tone (GDELT) is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world over the last two centuries down to the city level globally, to make all of this data freely available for open research, and to provide daily updates to create the first "realtime social sciences earth observatory." Nearly a quarter-billion georeferenced events capture global behavior in more than 300 categories covering 1979 to present with daily updates. Data is available for data scientists to mine and analyze - See more at: http://gdelt.utdallas.edu/#sthash.04oK6DrA.dpuf
Platfora has gained a lot of buzz in the Big Data analytics market primarily through word of mouth. Late last year the company...
Phillip Trotter's insight:
Platforma is an impressive platform demonstrated at the recent Strata conference and helps make data analytics easily accessible. Good article covering why the accessiblity demonstrated by Platforma is a key to large scale data analytics adoption. Worth reading.
One of the emerging, and soon to be defining, characteristics of science research is the collection, usage and storage of immense amounts of data.
Phillip Trotter's insight:
Lessons being learned in scientific disciplines for big data workflows, analysis will be increasingly applicable to general business as large scale data analysis migrates into multiple fields. Scientific based workflow tools such as Kepler will begin to migrate into non science fields because they have already solved a number of problems that mainstream IT is beginnging to encounter.
Good article on Big Data by Jeff Kelly explaning why so called Big Data technologies are being adopted over more traditional alternatives such as relational databases. Worth a quick read. Click on image or title to learn more.
Results from Kalido survey in London – Big Data Analytics 2012. Key take aways: Only 15% had a Big Data Analytics project underway and another 37% planned to do something within 1 year. However, 26% have no timeline in place. This maps with our internal surveys where we are seeing companies with a lot of machine based data (log files, call data records and the like) looking at the technologies while general business is adopting a wait and see how this thing matures, pans out and is relevant. Click on the image or the title to learn more.
Vincent Huang has a great set of posts at Ericsson labs on preserving privacy in big data analytics. Over the last few years there has been a growing trend in the international development world to highlight the use of big data for development. Using mobile telephony call records, twitter postings etc as raw data sets that can be mined for indicators of social circumstances and systemic change. A key first step in this process is anonymizing the data so individuals cannot be easily identified from distilled results. Vincent's work discusses the requirements and algorithms used and is both a fascinating and essential read for anyone working in this area. Must read posts if you are involved in big data for development or working with private individual identifiable data sets. Click on the image or title to learn more.
With the recent Hurrican /storm Sandy remdining us that mother nature rules, it has also serving as a teaching tool for companies whose business is big data. Gigaom has a good article covering various companies applying data analytics in the context of the storm. Click on the image or title to learn more.
Titan is the latest supercomputer to be deployed at Oak Ridge National Laboratory (ORNL), although it's technically a significant upgrade rather than a brand new installation. Jaguar, the supercomputer being upgraded, featured 18,688 compute nodes - each with a 12-core AMD Opteron CPU. Titan takes the Jaguar base, maintaining the same number of compute nodes, but moves to 16-core Opteron CPUs paired with an NVIDIA Kepler K20 GPU per node. The result is 18,688 CPUs and 18,688 GPUs, all networked together to make a supercomputer that should be capable of landing at or near the top of the TOP500 list. Click on the image or the title to learn more
I am not usually a fan of market analysts and especially market reports that report X market worth zigabillions with no qualification to market sizing methodology, categorization, data source or actual product adoption. However john Rymer is a Forrester analyst who has a rather good "crabby old guy" rant on big data. Read the rant its well worth noting - as is the closing lines: "My hope is that we start zeroing in on the scenarios and the value to enterprises of the various big data solutions that address each scenario. By doing so, we'll unlock the value of big data a lot faster. " Could not agree more. Click on the image or title to learn more.
Interesting article on CMS wire regarding insight and big data. What is the Big Data Fallacy? Data does provide information, and more data generally gives more information. However, the fallacy of big data is that more data doesn’t lead to proportionately more information. In fact, the more data you have, the less information you gain as a proportion of the data. The return on extractable information from any amount of big data asymptotically diminishes as your data volume increases. In nearly all realistic data sets (especially big data), the amount of information one can extract will be a very tiny fraction of the data volume: information << data. What about insights? How does that relate to information and data? All insights are information, but not all information provides insight. In order to provide insight, information must be: Interpretable, Relevant, Novel. Click on the image or the title to learn more.
Good article on ReadWriteWeb. At some point,most businesses ask one basic question: is there information out there that will help improve the business? Sometimes the answer is we need lots of bits of information from lots of places. When this is the answer data as a service solutions might be worth the a look. In the article there is a great stand out example of how big data as a service can help small businesses. It's becoming pretty well-known that big data and fast data analysis are being used by large hotels and chains to improve their yield management processes. LeisureLink is a company looking to bring a data as a service platform to small and medium size hotels, restaurants and bed and breakfasts. One of LeisureLink's clients is a 120-room property outside Myrtle Beach, SC, that "is great at hospitality, not necessarily at IT." Using LeisureLinks's service, the property is able to tap into information about local Myrtle Beach hotels and see real-time pricing information on other properties and make adjustments accordingly. This is exactly what the larger hotels and chains have been doing for a while. But now this capability as service is available to smaller, mid-market establishments thanks to Leisure Link. LeisureLinks is a good example of emergent data platforms that can help small businesses using so called big data data mining technqiues that large companies are employing. Worth a read, click on image or title to learn more.
This article states that in order for chains to improve their yield management processes, they must collect big data and analyse the data as fast as possible. it also states that if Big Data can work for every business and it shows that if you get a yes answer to this specific question "Is there information out there that will help improve the business?" then finding big data solution might be worth the effort to improve the business. I believe that every company whether big or small should have the opportunity to be able improve their business by understanding the pricing strategy of other competitors. And it doesn't mean you have to spend a lot of money in order to find the answer, there's a good chance that someone out there can have the valuable information you need.
Health care in the United States certainly needs an overhaul. The question is whether that overhaul will come from artificially intelligent doctors. IBM has launched partnerships with insurance giant WellPoint and the Sloan-Kettering Cancer Center in New York and is expected offer Watson commercially to hospitals within the next few years. Wired article discusses the approach and pros and cons as seen by some doctors. Worth a read for a quick intro to modern AI comprising data mining and big data topics as an approach to treatment diagnostics in healthcare. Click on the image or the title to learn more.
Watch all the EMC Big Ideas videos at http://emc.im/MprmL3 Learn how EMC Solutions help create value by merging Big Data with Cloud Computing in this hand-dr... (Big Ideas: How Big is Big Data?
Phillip Trotter's insight:
There are a bunch of interesting videos on big data from the likes of IBM, Cisco and now these from EMC. Worth a watch (if you haven't already overdosed on big data videos using cartoon presentation techniques).
We talk a lot about big data, but only analyze 1 percent of what’s available. In order to take advantage of the other 99 percent, we need to reconsider how we do big data.
Phillip Trotter's insight:
Good article by Gurjeet Singht, CEO of Ayasdi. I would agreed with his sentiment "I’m very concerned that the growing hype around the term big data has set us all up for disappointment. Query-based analysis is fine for a certain class of problems, but it will never be able to deliver on the expectations the market has for big data.We are on the cusp of critical breakthroughs in cancer research, energy exploration, drug discovery, financial fraud detection and more. It would be a crime if the passion, interest and dollars invested to solve critical global problems like these were sidetracked by a 'big data bubble.'". Article is worth reading.
Interesting article in The New York Times on Jeff Hawkins (of Palm computer fame) new company Numenta. Their Grok product uses a complex artificial neuron model to build predictive analytics from sensed data. Worth reading. Click on the image or the title to learn more.
Get into a cab and it's safe to assume the driver knows the ins, outs, shortcuts and potential traffic tie-ups between you and your destination. That kind of knowledge comes from years of experience, and IBM is taking a similar tact that blends real-time data and historical information into a new breed of traffic prediction.
IBM is testing the new traffic-management technology in a pilot program in Lyon, France, that’s designed to provide the city’s transportation engineers with “real-time decision support” so they can proactively reduce congestion. Called Decision Support System Optimizer (DSSO), the technology uses IBM’s Data Expansion Algorithm to combine old and new data to predict future traffic flow. Over time the system “learns” from successful outcomes to fine-tune future recommendations. Click on the title or image to learn more.
Thanks to Jed for spotting this. Barry Devlin's blog offers a good insight into the emerging discusions around unified data management platforms that can scale from small to big data architectures. Barry's recent white paper on Data Zoo's (linked from article) expands on the points made. Both are worth a read. Click on the image or the title for more info.
Over the last year we have been experimenting with different data processing platforms including Hadoop, HPCC, several graph databases and NOSQL databases. WE selected a hybrid of STORM and HPCC since combined they are a great match to our processing needs, both have permissive open source licenses and good resources.
Many Big data articles fail to mention that Big Data applications have varying requirements and the different available toolsets suit different applications and circumstances. One of the reasons we went with HPCC is we have good C++ skills on our team, we liked the ECL language and systems architecture. We could see how pur target workflows would map (it helps that the Lexis Nexis team have addressed similar areas int he past and it shows in the product) and we could easily see how to integrate the HPCC functionality with our C++, Java and Python tools.
If we were a pure java only house then HADOOP,CASCADE and HIVE would be a much better match for non real time procesing. Tools like Storm are better architected for real time event processing (hence our hybrid choice). One of the essential lessons and take aways is that each domain and circumstance has different requirements. You have to play with the technologies, test them against workflows and see what works for your problem domain, performance requirements and team skills. For evaluating HPCC a great resource is Arjuna's blog. You will find worked examples discussion of trade offs and approaches and a lot of good information. If you are interested in learning about a great Big Data platform - its a good place to start. Click on the image or the title to learn more.
Wired has an interesting article on mobile computing. As ARM seeks to put cell phone chips into our supercomputers, Intel is doing the reverse. The lines between the mobile hardware and data center hardware are blurring. Intel’s experimental Single-chip Cloud Computer project, or SCC is a 48 core chip acts as a “network” of processors on a single chip, with two cores per node. The nodes actually communicate to each other much the same way nodes in a cluster in a data center would. While these types of architectures are today used in big data applications that run on a server cloud environment there are actually many cases where a hybrid model would make the most sense including artificial intelligence, machine vision and augmented reality applications. Click on the image or the title to learn more.
Greenplum have open sourced their data collaboration platform Chorus as part of their commitment to OpenChorus (www.openchorus.org). Chrous is a facebook like environment for teams to collaborate around data. Licensed under the Apache 2.0 license - if you are building team tools around data processes OpenChorus is probably worth a look. Click on the image or the title to learn more.
eWeekEnterprises to Slowly Adopt Windows 8, Big Data Jobs to Grow: GartnereWeekBig data is also predicted to have a big impact on enterprises and the IT job market, with big data demand reaching 4.4 million jobs globally by 2014.
Techcrunch have an article explaining why investors are chasing the Big Data market and are likely to for some time. Splice Machine raised $4 million to develop its SQL Engine for big data apps. MongoHQ raised $6 million for its database as a service. A third startup, Bloomreach, announced $25 million in funding for its big data applications. These three companies provide examples for why the investor community will continue to invest in big data startups for many years to come. All reflect a changing dynamic — the rise of the big data app and the need for a new data infrastructure. These two converging trends now drive funding for a widening number of startups that make data functional inside and outside the enterprise. Data functionality, a term Gartner Research used in a report it published this week about how big data will drive $232 billion** in IT spending through 2016, speaks to why investment will continue to flow into companies such as Splice Machine, MongoHQ and Bloomreach. Click on the image or the title to learn more.
Interesting main stream article from the BBC on Big Data and analytics. The BBC article is based on recent report by Dynamica Markets whch shows that now one in five UK companies quantify the value of corporate data on their balance sheets. Nothing new in terms of technical information but some nice examples, good quotes and interesting perspective - worth a quick read. Click on the image or the headline for more info.
The CUBRID blog is a great source of info. This article covers technical details on how to effectively analyze data from a platform perspective and gives a good overview of working with a large amount of data, searching meaningful data for visualization, enabling predictive analysis. I also like dthe graph showing how applications will change over the coming years from trend analysis to simulation.
To get content containing either thought or leadership enter:
To get content containing both thought and leadership enter:
To get content containing the expression thought leadership enter:
You can enter several keywords and you can refine them whenever you want. Our suggestion engine uses more signals but entering a few keywords here will rapidly give you great content to curate.
The Global Database of Events, Language, and Tone (GDELT) is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world over the last two centuries down to the city level globally, to make all of this data freely available for open research, and to provide daily updates to create the first "realtime social sciences earth observatory." Nearly a quarter-billion georeferenced events capture global behavior in more than 300 categories covering 1979 to present with daily updates. Data is available for data scientists to mine and analyze - See more at: http://gdelt.utdallas.edu/#sthash.04oK6DrA.dpuf