Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts

Tuesday, January 29, 2013

Long Data Is Still Big Data

Image representing Hadoop as depicted in Crunc...
Image via CrunchBase
You add the time dimension to Big Data and you get Long Data. Long Data is still Big Data.

Stop Hyping Big Data and Start Paying Attention to ‘Long Data’

crunching big numbers can help us learn a lot about ourselves. ..... But no matter how big that data is or what insights we glean from it, it is still just a snapshot: a moment in time. ..... as beautiful as a snapshot is, how much richer is a moving picture, one that allows us to see how processes and interactions unfold over time? ..... many of the thi

Structure of Evolutionary Biology - Blue
Structure of Evolutionary Biology - Blue (Photo credit: Wikipedia)
ngs that affect us today and will affect us tomorrow have changed slowly over time ...... Datasets of long timescales not only help us understand how the world is changing, but how we, as humans, are changing it — without this awareness, we fall victim to shifting baseline syndrome. This is the tendency to shift our “baseline,” or what is considered “normal” — blinding us to shifts that occur across generations (since the generation we are born into is taken to be the norm). ..... Shifting baselines have been cited, for example, as the reason why cod vanished off the coast of the Newfoundland: overfishing fishermen failed to see the slow, multi-generational loss of cod since the population decrease was too slow to notice in isolation. ..... Fields such as geology and astronomy or evolutionary biology — where data spans millions of years — rely on long timescales to explain the world today. History itself is being given the long data treatment, with scientists attempting to use a quantitative framework to understand social processes through cliodynamics, as part of digital history. Examples range from understanding the lifespans of empires (does the U.S. as an “empire” have a time limit that policy makers should be aware of?) to mathematical equations of how religions spread (it’s not that different from how non-religious ideas spread today). ...... building a clock that can last 10,000 years .... the 26,000-year cycle for the precession of equinoxes ...... Just as big data scientists require skills and tools like Hadoop, long data scientists will need special skillsets. Statistics are essential, but so are subtle, even seemingly arbitrary pieces of knowledge such as how our calendar has changed over time
Enhanced by Zemanta

Sunday, January 15, 2012

Big Data

Image representing Hadoop as depicted in Crunc...Image via CrunchBaseBig Data: Big News
Facebook And Big Data

After reading this you appreciate your Facebook stream just a little more.

O'Reilly Radar: What is big data?
Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn't fit the strictures of your database architectures. ..... cost-effective approaches have emerged to tame the volume, velocity and variability of massive data. Within this data lie valuable patterns and information ...... Today's commodity hardware, cloud architectures and open source software bring big data processing into the reach of the less well-resourced. ...... analytical use, and enabling new products ...... Being able to process every item of data in reasonable time removes the troublesome need for sampling ...... by combining a large number of signals from a user's actions and those of their friends, Facebook has been able to craft a highly personalized user experience and create a new kind of advertising business. It's no coincidence that the lion's share of ideas and tools underpinning big data have emerged from Google, Yahoo, Amazon and Facebook. ....... The emergence of big data into the enterprise brings with it a necessary counterpart: agility. Successfully exploiting the value in big data requires experimentation and exploration. ........ Input data to big data systems could be chatter from social networks, web server logs, traffic flow sensors, satellite imagery, broadcast audio streams, banking transactions, MP3s of rock music, the content of web pages, scans of government documents, GPS trails, telemetry from automobiles, financial market data, the list goes on. ....... the three Vs of volume, velocity and variety are commonly used to characterize different aspects of big data. ........ Having more data beats out having better models ...... If you could run that forecast taking into account 300 factors rather than 6, could you predict demand better? ......... Many companies already have large amounts of archived data, perhaps in the form of logs, but not the capacity to process it. ...... data warehouses or databases such as Greenplum — and Apache Hadoop-based solutions ...... Apache Hadoop.. places no conditions on the structure of the data it can process. ...... First developed and released as open source by Yahoo, it implements the MapReduce approach pioneered by Google in compiling its search indexes. Hadoop's MapReduce involves distributing a dataset among multiple servers and operating on the data: the "map" stage. The partial results are then recombined: the "reduce" stage. ......... Hadoop is not itself a database or data warehouse solution, but can act as an analytical adjunct to one. ....... A MySQL database stores the core data. This is then reflected into Hadoop, where computations occur, such as creating recommendations for you based on your friends' interests. Facebook then transfers the results back into MySQL, for use in pages served to users. ............ the increasing rate at which data flows into an organization — has followed a similar pattern to that of volume. Problems previously restricted to segments of industry are now presenting themselves in a much broader setting. Specialized companies such as financial traders have long turned systems that cope with fast moving data to their advantage. Now it's our turn. ......... Online retailers are able to compile large histories of customers' every click and interaction: not just the final sales. Those who are able to quickly utilize that information, by recommending additional purchases, for instance, gain competitive advantage. The smartphone era increases again the rate of data inflow, as consumers carry with them a streaming source of geolocated imagery and audio data. ......... The importance lies in the speed of the feedback loop, taking data from input through to decision. ........ you wouldn't cross the road if all you had was a five-minute old snapshot of traffic location. ......... "streaming data," or "complex event processing." ...... when the input data are too fast to store in their entirety: in order to keep storage requirements practical some level of analysis must occur as the data streams in. ........ At the extreme end of the scale, the Large Hadron Collider at CERN generates so much data that scientists must discard the overwhelming majority of it — hoping hard they've not thrown away anything useful. The second reason to consider streaming is where the application mandates immediate response to the data. Thanks to the rise of mobile applications and online gaming this is an increasingly common situation. ........ The velocity of a system's outputs can matter too. The tighter the feedback loop, the greater the competitive advantage. ....... Rarely does data present itself in a form perfectly ordered and ready for processing. A common theme in big data systems is that the source data is diverse, and doesn't fall into neat relational structures. It could be text from social networks, image data, a raw feed directly from a sensor source. None of these things come ready for integration into an application. .......... the reality of data is messy. Different browsers send different data, users withhold information, they may be using differing software versions or vendors to communicate with you. And you can bet that if part of the process involves a human, there will be error and inconsistency. ....... Is this city London, England, or London, Texas? By the time your business logic gets to it, you don't want to be guessing. ...... a principle of big data: when you can, keep everything. There may well be useful signals in the bits you throw away. ....... documents encoded as XML are most versatile when stored in a dedicated XML store such as MarkLogic. Social network relations are graphs by nature, and graph databases such as Neo4J make operations on them simpler and more efficient. ....... a disadvantage of the relational database is the static nature of its schemas. In an agile, exploratory environment, the results of computations will evolve with the detection and extraction of more signals. Semi-structured NoSQL databases meet this need for flexibility: they provide enough structure to organize data, but do not require the exact schema of the data before storing it. ........ three forms: software-only, as an appliance or cloud-based. ...... IT is undergoing an inversion of priorities: it's the program that needs to move, not the data. .... Financial trading systems crowd into data centers to get the fastest connection to source data, because that millisecond difference in processing time equates to competitive advantage. ...... 80% of the effort involved in dealing with data is cleaning it up in the first place ...... data science, a discipline that combines math, programming and scientific instinct. ...... The art and practice of visualizing data is becoming ever more important in bridging the human-computer gap to mediate analytical insight in a meaningful way. ...... advice to businesses starting out with big data: first, decide what problem you want to solve.

Facebook And Big Data

Česky: Logo Facebooku English: Facebook logo E...Image via WikipediaBig Data: Big News

ReadWriteWeb: Why Facebook's Data Sharing Matters
Facebook has cut a deal with political website Politico that allows the independent site machine-access to Facebook users' messages, both public and private, when a Republican Presidential candidate is mentioned by name. The data is being collected and analyzed for sentiment by Facebook's data team, then delivered to Politico to serve as the basis of data-driven political analysis and journalism. ..... Facebook could be the biggest, most dynamic census of human opinion and interaction in history. ....... Back in the middle of the last century, when US Census data and housing mortgage loan data were both made available for computer analysis and cross referencing for the first time, early data scientists were able to prove a pattern of racial discrimination by banks against people of color who wanted to buy houses in certain neighborhoods. The data illuminated the problem and made it undeniable, thus leading to legislation to prohibit such discrimination...... the relationship between data and knowledge generally in the emerging data-rich world....... David Weinberger .. "It's not simply that there are too many brickfacts [datapoints] and not enough edifice-theories. Rather, the creation of data galaxies has led us to science that sometimes is too rich and complex for reduction into theories. As science has gotten too big to know, we've adopted different ideas about what it means to know at all." ...... The world's largest social network, rich with far more signal than any of us could wrap our heads around, could help illuminate emergent qualities of the human experience that are only visible on the network level.
Google machine-reads all your Gmail emails. That is how it serves ads against them. I don't think that is a breach of privacy. I can imagine Facebook similarly machine-reading your private Facebook messages and updates. As long as individuals are not identified, that collective data is fair game. It has the potential to do tremendous good.

This also tells me Google is not the only major tech company trying to get on the Big Data train. Facebook is also well-positioned.

Curiously Yahoo's new CEO also has said he will take Yahoo into the Big Data domain. He got the vocabulary right. I hope he can deliver. Yahoo also sits on some pretty Big Data.

Wednesday, November 30, 2011

Big Data: Big News

Those who think GOOG is a one trick search pony, checkout GFS, BigTable, MapReduce, Tenzing, etc. These are the building blocks of Big Data
Nov 30 via webFavoriteRetweetReply


I am no pioneer to this observation, neither is this guy above. But it is so obvious Big Data is in the wings. Big Data will gather buzz like social has been the buzz for a few years now.