Questions and answers on big data on Stack Overflow


Stack Overflow is a question and answer site. It's 100% free, no registration required.


Big Data is a blanket term for any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization.


Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies, software logs, cameras, microphones, radio-frequency identification (RFID) readers, and wireless sensor networks. What is considered "big data" varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain (from Wikipedia, the free encyclopedia).


Dataflow from Azure to Google Cloud Platform using NiFi

Sat, 23 March 2019 15:29 +0000 GMT

why boolean filed is not working in Hive?

Sat, 23 March 2019 14:10 +0000 GMT

Is there any open source framework for implementing a partition/replica storage system?

Sat, 23 March 2019 11:19 +0000 GMT

Mapreduce how to chain Mapperr>>Reduce>>Reduce

Fri, 22 March 2019 17:38 +0000 GMT

Mapper processing different number of lines

Fri, 22 March 2019 17:27 +0000 GMT

What is a Big Data exercise I can do? [on hold]

Fri, 22 March 2019 16:31 +0000 GMT

How to schedule jobs in Apache Hadoop for Hive & Spark scripts [on hold]

Fri, 22 March 2019 15:13 +0000 GMT

Example of session window with event time based watermark?

Fri, 22 March 2019 12:10 +0000 GMT

What's the difference between a watermark and a trigger in Flink?

Fri, 22 March 2019 11:07 +0000 GMT

Sqoop mysql error - communications link failure

Fri, 22 March 2019 11:04 +0000 GMT

Flink Session Window (based on EventTime) with Expiry Time? [on hold]

Fri, 22 March 2019 09:35 +0000 GMT

Get data in the last three months using talend (Big Data Hive)

Fri, 22 March 2019 07:54 +0000 GMT

How should I handle binary buffers that very dramatically in size to minimize overhead in accessing data?

Thu, 21 March 2019 21:29 +0000 GMT

Efficient Matrix Multiplication and Ranking for Collaborative Filtering

Thu, 21 March 2019 18:08 +0000 GMT

what is driver memory and executor memory in spark? [duplicate]

Thu, 21 March 2019 05:47 +0000 GMT

Map on dataframe takes too long [duplicate]

Wed, 20 March 2019 15:24 +0000 GMT

Integrating Hadoop Big Data Tools with a Data Warehouse

Wed, 20 March 2019 12:36 +0000 GMT

Flatten multiple tables into a single large table

Wed, 20 March 2019 10:25 +0000 GMT

R Studio - dealing with big data in modelling [on hold]

Tue, 19 March 2019 13:41 +0000 GMT

What is the best to go with? HBase, Hive or Spark SQL? [on hold]

Tue, 19 March 2019 13:02 +0000 GMT

Efficiently filtering a large dataset conditionally

Tue, 19 March 2019 05:23 +0000 GMT

ParseException: u"\nextraneous input '/' expecting {'SELECT', 'FROM', 'ADD',

Tue, 19 March 2019 02:50 +0000 GMT

NoSQL big DB retrieve performance

Mon, 18 March 2019 18:07 +0000 GMT

why big companies are using the other databases why not HDFS?

Mon, 18 March 2019 17:13 +0000 GMT

How to access hive using C++ [closed]

Mon, 18 March 2019 14:03 +0000 GMT

list concatenation with big CSVs in R

Mon, 18 March 2019 12:03 +0000 GMT

More efficient way to extract and subtract rows R in different dataframes

Sun, 17 March 2019 17:36 +0000 GMT

Can I store a ordered queue in spark?

Sat, 16 March 2019 12:33 +0000 GMT

Why CTAS query in hive doesn't give result as expected?

Fri, 15 March 2019 15:42 +0000 GMT

Searching for a findings in a large amount of structured log data stored on a object storage

Fri, 15 March 2019 09:36 +0000 GMT