Questions and answers on big data on Stack Overflow


Stack Overflow is a question and answer site. It's 100% free, no registration required.


Big Data is a blanket term for any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization.


Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies, software logs, cameras, microphones, radio-frequency identification (RFID) readers, and wireless sensor networks. What is considered "big data" varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain (from Wikipedia, the free encyclopedia).


Hive / Hue view return all rows from newest partition (year / month / day)

Wed, 22 May 2019 19:28 +0000 GMT

Pyspark split df into chunks of n (n can be considered as number of rows/records)

Wed, 22 May 2019 18:27 +0000 GMT

How can i store data from arduino sensor to hadoop hdfs in real time

Wed, 22 May 2019 18:20 +0000 GMT

An Efficient way to Calculate loss function batchwise?

Wed, 22 May 2019 15:13 +0000 GMT

Big Data Platform for a healthcare application

Wed, 22 May 2019 13:40 +0000 GMT

Can I tabulate 1 million observations in R on AWS cloud? [duplicate]

Wed, 22 May 2019 10:59 +0000 GMT

Importing massive dataset in Neo4j where each entity has differing properties

Tue, 21 May 2019 17:07 +0000 GMT

Performing joins on indices in flat file storage systems

Tue, 21 May 2019 12:37 +0000 GMT

How to convert datetime in SQL to Scala data type?

Tue, 21 May 2019 12:35 +0000 GMT

Is it possible to install CDH on a RHEL7 server where Hadoop and few other components are installed seperatly

Mon, 20 May 2019 07:34 +0000 GMT

Got wrong result when let -mapred.reduce.tasks larger than 1 when using hadoop streaming

Mon, 20 May 2019 02:23 +0000 GMT

SciKit Classification Metric

Sun, 19 May 2019 12:27 +0000 GMT

Increase Speed In SQL Big Data

Sun, 19 May 2019 11:27 +0000 GMT

Identify a node in undirected graph that has less number of connections with other nodes but the nodes that it is connected to has large degree [on hold]

Fri, 17 May 2019 21:13 +0000 GMT

How to add dynamic column with static value in hive

Fri, 17 May 2019 13:20 +0000 GMT

Which framework should be used to aggregate and joining the data of Kafka topics and store in to MySQL

Fri, 17 May 2019 08:41 +0000 GMT

How do I Map with a whole text document instead of line by line?

Fri, 17 May 2019 02:07 +0000 GMT

Impala Query Hangs after finished in Hue

Thu, 16 May 2019 18:44 +0000 GMT

Dataflow: Hot keys in a cross join

Thu, 16 May 2019 18:35 +0000 GMT

Approach for continuously processing large number of Computer Vision tasks?

Thu, 16 May 2019 15:48 +0000 GMT

YAMLLOADWARNING when use ccm in ubuntu

Thu, 16 May 2019 09:27 +0000 GMT

Discrepancies between unique users count in Google Analytics portal and result I am getting from BigQuery. Is something wrong with the query?

Thu, 16 May 2019 07:15 +0000 GMT

How to validate Cluster(anylsis) for high dimensional Data (gene expression)

Thu, 16 May 2019 05:21 +0000 GMT

Link between realtime database & bigquery

Wed, 15 May 2019 19:24 +0000 GMT

spark-shell not running in windows 8 | Unable to load native hadoop library

Wed, 15 May 2019 15:33 +0000 GMT

how to measure the read and write time on hdfs using job spark?

Wed, 15 May 2019 13:01 +0000 GMT

Airflow Task Fails

Wed, 15 May 2019 12:50 +0000 GMT

Can i devide my training data and train every part instead of inputting my whole data at once?

Wed, 15 May 2019 10:22 +0000 GMT

Storing a large number of money totals and memory/storage implications - BigDecimal vs Integer and best practices?

Wed, 15 May 2019 10:11 +0000 GMT

reduce RDD having key as (String,String)

Wed, 15 May 2019 06:33 +0000 GMT