Questions and answers on big data on Stack Overflow


Stack Overflow is a question and answer site. It's 100% free, no registration required.


Big Data is a blanket term for any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization.


Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies, software logs, cameras, microphones, radio-frequency identification (RFID) readers, and wireless sensor networks. What is considered "big data" varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain (from Wikipedia, the free encyclopedia).


BigQuery - Cacheing possibly isn't working. How do I diagnose?

Tue, 23 July 2019 00:26 +0000 GMT

which one is better considering performance?. Hive on Tez or Hivecontext in apache spark sql? [on hold]

Mon, 22 July 2019 12:47 +0000 GMT

Is there any method for k-prototypes in pyspark? [on hold]

Mon, 22 July 2019 11:25 +0000 GMT

Why does the process rate of a MapReduce task start slow and then increase?

Mon, 22 July 2019 05:19 +0000 GMT

How do I do a flatMap on spark Dataframe rows depending on conditions of multiple field values?

Mon, 22 July 2019 04:10 +0000 GMT

How to make knn algorithm more efficient

Sat, 20 July 2019 11:52 +0000 GMT

What dimentionality reduction method should be used fo multi dimensional target - PCA or any other?

Fri, 19 July 2019 11:18 +0000 GMT

How does calling random.shuffle on a h5py dataset work?

Fri, 19 July 2019 09:40 +0000 GMT

Count function and Case

Fri, 19 July 2019 01:00 +0000 GMT

Determining inactive records in a table

Thu, 18 July 2019 21:08 +0000 GMT

Retain the latest record in hive table

Thu, 18 July 2019 05:19 +0000 GMT

What is the difference between MapR and Map Reduce?

Thu, 18 July 2019 04:27 +0000 GMT

How to read structType schema from external text file in apache spark?

Wed, 17 July 2019 21:14 +0000 GMT

Best way to Drop Partitions using Presto + Hive

Wed, 17 July 2019 17:53 +0000 GMT

Memoryerror: in big data

Wed, 17 July 2019 15:34 +0000 GMT

Split dataset per rows into smaller files in R

Wed, 17 July 2019 14:27 +0000 GMT

System Design: How can we handle a Load Balancer crash?

Wed, 17 July 2019 04:41 +0000 GMT

Data transfer between two kerberos secured cluster

Tue, 16 July 2019 17:07 +0000 GMT

How to fix Zookeeper Nullpointer in my Kafka Connect Sink to HBase

Tue, 16 July 2019 15:04 +0000 GMT

Is Big data a right field to master as a fresher or should I move my career to Application developer? [closed]

Tue, 16 July 2019 09:26 +0000 GMT

How to update particular bucket corrupted data in hive table?

Tue, 16 July 2019 04:44 +0000 GMT

Spark convert csv with Excel formula and get only the value

Mon, 15 July 2019 12:40 +0000 GMT

use udf in sql vs method in C# code in performance

Mon, 15 July 2019 11:21 +0000 GMT

Apply custom schema to post response JSON from rest api using scala spark

Mon, 15 July 2019 10:31 +0000 GMT

Why can't I get a variable from another connection in dolphindb?

Mon, 15 July 2019 06:41 +0000 GMT

how to parallelize video with spark RDD

Mon, 15 July 2019 00:40 +0000 GMT

Connect Zeppelin with 2 Hive running on different servers?

Sun, 14 July 2019 07:37 +0000 GMT

Dealing with enormous DataFrames - in local environments versus HPC environments

Fri, 12 July 2019 18:50 +0000 GMT

How do I configure multiple hard disks for a dolphindb data node

Fri, 12 July 2019 17:56 +0000 GMT

Pig is not running in mapreduce mood (hadoop 3.1.1 + pig 0.17.0)

Fri, 12 July 2019 13:58 +0000 GMT