Questions and answers on big data on Stack Overflow


Stack Overflow is a question and answer site. It's 100% free, no registration required.


Big Data is a blanket term for any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization.


Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies, software logs, cameras, microphones, radio-frequency identification (RFID) readers, and wireless sensor networks. What is considered "big data" varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain (from Wikipedia, the free encyclopedia).


What we use to read csv files for big data in R? [duplicate]

Sat, 19 August 2017 18:27 +0000 GMT

Create new DF with values representing difference between two dataframes

Sat, 19 August 2017 10:46 +0000 GMT

Unbagging a dataset in pyspark

Fri, 18 August 2017 16:19 +0000 GMT

Saving wordCount to MongoDB server

Fri, 18 August 2017 12:32 +0000 GMT

What is Lineage In Spark?

Fri, 18 August 2017 08:34 +0000 GMT

How do I update User object at scale?

Fri, 18 August 2017 03:09 +0000 GMT

in Arules, return the smallest support items from a lot of rules

Thu, 17 August 2017 21:46 +0000 GMT

API + restart the services that restart is required

Thu, 17 August 2017 18:50 +0000 GMT

Extracting text after certain characters in string in hive

Thu, 17 August 2017 13:45 +0000 GMT (imply): log level settings

Thu, 17 August 2017 11:43 +0000 GMT

In Hadoop, What is the relationship between replication factor and number of nodes in cluster?

Thu, 17 August 2017 10:33 +0000 GMT

Loading large pickle files into variable accesses hard disks instead of loading into RAM

Thu, 17 August 2017 08:52 +0000 GMT

spark job performing poorly while converting text files to parquet format

Thu, 17 August 2017 08:19 +0000 GMT

File "/usr/local/hadoop-2.8.1/hue/desktop/core/src/desktop/", line 59, in entry execute_from_command_line(sys.argv)

Thu, 17 August 2017 00:49 +0000 GMT

Calculate disease prevalence and gender distribution based on lab data

Wed, 16 August 2017 23:11 +0000 GMT

How would I count how many events have a certain list item present in Keen IO?

Wed, 16 August 2017 23:00 +0000 GMT

How do I count events with multiple boolean variables in Keen IO?

Wed, 16 August 2017 22:54 +0000 GMT

Hive :Date Datatype issue

Wed, 16 August 2017 18:03 +0000 GMT

Find unique entities with multiple UUID identifiers in redshift

Wed, 16 August 2017 17:42 +0000 GMT

Overview multiple Ambari instances

Wed, 16 August 2017 13:52 +0000 GMT

How is salience in Google natural language API determined?

Wed, 16 August 2017 09:56 +0000 GMT

What options do I have for Big Data - storing/querying a history of 600,000 users statistics? [on hold]

Wed, 16 August 2017 09:33 +0000 GMT

How can i proces big(70gb) log file and extract needed information on pc with 1gb ram? [on hold]

Tue, 15 August 2017 20:29 +0000 GMT

Commands not working after editing bashrc file in ubuntu

Tue, 15 August 2017 17:45 +0000 GMT

What happens to already replicated blocks(with block size=64mb) when block size is reduced to 32MB?

Tue, 15 August 2017 13:15 +0000 GMT

unable to load data from parquet files to hive external table

Tue, 15 August 2017 08:33 +0000 GMT

Azure Data Lake Large Lookup

Tue, 15 August 2017 07:46 +0000 GMT

Data Validation in Hive

Mon, 14 August 2017 14:45 +0000 GMT

How to load large dataset to python and perform matrix operations

Mon, 14 August 2017 01:09 +0000 GMT

Terms in Hadoop [closed]

Sun, 13 August 2017 06:14 +0000 GMT