Databricks Hosted Datasets

The data contained within this directory is hosted for users to build data pipelines using Apache Spark and Databricks.

Rdatasets

Rdatasets is a collection of 747 datasets that were originally distributed alongside the statistical software environment R and some of its add-on packages. The goal is to make these data more broadly accessible for teaching and statistical software development.

The list of available datasets (csv and docs) is available here:

For more information, please see the README file within the latest data subdirectory

Versions

  • data-001 is from the git hash: aa0d6940a9

Airline On-Time Statistics and Delay Causes

Background

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.

FAQ Information is available at http://www.rita.dot.gov/bts/help_with_data/aviation/index.html

Data Source

http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp

Usage Restrictions

Amazon Reviews datasets

The data20K and test4K datasets were created by Professor Julian McAuley at the University of California San Diego with the permission for use in the databricks-datasets bucket by Databricks users.

Source: Image-based recommendations on styles and substitutes. J. McAuley, C. Targett, J. Shi, A. van den Hengel. SIGIR, 2015. Flight Performance Datasets 1997-2008 http://stat-computing.org/dataexpo/2009/the-data.html

Planes dataset http://stat-computing.org/dataexpo/2009/supplemental-data.html

Bike Sharing Dataset

Hadi Fanaee-T

Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto INESC Porto, Campus da FEUP Rua Dr. Roberto Frias, 378 4200 - 465 Porto, Portugal

Background

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

Dataset

Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to
the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is publicly available in http://capitalbikeshare.com/system-data. We aggregated the data on two hourly and daily basis and then extracted and added the corresponding weather and seasonal information. Weather information are extracted from http://www.freemeteo.com.

Associated Tasks

* Regression: 
	* Predication of bike rental count hourly or daily based on the environmental and seasonal settings.

* Event and Anomaly Detection:  
	* Count of rented bikes are also correlated to some events in the town which easily are traceable via search engines.
	For instance, query like "2012-10-30 washington d.c." in Google returns related results to Hurricane Sandy. Some of the important events are 
	identified in [1]. Therefore the data can be used for validation of anomaly or event detection algorithms as well.

Files

* hour.csv : bike sharing counts aggregated on hourly basis. Records: 17379 hours
* day.csv : bike sharing counts aggregated on daily basis. Records: 731 days

Dataset characteristics

Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv

- instant: record index
- dteday : date
- season : season (1:springer, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
- weathersit : 
	- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
	- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
	- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
	- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered

License

Use of this dataset in publications must be cited to the following publication:

[1] Fanaee-T, Hadi, and Gama, Joao, “Event labeling combining ensemble detectors and background knowledge”, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.

@article{ year={2013}, issn={2192-6352}, journal={Progress in Artificial Intelligence}, doi={10.1007/s13748-013-0040-3}, title={Event labeling combining ensemble detectors and background knowledge}, url={http://dx.doi.org/10.1007/s13748-013-0040-3}, publisher={Springer Berlin Heidelberg}, keywords={Event labeling; Event detection; Ensemble learning; Background knowledge}, author={Fanaee-T, Hadi and Gama, Joao}, pages={1-15} }

Contact

For further information about this dataset please contact Hadi Fanaee-T (hadi.fanaee@fe.up.pt)

[CAVAIR Test Case Scenarios: Clips from INRIA (1st Set)]

This data set was obtained from http://homepages.inf.ed.ac.uk/rbf/CAVIAR/. The source of this data is: EC Funded CAVIAR project/IST 2001 37540.

Data Set Information

The dataset structure is as follows:

Clips from INRIA (1st Set) from the CAVAIR Test Case Scenarios[1]:

* /databricks-datasets/cctvVideos/train/
* /databricks-datasets/cctvVideos/test/

Derived from the above datasets

All other folders contain dataset derived from the above Clips from INRIA (1st Set) from the CAVIAR Test Case Scenarios as described below.

/databricks-datasets/cctvVideos/mp4/  			# MP4 videos generated from the above videos
/databricks-datasets/cctvVideos/labels/			# Manually created labels categorizing suspicious images
/databricks-datasets/cctvVideos/train_images	# Hive-style partitioning of labelled images

MP4 version of videos

The MP4 videos stored in /databricks-datasets/cctvVideos/mp4/ were created by Databricks using the following command.

brew install ffmpeg

for x in *.MPG; do
	ffmpeg -i $x -strict experimental -f mp4 \
	       -vcodec libx264 -acodec aac \
           -ab 160000 -ac 2 -preset slow \
	       -crf 22 ${x/.MPG/.mp4};
done

Labels

Stored within /databricks-datasets/cctvVideos/labels/; these are manually created labels to identify which images (extracted from the training videos) are considered suspicious per the blog post Identify Suspicious Behavior in Video with Databricks Runtime for Machine Learning

Training Labeled Images

Stored within /databricks-datasets/cctvVideos/train_images; these are images labeled using Hive-style partitioning where label=0 denote non-suspicious images and label=1 denote suspicious images per the previously noted Labels section.

Citation

Applicable citations:

  1. EC Funded CAVIAR project/IST 2001 37540, found at URL: http://homepages.inf.ed.ac.uk/rbf/CAVIAR/ The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. The feature ‘pcaVector’ is a Vector of the principal components obtained with PCA, the only features which have not been transformed with PCA are ’time’ and ‘amountRange’. Feature ’time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘amountRange’ is the approximate amount of the transaction. The ranges are represented as an integer between 0 and 7 which correspond to the range (in dollars) 0-1, 1-5, 5-10, 10-20, 20-50, 50-100, 100-200 and 200+ respectively. Feature ’label’ is the response variable and it takes value 1 in case of fraud and 0 otherwise. Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification. This dataset is a slightly modified version of the dataset collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML Please cite: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

Data.gov Datasets

This folder houses data that is copied from http://www.data.gov/. This vast trove of data is published and maintained by the government of the United States.

We only provide a small subset of datasets that are published on the site and it’s worth exploring http://www.data.gov/ itself if you want to find other data to work with!

Datasets

This folder contains all of the datasets used in The Definitive Guide.

The datasets are as follow.

Flight Data

This data comes from the United States Bureau of Transportation. Please see the website for more information: https://www.rita.dot.gov/bts/help_with_data/aviation/index.html

Retail Data

Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197–208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

The data was downloaded from the UCI Machine Learning Repository. Please see this page for more information: http://archive.ics.uci.edu/ml/datasets/Online+Retail

Bike Data

This data comes from the Bay Area Bike Share network. Please see this page for more infomation: http://www.bayareabikeshare.com/open-data

Sensor Data (Heterogeneity Human Activity Recognition Dataset)

Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjærgaard, Anind Dey, Tobias Sonne, and Mads Møller Jensen “Smart Devices are Different: Assessing and Mitigating Mobile Sensing Heterogeneities for Activity Recognition” In Proc. 13th ACM Conference on Embedded Networked Sensor Systems (SenSys 2015), Seoul, Korea, 2015. [Web Link]

The data was downloaded from the UCI Machine Learning Repository. It is formally known as the Heterogeneity Human Activity Recognition Dataset. Please see this page for more information: https://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition

On-Time Performance Datasets

The source airports dataset can be found at OpenFlights Airport, airline and route data.

The flights, also known as the departuredelays, dataset can be found at Airline On-Time Performance and Causes of Flight Delays: On_Time Data

Flowers (images)

This data set was obtained from

https://www.tensorflow.org/datasets/catalog/tf_flowers

The source of the data is:

Author: “The TensorFlow Team”, Title: “Flowers”, Url: “http://download.tensorflow.org/example_images/flower_photos.tgz

Data Set Information A large set of images of flowers. License and/or Citation All images in this archive are licensed under the Creative Commons By-Attribution License, available at: https://creativecommons.org/licenses/by/2.0/

The photographers are listed below, thanks to all of them for making their work available, and please be sure to credit them for any use as per the license.

See the full list of photos and photographers in LICENSE.txt.

Citation:

@ONLINE {tfflowers, author = “The TensorFlow Team”, title = “Flowers”, month = “jan”, year = “2019”, url = “http://download.tensorflow.org/example_images/flower_photos.tgz" }

Flowers

This data set was obtained from

https://www.tensorflow.org/datasets/catalog/tf_flowers

The source of the data is:

Author: “The TensorFlow Team”, Title: “Flowers”, Url: “http://download.tensorflow.org/example_images/flower_photos.tgz

Data Set Information A Delta table contains a large set of images of flowers. The ‘content’ column is a binary column of the images, and the ‘label’ column is a string column of the labels. The ‘path’ column the dbfs path of the image and the ‘size’ column contains the width and height of the image.

License and/or Citation

All images in this archive are licensed under the Creative Commons By-Attribution License, available at: https://creativecommons.org/licenses/by/2.0/

The photographers are listed below, thanks to all of them for making their work available, and please be sure to credit them for any use as per the license.

(See the full list of photos and photographers in LICENSE.txt.)

Citation:

@ONLINE {tfflowers, author = “The TensorFlow Team”, title = “Flowers”, month = “jan”, year = “2019”, url = “http://download.tensorflow.org/example_images/flower_photos.tgz" }

VEP Cache, RefSeq Transcripts, GRCh38

This data set was obtained from ftp://ftp.ensembl.org/pub/release-96/variation/VEP/homo_sapiens_refseq_vep_96_GRCh38.tar.gz.

The sources of the data are: Laurent Gil, Sarah E. Hunt, William McLaren (wm2@ebi.ac.uk), Anja Thormann, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Data Set Information

Variant Effect Predictor cache for Assembly GRCh38, RefSeq transcripts (Ensembl release 96).

McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The Ensembl Variant Effect Predictor. Genome Biology Jun 6;17(1):122. (2016) doi:10.1186/s13059-016-0974-4

License and/or Citation

This data set has no restrictions: https://uswest.ensembl.org/info/about/legal/disclaimer.html

SafeGraph FootTraffic Dataset

This data set was obtained from http://databricks.com/notebooks/safegraph_patterns_simulated__1_-91d51.csv. The source of the data is simulated Monthly Foot Traffic Time Series in SafeGraph format.

Data Set Information

The Data Set Information details are on the SafeGraph page: Guide to Points of Interest Data.

License and/or Citation

This data set is derived from SafeGraph’s data schema.

IOT Device Data

This dataset was created by Databricks.
It contains fake generated data in json and csv formats. e.g. {"user_id": 12, "calories_burnt": 489.79998779296875, "num_steps": 9796, "miles_walked": 4.8979997634887695, "time_stamp": "2018-07-24 03:54:00.893775", "device_id": 10}

Data Set Information

Schema for data-device:

[StructField(id,LongType,false),  
 StructField(user_id,LongType,true),  
 StructField(device_id,LongType,true),  
 StructField(num_steps,LongType,true),  
 StructField(miles_walked,FloatType,true),  
 StructField(calories_burnt,FloatType,true),  
 StructField(timestamp,StringType,true),  
 StructField(value,StringType,true)]  

Schema for data-user:

[StructField(userid,IntegerType,true),
 StructField(gender,StringType,true),
 StructField(age,IntegerType,true),
 StructField(height,IntegerType,true),
 StructField(weight,IntegerType,true),
 StructField(smoker,StringType,true),
 StructField(familyhistory,StringType,true),
 StructField(cholestlevs,StringType,true),
 StructField(bp,StringType,true),
 StructField(risk,IntegerType,true)]

License and/or Citation

Copyright (2018) Databricks, Inc. This dataset is licensed under a Creative Commons Attribution 4.0 International Licensehttps://creativecommons.org/licenses/by/4.0/.

Learning Spark - Example Data From The Book

This dataset holds the files for examples in the Learning Spark book. These examples are used throughout the book.

For more information, please see the README from the Learning Spark github project

License

The files in the Learning Spark github project are licensed with the MIT license as defined in https://github.com/databricks/learning-spark/blob/master/LICENSE.md

Versions

  • data-001 is from the git hash: 13c39f22b1

Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing.

http://spark.apache.org/

Online Documentation

You can find the latest Spark documentation, including a programming guide, on the project web page and project wiki. This README file only contains basic setup instructions.

Building Spark

Spark is built using Apache Maven. To build Spark and its example programs, run:

build/mvn -DskipTests clean package

(You do not need to do this if you downloaded a pre-built package.) More detailed documentation is available from the project site, at “Building Spark”.

Interactive Scala Shell

The easiest way to start using Spark is through the Scala shell:

./bin/spark-shell

Try the following command, which should return 1000:

scala> sc.parallelize(1 to 1000).count()

Interactive Python Shell

Alternatively, if you prefer Python, you can use the Python shell:

./bin/pyspark

And run the following command, which should also return 1000:

>>> sc.parallelize(range(1000)).count()

Example Programs

Spark also comes with several sample programs in the examples directory. To run one of them, use ./bin/run-example <class> [params]. For example:

./bin/run-example SparkPi

will run the Pi example locally.

You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be a mesos:// or spark:// URL, “yarn” to run on YARN, and “local” to run locally with one thread, or “local[N]” to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:

MASTER=spark://host:7077 ./bin/run-example SparkPi

Many of the example programs print usage help if no params are given.

Running Tests

Testing first requires building Spark. Once Spark is built, tests can be run using:

./dev/run-tests

Please see the guidance on how to run tests for a module, or individual tests.

A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.

Please refer to the build documentation at “Specifying the Hadoop Version” for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions.

Configuration

Please refer to the Configuration Guide in the online documentation for an overview on how to configure Spark.

Lending Club Statistics

This data set was obtained from https://www.lendingclub.com/info/download-data.action. The source of the data is: LendingClub, LendingClub Corporation Dept. 34268 ,P.O. Box 39000, San Francisco, CA 94139

Data Set Information

These files contain complete loan data for all loans issued through the time period stated, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the “present” contains complete loan data for all loans issued through the previous completed calendar quarter.

License and/or Citation

Lending Club’s website does not explicitly state which license it is sharing the data under. However, it is stated explicitly on the URL where one downloads the data that “Want to slice and dice the data? Help yourself to the following exports of our loan databases.”

MNIST handwritten digits dataset

Data Source

LibSVM Datasets https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#mnist Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.

Original Data Set Source

Yann LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, November 1998. MNIST database available at http://yann.lecun.com/exdb/mnist/

  • 20 Newsgroups Dataset – Binary Classification

This is a processed version of the 20 Newsgroup Dataset, saved in a parquet format.

Attribute Information

  • newsgroup:string, Name of Newsgroup
  • content:string, Document Content
  • relatedToSci:integer, 1/0 binary indicator to determine if article belongs to a sci newsgroup or not

List of Newsgroups:

  • alt.atheism
  • comp.graphics
  • comp.os.ms-windows.misc
  • comp.sys.ibm.pc.hardware
  • comp.sys.mac.hardware
  • comp.windows.x
  • misc.forsale
  • rec.autos
  • rec.motorcycles
  • rec.sport.baseball
  • rec.sport.hockey
  • sci.crypt
  • sci.electronics
  • sci.med
  • sci.space
  • soc.religion.christian
  • talk.politics.guns
  • talk.politics.mideast
  • talk.politics.misc
  • talk.religion.misc

Source

####Original Owner and Donor Tom Mitchell School of Computer Science Carnegie Mellon University tom.mitchell@cmu.edu

Date Donated: September 9, 1999

You may use this material free of charge for any educational purpose, provided attribution is given in any lectures or publications that make use of this material.

NYC Taxi Dataset

This dataset was obtained from https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page. The source of the data is: The New York City Taxi Commision, Office of Legal Affairs, 33 Beaver Street, 22nd Floor, New York, NY 10004; Attn.: Records Access Officer

Data Set Information

This dataset contains aggregated data containing information from the NYC Taxi and Limousine on their various indicators, trip counts, crash history, etc., and also raw trip data from a variety of sources.

License and/or Citation

Public domain–this data is freely available without restriction from https://www1.nyc.gov/site/tlc/about/request-data.page

Combined Cycle Power Plant Data Set

Power Plant Sensor Readings Data Set

Source

http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

##Summary

The example data is provided by UCI at UCI Machine Learning Repository Combined Cycle Power Plant Data Set You can read the background on the UCI page, but in summary we have collected a number of readings from sensors at a Gas Fired Power Plant (also called a Peaker Plant) and now we want to use those sensor readings to predict how much power the plant will generate.

Usage License

If you publish material based on databases obtained from this repository, then, in your acknowledgements, please note the assistance you received by using this repository. This will help others to obtain the same data sets and replicate your experiments. We suggest the following reference format for referring to this repository: Pınar Tüfekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, International Journal of Electrical Power & Energy Systems, Volume 60, September 2014, Pages 126-140, ISSN 0142-0615, Link, Link

Heysem Kaya, Pınar Tüfekci , Sadık Fikret Gürgen: Local and Global Learning Methods for Predicting Power of a Combined Gas & Steam Turbine, Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE 2012, pp. 13-18 (Mar. 2012, Dubai)

Synthetic Retail Dataset

This dataset is a collection of files representing different dimensions and facts for a retail organization.

Provenance

This dataset was generated by Databricks.

Data Set Information

  • Sales Orders: sales_orders/sales_orders.json records the customers’ originating purchase order.
  • Purchase Orders: purchase_orders/purchase_orders.xml contains the raw materials that are being purchased.
  • Products: products/products.csv contains products that the company sells.
  • Goods Receipt: goods_receipt/goods_receipt.parquet contains the arrival time of purchased orders.
  • Customers: customers/customers.csv contains those customers who are located in the US and are buying the finished products.
  • Suppliers: suppliers/suppliers.csv contains suppliers that provide raw materials in the US.
  • Sales Stream: sales_stream/sales_stream.json/ is a folder containing JSON files for streaming purposes.
  • Promotions: promotions/promotions.csv contains additional benefits on top of normal purchases.
  • Active Promotions: active_promotions/active_promotions.parquet shows how customers are progressing towards becoming eligible for promotions.
  • Loyalty Segment: loyalty_segment/loyalty_segment.csv contains segmented customer data to appeal to all types of guests using targeted rewards and promotions.

License and/or Citation

Copyright (2020) Databricks, Inc. This dataset is licensed under a Creative Commons Attribution 4.0 International Licensehttps://creativecommons.org/licenses/by/4.0/

README

Introduction

Fire Calls-For-Service includes all fire units responses to calls. Each record includes the call number, incident number, address, unit identifier, call type, and disposition. All relevant time intervals are also included. Because this dataset is based on responses, and since most calls involved multiple units, there are multiple records for each call number. Addresses are associated with a block number, intersection or call box, not a specific address.

License

The data itself is available under an ODC Public Domain Dedication and License.

Additional Information

See https://data.sfgov.org/Public-Safety/Fire-Department-Calls-for-Service/nuek-vuh3 #2013 SFO Customer Survey Data Set + Dictionary

SFO conducts a yearly comprehensive survey of our guests to gauge satisfaction with our facilities, services, and amenities. SFO compares results to previous surveys to look for areas of improvement and discover elements of the guest experience that are not satisfactory.

Source: https://data.sfgov.org/Transportation/2013-SFO-Customer-Survey-Data-Set-Dictionary/mjr8-p6m5

SMS Spam Collection v. 1

The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.

Composition

This corpus has been collected from free or free for research sources at the Internet:

  • A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: http://www.grumbletext.co.uk/.
  • A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/.
  • A list of 450 SMS ham messages collected from Caroline Tag’s PhD Thesis available at http://etheses.bham.ac.uk/253/1/Tagg09PhD.pdf. Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: http://www.esp.uem.es/jmgomez/smsspamcorpus/.

You can find more useful information about the SMS Spam Collection v.1 at the following page of the UCI Repository.

http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

Usage

The collection is composed by just one file, where each line has the correct class (ham or spam) followed by the raw message.

ham   What you doing?how are you?
ham   Ok lar... Joking wif u oni...
ham   dun say so early hor... U c already then say...
ham   MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H*
ham   Siva is in hostel aha:-.
ham   Cos i was out shopping wif darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor.
spam  FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop
spam  Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B
spam  URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU

We would appreciate:

If you find this collection useful, make a reference to the paper below and the web page: http://dcomp.sor.ufscar.br/talmeida/smspamcollection/. Send us a message either to talmeida < AT > ufscar.br or jmgomezh yahoo.es in case you make use of the corpus.

Publication and More Information

We offer a comprehensive study of this corpus in the following papers. These works present a number of interesting statistics, studies and baseline results for many traditional machine learning methods.

Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011. [preprint]

Gómez Hidalgo, J.M., Almeida, T.A., Yamakami, A. On the Validity of a New SMS Spam Collection. Proceedings of the 11th IEEE International Conference on Machine Learning and Applications (ICMLA'12), Boca Raton, FL, USA, 2012. [preprint]

Almeida, T.A., Gómez Hidalgo, J.M., Silva, T.P. Towards SMS Spam Filtering: Results under a New Dataset. International Journal of Information Security Science (IJISS), 2(1), 1-18, 2013. [Invited paper - full version]

About

The SMS Spam Collection has been created by Tiago A. Almeida and José María Gómez Hidalgo.

We would like to thank Min-Yen Kan and his team for making the NUS SMS Corpus available.

Sample of Million Song Dataset

Source

This data is a small subset of the Million Song Dataset. The original data was contributed by The Echo Nest. Prepared by T. Bertin-Mahieux <tb2332 ‘@’ columbia.edu>

Attribute Information

  • artist_id:string
  • artist_latitude:double
  • artist_longitude:double
  • artist_location:string
  • artist_name:string
  • duration:double
  • end_of_fade_in:double
  • key:int
  • key_confidence:double
  • loudness:double
  • release:string
  • song_hotnes:double
  • song_id:string
  • start_of_fade_out:double
  • tempo:double
  • time_signature:double
  • time_signature_confidence:double
  • title:string
  • year:double
  • partial_sequence:int

Citation

Using the dataset?

Please cite the following paper pdf bib:

Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.

Acknowledgements

The Million Song Dataset was created under a grant from the National Science Foundation, project IIS-0713334. The original data was contributed by The Echo Nest, as part of an NSF-sponsored GOALI collaboration. Subsequent donations from SecondHandSongs.com, musiXmatch.com, and last.fm, as well as further donations from The Echo Nest, are gratefully acknowledged.

Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.

Fire Department Calls for Service

This data set was obtained from https://data.sfgov.org/Public-Safety/Fire-Department-Calls-for-Service/nuek-vuh3 as of 11/11/2019. The source of the data is: Open Data published by the San Francisco Fire Department and is updated daily (available at the prior link).

Data Set Information

Fire Calls-For-Service includes all fire units responses to calls. Each record includes the call number, incident number, address, unit identifier, call type, and disposition. All relevant time intervals are also included. Because this dataset is based on responses, and since most calls involved multiple units, there are multiple records for each call number. Addresses are associated with a block number, intersection or call box, not a specific address.

License and/or Citation

This data set is licensed under the following license: Open Data Commons Public Domain Dedication and License (https://opendatacommons.org/licenses/pddl/1.0/)

TPC-H Data

The data in this directory was generated to run the TPC-H benchmark.

The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.

For more information, refer to the Transaction Processing Performance Council’s TPC-H page

Versions

Travel Recommendations Data Set

Synthetic Dataset related to travel recommendations.

License and/or Citation

The dataset was generated using Databricks Labs Data Generator https://databrickslabs.github.io/dbldatagen/public_docs/index.html.

Seattle Temperature Recordings Data Set

This data set was obtained from https://w2.weather.gov/climate/index.php?wfo=sew. The source of the data is: National Weather Service The National Weather Service data is not subject to copyright protection.

Data Set Information

This is a history weather recordings data set which contains all the high and low temperatures in Seattle, WA occurring between 01/01/2015 and 09/30/2018.

Attribute Information: date: The date of the temperature recording. temp: The daily maximum or minimum temperature in Fahrenheit.

Wine Quality Data Set

Two datasets related to red and white variants of the Portuguese “Vinho Verde” wine.

Provenance

This data set was obtained from http://archive.ics.uci.edu/ml/datasets/wine+quality. The source of the data is: Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal. @2009

License and/or Citation

Example: This data set is licensed under the following license: See citations.

Applicable citations: Cortez, Paulo (2009). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

License

Unless otherwise noted (e.g. within the README for a given data set), the data is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0), which can be viewed at the following url: http://creativecommons.org/licenses/by/4.0/legalcode

Contributions and Requests

To request or contribute new datasets to this repository, please send an email to: hosted-datasets@databricks.com.

When making the request, include the README.md file you want to publish. Make sure the file includes information about the source of the data, the license, and how to get additional information. Please ensure the license for this data allows it to be hosted by Databricks and consumed by the public.