500MB
5GB
50GB
Not only faster but less onboard memory used https://h2oai.github.io/db-benchmark/
Most of the functions of dplyr and a bit more:
Manipulate
Filter
Sort
Compute
Group by
Bind rows / columns
Joining columns
Wide – Long format
Read + Write files
Almost everything you’d need to reshape data
If this is the only function you want, use “vroom” instead
tidyverse
Ethos: Do one thing at a time
Each function has an easy to understand name
Example:
Example dataset:
Fruit | Variety | Weight |
---|---|---|
Orange | Navel | 400 |
Apple | Jazz | 300 |
Apple | Jazz | 400 |
… | … | … |
Filter a row:
Example[Fruit == "Orange"]
Filter and add a row:
Example[Fruit == "Apple", Fruit_Variety := paste(Variety,Fruit)]
Just add a row:
Example[,BaseCost := (Weight/1000) * 2]
Perform a grouped summary
Example[, by = "Fruit", Fruit_Weight := mean(Weight)]
Why Chain? Lots of concise alterations, and less objects saved
“I want to know all the types of fruit that will cost on average above $1 per item in order of most to least expensive”
Example2 <- Example[, Fruit_Weight := mean(Weight), by = "Fruit"][, ItemCost := (Weight/1000) * Cost.kg][ItemCost >= 1]\[order(Item)]
Tidyverse
Example2 <- Example %>%
group_by(Fruit) %>%
summarise(Fruit_Weight = mean(Weight)) %>%
Ungroup() %>%
Mutate(ItemCost = (Weight/1000) * Cost.kg) %>%
Filter (ItemCost >= 1) %>%
Arrange(Item)
data.table with pipes
Example dataset:
Fruit | Variety | Weight | Cost.kg |
---|---|---|---|
Orange | Navel | 400 | 2 |
Apple | Jazz | 300 | 4 |
Apple | Gala | 400 | 4.5 |
… | … | … |
filter() ,mutate(), group_by ()
summarise(df, sum(ColA), sd (ColB))
arrange(df, ColA)
select(df, ColA, ColB, ColD)
group_by(df, ColA)
gather()
spread()
full_join()
Intrinsic df[ filter, mutate, by = ]
df[,.(sum = ColA, sd(ColB))]
df[order(ColA)]
df[,.(ColA, ColB, ColD)]
df[,by = ColA] or df[,keyby = ColA]
melt()
dcast()
merge(all = “true”)
You: *“I really want to use _ data.table* _ to speed things up but I don’t have time to learn it or alter my pre-existing code”
Me: Give dtplyr a try?
dtplyr allows you to write dplyr code that is automatically translated to the equivalent data.table code under the hood.
Just load the package library( dtplyr ) and use df2 <- lazy_dt ( df ) before performing your normal operations. At the end use as.data.table () or as_tibble ()
Some code doesn’t translate well between the two – Try to check on your data frames often especially when you start using this
Be careful with ordering, grouping, and manipulating or joining.
Running some package functions within data.table that don’t run efficiently won’t be much faster. Ie. The package creator wrote the code using tidyverse dependencies
Package information page:
https://rdatatable.gitlab.io/data.table/
Github wiki:
https://github.com/Rdatatable/data.table/wiki/Getting-started
A great comparison website between data.table and dplyr (I use this a lot):
https://atrebas.github.io/post/2019-03-03-datatable-dplyr/