Table of Contents
- Overview of Sources
- Aggregate Functions
- Window Functions
- Exclusive Functions (Analysis Specific)
- Date & Time Functions
- String Functions
- Linear and Quadratic Regression Models
- 4PL Formula
- 4PL Fitting Algorithm
- Appendix: Functions Table
Overview of Sources
Functionality | Category | Tool | Link |
Linear and Quadratic Regression Models | Charting | statsmodels | https://www.statsmodels.org/dev/examples/notebooks/generated/wls.html |
4PL Formula | Charting | Mathworks | https://it.mathworks.com/matlabcentral/fileexchange/38122-four-parameters-logistic-regression-there-and-back-again |
4PL Fitting Algorithm | Charting | lmfit package for python | https://lmfit.github.io/lmfit-py/model.html |
Window Functions | Transforms | DuckDB | |
Aggregate Functions | Transforms | DuckDB | https://duckdb.org/docs/sql/functions/aggregates |
Computed Columns | Transforms | DuckDB and Internal (Based off of Microsoft Excel functions) |
https://duckdb.org/docs/sql/functions/aggregates |
Aggregate Functions
Aggregate functions operate on multiple rows of data and return a single result, which makes them useful for summarizing large datasets.
- average: Returns the arithmetic mean of a set of numeric values.
- count: Counts the number of non-NULL entries.
- count nulls: Counts the number of NULL values in a dataset.
- max: Returns the largest value in a dataset.
- median: Finds the median value in a set of numbers.
- min: Returns the smallest value in a dataset.
- mode: Returns the most frequent value in a dataset.
- product: Returns the product of all values in a dataset (multiplies them together).
- standard deviation: Measures the amount of variation or dispersion in a set of values.
- standard error: Measures the accuracy with which a sample distribution represents a population.
- sum: Returns the sum of all non-NULL values.
- variance: Measures how far a set of numbers are spread out from their average value.
- geometric mean: Calculates the geometric mean, which is useful when dealing with percentages or rates of change.
- text concat: Concatenates (joins) multiple text values into one.
Window Functions
Window functions perform a calculation across a set of table rows that are somehow related to the current row. Unlike aggregate functions, window functions do not group the result into a single row.
- first: Returns the first value in the window of rows.
- lag: Returns the value of a preceding row at a given offset.
- last: Returns the last value in the window of rows.
- lead: Returns the value of the following row at a given offset.
- cume dist: Calculates the cumulative distribution of a value within its partition.
- dense rank: Assigns ranks to rows without gaps in ranking values.
- percent rank: Returns the relative rank of a row as a percentage of the total.
- rank: Assigns a rank to each row within a partition, with gaps if there are ties.
- row number: Assigns a unique sequential integer to rows within a partition.
Exclusive Functions (Analysis Specific)
These are functions that are specific to certain contexts like notebook environments or particular analyses.
- countblanks: Counts the number of blank or NULL values in a dataset.
- geomean: Computes the geometric mean in specific notebook environments.
Date & Time Functions
Date functions help to manipulate and analyze date and time values.
- datebucket: Aggregates date or timestamp values into defined time intervals or "buckets."
- dateconvert: Converts a date to another format or data type.
- datetif: Returns the difference between two dates in a specified unit (e.g., days, months).
- datetrunc: Truncates a date or time to a specified precision (e.g., month, day).
String Functions
- split: Splits a string into an array based on a specified delimiter.
- extract: Extracts a substring or a portion of a string.
Linear and Quadratic Regression Models
- Linear Regression: This is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The linear regression model assumes a linear relationship between the variables. Both linear and quadratic regressions use the Statsmodels library in Python for implementation. For more information, visit the Statsmodels documentation.
- Quadratic Regression: An extension of linear regression, quadratic regression models the relationship between the dependent and independent variables as a quadratic (parabolic) function. This is particularly useful when the data displays a curved pattern rather than a straight line. Like linear regression, it is implemented using the Statsmodels library.
It's important to note that linear, quadratic, and 4PL regressions are currently used in the charting of scatter plots, helping to visualize trends within the plotted data. These regressions differ from the other product functions, which serve different roles in the platform.
4PL Formula
- 4PL Formula: The 4-Parameter Logistic (4PL) formula is commonly used in bioassay analysis. It fits a sigmoidal curve to the data, making it ideal for modeling biological systems with a limited response range. For more details on this, refer to the Mathworks page.
4PL Fitting Algorithm
- 4PL Fitting Algorithm: This algorithm fits the 4-Parameter Logistic curve to data by minimizing the residual sum of squares between observed and predicted values. It is typically used for nonlinear curve fitting and can be implemented using the lmfit package in Python, which provides advanced curve fitting capabilities. For more information, visit the lmfit documentation.
Appendix: Functions Table
In the table below, each function is categorized based on its role within SQL and data manipulation. Some functions may belong to multiple categories (e.g., Window, Aggregate, or Computed) due to their versatility in being used across different types of operations:
- Window Functions: These functions operate over a set of rows and return a value for each row in the result set, offering flexibility in calculations across data partitions.
- Aggregate Functions: These functions return a single result by processing multiple rows of input, often used for summarizing data (e.g., calculating totals, averages).
- Computed Functions: These are functions that perform specific calculations or manipulations on columns, often transforming data within queries.
Many functions, such as count and median, can belong to more than one category because they are designed to operate both as Window and Aggregate functions, depending on how they are used in queries. Other functions, such as datebucket and split, are strictly categorized as Computed functions, used to transform or manipulate specific columns in a dataset.
Name | Type |
average | Window and Aggregate |
count | Window and Aggregate |
count nulls | Window, Aggregate, and Computed |
cume dist | Window |
datebucket | Computed |
dateconvert | Computed |
datetif | Computed |
datetrunc | Computed |
dense rank | Window |
extract | Computed |
first | Window |
geometric mean | Aggregate and Computed |
lag | Window |
last | Window |
lead | Window |
max | Window and Aggregate |
median | Window, Aggregate, and Computed |
min | Window and Aggregate |
mode | Window, Aggregate, and Computed |
percent rank | Window |
product | Window and Aggregate |
rank | Window |
row number | Window |
split | Computed |
standard deviation | Window and Aggregate |
standard error | Window, Aggregate, and Computed |
sum | Window and Aggregate |
text concat | Aggregate |
variance | Window, Aggregate, and Computed |