Analysis Functions Overview

Achala
Achala
  • Updated

Table of Contents

  1. Overview of Sources
  2. Aggregate Functions
  3. Window Functions
  4. Exclusive Functions (Analysis Specific)
    1. Date & Time Functions
    2. String Functions
  5. Linear and Quadratic Regression Models
  6. 4PL Formula
  7. 4PL Fitting Algorithm
  8. Appendix: Functions Table

 

Overview of Sources

 

Functionality Category Tool Link
Linear and Quadratic Regression Models Charting statsmodels https://www.statsmodels.org/dev/examples/notebooks/generated/wls.html
4PL Formula Charting Mathworks https://it.mathworks.com/matlabcentral/fileexchange/38122-four-parameters-logistic-regression-there-and-back-again
4PL Fitting Algorithm Charting lmfit package for python https://lmfit.github.io/lmfit-py/model.html
Window Functions Transforms DuckDB

https://duckdb.org/docs/sql/functions/aggregates

https://duckdb.org/docs/sql/functions/window_functions

Aggregate Functions Transforms DuckDB https://duckdb.org/docs/sql/functions/aggregates
Computed Columns Transforms DuckDB and Internal (Based off of Microsoft Excel functions)

https://support.microsoft.com/en-us/office/excel-functions-by-category-5f91f4e9-7b42-46d2-9bd1-63f26a86c0eb

https://duckdb.org/docs/sql/functions/aggregates

https://duckdb.org/docs/sql/functions/char.html

https://duckdb.org/docs/sql/functions/date

 

Aggregate Functions

Aggregate functions operate on multiple rows of data and return a single result, which makes them useful for summarizing large datasets.

  1. average: Returns the arithmetic mean of a set of numeric values.
  2. count: Counts the number of non-NULL entries.
  3. count nulls: Counts the number of NULL values in a dataset.
  4. max: Returns the largest value in a dataset.
  5. median: Finds the median value in a set of numbers.
  6. min: Returns the smallest value in a dataset.
  7. mode: Returns the most frequent value in a dataset.
  8. product: Returns the product of all values in a dataset (multiplies them together).
  9. standard deviation: Measures the amount of variation or dispersion in a set of values.
  10. standard error: Measures the accuracy with which a sample distribution represents a population.
  11. sum: Returns the sum of all non-NULL values.
  12. variance: Measures how far a set of numbers are spread out from their average value.
  13. geometric mean: Calculates the geometric mean, which is useful when dealing with percentages or rates of change.
  14. text concat: Concatenates (joins) multiple text values into one.

 

Window Functions

Window functions perform a calculation across a set of table rows that are somehow related to the current row. Unlike aggregate functions, window functions do not group the result into a single row.

  1. first: Returns the first value in the window of rows.
  2. lag: Returns the value of a preceding row at a given offset.
  3. last: Returns the last value in the window of rows.
  4. lead: Returns the value of the following row at a given offset.
  5. cume dist: Calculates the cumulative distribution of a value within its partition.
  6. dense rank: Assigns ranks to rows without gaps in ranking values.
  7. percent rank: Returns the relative rank of a row as a percentage of the total.
  8. rank: Assigns a rank to each row within a partition, with gaps if there are ties.
  9. row number: Assigns a unique sequential integer to rows within a partition.

 

 

Exclusive Functions (Analysis Specific)

These are functions that are specific to certain contexts like notebook environments or particular analyses.

  1. countblanks: Counts the number of blank or NULL values in a dataset.
  2. geomean: Computes the geometric mean in specific notebook environments.

Date & Time Functions

Date functions help to manipulate and analyze date and time values.

 

  1. datebucket: Aggregates date or timestamp values into defined time intervals or "buckets."
  2. dateconvert: Converts a date to another format or data type.
  3. datetif: Returns the difference between two dates in a specified unit (e.g., days, months).
  4. datetrunc: Truncates a date or time to a specified precision (e.g., month, day).

String Functions

  1. split: Splits a string into an array based on a specified delimiter.
  2. extract: Extracts a substring or a portion of a string.

 

Linear and Quadratic Regression Models

  • Linear Regression: This is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The linear regression model assumes a linear relationship between the variables. Both linear and quadratic regressions use the Statsmodels library in Python for implementation. For more information, visit the Statsmodels documentation.
  • Quadratic Regression: An extension of linear regression, quadratic regression models the relationship between the dependent and independent variables as a quadratic (parabolic) function. This is particularly useful when the data displays a curved pattern rather than a straight line. Like linear regression, it is implemented using the Statsmodels library.

It's important to note that linear, quadratic, and 4PL regressions are currently used in the charting of scatter plots, helping to visualize trends within the plotted data. These regressions differ from the other product functions, which serve different roles in the platform.

 

4PL Formula

  • 4PL Formula: The 4-Parameter Logistic (4PL) formula is commonly used in bioassay analysis. It fits a sigmoidal curve to the data, making it ideal for modeling biological systems with a limited response range. For more details on this, refer to the Mathworks page.

 

4PL Fitting Algorithm

  • 4PL Fitting Algorithm: This algorithm fits the 4-Parameter Logistic curve to data by minimizing the residual sum of squares between observed and predicted values. It is typically used for nonlinear curve fitting and can be implemented using the lmfit package in Python, which provides advanced curve fitting capabilities. For more information, visit the lmfit documentation.

Appendix: Functions Table

In the table below, each function is categorized based on its role within SQL and data manipulation. Some functions may belong to multiple categories (e.g., Window, Aggregate, or Computed) due to their versatility in being used across different types of operations:

  • Window Functions: These functions operate over a set of rows and return a value for each row in the result set, offering flexibility in calculations across data partitions.
  • Aggregate Functions: These functions return a single result by processing multiple rows of input, often used for summarizing data (e.g., calculating totals, averages).
  • Computed Functions: These are functions that perform specific calculations or manipulations on columns, often transforming data within queries.

Many functions, such as count and median, can belong to more than one category because they are designed to operate both as Window and Aggregate functions, depending on how they are used in queries. Other functions, such as datebucket and split, are strictly categorized as Computed functions, used to transform or manipulate specific columns in a dataset.

 

Name Type
average Window and Aggregate
count Window and Aggregate
count nulls Window, Aggregate, and Computed
cume dist Window
datebucket Computed
dateconvert Computed
datetif Computed
datetrunc Computed
dense rank Window
extract Computed
first Window
geometric mean Aggregate and Computed
lag Window
last Window
lead Window
max Window and Aggregate
median Window, Aggregate, and Computed
min Window and Aggregate
mode Window, Aggregate, and Computed
percent rank Window
product Window and Aggregate
rank Window
row number Window
split Computed
standard deviation Window and Aggregate
standard error Window, Aggregate, and Computed
sum Window and Aggregate
text concat Aggregate
variance Window, Aggregate, and Computed



Was this article helpful?

Have more questions? Submit a request