Analysis Functions Overview

Achala
Achala
  • Updated

Overview of Sources

Functionality Category Tool Link
Linear and Quadratic Regression Models Charting statsmodels https://www.statsmodels.org/dev/examples/notebooks/generated/wls.html
4PL Formula Charting Mathworks https://it.mathworks.com/matlabcentral/fileexchange/38122-four-parameters-logistic-regression-there-and-back-again
4PL Fitting Algorithm Charting lmfit package for python https://lmfit.github.io/lmfit-py/model.html
Window Functions Transforms DuckDB

https://duckdb.org/docs/sql/functions/aggregates

https://duckdb.org/docs/sql/functions/window_functions

Aggregate Functions Transforms DuckDB https://duckdb.org/docs/sql/functions/aggregates
Computed Columns Transforms DuckDB and Internal (Based off of Microsoft Excel functions)

https://support.microsoft.com/en-us/office/excel-functions-by-category-5f91f4e9-7b42-46d2-9bd1-63f26a86c0eb

https://duckdb.org/docs/sql/functions/aggregates

https://duckdb.org/docs/sql/functions/char.html

https://duckdb.org/docs/sql/functions/date

 

Aggregate Functions

Aggregate functions operate on multiple rows of data and return a single result, which makes them useful for summarizing large sets of data.

  1. average: Returns the arithmetic mean of a set of numeric values.
  2. count: Counts the number of non-NULL entries.
  3. count nulls: Counts the number of NULL values in a dataset.
  4. max: Returns the largest value in a dataset.
  5. median: Finds the median value in a set of numbers.
  6. min: Returns the smallest value in a dataset.
  7. mode: Returns the most frequent value in a dataset.
  8. product: Returns the product of all values in a dataset (multiplies them together).
  9. standard deviation: Measures the amount of variation or dispersion in a set of values.
  10. standard error: Measures the accuracy with which a sample distribution represents a population.
  11. sum: Returns the sum of all non-NULL values.
  12. variance: Measures how far a set of numbers are spread out from their average value.
  13. geometric mean: Calculates the geometric mean, which is useful when dealing with percentages or rates of change.
  14. text concat: Concatenates (joins) multiple text values into one.

Window Functions

Window functions perform a calculation across a set of table rows that are somehow related to the current row. Unlike aggregate functions, window functions do not group the result into a single row.

  1. first: Returns the first value in the window of rows.
  2. lag: Returns the value of a preceding row at a given offset.
  3. last: Returns the last value in the window of rows.
  4. lead: Returns the value of the following row at a given offset.
  5. cume dist: Calculates the cumulative distribution of a value within its partition.
  6. dense rank: Assigns ranks to rows without gaps in ranking values.
  7. percent rank: Returns the relative rank of a row as a percentage of the total.
  8. rank: Assigns a rank to each row within a partition, with gaps if there are ties.
  9. row number: Assigns a unique sequential integer to rows within a partition.

Exclusive Functions (Analysis Specific)

These are functions that are specific to certain contexts like notebook environments or particular analyses.

  1. countblanks: Counts the number of blank or NULL values in a dataset.
  2. geomean: Computes the geometric mean in specific notebook environments.

Date & Time Functions

Date functions help to manipulate and analyze date and time values.

  1. datebucket: Aggregates date or timestamp values into defined time intervals or "buckets."
  2. dateconvert: Converts a date to another format or data type.
  3. datetif: Returns the difference between two dates in a specified unit (e.g., days, months).
  4. datetrunc: Truncates a date or time to a specified precision (e.g., month, day).

String Functions

  1. split: Splits a string into an array based on a specified delimiter.
  2. extract: Extracts a substring or a portion of a string.

Search Functions

  1. vlookup: Locates a value in a specified column, then returns a value from a different column in the same located row

Linear and Quadratic Regression Models

  • Linear Regression: This is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The linear regression model assumes a linear relationship between the variables. Both linear and quadratic regressions use the Statsmodels library in Python for implementation. For more information, visit the Statsmodels documentation.
  • Quadratic Regression: An extension of linear regression, quadratic regression models the relationship between the dependent and independent variables as a quadratic (parabolic) function. This is particularly useful when the data displays a curved pattern rather than a straight line. Like linear regression, it is implemented using the Statsmodels library.

It's important to note that linear, quadratic, and 4PL regressions are currently used in the charting of scatter plots, helping to visualize trends within the plotted data. These regressions differ from the other product functions, which serve different roles in the platform.

4PL Formula

  • 4PL Formula: The 4-Parameter Logistic (4PL) formula is commonly used in bioassay analysis. It fits a sigmoidal curve to the data, making it ideal for modeling biological systems with a limited response range. For more details on this, refer to the Mathworks page.

4PL Fitting Algorithm

  • 4PL Fitting Algorithm: This algorithm fits the 4-Parameter Logistic curve to data by minimizing the residual sum of squares between observed and predicted values. It is typically used for nonlinear curve fitting and can be implemented using the lmfit package in Python, which provides advanced curve fitting capabilities. For more information, visit the lmfit documentation.

Appendix: Functions Table

In the table below, each function is categorized based on its role within SQL and data manipulation. Some functions may belong to multiple categories (e.g., Window, Aggregate, or Computed) due to their versatility in being used across different types of operations:

  • Window Functions: These functions operate over a set of rows and return a value for each row in the result set, offering flexibility in calculations across data partitions.
  • Aggregate Functions: These functions return a single result by processing multiple rows of input, often used for summarizing data (e.g., calculating totals, averages).
  • Computed Functions: These are functions that perform specific calculations or manipulations on columns, often transforming data within queries.

Many functions, such as count and median, can belong to more than one category because they are designed to operate both as Window and Aggregate functions, depending on how they are used in queries. Other functions, such as datebucket and split, are strictly categorized as Computed functions, used to transform or manipulate specific columns in a dataset.

 

Name Type Implementing Function
average Window and Aggregate avg(x)
count Window and Aggregate  
count nulls Window, Aggregate, and Computed  
cume dist Window  
datebucket Computed  
dateconvert Computed  
datetif Computed  
datetrunc Computed  
dense rank Window  
extract Computed  
first Window  
geometric mean Aggregate and Computed geomean(x)
lag Window  
last Window  
lead Window  
max Window and Aggregate  
median Window, Aggregate, and Computed  
min Window and Aggregate  
mode Window, Aggregate, and Computed  
percent rank Window  
product Window and Aggregate  
rank Window  
row number Window  
split Computed  
standard deviation Window and Aggregate sample standard deviation 
stddev_samp(x)
standard error Window, Aggregate, and Computed stddev_samp(x) / sqrt(count(x))
sum Window and Aggregate  
text concat Aggregate  
variance Window, Aggregate, and Computed sample variance
var_samp(x)
vlookup Search  



Was this article helpful?

Have more questions? Submit a request