当前位置：首页 > news >正文

网站备案申请模板网站开发会计处理

news 2026/4/22 0:12:04

网站备案申请模板,网站开发会计处理,网片筋,南宁seo主管statistical concept 統計學概念免費完整內容 PMF and CDF PMF定義的值是P(Xx)#xff0c;而CDF定義的值是P(X x)#xff0c;x為所有的實數線上的點。 probability mass function (PMF) 概率質量函數 p X ( x ) P ( X x ) pX(x)P(Xx) pX(x)P(Xx) 是離散隨機變數…statistical concept 統計學概念免費完整內容 PMF and CDF PMF定義的值是P(Xx)而CDF定義的值是P(X x)x為所有的實數線上的點。 probability mass function (PMF) 概率質量函數 p X ( x ) P ( X x ) pX(x)P(Xx) pX(x)P(Xx) 是離散隨機變數在各個特定取值上的概率。有時也被稱為離散密度函數。概率密度函數通常是定義離散隨機分佈的主要方法並且此類函數存在於其定義域是離散的純量變數多遠隨機變數維基百科 Cumulative distribution function(CDF)累積分佈函數 F X ( x ) P ( X x ) FX(x)P(Xx) FX(x)P(Xx) 也叫概率分佈函數或分佈函數是概率密度函數的積分能夠完整的描述一個實隨機變數X的概率分佈維基百科 probability density functionPDF概率密度函數 PDF 概率密度函數Probability Density Function PDF-CSDN博客 Central Limits 中央界限 Support we have a set of independent random variables X i X_{i} Xi for i 1 , . . . . , n i1,....,n i1,....,n with: M e a n ( X i ) μ Mean(X_{i})\mu Mean(Xi)μ V a r ( X i ) V Var(X_{i})V Var(Xi)V for all i i i Then as n n n becomes large, the sum: S m ∑ i 1 n X i → N ( n μ , n V ) S_{m}\sum\limits_{i 1}^n {{X_i} \to {\rm N}(n\mu ,nV)} Smi1∑nXi→N(nμ,nV) tends to become normally distributed. Absence of Central Limits Another case is where the moments are not defined / infinite 另一種情況是力矩不確定或無限大 Randomness 隨機性 Motivation 動機 Three main ways that random comes into data science: The data themselves are often best understood as random 數據本身通常最好被理解為隨機**When we want to reason under **subjective uncertainty **(for example in Bayesian approaches) then unknown quantities can be represented as random. Often when we make predictions they will be **probabilitic 當我們主管不確定性的情況下進行推理時可以將未知量表示為隨機量當我們進行預測時他們將是概率性的Many of the most effective / efficient / commonly‑used algorithms in data science—typically called Monte Carlo algorithms—exploit randomness. 蒙特卡洛算法 Unpredictable 不可預測性Subjective uncertainty 主管不確定性 The logistic map 邏輯地圖單峰映象是一個二次多項式映射(遞歸關係)经常作为典型范例来说明复杂的混沌现象是如何从非常简单的非线性动力学方程中产生的。 is an example of deterministic chaos 是確定性混沌的一個例子 but whose results are apparently not easy to predict. 結果不容易被預測它在一定程度上是一个时间离散的人口统计模型 Logistic模型可以描述生物種群的演化它可以表示成一維非線性迭代方程 x n 1 r x n ( 1 − x n ) x_{n1}rx_{n}(1-x_{n}) xn1rxn(1−xn) Math: x ( t 1 ) μ x ( t ) ( 1 − x ( t ) ) \displaystyle{ x(t1)\mu x(t)(1-x(t)) } x(t1)μx(t)(1−x(t)) 其中t为迭代时间步对于任意的tx(t)∈[0,1]μ为一可调参数为了保证映射得到的x(t)始终位于[0,1]内则μ∈[0,4]。x(t)为在t时刻种群占最大可能种群规模的比例即现有人口数与最大可能人口数的比率。当变化不同的参数μ时该方程会展现出不同的动力学极限行为即当t趋于无穷大x(t)的变化情况包括稳定点即最终x(t)始终为同一个数值、周期x(t)会在2个或者多个数值之间跳跃)、以及混沌x(t)的终态不会重复而会等概率地取遍某区间。当μ超过[1,4]时就会发生混沌现象。该非线性差分方程意在观察两种情形: • 当人口规模很小时人口将以与当前人口成比例的速度增长进行繁殖。 • 饥饿(与密度有关的死亡率) 其增长率将以与环境的”承受能力”减去当前人口所得值成正比的速度下降然而Logistic映射作为一种人口统计模型存在着一些初始条件和参数值(如μ 4)为某值时所导致的混沌问题。这个问题在较老的瑞克模型中没有出现该模型也展示了混沌动力学。 0 μ 1 0μ1 0μ1: entropy 熵另一种方法是利用计算机外部因素来产生随机性例如鼠标点击的位置和时间。在此我们将考虑把代码运行的时间作为外部因素即使用系统时钟当前时间的小数点后六位数字分辨率为微秒 R和Matlab 使用軟件包提供隨機數生成的函數 Estimation of π π π using Monte Carlo methods 假設我們將 π \pi π 定義為半徑為1的圓的面積根據其定義來估算這個數字 we will pick random values of x x x and y y y independently from a uniform distribution between 0 and 1, then let the random variable Z Z Z equal 1 if the point ( x , y ) (x, y) (x,y) falls within the quarter-circle shown and 0 otherwise. This Z Z Z allows us to make an estimate of π π π in that its expected value, E [ Z ] π / 4 E[Z] π/4 E[Z]π/4. We can then define a random variable An to be the average of n n n independent samples of Z Z Z. Formally: A n 1 n ∑ i 1 n Z i π 4 ε n {{\rm{A}}_n} \frac{1}{n}\sum\limits_{i 1}^n {{Z_i} \frac{\pi }{4} {\varepsilon _n}} Ann1i1∑nZi4πεn Code operation To deal with this, we’ll repeat the experiment m m m times and make a list of all the estimates we get. We’ll then arrange these results in ascending order and throw away a certain fraction α \alpha α of the largest and smallest results. The remaining values should provide decent upper and lower bounds for an interval containing π \pi π. m 100 # Number of estimates taken n 80000 # Number of points used in each estimateIf we increase n n n above, we should get a more accurate estimates of π \pi π each tme we run the experiment, while if we increase m m m, we’ll get more accurate estimates of the endpoints of an interval containing π \pi π. #Generate a set of m estimates of the area of a unit-radius quarter-circle np.random.seed(42) # Seed the random number generator A np.zeros(m) # A will hold our m estimates for i in range(0,m):for j in range(0,n):# Generate an (x, y) pair in the unit squarex np.random.rand()y np.random.rand()# Decide whether the point lies in or on# the unit circle and set Z accordinglyr x**2 y**2if ( r 1.0):Z 1.0else:Z 0# Add up the contribution to the current estimateA[i] A[i] Z# Convert the sum weve built to an estimate of piA[i] 4.0 * A[i] / float( n )# Calculate approximate 95% confidence interval for pi based on our Monte Carlo estimates pi_estimates np.sort(A) piLower np.percentile(pi_estimates,2.5) piUpper np.percentile(pi_estimates,97.5) print(fWe estimate that pi lies between {piLower:.3f} and {piUpper:.3f}.)standard distribution Bernoulli 伯努利分佈 P ( X x ) p x ( 1 − p ) 1 − x , x 0 , 1 ; 0 p 1 P(Xx) p^{x}(1-p)^{1-x}, x 0, 1; 0 p 1 P(Xx)px(1−p)1−x,x0,1;0p1 only have two choices(binary situations). 只有兩個結果例如成功失敗硬幣正反面 Random Variable (X): In the context of Bernoulli Distribution, X represents the variable that can take the values 1 or 0, denoting the number of successes occurring. Bernoulli Trial: An individual experiment or trial with only two possible outcomes. Bernoulli Parameter: This refers to the probability of success § in a Bernoulli Distribution. Mean: E [ X ] μ p E[X] μ p E[X]μp Variance: V a r [ X ] E [ X 2 ] − ( E [ X ] ) 2 σ 2 p ( 1 − p ) o r p q Var[X] E[X^{2}] - (E[X])^2 \\ σ2 p(1 - p) \ or\ pq Var[X]E[X2]−(E[X])2σ2p(1−p) or pq Applications of Bernoulli Distribution in Business Statistics 1. Quality Control: In manufacturing, every product undergoes quality checks. Bernoulli Distribution helps assess whether a product passes (success) or fails (failure) the quality standards. By analysing the probability of success, manufacturers can evaluate the overall quality of their production process and make improvements. 2. Market Research: Bernoulli Distribution is useful in surveys and market research when dealing with yes/no questions. For instance, when surveying customer satisfaction, responses are often categorised as satisfied (success) or dissatisfied (failure). Analysing these binary outcomes using Bernoulli Distribution helps companies gauge customer sentiment. 3. Risk Assessment: In the context of risk management, the Bernoulli Distribution can be applied to model events with binary outcomes, such as a financial investment succeeding (success) or failing (failure). The probability of success serves as a key parameter for assessing the risk associated with specific investments or decisions. 4. Marketing Campaigns: Businesses use Bernoulli Distribution to measure the effectiveness of marketing campaigns. For instance, in email marketing, success might represent a recipient opening an email, while failure indicates not opening it. Analysing these binary responses helps refine marketing strategies and improve campaign success rates. Difference between Bernoulli Distribution and Binomial Distribution 伯努利分佈和二項分佈的區分 The Bernoulli Distribution and the Binomial Distribution are both used to model random experiments with binary outcomes, but they differ in how they handle multiple trials or repetitions of these experiments. 同樣是對具有二元結果的隨機實驗進行建模但在處理多次實驗的方式上有所不同 BasisBernoulli DistributionBinomial DistributionNumber of TrialsSingle trialMultiple trialsPossible Outcomes2 outcomes (1 for success, 0 for failure)Multiple outcomes (e.g., success or failure)ParameterProbability of success is pProbability of success in each trial is p and the number of trials is nRandom VariableX can only be 0 or 1X can be any non-negative integer (0, 1, 2, 3, …)PurposeDescribes single trial events with success/failure.Models the number of successes in multiple trials.ExampleCoin toss (Heads/Tails), Pass/Fail, Yes/No, etc.Counting the number of successful free throws in a series of attempts, number of defective items in a batch, etc. Arithmetic with normally-distributed variables Suppose we have two random variables, X1 and X2 that are independent and are both normally distributed with means µ1 and µ2 **and variances σ12 and σ2 2, respectively. W X 1 X 2 WX_{1}X_{2} WX1X2 will also be normally distributed mean: μ W μ 1 μ 2 {\mu_{W}}{\mu_{1}} {\mu_{2}} μWμ1μ2 variance: σ W 2 σ 1 2 σ 2 2 {\sigma^{2}_{W}}{\sigma^{2}_{1}}{\sigma^{2}_{2}} σW2σ12σ22 Y a X 1 b YaX_{1}b YaX1b will also be normally distributed mean: μ Y a μ 1 b \mu_{Y}a\mu_{1}b μYaμ1b variance: σ Y 2 a 2 σ 1 2 \sigma^{2}_{Y}a^{2}\sigma^{2}_{1} σY2a2σ12 PDF CDF Cauchy 柯西分佈 The Cauchy distribution has probability density function f ( x ) 1 π s ( 1 ( ( x − t ) / s ) 2 ) f(x) \frac{1}{{\pi s(1 {{((x - t)/s)}^2})}} f(x)πs(1((x−t)/s)2)1 s s s is positive t t t is parameter can be any parameters It has “heavy tails”, which means that large values are so common that the Cauchy distribution lacks a well-defined mean and variance! But the parameter t t t gives the location of the mode and median, which are well-defined. The parameter s s s determines the ‘width’ of the distribution as measured using e.g. the distances between percentiles, which are also well defined. PDF CDF EDA: Exploratory data analysis motivation: EDA is about getting an intuitive understanding of the data, and as such different people will find different techniques useful.Data quality The first thing understand is where the data come from and how accurate they are. 數據的來源和準確性 star rating 星級評級 This is based on experience rather than any formal theory 4 Numbers we can believe. Examples: official statistics(官方統計數據); well controlled laboratory experiments3 Numbers that are reasonably accurate. Examples: well conducted surveys / samples; field measurements; less well controlled experiments.2Numbers that could be out by quite a long way. Examples: poorly conducted surveys / samples; measurements of very noisy systems1 Numbers that are unreliable. Examples: highly biased / unrepresentative surveys / samples; measurements using biased / low-quality equipment0 Numbers that have just been made up. Examples: urban legends / memes; fabricated experimental data Univariate Data Vectors univariate case: one measurement per ‘thing’ 每個變量都單獨探索 Mathematically, we represent a univariate dataset as a length-n vector: x ( x 1 , x 2 , . . . , x n ) x (x_{1},x_{2},...,x_{n}) x(x1,x2,...,xn) The sample mean of a function f (x) is ⟨ f ( x ) ⟩ 1 n ∑ i 1 n f ( x i ) 1 n [ f ( x 1 ) f ( x 2 ) . . . . f ( x n ) ] \left\langle {{\rm{f}}(x)} \right\rangle \frac{1}{n}\sum\limits_{i 1}^n {f({x_i}) \frac{1}{n}[f({x_1}) f({x_2}) .... f({x_n})]} ⟨f(x)⟩n1i1∑nf(xi)n1[f(x1)f(x2)....f(xn)] Visualisation and Information There is an important distinction in visualisations between Lossless(無損) ones from which, if viewed at sufficiently high resolution, one could recover the original datasetLossy(有損) ones, where a given plot would be consistent with many different raw datasets Typically for complex data, choosing the lossy visualistaion that loses the ‘right’ information is key to successful visualisation. Multivariate Exploratory Data Analysis In real applications, we almost almost always have multiple features of different things measured, and are so in a multivariate rather than univariate situation Professional Skill Data types Nominal or categorical (e.g. colours, car names): not ordered; cannot be added or compared; can be relabelled.Ordinal (e.g. small/medium/large): sometimes represented by numbers; can be ordered, but differences or ratios are not meaningful.Measurement: meaningful numbers, on which (some) operations make sense. They can be: Discrete (e.g. publication year, number of cylinders): typically integer.Continuous (e.g. height): precision limited only by measurement accuracy. Measurements can be in an interval scale (e.g. temperature in degrees Celsius), ratio scale (say, weights in kg), or circular scale (time of day on the 24 hr clock), depending on the 0 value and on which operations yield meaningful results Summary Statistics Measures of Central Tendency 集中趨勢測度 Often, we are interested in what a typical value of the data; The mean of the data is: M e a n ( x ) ⟨ x ⟩ 1 n ∑ i 1 n x i Mean(x)\left\langle {\rm{x}} \right\rangle \frac{1}{n}\sum\limits_{i 1}^n {{x_i}} Mean(x)⟨x⟩n1i1∑nxi The median of the data is the value that sits in the middle when the data are sorted by valueA mode in data is a value of x x x that is ‘more common’ than those around it, or a ‘local maximum’ in the density. For discrete data[离散数据] this can be uniquely determined as the most common valueFor continuous data, modes need to be estimated, one aspect of a major strand in data science, estimating distributions. Visualising For the data, we estimate from the kernel density that there is one mode, and its location and calculate the mean and median directly Example: The data are right-skewed(右偏的), and as a consequence of this the mode is smallest and the mean is largest – we will consider this further; (note that for a normal distribution all would be equal.) Variance 特性有偏差方差無偏差方差分母nn-1應用場景描述樣本的離散型估計總體的方差偏差對總體方差的估計存在低估偏差對總體方差的估計是無偏的應用場景數據分析、機器學習中的樣本優化統計學中總體方差估計何时使用有偏差方差在机器学习中通常计算样本的有偏差方差分母为 nnn因为重点在于优化模型对样本的适配性而非推断总体。无偏差方差在统计学和推断中需要用无偏差方差分母为 n−1n-1n−1因为它更准确地估计总体参数。 V a r ( x ) ⟨ ( x − ⟨ x ⟩ ) 2 ⟩ 1 n ∑ i 1 n ( x i − ⟨ x ⟩ ) 2 1 n ∑ i 1 n ( x 2 i − 2 x i ⟨ x ⟩ ⟨ x ⟩ 2 ) ( 1 n ∑ i 1 n x i 2 ) 2 ( 1 n ∑ i 1 n x i ) ⟨ x ⟩ 1 n ( ∑ i 1 n 1 ) ⟨ x ⟩ 2 1 n ( ∑ i 1 n x i 2 ) − ( 1 n ∑ i 1 n x i ) 2 ⟨ x 2 ⟩ − ⟨ x ⟩ 2 \begin{array}{ccccc} Var(x) \left\langle {{{(x - \left\langle x \right\rangle )}^2}} \right\rangle\\ \frac{1}{n}\sum\limits_{i 1}^n {{{({x_i} - \left\langle x \right\rangle )}^2}}\\ \frac{1}{n}\sum\limits_{i 1}^n {({x^2}_i - 2{x_i}\left\langle x \right\rangle {{\left\langle x \right\rangle }^2})}\\ \left( {\frac{1}{n}\sum\limits_{i 1}^n {x_i^2} } \right) 2\left( {\frac{1}{n}\sum\limits_{i 1}^n {{x_i}} } \right)\left\langle x \right\rangle \frac{1}{n}\left( {\sum\limits_{i 1}^n 1 } \right){\left\langle x \right\rangle ^2}\\ \frac{1}{n}\left( {\sum\limits_{i 1}^n {x_i^2} } \right) - {\left( {\frac{1}{n}\sum\limits_{i 1}^n {{x_i}} } \right)^2}\\ \left\langle {{x^2}} \right\rangle - {\left\langle x \right\rangle ^2} \end{array} Var(x)⟨(x−⟨x⟩)2⟩n1i1∑n(xi−⟨x⟩)2n1i1∑n(x2i−2xi⟨x⟩⟨x⟩2)(n1i1∑nxi2)2(n1i1∑nxi)⟨x⟩n1(i1∑n1)⟨x⟩2n1(i1∑nxi2)−(n1i1∑nxi)2⟨x2⟩−⟨x⟩2 Unbiased Variance and Computation 無偏方差 V a r ^ ( x ) n n − 1 V a r ( x ) 1 n − 1 ∑ i 1 n ( x i − ⟨ x ⟩ ) 2 1 n − 1 ( ∑ i 1 n x i 2 − 1 n ( ∑ i 1 n x i ) 2 ) \begin{array}{ccccc} \widehat {Var}(x) \frac{n}{{n - 1}}Var(x) \\ \frac{1}{{n - 1}}\sum\limits_{i 1}^n {{{({x_i} - \left\langle x \right\rangle )}^2}}\\ \frac{1}{{n - 1}}\left( {\sum\limits_{i 1}^n {x_i^2 - \frac{1}{n}{{\left( {\sum\limits_{i 1}^n {{x_i}} } \right)}^2}} } \right) \end{array} Var (x)n−1nVar(x)n−11i1∑n(xi−⟨x⟩)2n−11(i1∑nxi2−n1(i1∑nxi)2) 默認情況下 python計算有偏差的R計算無偏差的無偏差樣本 ‘Natural’ units there are two commonly-used quantities that have the same units as the data 與數據有相同單位 mean μ M e a n ( x ) \mu Mean(x) μMean(x)standard deviation σ V a r ( x ) \sigma \sqrt {Var(x)} σVar(x) These two quantities let us define two transformations commonly applied to data 用於數據轉換 centring y i x i − μ {y_i} {x_i} - \mu yixi−μ | M e a n ( y ) 0 Mean(y) 0 Mean(y)0standardisation z i y i σ {z_i} \frac{{{y_i}}}{\sigma } ziσyi | V a r ( z ) 1 Var(z)1 Var(z)1 Higher moments In general, the r r r-th moment of the data is 第 r r r時刻的數據是 m r ⟨ x r ⟩ {m_r} \left\langle {{x^r}} \right\rangle mr⟨xr⟩ The r r r-th central moment中心距 of the data is μ r ⟨ ( x − μ ) r ⟩ ⟨ y r ⟩ {\mu _r} \left\langle {{{(x - \mu )}^r}} \right\rangle \left\langle {{y^r}} \right\rangle μr⟨(x−μ)r⟩⟨yr⟩ where the y’s are the centred versions of the data. The r r r-th standardised moment of the data is μ r ⟨ ( x − μ σ ) r ⟩ ⟨ z r ⟩ ⟨ ( x − μ ) 2 ⟩ σ r μ r σ r {\mu _r} \left\langle {{{(\frac{{x - \mu }}{\sigma })}^r}} \right\rangle \left\langle {{z^r}} \right\rangle \frac{{\left\langle {{{\left( {x - \mu } \right)}^2}} \right\rangle }}{{{\sigma ^r}}} \frac{{{\mu _r}}}{{{\sigma ^r}}} μr⟨(σx−μ)r⟩⟨zr⟩σr⟨(x−μ)2⟩σrμr In theory, all higher moments are informative about the data, but in practice those with r 3 and r 4 are most commonly reported standardised moment M k μ k σ k 原始矩標準差 {M_k} \frac{{{\mu _k}}}{{{\sigma ^k}}}\frac{{{原始矩}}}{{{標準差}}} Mkσkμk標準差原始矩 M k M_k Mk第 k k k阶标准化矩。 μ k \mu_k μk第 k k k 阶原始矩。 σ \sigma σ标准差标准化矩通过除以标准差的 k k k 次方使矩的量纲消失方便分布的比较第一阶标准化矩 M 1 μ 1 σ 1 {M_1} \frac{{{\mu _1}}}{{{\sigma ^1}}} M1σ1μ1 表示分布的中心位置但通常为 0如果中心点选均值第二阶标准化矩 M 2 μ 2 σ 2 {M_2} \frac{{{\mu _2}}}{{{\sigma ^2}}} M2σ2μ2 恒等于 1因为分布已经用标准差标准化。第三阶标准化矩偏度Skewness M 3 μ 3 σ 3 μ 3 ~ S k e w ( x ) {M_3} \frac{{{\mu _3}}}{{{\sigma ^3}}}\widetilde {{\mu _3}} Skew(x) M3σ3μ3μ3 Skew(x) 用于描述分布的对称性或偏斜程度 M 3 0 {{\rm{M}}_3} 0 M30: 分佈偏右(右尾較長) M 3 0 {{\rm{M}}_3} 0 M30: 分佈偏左(左尾較長) M 3 0 {{\rm{M}}_3} 0 M30: 分佈對稱 A larger (more positive) value of this quantity indicates right-skewness, meaning that more of the data’s variability arises from values of x larger than the mean Conversely, a smaller (more negative) value of this quantity indicates left-skewness, meaning that more of the data’s variability arises from values of x smaller than the mean. A value close to zero means that the variability of the data is similar either side of the mean (but does not imply an overall symmetric distribution). 第四阶标准化矩峰度Kurtosis M 4 μ 4 σ 4 {M_4} \frac{{{\mu _4}}}{{{\sigma ^4}}} M4σ4μ4 用于描述分布的尖峰或平坦程度. M 4 3 {{\rm{M}}_4} 3 M43: 尖峰分佈 M 4 3 {{\rm{M}}_4} 3 M43: 平坦分佈用途描述分布形状偏度和峰度是最常用的标准化矩用于研究数据分布的对称性和尾部特性。模型假设检验例如判断数据是否符合正态分布。分布比较通过标准化消除了尺度和单位的影响可以直接比较不同数据集的形状特征。 A value of this quantity larger than 3 means that more of the variance of the data arises from the tails than would be expected if it were normally distributedA value of this quantity less than 3 means that less of the variance of the data arises from the tails than would be expected if it were normally distributed.A value close to 3 is consistent with, though not strong evidence for, a normal distribution.The difference between the kurtosis and 3 is called the excess kurtosis. functions Quantiles and Order Statistics The z-th percentile, P z P_z Pz is the value of x for which z% of the data is ≤ xSo the median is median(x) P 50 P_{50} P50This is related to the ECDF as illustrated belowA measure of dispersal of the data is the inter-quartile range I Q R ( x ) P 75 − P 25 IQR(x) {P_{75}} - {P_{25}} IQR(x)P75−P25 Density Estimation Histograms histogram can be used to make an estimate of the probability density underlying a data set. Given data{ x 1 , . . . , x n { {x_1}, . . . , {x_n} } x1,...,xn} and a collection of q 1 bin-boundaries, b ( b 0 , b 1 , . . . , b q ) b (b_0, b_1, . . . , b_q ) b(b0,b1,...,bq) chosen so that b 0 m i n ( x ) a n d m a x ( x ) b q {b_0} min(x) \ and \ max(x) {b_q} b0min(x) and max(x)bq , we can think of the histogram-based density estimate as a piecewise-constant (that is, constant on intervals) function arranged so that the value of the estimator in the interval b a − 1 ≤ x b a b_{a−1} ≤ x b_{a} ba−1≤xba is f ( x ∣ b ) 1 b a − b a − 1 ( ∣ { x j ∣ b a − 1 ≤ x j b a } ∣ n ) f(x|b) \frac{1}{{{b_a} - {b_{a - 1}}}}\left( {\frac{{\left| {\{ {x_j}|{b_{a - 1}} \le {x_j} {b_a}\} } \right|}}{n}} \right) f(x∣b)ba−ba−11(n∣{xj∣ba−1≤xjba}∣) where the second factor is the proportion of the x j {x_j} xj that fall into the interval and b a − b a − 1 b_a − b_{a−1} ba−ba−1 is the width of the interval. These choices mean that the bar (of the histogram) above the interval has an area equal to the proportion of the data points x j x_j xj that fall in that interval Estimating a Density with Kernels f ^ ( x ∣ w ) 1 n ∑ j 1 n 1 w K ( x − x j w ) \widehat f(x|w) \frac{1}{n}\sum\limits_{j 1}^n {\frac{1}{w}K\left( {\frac{{x - {x_j}}}{w}} \right)} f (x∣w)n1j1∑nw1K(wx−xj) The main players in this formula are K ( x ) K(x) K(x): the kernel, typically some bump-shaped function such as a Gaussian or a parabolic bump. It should be normalised in the sense that ∫ − ∞ ∞ K ( x ) d x 1 \int_{ - \infty }^\infty {K(x)\ dx 1} ∫−∞∞K(x) dx1 w w w : the bandwidth, which sets the width of the bumps Kernel Density Estimation KDE 是一种非参数方法用于估计随机变量的概率密度函数PDFProbability Density Function。它提供了一种平滑方式来描述数据的分布不依赖特定的分布假设如正态分布目标 KDE 的目标是从有限的样本数据中估计其背后的概率密度函数。与直方图类似KDE 描述了数据的分布但比直方图更平滑且不受特定区间bin的影响。核心公式给定 n n n 个数据点 { x 1 , x 2 , … , x n } \{x_1, x_2, \dots, x_n\} {x1,x2,…,xn}KDE 在位置 x x x 处的估计值为 f ( x ) 1 n h ∑ i 1 n K ( x − x i h ) f ^ ( x ) 1 n h ∑ i 1 n K ( x − x i h ) 在 x 处的密度估计。 f^(x)1nh∑i1nK(x−xih)\hat{f}(x) \frac{1}{n h} \sum_{i1}^{n} K\left(\frac{x - x_i}{h}\right)在 x 处的密度估计。 f(x)1nh∑i1nK(x−xih)f^(x)nh1i1∑nK(hx−xi)在x处的密度估计。 K ( ⋅ ) K(\cdot) K(⋅) 核函数 Kernel Function定义如何分布平滑权重。 h h h 带宽参数 Bandwidth控制平滑的程度。 x i x_i xi数据点。核函数 K ( ⋅ ) K(\cdot) K(⋅) 核函数是一个对称的非负函数其积分为 1通常用来为每个点分配权重。常见核函数高斯核Gaussian Kernel K ( u ) 12 π e − u 22 K ( u ) 1 2 π e − u 2 2 K(u)12πe−u22K(u) \frac{1}{\sqrt{2\pi}} e^{-\frac{u^2}{2}} K(u)12πe−u22K(u)2π 1e−2u2均匀核Uniform Kernel K ( u ) 12 K ( u ) 1 2 K(u)12K(u) \frac{1}{2} K(u)12K(u)21如果 ∣ u ∣ ≤ 1 ∣ u ∣ ≤ 1 ∣u∣≤1|u| \leq 1 ∣u∣≤1∣u∣≤1否则为 0三角核Triangular Kernel K ( u ) 1 − ∣ u ∣ K ( u ) 1 − ∣ u ∣ K(u)1−∣u∣K(u) 1 - |u| K(u)1−∣u∣K(u)1−∣u∣如果 ∣ u ∣ ≤ 1 ∣ u ∣ ≤ 1 ∣u∣≤1|u| \leq 1 ∣u∣≤1∣u∣≤1否则为 0 带宽 h h h 带宽控制了核的扩展范围。 h h h 的选择非常重要 h h h 太小估计函数会过于波动过拟合。 h h h 太大估计函数会过于平滑欠拟合。 KDE 的核心思想是用核函数 K ( ⋅ ) K(\cdot) K(⋅)平滑地“覆盖”每个数据点。通过将核函数中心放在每个数据点上并根据带宽 h h h 调整宽度最终生成一个连续的概率密度曲线 KDE与直方图的比较特點直方圖KDE區間数据被划分成固定宽度的区间bin不需要固定区间平滑性曲线可能不连续有棱角曲线连续、平滑參數区间宽度bin width核函数和带宽kernel bandwidth靈活性对区间位置敏感更灵活适用于复杂数据分布应用场景数据分布可视化如观察数据的集中趋势和分布形态。异常检测识别不符合密度分布的数据点。概率密度估计用于机器学习和统计建模中的特征分布建模。

查看全文

http://www.hkea.cn/news/14361107/