ocrmypdf pdf文件转可编辑文字
1.ubuntu/UOS下安装
sudo apt install ocrmypdf tesseract-ocr-chi-sim
2.转换
ocrmypdf -l eng+chi_sim --force-ocr input.pdf output.pdf
1.ubuntu/UOS下安装
sudo apt install ocrmypdf tesseract-ocr-chi-sim
2.转换
ocrmypdf -l eng+chi_sim --force-ocr input.pdf output.pdf
head.xMin = FontBBox.xMin
head.yMin = FontBBox.yMin
head.xMax = FontBBox.xMax
head.yMax = FontBBox.yMax
思源字体在不同平台底下留白的字段有多种,比如 @孫志貴 发现的LineGap,所以思源黑体在V1.002时LineGap改为0。比如 @厉向晨 发现的Adobe的排版时,使用了FontBBox字段。但是安卓中并不使用这个字段,而是head段。
是因为类似于竖排破折号(有两个字高度和三个字高度)导致FontBBox和head字段被撑大了,而Android又依赖于head.yMin和head.yMax来写字导致的。
特殊的字可以见这个:tamcy/CYanHeiHK
1、基于已经发布的字库,用adobe的字库工具修改后重新生成。
2、通过设置textview.setIncludeFontPadding(false)来使用ascent/descent而非top/bottom(即head.yMax/head.yMin)来进行排版。
查看字体Metrics的方法:
pip install font-lines
font-line report xx.otf
例如我修改后的字体为:
--- Metrics ---
[head] Units per Em: 1000
[head] yMax: 1221
[head] yMin: -488
[OS/2] TypoAscender: 880
[OS/2] TypoDescender: -120
[OS/2] WinAscent: 1160
[OS/2] WinDescent: 320
[hhea] Ascent: 1160
[hhea] Descent: -320
[hhea] LineGap: 0
[OS/2] TypoLineGap: 0
这里面跟ascent相关的有hhea.Ascent、OS/2 TypoAscender和OS/2 WinAscent。对应的不同平台需要的参数。其中前两个在安卓平台有使用到
使用思源字体在Android TextView中写字,发现字高很奇怪,主要问题是:
(1)字高约为正常字的近3倍。
(2)中文顶部对齐,英文底部对齐。
使用的字体来自:
(1)adobe发布adobe-fonts/source-han-sans
(2)google发布googlei18n/noto-fonts
为了方便测试,使用100sp字号的字进行测试。测试方法为打印FontMetrics的值:
private void printFontMetrics() {
Paint.FontMetrics metrics = mTextView.getPaint().getFontMetrics();
Log.e("FONT", "metrics top=" + metrics.top +",ascent=" + metrics.ascent
+ ",descent=" + metrics.descent + ",bottom=" + metrics.bottom
+ ",leading=" + metrics.leading
);
}
测试的结果大概为以下Python的公式:
#!/usr/bin/env python
from __future__ import print_function
import sys
class Metrics:
def __init__(self):
self.leading = 0.0
self.top = 0.0
self.ascent = 0.0
self.descent = 0.0
self.bottom = 0.0
def elegant(self, size):
self.leading = 0
self.top = -size * 2500.0 / 2048
self.ascent = -size * 1900.0 / 2048
self.descent = size * 500.0 / 2048
self.bottom = size * 1000.0 / 2048
# (NotoSansHans)top = -180.7,ascent = -88.0,descent = 12.0,bottom = 104.700005,leading = 50.0
def otf_sans_hans(self, size):
# top bottom https://raw.githubusercontent.com/adobe-fonts/source-han-sans/1.000/Medium/cidfont.ps.CN
self.leading = size * 0.5 # OS/2 TypoLineGap / 1000
self.top = -size * 1.807 # head.yMax
# https://github.com/adobe-fonts/source-han-sans/blob/1.000/Medium/features.CN
self.ascent = -size * 0.88 # OS/2 TypoAscender / 1000
self.descent = size * 0.12 # OS/2 TypoDescender / 1000
self.bottom = size * 1.047 # head.yMin
# (NotoSansSC)top = -180.7,ascent = -116.0,descent = 32.0,bottom = 104.700005,leading = 0.0
def otf_sans_sc(self, size):
self.leading = 0 # hhea.LineGap
self.top = -size * 1.807 # head.yMax
self.ascent = -size * 1.16 # hhea.Ascender
self.descent = size * 0.32 # hhea.Descender
self.bottom = size * 1.047 # head.yMin
# (NotoSerifSC)top = -180.8,ascent = -115.100006,descent = 28.600002,bottom = 104.799995,leading = 0.0
def otf_serif_sc(self, size):
self.leading = 0
self.top = -size * 1.808
self.ascent = -size * 1.151
self.descent = size * 0.286
self.bottom = size * 1.048
# top = -105.615234,ascent = -92.77344,descent = 24.414063,bottom = 27.09961,leading = 0
def ttf(self, size):
self.leading = 0
self.top = -size * 2163.0 / 2048
self.ascent = -size * 1900.0 / 2048
self.descent = size * 500.0 / 2048
self.bottom = size * 555.0 / 2048
def printer(self):
print("top = " + str(self.top)
+ ",ascent = " + str(self.ascent)
+ ",descent = " + str(self.descent)
+ ",bottom = " + str(self.bottom)
+ ",leading = " + str(self.leading))
if __name__ == '__main__':
metrics = Metrics()
size = int(sys.argv[1])
print('elegant:')
metrics.elegant(size)
metrics.printer()
print('NotoSansHans-Medium(1.000):')
metrics.otf_sans_hans(size)
metrics.printer()
print('NotoSansHans-Medium(1.002):')
metrics.otf_sans_hans(size)
metrics.leading = 0
metrics.printer()
print('NotoSansSC-Medium(1.004):')
metrics.otf_sans_sc(size)
metrics.printer()
print('Native-Medium:')
metrics.ttf(size)
metrics.printer()
print('NotoSerifSC-Medium:')
metrics.otf_serif_sc(size)
metrics.printer()
elegant指的是设置textview.setElegantTextHeight(true); 这是Android SDK 21新增的接口,其值来自于android source(frameworks/base/core/jni/android/graphics/Paint.cpp):
static SkScalar getMetricsInternal(JNIEnv* env, jobject jpaint, Paint::FontMetrics *metrics) {
const int kElegantTop = 2500;
const int kElegantBottom = -1000;
const int kElegantAscent = 1900;
const int kElegantDescent = -500;
const int kElegantLeading = 0;
Paint* paint = GraphicsJNI::getNativePaint(env, jpaint);
TypefaceImpl* typeface = GraphicsJNI::getNativeTypeface(env, jpaint);
typeface = TypefaceImpl_resolveDefault(typeface);
FakedFont baseFont = typeface->fFontCollection->baseFontFaked(typeface->fStyle);
float saveSkewX = paint->getTextSkewX();
bool savefakeBold = paint->isFakeBoldText();
MinikinFontSkia::populateSkPaint(paint, baseFont.font, baseFont.fakery);
SkScalar spacing = paint->getFontMetrics(metrics);
// The populateSkPaint call may have changed fake bold / text skew
// because we want to measure with those effects applied, so now
// restore the original settings.
paint->setTextSkewX(saveSkewX);
paint->setFakeBoldText(savefakeBold);
if (paint->getFontVariant() == VARIANT_ELEGANT) {//那个变量影响的是这个
SkScalar size = paint->getTextSize();
metrics->fTop = -size * kElegantTop / 2048;
metrics->fBottom = -size * kElegantBottom / 2048;
metrics->fAscent = -size * kElegantAscent / 2048;
metrics->fDescent = -size * kElegantDescent / 2048;
metrics->fLeading = size * kElegantLeading / 2048;
spacing = metrics->fDescent - metrics->fAscent + metrics->fLeading;
}
return spacing;
}
还有见到一种调整高度的方式是使用textview.setIncludeFontPadding(false),其原理其实是BoringLayout中使用bottom-top来计算或者是使用descent-ascent来计算而已:
if (includepad) {
spacing = metrics.bottom - metrics.top;
mDesc = metrics.bottom;
} else {
spacing = metrics.descent - metrics.ascent;
mDesc = metrics.descent;
}
mBottom = spacing;
NotoSansHans是思源黑体的V1.000简体中文版本,最新的版本V1.004改为NotoSansSC。
就以NotoSansSC的参数为例,说明来源:
def otf_sans_sc(self, size):
self.leading = 0
self.top = -size * 1.807
self.ascent = -size * 1.16
self.descent = size * 0.32
self.bottom = size * 1.047
其中的leading、ascent、descent来自于字体的hhea参数。
根据adobe-fonts/source-han-sans
table hhea {
Ascender 1160;
Descender -320;
LineGap 0;
} hhea;
可知:
ascent = -hhea.Ascender / 1000
descent = -hhea.Descender / 1000
leading = -hhea.LineGap / 1000
而top、bottom不是来自字体的FontBBox参数,而是head.yMin和head.yMax字段。参考:修正思源黑体在 Adobe 软件中文本选区过高的问题(注意这个答案的方法对Android无效)
通过查看
https://raw.githubusercontent.com/adobe-fonts/source-han-sans/master/Medium/cidfont.ps.CN
/FontBBox {-1007 -1047 2927 1807} def
表示的是xMin/yMin/xMax/yMax。也是1000倍的关系。
如果按照上面elegant的公式,我觉得应该建议改为:
/FontBBox {-1007 -488 2927 1221} def
但是使用字体编译工具修改FontBBox对安卓是无效的。安卓使用的是head.yMin和head.yMax。
用
ttx -i ./NotoSansSC-Medium.otf
生成字体的ttx文件。
然后使用vi来编辑生成的ttx文件,搜索yMin,找到下面段。
<head>
<!-- Most of this table will be recalculated by the compiler -->
<tableVersion value="1.0"/>
<fontRevision value="1.004"/>
<checkSumAdjustment value="0x4386f026"/>
<magicNumber value="0x5f0f3cf5"/>
<flags value="00000000 00000011"/>
<unitsPerEm value="1000"/>
<created value="Mon Jun 15 05:07:55 2015"/>
<modified value="Mon Jun 15 05:07:55 2015"/>
<xMin value="-1007"/>
<yMin value="-1047"/>
<xMax value="2927"/>
<yMax value="1807"/>
<macStyle value="00000000 00000000"/>
<lowestRecPPEM value="3"/>
<fontDirectionHint value="2"/>
<indexToLocFormat value="0"/>
<glyphDataFormat value="0"/>
</head>
修改yMin/yMax为:
<yMin value="-488"/>
<yMax value="1221"/>
然后重新编译:
ttx -b ./NotoSansSC-Medium.ttx
就会得到./NotoSansSC-Medium#1.otf。
字体编译工具:
Adobe Font Development Kit for OpenType
http://www.adobe.com/devnet/opentype/afdko/eula.html
<?xml version="1.0"?>
<!DOCTYPE fontconfig SYSTEM "fonts.dtd">
<fontconfig>
<!--宋体-->
<match target="pattern">
<test qual="any" name="family" compare="eq"><string>\5B8B\4F53</string></test>
<edit name="family" mode="assign" binding="same"><string>CESI_SS_GB18030</string></edit>
</match>
<match target="pattern">
<test qual="any" name="family" compare="eq"><string>simsun</string></test>
<edit name="family" mode="assign" binding="same"><string>CESI_SS_GB18030</string></edit>
</match>
<match target="pattern">
<test qual="any" name="family" compare="eq"><string>宋体</string></test>
<edit name="family" mode="assign" binding="same"><string>CESI_SS_GB18030</string></edit>
</match>
<!--楷体-->
<match target="pattern">
<test qual="any" name="family" compare="eq"><string>\6977\4F53</string></test>
<edit name="family" mode="assign" binding="same"><string>CESI_KT_GB18030</string></edit>
</match>
<match target="pattern">
<test qual="any" name="family" compare="eq"><string>kaiti</string></test>
<edit name="family" mode="assign" binding="same"><string>CESI_KT_GB18030</string></edit>
</match>
<match target="pattern">
<test qual="any" name="family" compare="eq"><string>楷体</string></test>
<edit name="family" mode="assign" binding="same"><string>CESI_KT_GB18030</string></edit>
</match>
<!--仿宋-->
<match target="pattern">
<test qual="any" name="family" compare="eq"><string>\4EFF\5B8B</string></test>
<edit name="family" mode="assign" binding="same"><string>CESI_FS_GB18030</string></edit>
</match>
<match target="pattern">
<test qual="any" name="family" compare="eq"><string>fangsong</string></test>
<edit name="family" mode="assign" binding="same"><string>CESI_FS_GB18030</string></edit>
</match>
<match target="pattern">
<test qual="any" name="family" compare="eq"><string>仿宋</string></test>
<edit name="family" mode="assign" binding="same"><string>CESI_FS_GB18030</string></edit>
</match>
<!--黑体-->
<match target="pattern">
<test qual="any" name="family" compare="eq"><string>\9ED1\4F53</string></test>
<edit name="family" mode="assign" binding="same"><string>CESI_HT_GB18030</string></edit>
</match>
<match target="pattern">
<test qual="any" name="family" compare="eq"><string>heiti</string></test>
<edit name="family" mode="assign" binding="same"><string>CESI_HT_GB18030</string></edit>
</match>
<match target="pattern">
<test qual="any" name="family" compare="eq"><string>黑体</string></test>
<edit name="family" mode="assign" binding="same"><string>CESI_HT_GB18030</string></edit>
</match>
<!--微软雅黑-->
<match target="pattern">
<test qual="any" name="family" compare="eq"><string>\5FAE\8F6F\96C5\9ED1</string></test>
<edit name="family" mode="assign" binding="same"><string>Noto Sans CJK SC</string></edit>
</match>
<match target="pattern">
<test qual="any" name="family" compare="eq"><string>Microsoft Yahei</string></test>
<edit name="family" mode="assign" binding="same"><string>Noto Sans CJK SC</string></edit>
</match>
<match target="pattern">
<test qual="any" name="family" compare="eq"><string>微软雅黑</string></test>
<edit name="family" mode="assign" binding="same"><string>Noto Sans CJK SC</string></edit>
</match>
</fontconfig>
首先要安装emoji字体,以noto fonts emoji为例
sudo pacman -S noto-fonts-emoji
Fontconfig 配置文件中,70-no-bitmaps.conf
作用是禁用位图字体。位图字体有时用作缺失字体的后备,这可能会导致文本呈现像素化或过大。在/etc/fonts/conf.d/
中保留该配置文件,则禁用位图字体。
sudo ln -s /usr/share/fontconfig/conf.avail/70-no-bitmaps.conf /etc/fonts/conf.d/70-no-bitmaps.conf
fontconfig有些时候把某些emoji字体也当做位图字体,所以使用该配置文件,会导致同时禁用emoji字体。如果对所有字体禁用了嵌入位图,则仍然可以为特定字体启用嵌入位图,以防没有嵌入位图而无法正常工作。例如,启用Noto emoji:
gedit 64-enable-emoji.conf
<?xml version="1.0"?>
<!DOCTYPE fontconfig SYSTEM "urn:fontconfig:fonts.dtd">
<fontconfig>
<match target="font">
<edit name="embeddedbitmap" mode="assign">
<bool>false</bool>
</edit>
</match>
<match target="font">
<test name="family" qual="any">
<string>Noto Color Emoji</string>
</test>
<edit name="embeddedbitmap">
<bool>true</bool>
</edit>
</match>
</fontconfig>
/etc/fonts/conf.d/
。~/.config/fontconfig/conf.d/
。禁用位图字体的缩放通常会使位图字体变得模糊,删除 /etc/fonts/conf.d/10-scale-bitmap-fonts.conf
会解决。但是同时会破坏表情符号字体(如 Noto emoji表情符号)的缩放,使它们变得巨大。由于我们上面已禁用了其他位图字体,所以启用位图字体的缩放。
检查/etc/fonts/conf.d/
目录下是否有10-scale-bitmap-fonts.conf
配置文件,如果没有,新建软连接到此处。
sudo ln -s /usr/share/fontconfig/conf.default/10-scale-bitmap-fonts.conf /etc/fonts/conf.d/10-scale-bitmap-fonts.conf
fc-cache -fv
关闭需要显示emoji的应用,如浏览器、编辑器、终端等,重新打开,输入emoji表情符号即可看到效果。
1.每个文件都有一个唯一的inode号,获取文件的inode号
ls -i
2.使用find命令结合rm命令通过inode号删除文件。例如,要删除inode号为2236429
的文件或目录,可以使用以下命令
find -inum 2236429 -exec rm -rf {} \;
这种方法适用于删除单个文件或逐个删除乱码命名的文件。